Introduction
RDF data is specified in terms of strings.
AllegroGraph converts the string representation of a URI into a 12-byte universal part identifier (UPI). These 12-bytes can either be an encoded representation of the string or a hash that is used to save and retrieve the string's contents in a string dictionary. Because they do not require the lookup and I/O costs of string retrieval, encoded parts can be much more efficient.
AllegroGraph automatically encodes strings representing most numeric formats, dates, times, and date times. Geospatial information can also be encoded if the particular format is registered with AllegroGraph. AllegroGraph can now encode strings representing identifiers using the API described below.
This lets URIs like
http://franz.com/identifiers/person@@156-agh-4562
be saved and manipulated without requiring a trip to the string dictionary.
Note the '@@' separating the prefix and the id. This is used in place of the normal '#' to indicate that an encoded id is intended.
Setup
To use Encoded IDs, a string prefix and an associated ID format are registered with AllegroGraph (using the Java API):
registerEncodableNamespace(
"http://franz.com/identifiers/person",
"[0-9]{3}-[a-z]{3}-[0-9]{1,4}")
This tells AllegroGraph to automatically convert any strings with this prefix and suffix matching the ID format into an encoded UPI rather than saving it as a string.
Note our example has variable-width IDs (The {1,4} at the end says from one to four digits 0 to 9, so 121-abc-1 and 121-abc-11 and 121-abc-1234 are all acceptable IDs. Variable-width IDs cannot use automatic generation of IDs (so their namespace prefix is not suitable as an argument to next-encoded-upi-for-prefix). A fixed width id whose prefix is suitable would be, for example,
"[0-9]{3}-[a-z]{3}-[0-9]{2}"
Once prefixes and id templates have been registered, you can define encode IDs by specifying them with a string like
"http://franz.com/identifiers/person@@121-abc-1"
When the template is fixed-width, you can also get IDs with next-encoded-upi-for-prefix.
Uses of Encoded IDs
Encoded IDs are not designed to encode information. Instead, they are analogous to blank nodes, that is placekeeper nodes which themselves link to information.
Suppose, for example, you have data on employees (name, address, salary, date of hire, etc.) Then for each employee, you create an encoded id prefix
"http://www.mycompany.com/identifiers/employees"
and an id template
"[0-9]{3}-[a-z]{3}-[0-9]{3}"
Then, for each employee, create an encoded id, say
"http://www.mycompany.com/identifiers/employees@@103-ayt-928"
and add triples
"http://www.mycompany.com/identifiers/employees@@103-ayt-928" "name" "John Smith"
"http://www.mycompany.com/identifiers/employees@@103-ayt-928" "salary" "50000"
and so on.
Encoded IDs are preferable to blank nodes because you can refer to them without referencing another node (you can only get a handle on a blank node by finding another node which points to it).
Federation
Since the prefixes are registered with a particular store, federations of stores may have different ideas about which encoding a given prefix should use. For example, store A may think that http://franz.com/identifiers/person
should be encoded with id 45 where store B thinks that it is id 46,012. If the stores are federated, additional bookkeeping would need to happen so that the data is correctly processed. This bookkeeping is not supported in the current release. You should only register a prefix in a single store.
ID generation
You can generate ids automatically using next-encoded-upi-for-prefix, which takes the prefix string and a upi as its two required arguments (the upi is modified to be the next id in sequence). AllegroGraph keeps track of a counter for each prefix in a manner similar to that used to generate fresh blank IDs. Multiple processes can request the next identifier for a given prefix and are guaranteed to get unique IDs back (but see note of durability below: uniqueness of uncommitted IDs after a recovery is not guaranteed).
Or you can specify a UPI using the prefix followed by @@ followed by a value which follows the definition template.
For a specific prefix, you should either always generate IDs automatically with next-encoded-upi-for-prefix or always specify IDs yourself. Do not do both as that may result in accidentally using the same id when two separate ids are needed.
ID uniqueness in the event of a crash
Internal data structures keep track of IDs used and the next available ID for generation. Whenever a commit is done (with, say, commit-triple-store), the state of that internal structure is saved. Until a commit, however, if there is a crash of some sort, the next id after recovery may be the same as one provided after the last commit but before the crash. So, for example, if you do (using the Lisp API):
(register-encoded-id-prefix
"http://franz.com/employees"
"[0-9]{3}-[a-z]{3}-[0-9]{2}")
(setq em-id1
(next-encoded-upi-for-prefix "http://franz.com/employees" (make-upi)))
(commit-triple-store)
(setq em-id2
(next-encoded-upi-for-prefix "http://franz.com/employees" (make-upi)))
(register-encoded-id-prefix
"http://franz.com/clients"
"[0-9]{2}-[a-z]{3}-[0-9]{2}")
(setq cli-id1
(next-encoded-upi-for-prefix "http://franz.com/clients" (make-upi)))
<server crash>
<close database>
<server restart>
<reopen database>
(setq cli-id1
(next-encoded-upi-for-prefix "http://franz.com/clients" (make-upi)))
-> ERROR, that prefix is unknown
(setq em-id3
(next-encoded-upi-for-prefix "http://www.franz.com/employees" (make-upi)))
;; value may be the same as id2 value before server crash.
Lisp API
There are these functions defined:
- register-encoded-id-prefix
- unregister-encoded-id-prefix
- encoded-id-prefix-registered-p
- next-encoded-upi-for-prefix
- encoded-id-prefix-in-use-p
- collect-encoded-id-prefixes
- map-encoded-id-prefixes
See the documentation for those function for more information.
HTTP API
- (service "r" :get store ("encodedIds" "prefixes") ()
- (service "W" :post store ("encodedIds" "prefixes")
- (service "W" :post store "encodedIds" ((prefix :string) (amount :integer 1))
Java API
- AGRepositoryConnection#registerEncodableNamespace(String)
- generateURI(String encodableNamespace)
Adding triples
Once a prefix has been registered, triples added that use the format will be automatically converted. The current implementation will not recode existing triples to use the new prefix.
Re-registration
Once triples that use a particular prefix have been added to a store, it will not be possible to remove that prefix (without also dropping all of those triples or converting all of their IDs into strings). The first version of this feature will disallow the changing of registered prefixes after they have been committed.
Upgrading from 4.2
When Encoded IDs were introduced in version 4.2, they were identified with @# in their string representations. Starting in release 4.2.1a, the identifier is @@ and @# will no longer work. If you used Encoded IDs, please change @# to @@ wherever necessary.
Future features
We welcome comments about this feature. We are considering future enhancements including
- Unicode support in id-formats
- check for ambiguous id-formats
- update triples when new prefixes are registered.
- update triples when registrations are dropped or changed.