Introduction
RDF data is specified in terms of strings.
AllegroGraph converts the string representation of a URI into a 12-byte universal part identifier (UPI). These 12-bytes can either be an encoded representation of the string or a hash that is used to save and retrieve the string's contents in a string dictionary. Because they do not require the lookup and I/O costs of string retrieval, encoded parts can be much more efficient.
AllegroGraph automatically encodes strings representing most numeric formats, dates, times, and date times. Geospatial information can also be encoded if the particular format is registered with AllegroGraph. AllegroGraph can now encode strings representing identifiers using the API described below.
This lets URIs like
http://franz.com/identifiers/person@@156-agh-4562
be saved and manipulated without requiring a trip to the string dictionary.
Note the '@@' separating the prefix and the id. This is used in place of the normal '#' to indicate that an encoded id is intended.
Setup
To use Encoded IDs, a string prefix and an associated ID format are registered with AllegroGraph (using the Java API):
registerEncodableNamespace(
"http://franz.com/identifiers/person",
"[0-9]{3}-[a-z]{3}-[0-9]{1,4}")
This tells AllegroGraph to automatically convert any strings with this prefix and suffix matching the ID format into an encoded UPI rather than saving it as a string. The characters + and * are not allowed in the template string (so, e.g., the string "[0-9]+" is illegal and will fail).
Note our example has variable-width IDs (The {1,4} at the end says from one to four digits 0 to 9, so 121-abc-1 and 121-abc-11 and 121-abc-1234 are all acceptable IDs. Variable-width IDs cannot use automatic generation of IDs (so their namespace prefix is not suitable as an argument to next-encoded-upi-for-prefix). A fixed width id whose prefix is suitable would be, for example,
"[0-9]{3}-[a-z]{3}-[0-9]{2}"
Once prefixes and id templates have been registered, you can define encode IDs by specifying them with a string like
"http://franz.com/identifiers/person@@121-abc-1"
When the template is fixed-width, you can also get IDs with next-encoded-upi-for-prefix.
Encoded-ids in multi-master replication cluster instances
When a repository is an instance of a multi-master replication cluster (see Multi-master Replication), then encoded-ids templates must define 2^60 unique values. While it is possible to convert a repository with registered prefixes whose templates do not define exactly that number of possibilities, calling next-encoded-upi-for-prefix on such a prefix will cause an error. (Existing UPIs are not a problem.)
The special template "plain" will always work. It simply goes through 2^60 integers converted to strings. "plain" is the recommended template when the goal is to create unique URIs.
Here are templates that meet the 2^60 unique strings requirement:
[a-p]{15}
[0-7]{10}-[0-7]{10}
Here is an example showing the behavior of next-encoded-upi-for-prefix with an unsuitable template and a suitable template before and after a repository has become a multi-master replication cluster instance:
triple-store-user(96): (create-triple-store "encoded-id-example"
:port 20641)
#<triple-db encoded-id-example:20641, 0, open @ #x10006d9d882>
triple-store-user(97): (register-encoded-id-prefix
"http://franz.com/test-1"
; Template is not suitable for
; calling next-encoded-upi-for-prefix
; in a cluster instance
"[0-9]{3}-[a-z]{3}-[0-9]{2}")
3
t
triple-store-user(98): (register-encoded-id-prefix
"http://franz.com/test-2"
; Template is suitable for
; calling next-encoded-upi-for-prefix
; in a cluster instance
"[a-p]{15}")
4
t
;; We commit the triple store to make the encoded id templates
;; permanent
triple-store-user(99): (commit-triple-store)
;; The repository is not yet a cluster instance so both
;; templates work:
triple-store-user(100): (next-encoded-upi-for-prefix
"http://franz.com/test-1" (make-upi))
{test-1@@000-aaa-00}
triple-store-user(101): (next-encoded-upi-for-prefix
"http://franz.com/test-2" (make-upi))
{test-2@@aaaaaaaaaaaaaaa}
;;
;; When we try to make the repository a cluster instance with
;; the command:
;; bin/agtool repl create-cluster --if-exists use \
;; http://test:xyzzy@localhost:20641/repositories/encoded-id-example
;;
;; it fails:
;; Warning: there are encoded id prefixes that will not
;; be able to be used
;; with next-encoded-upi-with-prefix when this repository is converted
;; to a replication instance.
;; The prefixes are:
;; http://franz.com/test-1
;;
;; If you wish to do the conversion despite the warning
;; put --force true on the command line
;;
;; create-cluster not performed.
;;
;; We re-enter the command with '--force true':
;;
;; bin/agtool repl create-cluster --if-exists use --force true \
;; http://test:xyzzy@localhost:20641/repositories/encoded-id-example
;;
;; Now the encoded id with a suitable template works but
;; the one with an unsuitable template signals an error
;; when next-encoded-upi-for-prefix is called:
;;
triple-store-user(102): (next-encoded-upi-for-prefix
"http://franz.com/test-2" (make-upi))
{test-2@@aaaeaaaaaaaaaab}
triple-store-user(103): (next-encoded-upi-for-prefix
"http://franz.com/test-1" (make-upi))
Error: In order to use next-encoded-upi-for-prefix the pattern must generate
1,152,921,504,606,846,976 strings but the pattern for
"http://franz.com/test-1" generates 1,757,600,000 strings
Restart actions (select using :continue):
0: Return to Top Level (an "abort" restart).
1: Abort entirely from this (lisp) process.
[1] triple-store-user(104):
Note that in later releases of AllegroGraph, the number of strings may be changed from 2^60.
Uses of Encoded IDs
Encoded IDs are not designed to encode information. Instead, they are analogous to blank nodes, that is placekeeper nodes which themselves link to information.
Suppose, for example, you have data on employees (name, address, salary, date of hire, etc.) Then for each employee, you create an encoded id prefix
"http://www.mycompany.com/identifiers/employees"
and an id template
"[0-9]{3}-[a-z]{3}-[0-9]{3}"
Then, for each employee, create an encoded id, say
"http://www.mycompany.com/identifiers/employees@@103-ayt-928"
and add triples
"http://www.mycompany.com/identifiers/employees@@103-ayt-928" "name" "John Smith"
"http://www.mycompany.com/identifiers/employees@@103-ayt-928" "salary" "50000"
and so on.
Encoded IDs are preferable to blank nodes because you can refer to them without referencing another node (you can only get a handle on a blank node by finding another node which points to it).
Federation
Since the prefixes are registered with a particular store, federations of stores may have different ideas about which encoding a given prefix should use. For example, store A may think that http://franz.com/identifiers/person
should be encoded with id 45 where store B thinks that it is id 46,012. If the stores are federated, additional bookkeeping would need to happen so that the data is correctly processed. This bookkeeping is not supported in the current release. You should only register a prefix in a single store.
ID generation
You can generate ids automatically using next-encoded-upi-for-prefix, which takes the prefix string and a upi as its two required arguments (the upi is modified to be the next id in sequence). AllegroGraph keeps track of a counter for each prefix in a manner similar to that used to generate fresh blank IDs. Multiple processes can request the next identifier for a given prefix and are guaranteed to get unique IDs back (but see note of durability below: uniqueness of uncommitted IDs after a recovery is not guaranteed).
Or you can specify a UPI using the prefix followed by @@ followed by a value which follows the definition template.
For a specific prefix, you should either always generate IDs automatically with next-encoded-upi-for-prefix or always specify IDs yourself. Do not do both as that may result in accidentally using the same id when two separate ids are needed.
ID uniqueness in the event of a crash
Internal data structures keep track of IDs used and the next available ID for generation. Whenever a commit is done (with, say, commit-triple-store), the state of that internal structure is saved. Until a commit, however, if there is a crash of some sort, the next id after recovery may be the same as one provided after the last commit but before the crash. So, for example, if you do (using the Lisp API):
(register-encoded-id-prefix
"http://franz.com/employees"
"[0-9]{3}-[a-z]{3}-[0-9]{2}")
(setq em-id1
(next-encoded-upi-for-prefix "http://franz.com/employees" (make-upi)))
(commit-triple-store)
(setq em-id2
(next-encoded-upi-for-prefix "http://franz.com/employees" (make-upi)))
(register-encoded-id-prefix
"http://franz.com/clients"
"[0-9]{2}-[a-z]{3}-[0-9]{2}")
(setq cli-id1
(next-encoded-upi-for-prefix "http://franz.com/clients" (make-upi)))
<server crash>
<close database>
<server restart>
<reopen database>
(setq cli-id1
(next-encoded-upi-for-prefix "http://franz.com/clients" (make-upi)))
-> ERROR, that prefix is unknown
(setq em-id3
(next-encoded-upi-for-prefix "http://www.franz.com/employees" (make-upi)))
;; value may be the same as id2 value before server crash.
Lisp API
There are these functions defined:
- register-encoded-id-prefix
- unregister-encoded-id-prefix
- encoded-id-prefix-registered-p
- next-encoded-upi-for-prefix
- encoded-id-prefix-in-use-p
- collect-encoded-id-prefixes
- map-encoded-id-prefixes
See the documentation for those function for more information.
HTTP API
- Register an Encoded ID
- Unregister an Encoded ID
- Get a batch of new Encoded ID URIs
- Find out if there are any registered Encoded IDs
- Get a list of registered Encoded IDs
- Check if an Encoded ID prefix is registered
- Find out if any triples have been added using a given Encoded ID
Java API
- AGRepositoryConnection#registerEncodableNamespace(String)
- generateURI(String encodableNamespace)
Adding triples
Once a prefix has been registered, triples added that use the format will be automatically converted. The current implementation will not recode existing triples to use the new prefix.
Re-registration
Once triples that use a particular prefix have been added to a store, it will not be possible to remove that prefix (without also dropping all of those triples or converting all of their IDs into strings). The first version of this feature will disallow the changing of registered prefixes after they have been committed.
Upgrading from 4.2
When Encoded IDs were introduced in version 4.2, they were identified with @# in their string representations. Starting in release 4.2.1a, the identifier is @@ and @# will no longer work. If you used Encoded IDs, please change @# to @@ wherever necessary.
Future features
We welcome comments about this feature. We are considering future enhancements including
- Unicode support in id-formats
- check for ambiguous id-formats
- update triples when new prefixes are registered.
- update triples when registrations are dropped or changed.