Overview
This introduction provide an overview of AllegroGraph: what it does and how to use it. This introduction is as language neutral as possible. You will find language specific details in the Java, Lisp, and HTTP tutorials and reference guide.
Though this introduction is aimed at using it as an RDF repository, AllegroGraph is much more and can also be used as a general graph database. In the cases when we present examples or usage-patterns that are beyond RDF, we will make it clear that we are doing so.
Required RDF knowledge. We assume that you are familiar with RDF (Resource Description Framework), RDFS (RDF Schema), and OWL (Web Ontology Language). If you are not, then we suggest that you start out with the Wikipedia entries for OWL, RDF and RDFS or go to the source:
- RDF: http://www.w3.org/TR/rdf-primer/,
- RDFS: http://www.w3.org/TR/rdf-schema/, and
- OWL: http://www.w3.org/TR/owl-guide/.
For a gentle introduction you can also read A Semantic Web Primer by Grigoris Antoniou and Frank van Harmelen (2001, Cambridge MA, MIT press; available, e.g. from
Accessing a triple-store
There are several ways to work with a triple-store:
Java. The Java client interface implements most of the Sesame and Jena interfaces for accessing remote RDF repositories. Because AllegroGraph provides functionality not found in other triple-stores, we have implemented extensions where applicable. When working in Java you will always be in client/server mode. Multiple Java clients can access the same triple store. And from one Java client multiple triple-stores can be accessed.
HTTP. It is now possible for web developers and programmers alike to interact with AllegroGraph 2.2.5 completely using a RESTful HTTP protocol (using GET, PUT, POST) to add and delete triples, to query for individual triples and to do SPARQL and Prolog selects. We didn't invent this interface ourselves. We implemented the Sesame 2.0 HTTP-interface with some extensions (see the HTTP protocol reference). The HTTP-interface can be used from any language that knows how to make HTTP client requests. This means that you can use AllegroGraph (functioning as a server) from Ruby, Python and many other languages.
Lisp. Lisp programmers can open up one or more triple-stores from within Lisp. Lispers can create applications in the same image that the AllegroGraph server is running so they have very fine-grained control over the behavior of the triple-store. However, in this mode you run in stand-alone mode. The Lisp client API allows an application to run in client/server mode. In this mode multiple clients can access the same collection of triple stores and each client can access multiple triple stores. You can also use AllegroServe and extend the provided HTTP-interface if that is what your application requires.
TopBraid Composer. This is an advanced tool for examining and building ontologies. You can connect TopBraid composer to AllegroGraph and visually inspect your data.
In the rest of the introduction we will regularly refer to 'all modes'. When we do so we primarily refer to the Java, HTTP and Lisp modes.
a Bird's eye view of AllegroGraph
We will start with a quick summary of AllegroGraph's features. Most of these will be covered in greater detail below.
Triples
For conventional reasons we call AllegroGraph a triple-store but actually it stores quints. A triple is a structure with 5 slots: The first three are the usual subject (s), predicate (p), and object (o); In addition a triple has a named-graph slot (g) and a unique, AllegroGraph assigned, id (i). If you are not familiar with named-graphs or their usage then the W3C documentation discussing them is a great place to start. You may also want to look at the paper "Named Graphs, Provenance and Trust" (PDF) by Carroll et. al. at where they were introduced. The graph is primarily used to store information about where the triple originated. The ID slot is used for internal administrative purposes but triples can also refer to these IDs directly. As an example, suppose we have defined a namespace fr that resolves to http://www.franz.com/ont#. Example triples would look something like this:
(s) (p) (o) (g) (i)
fr:person12 rdfs:type fr:Person file://internal/rdf-file1 1200
fr:person12 fr:firstname jans file://internal/rdf-file1 1201
String Dictionary
In reality of course we do not store strings (representing resources and literals) directly in a triple's slots. Instead the slots contain hash-codes for s, p, o, and g that point to entries in dictionaries. We call these hash-codes Unique Part Identifiers (UPIs) and we'll discuss them later in this document.
Creating a Triple-Store and connecting to it
You create a triple-store by calling the function create-triple-store with the directory name that AllegroGraph will use to store all of the triple-store's persistent data. Create-triple-store takes additional options to specify:
- how to behave if a triple-store already exists in the directory (if left unspecified, AllegroGraph will overwrite the existing store);
- which indices you want to use (if left unspecified, the triple-store will use all of the possible indices); and
- the likely number of unique resources and literals you expect to store. This value helps AllegroGraph optimize its data structures and can make bulk-loading more efficient.
An example of how to create a triple-store and connect to it can be found in the files AGEx1a.java and AGEx1b.java. These are located in the AllegroGraph examples/java-examples/ directory. An example of how to do this with the HTTP-interface can be found in the Sesame HTTP protocol reference.
Loading triples
In all modes you load data from RDF files. Currently we support as input N-Triple format and RDF/XML format. If you have another format (e.g. Turtle) then we advise you to use the open source tool rapper (the Raptor RDF parser utility 1 ) to convert it to N-Triples.
Programmatically adding and deleting triples
All modes support adding and deleting triples programmatically. When triples are added, a default named-graph will be provided if you don't specify one.
Cursors
A cursor in AllegroGraph is a data structure you can use to navigate through the results of a query. Cursors support only a very small public interface: next, row and is-next?. AllegroGraph also supports several common idioms like map-cursor, count-cursor and collect-cursor. You will see many examples of creating and using cursors below and in the rest of the AllegroGraph documentation.
Retrieving triples
In all modes we provide a low level function called 'get-triples' to retrieve triples from the triple-store. Our SPARQL, Prolog, and AllegroGraph RDFS++ reasoner all compile down to get-triples calls. The function takes a pattern as input. Say we have the triple 'jans isa man' from the file 'file1'. Then for example:
get-triples jans, nil, nil, nil
-> return a cursor that will generate all triples
with subject jans.
get-triples jans, nil, man, file1
-> return a cursor that will generate all triples with
subject 'jans', object 'man' from the file 'file1'
Get-triples returns a cursor which is memory and time efficient since it need access the triple-store only as much as necessary; sometimes, however, it is more convenient to get all of the triples you need at once. For this, you can use get-triples-list. Warning: retrieving every matching triple can be convenient but it may also require more time, memory and bandwidth than you have available! Using get-triples is almost always safer!
Indices
AllegroGraph builds indices so that any query can find its first match in a single I/O operation. We can abbreviate the index flavors using s for subject, p for predicate and so on. What matters with an index is the sort order of the triples. For example, the spogi index first sorts on subject, then predicate, object, graph, and finally, id.
Suppose we have the triple 'jans isa man' from the file 'file1'. On disk, this triple is represented as
s := jans
p := isa
o := man
g := file1
i := 213863271
There are many queries that will return this triple. Here is a table that lists each possible query and the index flavor that AllegroGraph will use to optimize it.
spogi get-triples jans, ---, ---, ---
get-triples jans, isa, ---, ---
get-triples jans, isa, man, ---
posgi get-triples ---, isa, ---, ---
get-triples ---, isa, man, ---
ospgi get-triples ---, ---, man, ---
get-triples jans, ---, man, ---
gspoi get-triples jans, ---, ---, file1
get-triples jans, isa, ---, file1
get-triples jans, isa, man, file1
gposi get-triples ---, isa, ---, file1
get-triples ---, isa, man, file1
gospi get-triples ---, ---, man, file1
get-triples jans, ---, man, file1
Out of the box, AllegroGraph builds all six indices in the background as triples are added. Of course, you can customize which indices are built, when they are built and how they are updated. For example, if you never used named-graphs then the three indices that start with g will never be used and you could remove them to save both disk space and processing time. Its also relatively rare to need the ospgi index (because there are other ways to find all predicates or all predicates for a particular subject).
Indexing strategies
If your triple-store is not too large and insertion rate is moderate then you can rely on AllegroGraph's standard indexing strategy and your triples will be automatically indexed shortly after they have been added. On the other hand, If your triple-store is large or the insertion rate is high then you will want to think about your indexing strategy.
The standard strategy
When you add a triple to AllegroGraph it goes into an unindexed log. The standard indexing strategy is to build indexes whenever there are more than a threshold number of triples (this is tunable). Each batch of triples indexed in this way will form a single index chunk (per index flavor). Each chunk contributes to the total cost of querying a triple-store so AllegroGraph will also merge chunks automatically where there are too many (this is another tunable parameter). AllegroGraph provides many functions to understand the current index state and to update the state to achieve better performance. For example:
- Indexing-needed-p: tells you if there are unindexed triples.
- Index-new-triples: indexes all the currently unindexed triples but does not merge the new indices with the currently existing indices.
- Index-all-triples: first calls index-new-triples and then merges all indices together.
- average-index-fragment-count - Returns the average number of index chunks (smaller is better)
- index-coverage-percent - Returns the average proportion of triples that are indexed for each flavor (closer to 100 is better)
- Merge-new-triples: will merge all the indices that were built since the last full index-all-triples. Usually this will happen automatically for you but you can speed up your application by calling this in your application code.
(Look in the indexing section of the reference guide for details)
A Common Scenario
In our experience the most common pattern of usage for AllegroGraph is to start with a large initial load of triples from a set of RDF files, CSV files or a relational database. Once the data has been loaded (and indexed) the triple-store will go into interactive mode with default indexing behavior. So an example in pseudo code.
Initial Load
- create a new database
- for every file in file list
load-ntriple-file file - index-all-triples
Daily use
- load many triples
- index-new-triples
repeat
So imagine this use case: you are working with a really big triple-store on a day-to-day production environment. You loaded the first several 100-million triples in bulk-mode (first loading, then indexing). Then you switch over to interactive mode. Multiple clients might be adding triples. If query performance deteriorates too much, you can lower the indexing and merging thresholds or call index-new-triples yourself. Don't do this too often because otherwise the indices get too fragmented. A quick optimization is to call merge-new-triples so that the newest indices get merged. Then at night or maybe once a week you will want to do a full index-all-triples.
the RDFS++ reasoner
Description logic or OWL reasoners are good at handling (complex) ontologies, they are usually complete (give all the possible answers to a query) but totally unpredictable with respect to execution time when the number of individuals increases beyond millions.
AllegroGraph's RDFS++ reasoning supports all the RDFS predicates and some of OWL's. It is not complete but it has predictable and fast performance. Here are the supported predicates:
- !rdf:type and !rdfs:subClassOf
- !rdfs:range and !rdfs:domain
- !rdfs:subPropertyOf
- !owl:sameAs
- !owl:inverseOf
- !owl:TransitiveProperty
The reasoner tutorial provides a quick introduction of how each predicate behaves and describes the reasoner in more detail. We suggest reviewing the tutorial before delving into the W3C documentation.
SPARQL and twinql
SPARQL is the query language of choice for modern triple-stores. AllegroGraph's SPARQL sub-system is twinql; it adheres to proposed W3C standard; includes a query optimizer; and has full support for named-graphs. For more information on using SPARQL with AllegroGraph, see the tutorial and twinql reference guide.
You can use SPARQL in all modes. In Java and HTTP mode you send SPARQL queries to the AllegroGraph server; in stand-alone Lisp mode you can use the SPARQL query string syntax or use the more Lispy s-expression version.
Prolog
Prolog is an alternative query mechanism for AllegroGraph. With Prolog, you can specify queries declaratively. In the Prolog tutorial, we provide an introduction to using Prolog and AllegroGraph together. Prolog is an integral part of Lisp. So for Lispers the combination of Lisp, Prolog, and AllegroGraph are a natural triad. In Java mode and with the HTTP-interface you can send Prolog select queries to the server. You send them as a string and you get the bindings back as a list of values. See the HTTP protocol and Javadocs for the specifics. You can run through the prolog tutorial in Java using the file AGPrologTutorial.java. This file is included along with the of the example code.
AllegroGraph: Advanced topics
Here, we cover advanced topics that are mostly relevant for programmers that want to work with the triple-store on the lower level.
The ! reader macro
When programming directly in Lisp or when sending Prolog selects from Java or an HTTP client you will see the following notation.
(select (?x) (q ?x !ex:has !ex:fun))
Suppose you used register-namespace to map ex to "http://www.yourcompany.com/ont#". The AllegroGraph will expand !ex:has to the URIRef <http://www.yourcompany.com/ont#has> at the last possible moment. It will also cache the computations involved in the expansion and part lookup so that future use of !ex:has will be faster and use less memory.
For Java and HTTP-client use: please remember that if you do not register a namespace before using it an error is returned.
For Lisp users: please look in the reference guide and tutorial to learn how the reader macro actually returns a structure that we call a future-part. Future-parts allow you to write Lisp and Prolog rules that can refer to namespaces that will be resolved sometime in the future.
UPIs
This section is highly relevant to Lisp programmers and Java programmers who want to work with low-level triple data in order to avoid the extra bandwidth and memory consumption of strings. These details matter a good less, of course, for the HTTP-client users.
When we discussed triples above, we explained that a triple has five slots where the s, p, o, and g slots are UPIs (Unique Part Identifiers) and the last slot is an ID. Internally, UPIs are size 12 octet arrays and the ID is a 64 bit long integer. In reality a triple is a size 56 octet array more like a C data type than a Lisp structure.
When working with string data (resources or literals) the content of the UPI are just hash codes. The mapping from UPI to the string for which it stands is stored on disk in a string-table. For some other data types the UPIs contain immediate values that retain the data-type's sort order. We support a whole list of data types. More information on many of them can be found at the W3C's XML Datatypes web page. As of this writing, the supported data-types are:
:byte - signed byte number
:date - UTC date
:date-time - UTC date and time
:double-float - IEEE double-float
:gyear - general year
:int - signed integer (32-bit)
:latitude - double-float representing a latitude
:literal-short - short (10-bytes or less in length) string
:long - signed long (64-bit)
:longitude - double-float representing a longitude
:short - signed short (16-bit)
:single-float - IEEE single-float
:telephone-number - telephone number
:time - UTC time
:triple-id - reference other triples (for non-RDF reification)
:unsigned-byte - unsigned-byte number
:unsigned-int - unsigned-int (32-bit)
:unsigned-long - unsigned-long (64-bit)
:unsigned-short - unsigned-short (16-bit)
This list will grow in the future (You can use the supported-types function to see exactly which types your version of AllegroGraph supports).
We have accessors to get the parts out of the triples, so let us look at them in more detail.. The following explanation uses Lisp,
triple-store-user(35): (create-triple-store
"/tmp/example.db"
:if-exists :supersede)
#<db.agraph::triple-db /tmp/example.db, open @ #x10024f1fd2>
triple-store-user(36): (register-namespace "ex" "http://short#"
:errorp nil)
"http://short#"
Note how we tell AllegroGraph how we want to encode the value 38 as an unsigned long. AllegroGraph returns 1 because this is the first triple that we have added.
triple-store-user(37): (add-triple !ex:jans !ex:age
(value->upi 38 :unsigned-long))
1
triple-store-user(38): (get-triples-list :s !ex:jans)
(#(70 10 50 194 4 135 131 89 83 11 ...))
nil
triple-store-user(39): (setf triple (first *))
#(70 10 50 194 4 135 131 89 83 11 ...)
triple-store-user(40): (pprint triple)
#(70 10 50 194 4 135 131 89 83 11 217 0 6 0 66 39 72
227 180 163 243 20 30 0 38 0 0 0 0 0 0 0 0 0 0 17 0
0 0 0 0 0 0 0 0 0 0 31 1 0 0 0 0 0 0 0)
; yes, 56 bytes
Let us look at the subject
triple-store-user(41): (setf s (subject triple))
#(70 10 50 194 4 135 131 89 83 11 ...)
You can compare UPIs
triple-store-user(43): (upi= (subject triple) (object triple))
nil
You can get the original values back:
triple-store-user(46): (upi->value (subject triple))
"http://short#jans"
0
nil
triple-store-user(47): (upi->value (object triple))
38
17 ;this is the type code
nil
The second return value of 17 is the type-code and corresponds to :unsigned-long. The functions type-code->type-name and type-name->type-code let you work with type-codes in your programs
triple-store-user(47): (type-code->type-name 17)
:unsigned-long
triple-store-user(48): (type-name->type-code :unsigned-long)
17
Now let us add two more triples...
triple-store-user(48): (add-triple !ex:gary !ex:age
(value->upi 32 :unsigned-long))
2
triple-store-user(60): (add-triple !ex:kevin !ex:age
(value->upi 28 :unsigned-long))
3
and then do a range query; we will find all triples where age is between 30 and 40. We will use print-triples to display the information more nicely
triple-store-user(63): (print-triples
(get-triples-list
:p !ex:age
:o (value->upi 30 :unsigned-long)
:o-end (value->upi 40 :unsigned-long)))
<http://short#jans> <http://short#age> 38 .
<http://short#gary> <http://short#age> 32 .
How does AllegroGraph support range queries?
Range encoding and Range queries
Most, if not all triple-stores (including the original AllegroGraph
- 2), store every subject, predicate, object and graph as pointers to strings in a string-dictionary. The only way to do range query in these triple-stores is to go through all the values for a particular predicate. This is fine if everything fits in memory but if your predicate has millions of triples that won't work! AllegroGraph 2.2.5 Unique Part Identifiers (UPIs) can contain immediate values that are sortable.
To be clear, you can stick with strings and be in RDF-land
> (add-triple !ex:kevin !ex:age !"28")
or you can use AllegroGraph's UPI encodings:
> (add-triple !ex:kevin !ex:age (value->upi 28 :unsigned-long))
Range queries are only possible in the latter case.
Type and property mapping
In the above code we told AllegroGraph programmatically how to encode certain values. You can also do this while loading an RDF file but you have to specify the mapping for a predicate or data type. Each triple-store maintains a set of mappings between predicate URIrefs or data type URIrefs and a supported AllegroGraph type. For example, here is how to specify that the data type xsd:double maps to an AllegroGraph :double-float and the predicate http://www.example.com/predicate#age maps to an :unsigned-byte
> (setf (datatype-mapping "<http://www.w3.org/2001/XMLSchema#double>")
:double-float)
:double-float
> (setf (predicate-mapping "<http://www.example.com/predicate#age>")
:unsigned-byte)
:unsigned-byte
Now when you load an RDF file, AllegroGraph will examine each triple to see if it satisfies the mappings. 2 When it does, then an encoded-triple will be added to the triple-store. Depending on your needs, you can even tell AllegroGraph to only load encoded-triples and not worry about strings at all. This can provide tremendous spaces savings and also gives you the benefit of range queries (see above).
Clustering & Indexing in the background
On a single processor system: if you load in a large set of data roughly 60 % of the time is spent in loading triples, 40 % is spent in indexing. If you run this on a multiple processor system or a cluster of independent machines, you can do nearly all indexing in parallel to the loading process. And, while running interactively the indexing of the newly added triples can be done in the background too.
AllegroGraph uses Franz's clustering technology which is itself built on a powerful RPC mechanism. There are two possible opportunities for using clustering: one is running on a single computer with either multiple CPUs or a single multiple-core CPU; the second is running on a more traditional cluster of independent systems. From AllegroGraph's point of view, the setting makes very little difference. All you need to do is make sure that Allegro Common Lisp is installed on each machine you want to use and that you can access the machines over the network (details of the network setup are beyond the scope of this guide). Then, you tell AllegroGraph to add the machines:
> (add-indexing-host "localhost" :max-tasks 1)
> (add-indexing-host "www.other-machine.com" :max-tasks 8)
Once you've done this, AllegroGraph will automatically make use of the extra processing power to speed indexing and merging operations. See the reference-guide for additional details.
RDF input and output & Using rapper
In the introduction we mention that we support two input formats: N-Triple and RDF/XML. For any other input format please use 'rapper' to transform it into N-Triples. As far as we know it runs on every platform that AllegroGraph runs. See http://librdf.org/raptor/rapper.html for information on downloading and using it.
Other uses for the Named-Graph slot
The W3C proposal is to use the g or 'named-graph' slot for clustering triples. So for example: you load a file with triples and you use the filename as the named-graph. This way: if there are changes to the triple file, you just can delete every triple that came from the original file and then load the new file.
However, you can put everything you want into the named-graph slot, including numeric values. You can use it for weights, trust factors, time, provenance info etc etc. So for example if you want to store a distance between objects you can do.
(add-triple !ex:SanFrancisco !ex:distance
!ex:NewYork (value->upi 3000 :unsigned-long))
(add-triple !ex:Berkeley !ex:position
(value->upi 37.871666 :longitude)
(value->upi -122.27167 :latitude))
The advantage of this approach is that you can reduce the total number of triples in the store and, even more importantly, dramatically reduce query time because a single query can retrieve more data.
Pointing to other triples
Every triple has a unique ID. This allows triples to point to other triples. This makes reification (making statements about a triple) very efficient, i.e. less space and time is consumed than with the original RDF model of reification (see the RDF Semantics document for all the details of this model).
Warning! RDF does not support this. So if you start doing this you live in (RDF) sin. However, in 1.2.6 many customers wanted this because it is far, far more memory efficient than classical RDF reification.
(let ((id (add-triple !ex:Jans !ex:is !"28")))
(add-triple !ex:Steve !ex:believes (value->upi id :triple-id)))
Text Indexing in AllegroGraph
AllegroGraph 2.2 and beyond support freetext indexing. Here is a summary of the current features. See the reference guide for more details and the Java tutorial for Java specific information on freetext indexing.
AllegroGraph provides freetext indexing for predicates that are registered to be 'freetext-indexed'. A fairly natural choice is to index comments and labels, but the AllegroGraph user can 'free-text-index' any predicate he or she wants. Note: you should register your predicates for indexing before you load a triple file or add triples programmatically. Registering will not re-index triples that are already in the triple store. We will relax this constraint in future version.
AllegroGraph does freetext indexing on literals. A future version of AllegroGraph might support free-text-indexing on resources too, but we are still thinking about how we should tokenize resources and whether or not to include common words like http, www, etc.
we support boolean expressions in search queries.
we support unix style wild cards (*) and blanks (?) in the search expressions. Wild cards and blanks can appear anywhere but if you put one at the beginning of a word the searches will be slower.
we support phrase searches (but you cannot put wild cards and blanks in phrases).
there are four types of information you can get back from the freetext indexer (see the reference guides or tutorial)
- the triple-ids for the triples that match,
- a cursor that will return all triples that match,
- a list of all the triples that match and
- all the unique triple-subjects that match a particular query.
case information: we down-case all strings before they are put in the text indexer, so use lower case when doing queries. However, when you do phrase searches you have to use the exact case.
performance warning: expect a slow down in loading triples anywhere from 5 to 25 % depending on how many predicates you register for freetext indexing.
the current tokenizer is not user-settable. Internally the indexer takes a tokenizer-function but we don't provide a user API for that yet. Currently we break on white-space characters but not on '-', '_' and '.' if these are in words. Contact us if you have special tokenizing needs.
Summary
AllegroGraph is a modern graph database with support for high-performance RDF triple-stores and much, much, more!
Footnotes
- The Raptor RDF tools are part of Redland a set of free software libraries that provide support for the Resource Description Framework (RDF) See the project website for more details and additional tools. ↩
- Currently, only AllegroGraph's N-Triple parser handles data-type and predicate mappings. ↩