Overview
AllegroGraph is a database and application framework for building Semantic Web applications. It can store data and meta-data as triples; query these triples through various query APIs like SPARQL (the proposed standard W3C query language) and Prolog; and apply RDFS++ reasoning with its built-in reasoner. If more powerful reasoning is required, AllegroGraph can integrate with Racer and its full description logic. Another alternative is to connect AllegroGraph to the ontology builder TopBraid Composer and do reasoning with this GUI based tool. AllegroGraph 3 provides a number of new exciting features: Federation, Social Network Analysis, Geospatial capabilities and Temporal reasoning. All of these are described in more detail further on in this document.
The Semantic Web
Since AllegroGraph is a database, we'll start by looking at the kind of data it is designed to store. The figure below represents some information about Jans and his pets. Like much of the data on the web, there are explicit relations like Robbie is the petOf Jans and implicit or common-sense relations such as petOf is an inverse relation to hasPet and Dog is a subClassOf Mammal
Though there are many ways to store this information, the W3C has standardized on the Resource Description Framework (RDF). RDF breaks knowledge into assertions of subject predicate object (like the three sentences above). For obvious reasons, these assertions are called triples. If we have many triples from different contexts, we can append an additional slot to each assertion; we call this slot a named graph. Even though these assertions are now quads, we'll still call them triples. Here are the assertions from above rewritten slightly to fit them into the triple-framework:
subject predicate object graph
jans Type Human jans's home page
robbie petOf jans jans's home page
petOf inverseOf hasPet english grammar
Dog subClassOf Mammal science
The Semantic Web vision is one where web pages contain enough self-describing data that machines will be able to navigate them as easily as humans do now. This will let computers better assist us in answering questions and managing our ever more complicated world. AllegroGraph is a high-performance database built to hold this information, query it, and reason with it. For more information on the Semantic Web, RDF and all the rest, see the following resources:
- RDF: http://www.w3.org/TR/rdf-primer/,
- RDFS: http://www.w3.org/TR/rdf-schema/, and
- OWL: http://www.w3.org/TR/owl-guide/.
For more information on the above topics, see the Suggested Reading for recommended introductory texts.
One thing to note is that AllegroGraph doesn't restrict the contents of its triples to pure RDF. In fact, we can represent any graph data-structure by treating its nodes as subjects and objects, its edges as predicates and creating a triple for every edge. The named-graph slot can be used to hold additional, application-specific, information. Used this way, AllegroGraph becomes a powerful graph-oriented database.
A Block diagram of an AllegroGraph database
Now that we've seen what kind of data AllegroGraph can manage, we can take a look at how it keeps track of it in a bit more detail:
- In RDF-land, an assertion is a statement that
subject predicate object (in the context of graph)
- The bulk of an AllegroGraph triple-store is composed of assertions. Though called triples for historical reasons, each assertion has five fields:
- subject (s)
- predicate (p)
- object (o)
- graph (g)
- triple-id (i)
All of s, p, o, and g are strings of arbitrary size. Of course, it would be very inefficient to store all of the duplicated strings directly so we associate a special number (called a Unique Part Identifier or UPI) with each unique string. The string dictionary manages these strings and UPIs and prevents duplication.
To speed queries, AllegroGraph creates indices which contain the assertions plus additional information.
AllegroGraph can also perform freetext searching in the assertions using its freetext indices.
and Finally, AllegroGraph keeps track of deleted triples
Triple-data generally comes into AllegroGraph as strings either from pure RDF/XML (see example) or as the more verbose but simpler N-Triple format (see example). The programmer API also makes it easy to import data from RDBMSs, CSV or any other custom data format
the N-triples data format
<http://www.franz.com/simple#Animal> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .
<http://www.franz.com/simple#Mammal> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://www.franz.com/simple#Animal> .
<http://www.franz.com/simple#Mammal> <http://www.franz.com/simple#eyes> "two" .
Some sample RDF/XML data format
<?xml version="1.0"?>
<RDF xmlns="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xml:base="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:owl="http://www.w3.org/2002/07/owl#">
<Description rdf:about="http://www.franz.com/simple#Animal">
<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Class"/>
</Description>
<Description rdf:about="http://www.franz.com/simple#Mammal">
<rdfs:subClassOf rdf:resource="http://www.franz.com/simple#Animal"/>
<ns1:eyes>two</ns1:eyes>
</Description>
</RDF>
You should notice that although the above diagram shows a triple-store through an RDF-lens, there is nothing that constrains AllegroGraph to the RDF world. In fact, AllegroGraph contains many features that are outside of pure RDF and make it a true graph database.
Graph databases
We've already seen how AllegroGraph can help manage Semantic Web data and noticed how any graph can be viewed as a collection of triples representing its edges. Because AllegroGraph can store anything in the subject, predicate, object and graph fields of its triples, it can be used to efficiently model things than would crush a pure RDF database. The main point to note is that we can add one more quality to the list above:
- In graph-database land, an assertion says that node s is connected to node o via edge p with additional data g.
For many applications, graph databases can be both more flexible and faster than RDBMSs because
You add new predicates without changing any schema
One-to-many relations are directly encoded without the indirection of tables
You never think about what to index because everything is indexed
AllegroGraph has the ability to encode values directly into its triples (thus bypassing the string dictionary completely). This allows for both more efficient data retrieval and extremely efficient range queries. We take advantage of this data representation in the add-on libraries for geospatial reasoning, temporal reasoning and social network analysis.
Triple-store operations
You can manipulate data in triple-stores via many different interfaces and languages including Java, HTTP and Lisp. Each language provides mechanisms to create and open triple-stores; load them with data in bulk-mode or programmatically; manage indices; enable RDFS++ reasoning; query for triples that match simple or complex constraints; serialize triples in many formats; and understand and manage server performance.
Adding triples
AllegroGraph has facilities to bulk load from both N-Triples, N-Quads, Trix and RDF/XML files. 1 You can create freetext-indices while loading triples by specifying which predicates should be indexed. Additionally, AllegroGraph supports a wide array of encoded data-types such as numbers, dates, and geospatial coordinates. Using these data-types not only shrinks the size of your triple-store (because the string data need not be saved) but also provides for both super-fast range queries and geospatial queries.
Of course, you can also load triples into AllegroGraph programmatically. This can be used to import custom data formats, or to build a triple-store incrementally. Triples can be added using RDF syntax or AllegroGraph's special encoded data-types. Programmatically added triples can also make use of the AllegroGraph's triple-id to perform super-efficient reification.
Getting Triples back
Query Patterns
AllegroGraph provides numerous methods for getting triples back out of a triple-store. The simplest is to ask for triples matching a pattern of subject, predicate, object and graph. Each part of the pattern can be an exact match, a range specifier, or a wild card (don't care). For example, the pattern
subject : <http://www.example.com/people/gwking>
predicate: wild
object : wild
graph : <http://www.example.com/context/initial>
would retrieve all triples about the person gwking from the graph
named initial. We could learn about all of someone's phone numbers using:
subject : <http://www.example.com/people/gwking>
predicate: <http://www.example.com/telephone/>
object : wild
graph : wild
and learn about everyone born in the first half of 1964 with:
subject : wild
predicate : <http://www.example.com/birthm/>
object-start : "1964-01-01"^^<http://www.w3.org/2000/01/XMLSchema#date>
object-end : "1964-06-30"^^<http://www.w3.org/2000/01/XMLSchema#date>
graph : wild
You can use these pattern-based queries in your own programs to query triple-stores at the bare-metal. In fact, AllegroGraph's other query interfaces such as SPARQL, Prolog and the RDFS++ reasoner all ground out in patterns exactly like these.
Range Queries
In addition to strings, AllegroGraph can store many data-types directly in its triples. This lets it perform range queries in a single operation. Suppose, for example, that weights were stored as strings like
"158"^^<http://www.w3.org/2000/01/XMLSchema#long>
If you wanted to find all people whose weight was greater than 200, then you would need to scan every triple in the store, lookup the string, parse it and then do your comparison. Ouch! With AllegroGraph, the value is encoded directly in the raw triple data (using a special kind of UPI). A range query involves immediate data lookup and comparison and is therefore as fast as a search for an individual triple.
Cursors
When AllegroGraph is given a query pattern, it responds with a cursor that iterates over the triples that match the pattern. Programs can use functions like Java's cursorNext()
or Lisp's cursor-next to move through a cursor or use higher-level constructs like map-cursor.
Querying Quickly: Indices
AllegroGraph builds indices so that any query can find its first match in a single I/O operation. We can abbreviate the index flavors using s for subject, p for predicate and so on. What matters with an index is the sort order of the triples. For example, the spogi index first sorts on subject, then predicate, object, graph, and finally, id. If we ignore the triple ID, there are 24 different index flavors running from spogi through gopsi. Fortunately, we don't need every possible flavor in order to produce fast queries. Suppose, for example, that one of the triples we saw above is in our triple-store as
id subject predicate object graph
21445 jans Type Human jans's home page
There are many queries that will return this triple but the best flavor to use is the same for many of them. For example, we can use the spogi flavor for any of these queries:
- subject =
<jans>
- subject =
<jans>
and predicate =<Type>
- subject =
<jans>
and predicate =<Type>
and object =<Human>
(You can find more details on how AllegroGraph picks an index flavor in the reference guide).
Out of the box, AllegroGraph builds six index flavors in the background as triples are added. Of course, you can customize which indices are built, when they are built and how they are updated. For example, if you never use named-graphs then you can drop the three g indices to save both disk space and processing time.
Query APIs
SPARQL and twinql
SPARQL is the query language of choice for modern triple-stores. AllegroGraph's SPARQL sub-system is twinql; it adheres to proposed W3C standard; includes a query optimizer; and has full support for named-graphs. For more information on using SPARQL with AllegroGraph, see the tutorial and twinql reference guide.
RDFS++ Reasoning
Description logic or OWL reasoners are good at handling (complex) ontologies, they are usually complete (give all the possible answers to a query) but have completely unpredictable execution times when the number of individuals increases beyond millions.
AllegroGraph's RDFS++ reasoning supports all the RDFS predicates and some of OWL's. It is not complete but it has predictable and fast performance. Here are the supported predicates:
- rdf:type and rdfs:subClassOf
- rdfs:range and rdfs:domain
- rdfs:subPropertyOf
- owl:sameAs
- owl:inverseOf
- owl:TransitiveProperty
The reasoner tutorial provides a quick introduction of how each predicate behaves and describes the reasoner in more detail. AllegroGraph also includes an optional hasValue reasoning module.
Prolog
Prolog is an alternative query mechanism for AllegroGraph. With Prolog, you can specify queries declaratively. In the Prolog tutorial, we provide an introduction to using Prolog and AllegroGraph together. Prolog is an integral part of Lisp. So for Lispers the combination of Lisp, Prolog, and AllegroGraph are a natural triad. In Java mode and with the HTTP-interface you can send Prolog select queries to the server. You send them as a string and you get the bindings back as a list of values. See the HTTP protocol and Javadocs for the specifics.
Managing Massive Data - Federation
The block diagram we saw above is abstract: it can be implemented in many different ways. AllegroGraph 3.0 uses that same programming API to connect to local triple-stores (either on-disk or in-memory), remote-triple-stores and the entirely new federated triple-store. A federated store collects multiple triple-stores of any kind into a single virtual store that can be manipulated as if it were a simple local-store. Federation provides three big benefits:
- it scales,
- it makes triple-stores more manageable, and
- it makes data archive almost trivial.
Federation: Scalable triple-stores
Since federation provides a natural mechanism to join disparate triple-stores, we can use separate instances of AllegroGraph to load data on multiple CPUs and then combine them at query time. 2 Loading triples is an extremely parallelizable task in that using N CPUs decreases the total time by a factor of N. Experiments on the LUBM-8000 dataset show that we can easily load and index a billion triples on a four CPU machine in less than 10-hours.
Federation: Data Management
AllegroGraph's federation mechanism and flexible triple-store architecture combine to make it easy to connect multiple stores together and treat them as one. For example, we can combine the dbPedia, the USGS Geonames database and Census information into a single virtual store and explore the interconnections between these datasets without worrying about where the triples originate. Even better, we can keep different kinds of triples separate and combine them as needed. E.g., we can keep known facts, inferred triples, provenance information, ontologies, metadata and deleted triples in separate, easily manageable stores and combine and re-combine the data as necessary.
Federation: Data warehousing
Enterprise data volumes are growing without bound making it essential to enable the accumulation and archiving of multi-billions of triples. Federation lets you segment your data into usable chunks that can be swapped in and out as needed.
The figure illustrates how an enterprise data center can use federation to easily work with the three most current months of data. Since federated data stores can be built easily and easily changed, it is just as simple to look at historical data whenever that is necessary.
AllegroGraph and the Network
Sesame 2.0 HTTP interface
AllegroGraph's Sesame 2.0 HTTP Server runs in either the Java server or in any Lisp image with AllegroGraph loaded. The parameters to the Java server are described in the Server Installation document. Once running, clients interact with the server via HTTP requests. The Sesame libraries make it easy for user's to create requests and interpret the results.
We have extended the Sesame HTTP protocol (as described on the OpenRDF website) to expose additional AllegroGraph features such as indexing and encoded data-types. These extensions are described in the HTTP protocol documentation.
Clustering
On a single processor system: if you load in a large set of data roughly 60 % of the time is spent in loading triples and 40 % is spent in indexing. If you run this on a multiple processor system or a cluster of independent machines, you can do nearly all indexing in parallel to the loading process. And, while running interactively the indexing of the newly added triples can be done in the background too.
AllegroGraph uses Franz's clustering technology which is itself built on a powerful RPC mechanism. AllegroGraph clustering works the same whether you are running on a computer with multiple cores or on a more traditional cluster of independent systems. All you need to do is make sure that AllegroGraph is installed on each machine you want to use and that you can access them over the network. Then once you tell your running AllegroGraph to use the machines and how many tasks it can allocate on each of them, it will automatically make use of the extra processing power to speed indexing and merging operations.
Specialized Datatypes
AllegroGraph supports several specialized datatypes for efficient storage, manipulation, and search of Social Network, Geospatial and Temporal information.
Social Network Analysis
By viewing interactions as connections in a graph, we can treat a multitude of different situations using the tools of Social Network Analysis (SNA). SNA lets us answer questions like:
How closely connected are any two individuals?
What are the core groups or clusters within the data?
How important is this person (or company) to the flow of information
How likely is it that this person and that person know one another
The field is full of rich mathematical techniques and powerful algorithms. AllegroGraph's SNA toolkit includes an array of search methods, tools for measuring centrality and importance, and the building blocks for creating more specialized measures.
Geospatial Primitives
AllegroGraph provides a novel mechanism for efficient storage and retrieval of geospatial data. 3 Support is provided both for Cartesian coordinate systems (i.e., a flat plane) and for spherical coordinate sysdtems (e.g., the surface of the earth or the celestial sphere).
Coordinates in two dimensions are encoded into a single UPI. Once data has been encoded this way, AllegroGraph can perform the following sorts of queries in either Cartesian or spherical coordiates very quickly:
bounding-box: return a cursor that iterates over all triples within given a rectangular region.
center/radius: return a cursor that iterates over all triples within a circular region.
AllegroGraph's geospatial application also has support for defining polygons and quickly determining:
- whether a point lies inside or outside a given polygon.
- whether two polygons overlap.
- retrieving all triples that lie inside of a given polygon.
Temporal Primitives
AllegroGraph supports efficient storage and retrieval of temporal data including datetimes, time points, and time intervals:
- datetimes in ISO8601 format: "2008-02-01T00:00:00-08:00"
- time points: ex:point1, ex:h-hour, ex:when-the-meeting-began, etc
- time intervals: ex:delay-interval (say, from point ex:point1 to ex:h-hour)
Once data has been encoded, applications can perform queries involving a broad range of temporal constraints on data, including relations between :
- points and datetimes
- intervals and datetimes
- two points
- two intervals
- points and intervals
Freetext Indexing
AllegroGraph can build freetext indexes of the strings of the objects associated with a set of predicates that you specify. Given a freetext index, you can search for text using:
- boolean expressions ("market" AND "housing")
- wild cards ("science*" OR "math*")
- phrases ("Semantic Web search")
Of course, freetext indexing slows the rate at which you can insert triples. Our experiments suggest that you'll see a decrease somewhere between 5 and 25% depending on the number of predicates involved and the kinds of string data in your application.
AllegroGraph's internal architecture
Internally, an open AllegroGraph triple-store is an instance of one of the classes depicted below. Most of the time, you won't need to be concerned with this class implementation because AllegroGraph will manage it transparently. We're depicting them here because they also serve to illustrate many of AllegroGraph's capabilities.
Let's look at each of these in turn.
An Abstract-triple-store defines the main interfaces a triple-store must implement. This class has four main subclasses:
concrete-triple-store's manage actual triples whereas the other three function as wrappers between real triples and the store.
federated-triple-stores provide mechanisms to group and structure arbitrary collections of other triple-stores.
Encapsulated-triples-stores let us add new behaviors to existing stores in a controlled and easily optimized fashion. The best example of an encapsulated-store is a reasoning-triple-store which endow triple-stores with RDFS++, rule based or other reasoning engines.
Finally, a remote-triple-store lets AllegroGraph use triple-stores being served by other processes either locally or anywhere on the network. These triple-stores can be other AllegroGraph stores or connections to Oracle and Sesame ones. 4
By combining these four classes, you can build a triple-store composed of leaf stores from anywhere, implementing differing reasoning rules, from entirely different architectures and treat them as if they comprise a single unified store living on your desktop.
Programming with AllegroGraph
AllegroGraph comes in multiple flavors and works with multiple programming languages and environments.
Java. The Java client interface implements most of the Sesame and Jena interfaces for accessing remote RDF repositories. Because AllegroGraph provides functionality not found in other triple-stores, we have implemented extensions where applicable. See the pre-release Jena page for information on our Jena support.
HTTP. It is now possible for web developers and programmers alike to interact with AllegroGraph 3.3 completely using a RESTful HTTP protocol (using GET, PUT, POST) to add and delete triples, to query for individual triples and to do SPARQL and Prolog selects using the Sesame 2.0 HTTP-interface with some extensions
Lisp. Lisp programmers can open and use triple-stores from within Lisp. Lispers can create applications in the same image that the AllegroGraph server is running or use remote-triple-stores to access data in client/server mode.
TopBraid Composer is an advanced tool for examining and building ontologies. You can connect TopBraid composer to AllegroGraph and visually inspect your data.
Getting Started
We've included a rich set of tutorials to explain and expand on AllegroGraph's unique features and capabilities. These include the language specific Java Tutorial and Lisp Tutorial as well as tutorials on:
The reference guide includes details on the AllegroGraph design, its architecture and the complete Lisp API. Whereas the Javadocs cover the Java API. Java programmers (and others!) can learn a great deal about using and developing with AllegroGraph in the on-line learning center.
Finally, you'll want to check the rest of the Franz Semantic Technologies website for additional resources and ideas. Support is always available at [email protected]. If you send in a bug report or query to support, please include the following information:
- The version of AllegroGraph you are using.
- The operating system version on which you are using AllegroGraph.
- The client language and version you are using (e.g., Java version 1.6.0).
- The text of any error messasges.
Footnotes
- Parsers for N3, Turtle and other standard file formats are planned for the future. In the meantime, you can use the open source tool rapper to convert these formats into one that AllegroGraph can use. ↩
- Since a query needs to look in each leaf store, federated queries will be somewhat slower than ones which only need to look in a single, local store. Future versions of AllegroGraph will help work around this problem by keeping better track of which leaf stores contain which different kinds of data. ↩
- Though we use the shorthand geospatial, AllegroGraph actually supports any two-dimensional data duples whether these specify locations on a silicon chip, the sphere of the earth or some other, user-defined notion. Unless there is danger of ambiguity, we will continue to use geospatial for all of these notions. ↩
- The Oracle and Sesame connections are in development, please contact us for details. ↩