Overview of Indexing

NOTE: the information on this page pertains to AllegroGraph version 3.x. There is no corresponding 4.x page at this time.

AllegroGraph builds indices so that any query can find its first match in a single I/O operation. We can abbreviate the index flavors using s for subject, p for predicate and so on. What matters with an index is the sort order of the triples. For example, the spogi index first sorts on subject, then predicate, object, graph, and finally, id.

Suppose we have the triple 'jans isa man' from the file 'file1'. On disk, this triple is represented as

s := jans  
p := isa  
o := man  
g := file1  
i := 213863271    

There are many queries that will return this triple. Here is a table that lists each possible query and the index flavor that AllegroGraph will use to optimize it.

spogi    get-triples jans, ---, ---, ---  
         get-triples jans, isa, ---, ---  
         get-triples jans, isa, man, ---	  
posgi    get-triples ---,  isa, ---, ---	  
         get-triples ---,  isa, man, ---	  
ospgi    get-triples ---,  ---, man, ---	  
         get-triples jans, ---, man, ---	  
gspoi    get-triples jans, ---, ---, file1  
         get-triples jans, isa, ---, file1  
         get-triples jans, isa, man, file1  
gposi    get-triples ---,  isa, ---, file1  
         get-triples ---,  isa, man, file1  
gospi    get-triples ---,  ---, man, file1  
         get-triples jans, ---, man, file1 

Out of the box, AllegroGraph builds all six indices in the background as triples are added. Of course, you can customize which indices are built, when they are built and how they are updated. For example, if you never used named-graphs then the three indices that start with g will never be used and you could remove them to save both disk space and processing time. Its also relatively rare to need the ospgi index (because there are other ways to find all predicates or all predicates for a particular subject).

Indexing strategies

If your triple-store is not too large and insertion rate is moderate then you can rely on AllegroGraph's standard indexing strategy and your triples will be automatically indexed shortly after they have been added. On the other hand, If your triple-store is large or the insertion rate is high then you will want to think about your indexing strategy.

The standard strategy

When you add a triple to AllegroGraph it goes into an unindexed log. The standard indexing strategy is to build indexes whenever there are more than a threshold number of triples (this is tunable). Each batch of triples indexed in this way will form a single index chunk (per index flavor). Each chunk contributes to the total cost of querying a triple-store so AllegroGraph will also merge chunks automatically where there are too many (this is another tunable parameter). AllegroGraph provides many functions to understand the current index state and to update the state to achieve better performance. For example:

  • Indexing-needed-p: tells you if there are unindexed triples.
  • Index-new-triples: indexes all the currently unindexed triples but does not merge the new indices with the currently existing indices.
  • Index-all-triples: first calls index-new-triples and then merges all indices together.
  • average-index-fragment-count - Returns the average number of index chunks (smaller is better)
  • index-coverage-percent - Returns the average proportion of triples that are indexed for each flavor (closer to 100 is better)
  • Merge-new-triples: will merge all the indices that were built since the last full index-all-triples. Usually this will happen automatically for you but you can speed up your application by calling this in your application code.

A Common Scenario

In our experience the most common pattern of usage for AllegroGraph is to start with a large initial load of triples from a set of RDF files, CSV files or a relational database. Once the data has been loaded (and indexed) the triple-store will go into interactive mode with default indexing behavior. So an example in pseudo code.

  • Initial Load

    1. create a new database
    2. for every file in file list
      load-ntriple-file file
    3. index-all-triples
  • Daily use

    1. load many triples
    2. index-new-triples

So imagine this use case: you are working with a really big triple-store on a day-to-day production environment. You loaded the first several 100-million triples in bulk-mode (first loading, then indexing). Then you switch over to interactive mode. Multiple clients might be adding triples. If query performance deteriorates too much, you can lower the indexing and merging thresholds or call index-new-triples yourself. Don't do this too often because otherwise the indices get too fragmented. A quick optimization is to call merge-new-triples so that the newest indices get merged. Then at night or maybe once a week you will want to do a full index-all-triples.

Up | Next

Copyright © 2014 Franz Inc., All Rights Reserved | Privacy Statement
Delicious Google Buzz Twitter Google+