Triple Indices

Introduction

When AllegroGraph is given a query pattern, for example in the form of a get-triples call or as a triple pattern in a SPARQL query, it consults indices to efficiently locate the matching triples. This document describes the structure and usage of these indices.

The most important point is that optimizing indices will improve query performance.

Triple components

Before diving into indices, let's first look at the components of a triple in AllegroGraph:

subject (s) - an IRI or blank node
predicate (p) - an IRI
object (o) - an IRI, literal (like a number of string), or blank node
graph (g) - an IRI, or the default graph
triple id (i) - a unique number within the repository
triple attributes (optional)

The triple id is for internal use, though it also accessible via the function triple-id and the tripleId magic property. In a distributed or federated repository there may be multiple triples from different repositories that share the same triple id.

Triple attributes are not indexed.

Index flavors

An index name, also called a flavor, is a permutation of the letters s, p, o, g, i which are the triple component abbreviations described above. Not all possible permutations are supported by AllegroGraph, as the triple id is special in that it is unique. The supported index flavors are:

index i;
indices that start with spog in any order, ending with i (for example spogi and posgi).

Index Structure

An index is a data structure that contains all the triples in the repository. The index is stored on disk.

The order of the letters in an index name indicate how the triples are grouped and sorted.

For example in a spogi index:

the triples are first sorted by subject (s); all triples with the same subject are grouped together;
within such a group the triples are sorted by predicate (p);
within each subject/predicate combination the triples are sorted by object (o);
then graph (g);
and finally triple id (i).

If multiple triples have the same subject, predicate, object and graph values, then the id, which is always unique, finally determines a deterministic sort order. (If duplicate triples are undesired, see deleting duplicate triples.)

Similarly in the gspoi index the triples are sorted first by graph, then subject, predicate, object and finally id.

The index i is an exception, in that the sort order is determined by only one triple component, its unique triple id.

Literal and numeric values

Values that are strings in nature (like literal strings and IRIs) are stored once in a string table and every triple index only contains an identifier that refers to the string table entry.

Values that are numeric in nature (like integers, floats, booleans, dates, geospatial) are encoded in a way that allows efficient range queries.

Default Indices

The seven indices that are created by default for new repositories are: spogi, posgi, psogi, ospgi, gspoi, gposi, and i. With this set most query patterns can be handled efficiently.

Index storage space

The rule of thumb is that each triple needs about 100 bytes of disk storage space in total when using the default indices.

The required space grows less than linearly with the number of indices, as certain data (like the string table that contains the full name of literals and IRIs) is stored only once and shared by all indices.

Adding and Dropping Indices

A repository must have at least one index. Apart from that restriction indices can be added and dropped freely.

Dropping unused indices frees up disk and CPU resources. For example if the graph component of triples is never used, indices beginning with g are not useful and can be removed without influencing query performance. When using the default indices, this means dropping indices gspoi and gposi.

To modify indices:

Webview: Manage triple indices
Java: AGRepositoryConnection.addIndex and dropIndex methods
Lisp: function add-index and drop-index
Python: RepositoryConnection.addIndex and dropIndex
HTTP: /repositories/[name]/indices

Optimizing Indices

As triples are added and deleted, the index is incrementally updated and the data structure becomes less efficient over time. This means that locating the matching triples takes more and more time. The efficiency of an index is expressed as an Oscore which can be checked in WebView. The optimal value is 1.000 and higher values indicates worse performance.

An index can be optimized which involves rewriting the index in a more compact way that allows faster triple retrieval. During this optimization process the index can still be consulted by the query engine, so optimization is not a blocking operation. After optimization the Oscore will be lower.

AllegroGraph automatically optimizes indices in the background when their efficiency is below an internal threshold. In order to not take too many resources from the query handling or other concurrent actions, the background optimization does limited effort which results in good but not optimal index efficiency.

The agtool auto optimize tool starts a process, which in the default runs forever (that is, until the server or server restarts) which automatically optimizes indices whenever it detacts optimization is needed and the index seems idle. Along with the agtool command, there is also an AGWebView link.

Aside from the automatic optimizer, optimization can be initiated through the various client APIs. These operations take a level argument that can be 2 or 3, where 3 takes the most time but results in an optimal index. If automatic optimization is enabled, these tools should be needed only rarely if at all.

WebView: "Optimize the repository" option in WebView (and the Start background automatic optimizer link described above).
agtool: agtool optimize
Java: the AGRepositoryConnection.optimizeIndices method
Lisp: function optimize-indices
Python: RepositoryConnection.optimizeIndices
HTTP: /repositories/[name]/indices/optimize

In most cases, the best approach is to run optimization with regular intervals, for example using the cron utility on the machine running AllegroGraph server. This way cron has access to agtool utility provided with AllegroGraph.

Assuming AllegroGraph is running as agraph user, the following line in the system-wide crontab (/etc/crontab)

0 2,14 * * * agraph /path/to/agtool optimize cat:repo --level 2

will run the level 2 optimization of all indices of the repository repo in the catalog cat every day at hours 2:00 (2:00 AM) and 14:00 (2:00 PM).

Note that because this agtool call is run as the same user as the server (agraph), there is no need to specify credentials, only the catalog and repository. See Repository Specification for more information about triple store specification syntax.

In order to test that the cron job is congigured correctly, use the following command to check if the expected number of optimization cycles has happened:

grep 'completed at level 2' /path/to/agraph.log

The log file should contain a line of this format per triple index per optimization cycle.

Deleted triples and their effect on indices and query performance

When a triple is deleted from a repository, it is not automatically repoved from storage or memory. Instead, it is marked as deleted (in a bitmap which identifies live and deleted triples in an index). But, as said, it continues to use storage and, if there are a lot of deleted triples, this can affect query performance.

Deleted triples can be purged from storage. There is a Purge deleted triples button on the Manage triple indices page. There is also an HTTP interface. See the Purging Deleted Triples document for more information. As noted in that document, index optimizations will purge deleted triples that are encountered. Automatic background optimization may purge some triples but will miss most older deleted triples. Level 2 index optimization will do much better and level 3 index optimization will purge all deleted triples but it can be quite slow.

The approximate fraction of deleted triple among all triples in an index is displayed by the agtool storage-report command. Follow the link to see an example of the output of that command.

Query performance

Let's look at how the query engine uses indices for the query:

(get-triples :s <ex:person1> :p <ex:name>)

or the similar SPARQL pattern:

{ <ex:person1> <ex:name> ?name }

The spogi index is an efficient index for these queries, as the fixed subject <ex:person1> and fixed predicate <ex:name> directly lead to the index section with all matching triples, and these triples can be efficiently returned. Any index flavor that starts with sp... or ps... (like spgoi or psogi) is optimal.

What if no sp... or ps... index is active? In that case the query engine will look for an active index that starts with just s.... or p..... For example, assume sopgi is available. As the subject in the query pattern is fixed, the highest level lookup efficiently locates all triples with subject <ex:person1>. But within this section the triples are ordered first by object, second by predicate. To find the triples with the correct predicate, all triples in this section (with subject <ex:person1>) must be consulted, and for each the predicate must be compared to <ex:name>. Only the triples that pass this check will be returned. It could well be that the number of triples consulted is much larger than the number of triples that pass the check, so potentially a lot of work is done for very few returned triples, compared to the required effort in an optimal index.

In general for every query pattern there are certain optimal indices, and if such an index is active that is what the query engine will use. A suboptimal index can also be used, and the query results will still be correct.

In case of uncommon query patterns it could help to add a matching index that is not a default index. For example the SPARQL pattern:

graph <ex:g> {  
  <ex:person1> ?p ?o  
}  
filter ((?o > 10) && (?o < 20))

with fixed subject and graph, and a numeric range of object values, will run most efficiently if index gsopi or sgopi is active:

the graph and subject are fixed, so the index should start g and s;
in the gso.. and sgo.. indices the section for a given graph and subject contains triples ordered by object, so the matches for the range query for values between 10 and 30 are stored adjecent.

Neither index gsopi nor sgopi is enabled by default. But the default gspoi index already gives two levels of optimal sorting (graph and subject) and adding an extra index will probably not impact the query performance significantly.

In general, in case of unsatisfying query performance, optimize the indices first, and contact [email protected] if that did not resolve the problem.

Index Styles

AllegroGraph supports two different triple index data structures, style 1 and style 2, with different performance characteristics. In practice style 1 has shown the best performance, and it is the default. Style 2 has now been deprecated and might be removed in a future release.

AllegroGraph 9.0.1 Triple Indices