Introduction
When AllegroGraph is given a query pattern, for example in the form of a get-triples call or as a triple pattern in a SPARQL query, it consults indices to efficiently locate the matching triples. This document describes the structure and usage of these indices.
The most important point is that optimizing indices will improve query performance.
Triple components
Before diving into indices, let's first look at the components of a triple in AllegroGraph:
- subject (s) - an IRI or blank node
- predicate (p) - an IRI
- object (o) - an IRI, literal (like a number of string), or blank node
- graph (g) - an IRI, or the default graph
- triple id (i) - a unique number within the repository
- triple attributes (optional)
The triple id is for internal use, though it also accessible via the function triple-id and the tripleId magic property. In a distributed or federated repository there may be multiple triples from different repositories that share the same triple id.
Triple attributes are not indexed.
Index flavors
An index name, also called a flavor, is a permutation of the letters s
, p
, o
, g
, i
which are the triple component abbreviations described above. Not all possible permutations are supported by AllegroGraph, as the triple id is special in that it is unique. The supported index flavors are:
index
i
;indices that start with
spog
in any order, ending withi
(for examplespogi
andposgi
).
Index Structure
An index is a data structure that contains all the triples in the repository. The index is stored on disk.
The order of the letters in an index name indicate how the triples are grouped and sorted.
For example in a spogi
index:
- the triples are first sorted by subject (s); all triples with the same subject are grouped together;
- within such a group the triples are sorted by predicate (p);
- within each subject/predicate combination the triples are sorted by object (o);
- then graph (g);
- and finally triple id (i).
If multiple triples have the same subject, predicate, object and graph values, then the id, which is always unique, finally determines a deterministic sort order. (If duplicate triples are undesired, see deleting duplicate triples.)
Similarly in the gspoi
index the triples are sorted first by graph, then subject, predicate, object and finally id.
The index i
is an exception, in that the sort order is determined by only one triple component, its unique triple id.
Literal and numeric values
Values that are strings in nature (like literal strings and IRIs) are stored once in a string table and every triple index only contains an identifier that refers to the string table entry.
Values that are numeric in nature (like integers, floats, booleans, dates, geospatial) are encoded in a way that allows efficient range queries.
Default Indices
The seven indices that are created by default for new repositories are: spogi
, posgi
, psogi
, ospgi
, gspoi
, gposi
, and i
. With this set most query patterns can be handled efficiently.
Index storage space
The rule of thumb is that each triple needs about 100 bytes of disk storage space in total when using the default indices.
The required space grows less than linearly with the number of indices, as certain data (like the string table that contains the full name of literals and IRIs) is stored only once and shared by all indices.
Adding and Dropping Indices
A repository must have at least one index. Apart from that restriction indices can be added and dropped freely.
Dropping unused indices frees up disk and CPU resources. For example if the graph component of triples is never used, indices beginning with g
are not useful and can be removed without influencing query performance. When using the default indices, this means dropping indices gspoi
and gposi
.
To modify indices:
- Webview: Manage triple indices
- Java: AGRepositoryConnection.addIndex and dropIndex methods
- Lisp: function add-index and drop-index
- Python: RepositoryConnection.addIndex and dropIndex
- HTTP: /repositories/[name]/indices
Optimizing Indices
As triples are added and deleted, the index is incrementally updated and the data structure becomes less efficient over time. This means that locating the matching triples takes more and more time. The efficiency of an index is expressed as an Oscore which can be checked in WebView. The optimal value is 1.000
and higher values indicates worse performance.
An index can be optimized which involves rewriting the index in a more compact way that allows faster triple retrieval. During this optimization process the index can still be consulted by the query engine, so optimization is not a blocking operation. After optimization the Oscore will be lower.
AllegroGraph automatically optimizes indices in the background when their efficiency is below an internal threshold. In order to not take too many resources from the query handling or other concurrent actions, the background optimization does limited effort which results in good but not optimal index efficiency.
The agtool auto optimize tool starts a process, which in the default runs forever (that is, until the server or server restarts) which automatically optimizes indices whenever it detacts optimization is needed and the index seems idle. Along with the agtool command, there is also an AGWebView link.
Aside from the automatic optimizer, optimization can be initiated through the various client APIs. These operations take a level
argument that can be 2
or 3
, where 3
takes the most time but results in an optimal index. If automatic optimization is enabled, these tools should be needed only rarely if at all.
- WebView: "Optimize the repository" option in WebView (and the Start background automatic optimizer link described above).
- agtool: agtool optimize
- Java: the AGRepositoryConnection.optimizeIndices method
- Lisp: function optimize-indices
- Python: RepositoryConnection.optimizeIndices
- HTTP: /repositories/[name]/indices/optimize
Deleted triples and their effect on indices and query performance
When a triple is deleted from a repository, it is not automatically repoved from storage or memory. Instead, it is marked as deleted (in a bitmap which identifies live and deleted triples in an index). But, as said, it continues to use storage and, if there are a lot of deleted triples, this can affect query performance.
Deleted triples can be purged from storage. There is a Purge deleted triples button on the Manage triple indices page. There is also an HTTP interface. See the Purging Deleted Triples document for more information. As noted in that document, index optimizations will purge deleted triples that are encountered. Automatic background optimization may purge some triples but will miss most older deleted triples. Level 2 index optimization will do much better and level 3 index optimization will purge all deleted triples but it can be quite slow.
The approximate fraction of deleted triple among all triples in an index is displayed by the agtool storage-report command. Follow the link to see an example of the output of that command.
Query performance
Let's look at how the query engine uses indices for the query:
(get-triples :s <ex:person1> :p <ex:name>)
or the similar SPARQL pattern:
{ <ex:person1> <ex:name> ?name }
The spogi
index is an efficient index for these queries, as the fixed subject <ex:person1>
and fixed predicate <ex:name>
directly lead to the index section with all matching triples, and these triples can be efficiently returned. Any index flavor that starts with sp...
or ps...
(like spgoi
or psogi
) is optimal.
What if no sp...
or ps...
index is active? In that case the query engine will look for an active index that starts with just s....
or p....
. For example, assume sopgi
is available. As the subject in the query pattern is fixed, the highest level lookup efficiently locates all triples with subject <ex:person1>
. But within this section the triples are ordered first by object, second by predicate. To find the triples with the correct predicate, all triples in this section (with subject <ex:person1>
) must be consulted, and for each the predicate must be compared to <ex:name>
. Only the triples that pass this check will be returned. It could well be that the number of triples consulted is much larger than the number of triples that pass the check, so potentially a lot of work is done for very few returned triples, compared to the required effort in an optimal index.
In general for every query pattern there are certain optimal indices, and if such an index is active that is what the query engine will use. A suboptimal index can also be used, and the query results will still be correct.
In case of uncommon query patterns it could help to add a matching index that is not a default index. For example the SPARQL pattern:
graph <ex:g> {
<ex:person1> ?p ?o
}
filter ((?o > 10) && (?o < 20))
with fixed subject and graph, and a numeric range of object values, will run most efficiently if index gsopi
or sgopi
is active:
the graph and subject are fixed, so the index should start
g
ands
;in the
gso..
andsgo..
indices the section for a given graph and subject contains triples ordered by object, so the matches for the range query for values between 10 and 30 are stored adjecent.
Neither index gsopi
nor sgopi
is enabled by default. But the default gspoi
index already gives two levels of optimal sorting (graph and subject) and adding an extra index will probably not impact the query performance significantly.
In general, in case of unsatisfying query performance, optimize the indices first, and contact [email protected] if that did not resolve the problem.
Index Styles
AllegroGraph supports two different triple index data structures, style 1 and style 2, with different performance characteristics. In practice style 1 has shown the best performance, and it is the default. Style 2 has now been deprecated and might be removed in a future release.