AllegroGraph 8.3.0 Performance Tuning
This guide provides some insights on how to use some of AllegroGraph's tunable parameters to maximize database performance. These settings are applied in the AllegroGraph server configuration file (agraph.cfg) as well as the individual database parameter files (parameters.dat). Documentation of all parameter settings appears on the Server Configuration page.
Hardware Selection
For users looking to work with multi-billions of RDF triples we encourage you to purchase as much hardware as your budget will permit. We can offer purchase references ([email protected]) for hardware systems containing multiple CPU cores with 96GB RAM for ~$7,000 or less.
An ideal system for 2-3 billion triples would include 8 CPU cores, 96 GB of RAM, and solid state drives (SSDs). This type of system will optimize query performance as well providing rapid ingest speeds. We also work with smaller hardware in our development and release testing and receive very satisfactory results.
Your results will vary based on factors associated with your application, such as the number and size of unique strings found in the triples, etc.
Please contact us if you have any questions related to hardware selection - [email protected]
Disable transparent hugepages (THP)
Transparent hugepages (THP) is a mechanism that allows Linux to automatically use memory pages that are much larger than the standard 4096 bytes. (See, e.g., this description in Red Hat documentation.) THP may be enabled by default in some Linux distributions.
Having THP enabled can result in poor AllegroGraph performance, particularly for large databases. The following command will disable THP on the machine (but the change will not last over reboots so must be executed each time the computer is booted up):
echo never > /sys/kernel/mm/transparent_hugepage/enabled
That command requires root privileges.
You can usually disable THP entirely by editing a system file. On Red Hat Enterprise Linux 6, for example, edit the file /etc/rc.local, adding the line above (echo never > ...) to it. Other Red Hat releases and other versions of Linux likely have similar means of disabling THP. Check the OS documentation for further information.
If THP is enabled, AllegroGraph logs a warning into agraph.log on startup.
Adjust the Backends Setting
The Backends setting in agraph.cfg determines the maximum number of (non-dedicated) back-ends that the server may spawn. On a machine with many CPU cores, heavy concurrent access can be faster when more back-ends are allowed to be spawned.
As a rule of thumb, set the Backends setting to the number of CPU cores on your machine. agraph.cfg and configuration settings are discussed in the Server Configuration and Control document.
Use Dedicated Sessions
When using non-dedicated back ends, all communications between the client and back ends are funneled through the AllegroGraph service daemon, adding overhead. To maximize client/server communication speed, clients should use dedicated sessions instead.
However, be aware that each dedicated back end uses resources on the server. Dedicated back ends increase performance up to an optimum, after which resource issues begin to degrade performance again.
Catalog Parameters that Improve Performance
You can design catalogs in agraph.cfg to improve disk-access performance. Such a catalog description might resemble this one (the values shown are to illustrate the appearance of the file contents and may not be appropriate for a particular catalog):
<Catalog fast>
ExpectedStoreSize 2000000
CheckpointInterval 10m
Main /var/lib/ag4/fast
TransactionLogDir /mnt/disk2/ag4/fast
StringTableDir /mnt/disk3/ag4/fast
StringTableSize 128m
</Catalog>
Catalog parameters and directives are discussed in the Server Configuration and Control document. In the descriptions below, each parameter is linked to its description in that document.
ExpectedStoreSize
Load performance can be improved by using the ExpectedStoreSize catalog parameter. The value should be the maximum number of triples you expect to add to the triple store. During normal operation, AllegroGraph may resize data structures as triples are added to the triple store. Using the ExpectedStoreSize settings allows AllegroGraph to pre-size certain data structures, reducing the number of resizes required, thereby improving overall load performance.
CheckpointInterval
Load performance can be improved by setting the CheckpointInterval catalog parameter. By default, checkpoints occur every 5 minutes. While a checkpoint is operating, commits are blocked. On a large database or on a system with slow disks, checkpoints can take several tens of seconds to complete. Setting the CheckpointInterval to a longer interval reduces how often checkpoints occur, thereby reducing the impact on commits.
Note that increasing the CheckpointInterval may increase the amount of time it takes to recover after an unclean database shutdown.
Unused space in the datafile is not made available for reuse until a checkpoint completes, so increasing the time between checkpoints also increases the potential for datafile growth.
TransactionLogDir
Load performance can be improved by using a TransactionLogDir which specifies a directory on a filesystem which is physically separate from the filesystems on which the other database directories are located. This will separate the writes to the transaction log file, which occur at every commit, from the write activity of the merger processes, which operate in the background.
StringTableDir
By using the StringTableDir directive, it is possible to locate the string table files on a separate filesystem so that accesses to the string tables do not interfere with the write activity of mergers.
StringTableSize
The StringTableSize directive allows specification of the size of the string table, which determines the minimum number of slots to use for the hash table used to map UPIs to their corresponding strings. Increasing the number of slots may result in better insert and lookup performance for triple stores with a lot of unique strings. The increase in performance comes with the following costs:
Increased memory use when the database is open. Each slot requires 4 bytes of memory.
Longer checkpoints. The information stored in the slots is recorded in the transaction log during checkpoints.
Catalog Parameters that Trade Performance for Space
StringTableCompression
The StringTableCompression
parameter specifies whether string tables should be compressed. Compressing a string table will usually have a significant effect on the space used by the string table, but at the cost of slower access (since strings have to be uncompressed in order to be used). This is a catalog parameter that is inheritable (meaning it can be specified as a top-level parameter which will apply to all catalogs except those which have a different value specified in their definitions).
The parameter value provides the default for new repositories in a catalog. This value can be overridden when creating a new repository with agtool agload (see the --parameters
option) or using the REST interface (with the stringTableCompression
parameter to the repository creation HTTP service) or using the :params
keyword argument to create-triple-store.
If StringTableCompression
is unspecified in a catalog specification (whether directly or through inheriting from the top-level parameter), it defaults to none
, meaning do not compress the string table. For other possible values, see the parameter description in the Server Configuration and Control document.
Once a repository is created, its string table compression method is set and cannot be changed. To have a repository with the same contents but a different string table compression method, you must write out the triples or quads (perhaps using Repository Export) and read them back (perhaps using agtool load) into a new repository whose StringTableCompression
setting is what is desired. (Backing up and restoring will not work for this purpose. The restored repository will have the same string table compression method as the backed up repository.)
Minimize non-encoded typed literals
If a triple store contains non-encoded typed literals, then AllegroGraph's SPARQL engine cannot blithely translate numeric range queries such as
SELECT * {
?s ex:foo ?o .
FILTER (?o > 10)
}
into fast range queries over AllegroGraph's encoded datatypes because typed literals would be missed. However, AllegroGraph's default behavior is to encode numeric and other literals automatically which means that it is almost always appropriate to trust the encoded datatypes when querying. In rare circumstances (e.g., if you have turned off type-mapping), where non-encoded typed numeric (or date/dateTime) literals are in the store, then you may need to tell AllegroGraph to not trust the encoded datatypes in order to achieve correct results when using range filters.
You can force AllegroGraph to make these additional queries by setting the trustEncodedDatatypesForRangeQueries
PREFIX option to no
.
This can be accomplished either in the AllegroGraph configuration file using
QueryOption trustEncodedDatatypesForRangeQueries=no
or on a per-query basis using the PREFIX notation as in
PREFIX franzOption_trustEncodedDatatypesForRangeQueries: <franz:no>
Generally speaking, this setting should never need to be used as it will reduce performance.