Table of Contents

AllegroGraph 4.7 Performance Tuning

Hardware Selection

Adjust the Backends Setting

Use Dedicated Sessions

Catalog Parameters that Improve Performance

ExpectedStoreSize

CheckpointInterval

TransactionLogDir

StringTableDir

Minimize non-encoded typed literals

Use plain encoding for strings

Tell the SPARQL engine to ignore typed literals

AllegroGraph 4.7 Performance Tuning

This guide provides some insights on how to use some of AllegroGraph's tunable parameters to maximize database performance. These settings are applied in the AllegroGraph server configuration file (agraph.cfg) as well as the individual database parameter files (parameters.dat). Documentation of all parameter settings appears on the Server Configuration page.

Hardware Selection

For users looking to work with multi-billions of RDF triples we encourage you to purchase as much hardware as your budget will permit. We can offer purchase references ([email protected]) for hardware systems containing multiple CPU cores with 64GB RAM for ~$7,000 or less.

An ideal system for 2-3 billion triples would include 8 CPU cores, 64 GB of RAM, and fast hard drives (15K RPM, arranged in RAID arrays employing striping). This type of system will optimize query performance as well providing rapid ingest speeds. We also work with smaller hardware in our development and release testing and receive very satisfactory results.

Your results will vary based on factors associated with your application, such as the number and size of unique strings found in the triples, etc.

Please contact us if you have any questions related to hardware selection - [email protected]

Adjust the Backends Setting

The Backends setting in agraph.cfg determines the maximum number of (non-dedicated) back-ends that the server may spawn. On a machine with many CPU cores, heavy concurrent access can be faster when more back-ends are allowed to be spawned.

As a rule of thumb, set the Backends setting to the number of CPU cores on your machine.

Use Dedicated Sessions

When using non-dedicated back ends, all communications between the client and back ends are funneled through the AllegroGraph service daemon, adding overhead. To maximize client/server communication speed, clients should use dedicated sessions instead.

However, be aware that each dedicated back end uses resources on the server. Dedicated back ends increase performance up to an optimum, after which resource issues begin to degrade performance again.

Catalog Parameters that Improve Performance

You can design catalogs in agraph.cfg to improve disk-access performance. Such a catalog description might resemble this one:

<Catalog fast>  
  ExpectedStoreSize 2000000  
  CheckpointInterval 1h     
  Main /var/lib/ag4/fast  
  TransactionLogDir /mnt/disk2/ag4/fast   
  StringTableDir /mnt/disk3/ag4/fast  
</Catalog> 

ExpectedStoreSize

Load performance can be improved by using the ExpectedStoreSize catalog parameter. The value should be the maximum number of triples you expect to add to the triple store. During normal operation, AllegroGraph may resize data structures as triples are added to the triple store. Using the ExpectedStoreSize settings allows AllegroGraph to pre-size certain data structures, reducing the number of resizes required, thereby improving overall load performance.

CheckpointInterval

Load performance can be improved by setting the CheckpointInterval catalog parameter. By default, checkpoints occur every 5 minutes. While a checkpoint is operating, commits are blocked. On a large database, checkpoints can take several tens of seconds to complete. Setting the CheckpointInterval to a longer interval reduces how often checkpoints occur, thereby reducing the impact on commits.

Note that increasing the CheckpointInterval may increase the amount of time it takes to recover after an unclean database shutdown.

Note that certain background database operations trigger checkpoints regardless of the CheckpointInterval setting. So, in effect, the CheckpointInterval settings sets an upper bound on the amount of time that may elapsed before a checkpoint occurs.

TransactionLogDir

Load performance can be improved by using a TransactionLogDir which specifies a directory on a filesystem which is physically separate from the filesystems on which the other database directories are located. This will separate the writes to the transaction log file, which occur at every commit, from the write activity of the merger processes, which operate in the background.

StringTableDir

By using the StringTableDir directive, it is possible to locate the string table files on a separate filesystem so that accesses to the string tables do not interfere with the write activity of mergers.

Minimize non-encoded typed literals

In order to generate results in full compliance with the SPARQL specification, AllegroGraph's SPARQL engine cannot simply translate numeric range queries such as

SELECT * {  
  ?s ex:foo ?o .  
  FILTER (?o > 10)  
} 

into solely fast range queries over AllegroGraph's encoded numeric datatypes. It must also laboriously examine the contents of the typed literal portion of the indices. This is expensive, and becomes increasingly so as the number of non-encoded type literals in the store grows.

Fortunately, the default behavior of AllegroGraph 4 is to encode numeric and other literals automatically, which greatly reduces the number of typed literals interned in the store. However, you might still be adding a large number of typed literals which have no direct mapping to encoded types: xsd:string, for example, or custom datatypes.

We recommend two actions to mitigate this behavior. The first reduces the number of non-encoded typed literals; the second avoids querying them. We recommend both approaches be used.

Use plain encoding for strings

RDF makes an unhelpful distinction between "Hello" and "Hello"^^xsd:string. In SPARQL queries, AllegroGraph treats these two values as identical, but storing the additional type information is both expensive and redundant. We suggest you set a plain datatype mapping for xsd:string. This will cause xsd:strings added to the database to be stored as plain RDF literals.

In the Lisp client, you can evaluate

(setf (datatype-mapping (resource "string" "xsd")) :plain) 

using the HTTP repository interface (assuming your repository is named "test"):

PUT /repositories/test/mapping/type?type=%3Chttp%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema%23string%3E&encoding=%3Chttp%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema%23string%3E HTTP/1.1 

There are equivalent operations in the Java and Python clients.

Tell the SPARQL engine to ignore typed literals

If you know that all of your numeric and date/dateTime values will be encoded (which by default is true), you can force the SPARQL engine to avoid querying those values entirely.

In your init file (described in the HTTP protocol guide), add the line

(setf sparql.algebra::*sparql-assumes-range-queries-will-suffice-p* t) 

This change applies globally to the server. Ensure that you restart any dedicated backends if you apply this change via the HTTP interface.