SPARQL Exploration in AllegroGraph
This document explores how AllegroGraph's SPARQL engine processes queries and describes steps you can take to understand and optimize them.
It assumes you know how to start AllegroGraph server and run AllegroGraph in the Lisp enviroment. See the installation guide and the Lisp Quick-Start guide for more details.
We'll start by running Allegro Common Lisp, loading AllegroGraph and setting up a good working environment:
shell> mlisp
...
lisp: (require :agraph) ;load AllegroGraph
lisp: (enable-!-reader) ;turn on the !-reader
lisp: (in-package :db.agraph.user) ;switch to the AllegroGraph package
lisp: (enable-print-decoded t) ;print UPIs and triples readably
Once this is done, we can get started by creating a new triple-store:
lisp: (create-triple-store "ag-test")
#<db.agraph::triple-db ag-test, 0, open @ #x1008aa39e2>
For the purposes of this example, let’s load some triples from the web and try a query.
triple-store-user(6): (load-ntriples
"http://www.w3.org/2000/10/rdf-tests/rdfcore/ntriples/test.nt")
30
{G}
triple-store-user(7): (sparql:run-sparql "
SELECT DISTINCT ?o {
?s <http://example.org/property> ?o .
FILTER (isIRI(?o))
}"
:results-format :sparql-json)
{
"head" : {
"vars" : ["o"]
},
"results" : {
"bindings" : [
{
"o":{"type":"uri","value":"http://example.org/resource2"}
}
]
}
}
t
:select
(?o)
You can see four results printed in JavaScript Object Notation (JSON). Three additional values are also returned: t
(true, because the results have been printed), :select
(the SPARQL verb used in this query), and (?o)
, the list of variables selected by the query. If you need to programmatically process the output of a query, these values will be useful.
Seeing the final algebraic expression
There are several intermediate stages between your textual query and the results being generated. The first is a parse tree, which is translated into an algebraic expression in prefix notation. This algebraic expression is simplified, undergoes some optimizations, and is then used to construct cursors which ultimately build the result set.
The first step in your investigation, then, is to look at the optimized algebra expression which AllegroGraph executes. The algebra expression relates much more closely to the execution than does the SPARQL surface syntax, and so looking at the algebra expression gives more hints about the cost of a query.
There are two ways to do this. The most direct is to use the function sparql->algebra.
triple-store-user(8): (sparql.algebra:sparql->algebra "
SELECT DISTINCT ?o {
?s <http://example.org/property> ?o .
FILTER (isIRI(?o))
}")
(:filter (:bgp #(?s !<http://example.org/property> ?o))
(sparql.sop:funcall-sparql-uri !isIRI ?o))
:select
:distinct
(?o)
nil
triple-store-user(9):
The function returns a number of values: the algebra expression itself, and also some metadata about the query, such as whether DISTINCT
is used.
The expression that AllegroGraph will actually execute is:
(:filter
(:bgp
#(?s !<http://example.org/property> ?o))
(sparql.sop:funcall-sparql-uri !isIRI ?o))
This is a prefix expression; execution proceeds from the bottom up. In English, this form reads as "the Basic Graph Pattern (BGP) consisting of a single triple pattern (as shown), subject to a filter which executes the function named by isIRI
on the binding of ?o
".
This is executed much as it reads: a triple-pattern cursor is instantiated to evaluate the triple pattern, and this is wrapped in a filtered-cursor which executes the appropriate filter.
You can also explore which algebraic optimizations apply to a particular query.
sparql->algebra takes two relevant keyword arguments, :simplifyp
and :optimizep
; these are both true by default. specifying nil
will cause the appropriate phase to be skipped, showing you the pre-optimization form of a query.
You can see some of the optimizations that occur with a more complex query by comparison:
triple-store-user(9): (sparql.algebra:sparql->algebra "
PREFIX ex: <http://example.org/>
SELECT * {
?s ex:property ex:resource2 .
OPTIONAL {
?s2 ex:property 'chat'@fr .
}
?s3 ex:property 'chat'@en .
}" :optimizep nil)
(:join
(:left-join
(:bgp
#(?s !<http://example.org/property>
!<http://example.org/resource2>))
(:bgp #(?s2 !<http://example.org/property> !"chat"@fr)) t)
(:bgp #(?s3 !<http://example.org/property> !"chat"@en)))
...
triple-store-user(10): (sparql.algebra:sparql->algebra "
PREFIX ex: <http://example.org/>
SELECT * {
?s ex:property ex:resource2 .
OPTIONAL {
?s2 ex:property 'chat'@fr .
}
?s3 ex:property 'chat'@en .
}" :optimizep t)
(:left-join
(:bgp
#(?s !<http://example.org/property> !<http://example.org/resource2>)
#(?s3 !<http://example.org/property> !"chat"@en))
(:bgp #(?s2 !<http://example.org/property> !"chat"@fr)) t)
...
triple-store-user(11):
If you look closely, you can see that the outermost :join
has been collapsed inside the :left-join
, and then the resulting join of two BGPs has been simplified into a single BGP. The optimized version contains one :left-join
and two BGPs; the unoptimized version contains one :join
, one :left-join
, and three BGPs.
Similar rewriting occurs for Negation-As-Failure (the !bound()
idiom with which you are familiar), amongst others.
There is another way to see the optimized algebra expression during the actual execution of a query, using Common Lisp’s tracing features. Simply evaluate:
triple-store-user(11): (trace sparql.algebra::optimize-algebra)
(sparql.algebra::optimize-algebra)
and then run input number 7 again. Typing :7
will save you copying and pasting the earlier query; the REPL will echo back the "input" it is reusing.
triple-store-user(12): :7
(db.agraph.sparql:run-sparql "
SELECT DISTINCT ?o {
?s <http://example.org/property> ?o .
FILTER (isIRI(?o))
}" :results-format :sparql-json)
0[2]: (sparql.algebra::optimize-algebra
(:filter (:bgp #(?s !<http://example.org/property> ?o))
(sparql.sop:funcall-sparql-uri !isIRI ?o)))
0[2]: returned
(:filter (:bgp #(?s !<http://example.org/property> ?o))
(sparql.sop:funcall-sparql-uri !isIRI ?o))
{
"head" : {
"vars" : ["o"]
},
"results" : {
"bindings" : [
{
"o":{"type":"uri","value":"http://example.org/resource2"}
}
]
}
}
t
:select
(?o)
As you can see, the algebra expression was already quite simple; no optimization was needed.
This is a common approach to figuring out how a particular operation is executed during interactive Common Lisp development; any function in the system can be traced at runtime to observe its inputs and outputs. We needn’t limit ourselves to looking at the algebra expression. For example, we can observe the behavior of the SPARQL isIRI
predicate
1 , perhaps to deduce why some particular result isn’t being returned:
triple-store-user(13): (enable-print-decoded t) ; To print UPIs nicely.
t
triple-store-user(14): (untrace)
nil
triple-store-user(15): (trace sparql.sop:isIRI)
(sparql.sop:isIRI)
triple-store-user(16): :7
(db.agraph.sparql:run-sparql "
SELECT DISTINCT ?o {
?s <http://example.org/property> ?o .
FILTER (isIRI(?o))
}" :results-format :sparql-json)
{
"head" : {
"vars" : ["o"]
},
"results" : {
"bindings" : [
0[2]: (sparql.sop:isIRI {resource2})
0[2]: returned t
0[2]: (sparql.sop:isIRI {resource2})
0[2]: returned t
0[2]: (sparql.sop:isIRI {_:anon2})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {resource2})
0[2]: returned t
0[2]: (sparql.sop:isIRI {resource2})
0[2]: returned t
0[2]: (sparql.sop:isIRI {resource2})
0[2]: returned t
0[2]: (sparql.sop:isIRI {resource2})
0[2]: returned t
0[2]: (sparql.sop:isIRI {simple literal})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {backslash:\})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {dquote:\"})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {newline:\n})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {return\r})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {tab:\t})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {resource2})
0[2]: returned t
0[2]: (sparql.sop:isIRI {x})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {_:anon2})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {é})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {€})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {})
0[2]: returned nil
0[2]: (sparql.sop:isIRI { })
0[2]: returned nil
0[2]: (sparql.sop:isIRI {x})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {\"})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {<a></a>})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {a <b></b>})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {a <b></b> c})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {a\n<b></b>\nc})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {chat})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {chat})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {chat})
0[2]: returned nil
0[2]: (sparql.sop:isIRI {abc})
0[2]: returned nil
{
"o":{"type":"uri","value":"http://example.org/resource2"}
}
]
}
}
t
:select
(?o)
As you can see, isIRI
returns nil
(false) for non-resource UPIs, as expected. (UPIs are AllegroGraph’s internal representation for the parts of a triple: literals, blank nodes, and resources.)
What to watch for in algebraic expressions
Generally, left-joins where the right-hand side is very complex (containing nested FILTER
s and OPTIONAL
s) cannot be executed with shared bindings, and will be expensive. (You can see this for certain by looking at the cursors; see the next section.)
In general, "front-heavy" queries are preferred to "tail-heavy" queries: the iterative joins used for SPARQL queries cause the right-hand side of joins to be executed to completion for each step of the left-hand side. If the right-hand side is very expensive (e.g., traversing an entire type hierarchy), or the left-hand side will produce very many results, then it is probably worthwhile to rearrange your query such that the left-hand side produces fewer results (e.g., by tying a FILTER
closely to those patterns), or the right-hand side is executed elsewhere in the query (or even in a second query).
If you are attempting to use SPARQL’s idiom for negation (using OPTIONAL
and FILTER
) you should ensure that the algebra optimizer has recognized the idiom and converted it to a :not
expression. For example, with the optimization disabled:
triple-store-user(18):(let ((sparql.algebra::*rewrite-naf-p* nil))
(sparql.algebra:sparql->algebra "
SELECT * WHERE {
?s ?p ?o .
OPTIONAL {
?s <u:foo> ?baz
}
FILTER (!bound(?baz))
}"))
(:filter
(:left-join
(:bgp #(?s ?p ?o))
(:bgp #(?s !u:foo ?baz)) t)
(funcall-sparql-uri !fn:not (bound ?baz)))
as opposed to the behavior with the optimization enabled:
triple-store-user(17): (sparql.algebra:sparql->algebra "
SELECT * WHERE {
?s ?p ?o .
OPTIONAL {
?s <u:foo> ?baz
}
FILTER (!bound(?baz))
}")
(:join
(:bgp #(?s ?p ?o))
(:not
(:bgp #(?s !u:foo ?baz))))
This can significantly boost performance when the OPTIONAL would generate lots of intermediate results, as well as reducing the number of algebra expressions that need to be executed.
Cursors
Algebra expressions are interpreted to produce a bindings-cursor, which lazily computes result rows.
You can get direct access to these cursors by using the :cursor
results-format. Observing the printed representation of the cursor can be very informative, and stepping across the cursor one row at a time provides an opportunity for measurement and profiling.
Let’s try it.
triple-store-user(23): (sparql:run-sparql "
PREFIX ex: <http://example.org/>
SELECT * {
?s ex:property ex:resource2 .
OPTIONAL {
?s2 ex:property 'chat'@fr .
}
?s3 ex:property 'chat'@en .
}" :results-format :cursor)
#<shared-left-join:
#<shared-join:
#<triple: ?s {property} {resource2} nil>
#<triple: ?s3 {property} !"chat"@en nil>>
with #<triple: ?s2 {property} !"chat"@fr nil>>
:select
(?s ?s2 ?s3)
You can see that this is executed much as you would expect from the algebra expression: two triple cursors are joined, and then a left-join merges those with a third triple pattern.
Of note is the word "shared": both joins share bindings. This means that during iterative execution bindings generated by the left-hand side are propagated to the right-hand side, being used to narrow down the solution space. In this query there are no variables shared across these joins; in most queries there is some overlap.
Shared joins are almost always possible, but sometimes such sharing is impossible given the semantics of SPARQL (which specify bottom-up evaluation). Their absence will make a left-join much more expensive: the right-hand side must run without any prior bindings, which usually leads to a much larger intermediate result set.
If you find that your query contains a non-shared left-join, you probably have a nested OPTIONAL
in your query that itself contains a FILTER
which makes reference to a variable in an enclosing scope. If performance is unacceptable, you must rewrite your query to avoid the back-reference.
You can grab the first result from this cursor, timing if you wish. "*" in the REPL means "the last result".
triple-store-user(24): (time (sparql.algebra::yield-bindings *))
; cpu time (non-gc) 0.000000 sec user, 0.000000 sec system
; cpu time (gc) 0.000000 sec user, 0.000000 sec system
; cpu time (total) 0.000000 sec user, 0.000000 sec system
; real time 0.000981 sec
; space allocation:
; 6 cons cells, 3,072 other bytes, 0 static bytes
#({resource1} {resource30} {resource31})
If you’re noticing slow query response, it might be worthwhile applying conventional Common Lisp profiling techniques at this point, as described in the Runtime Analyzer documentation.
This is not the first port of call for query optimization, because the profiler does not operate at the same level of abstraction as your query: knowing the the query executor spends most of its time in cursor operations is not useful for the purposes of optimizing the query. However, looking through the profiler output can provide clues: for example, a significant portion of the runtime spent in a SPARQL FILTER
function (e.g., sparql.sop:...) might suggest that you relocate that FILTER
; lots of time spent in array operations indicates join activity, as opposed to time spent reading octets from the disk (which suggests that index usage is more significant).
Finally, using the time
macro (as above) or the profiler can indicate how much time is spent actually doing work (CPU time) versus waiting for IO (real time minus CPU time). If a large percentage of runtime is spent waiting for IO, you might consider making more RAM available, or rearranging your query to cause different IO behavior.
As with most performance optimization of complex systems, general guidelines are hard to formulate. These tools can provide clues and raw data, but interpreting the results and deciding on a course of action is not always easy. Experimentation is always worthwhile. Feel free to contact support for assistance.
Execution optimizations to watch for
Most optimizations performed by the query engine are desirable most of the time. However, because many are encapsulated heuristics, it is sometimes helpful to be able to observe their operation, and force AllegroGraph to avoid their use.
The most important such heuristic is the inlining of constraints. This process entails a FILTER
being eliminated, and the appropriate constraint (such as a numeric range limitation, or a regular expression check) being applied directly to a triple cursor.
In most cases, this reduces the number of intermediate results being generated and avoids a SPARQL function invocation, speeding up the query. In a small number of queries, the number of intermediate results is naturally reduced by other triple patterns, making it more efficient to use a traditional "filter-last" approach.
You can determine if inlining has occurred by examining the cursor output. Certain FILTER
constraints will disappear in the cursor output — i.e., you see a :filter
in the algebra expression, but no filtered-cursor is produced — which indicates that these have been inlined into a triple-pattern cursor, either as a range query or as a direct filter, or that they have been attached to a left-join (which includes a filtering stage).
The most straightforward way to defeat the SPARQL engine’s inlining is to add an intermediate function to the FILTER
. For example, turning
FILTER (regex(?s, "foo"))
into
PREFIX xs: <http://www.w3.org/2001/XMLSchema#>
...
FILTER (regex(xs:string(?s), "foo"))
(In the future the regex constraint extractor might elide SPARQL’s built-in str() function, which is why this example explicitly uses the XML Schema constructor function.)
The output changes from:
triple-store-user(25): (sparql:run-sparql "
SELECT ?s ?o {
?s <http://example/p> ?o .
FILTER (regex(?o, 'activity.*'))
}"
:results-format :cursor)
#<triple: ?s {p} ?o nil>
:select
(?s ?o)
to
triple-store-user(26): (sparql:run-sparql "
PREFIX xs: <http://www.w3.org/2001/XMLSchema#>
SELECT ?s ?o {
?s <http://example/p> ?o .
FILTER (regex(xs:string(?o), 'activity.*'))
}"
:results-format :cursor)
#<filter-cursor#<triple: ?s {p} ?o nil>>
:select
(?s ?o)
You can see that the former is a single triple cursor, whilst the latter adds a filter-cursor to implement the regex filtering.
In this simple example the former is always better (indeed, in some circumstances the regex can be translated into operations on the free-text index), but if the FILTER
encloses a more complex pattern then undoing this optimization could be worthwhile.
A similar constructor-based approach can be used for range queries.
Footnotes
- isIRI is a standard SPARQL function, defined in http://www.w3.org/TR/rdf-sparql-query/#func-isIRI. It returns true for URIs (and IRIs, their internationalized variants), and false otherwise. ↩