As in the other tutorials, you should read this while running an interactive Lisp session. The forms should be evaluated one after another from the top of the tutorial to the bottom.
Freetext indices provide a way to quickly retrieve triples containing a given word, allowing queries similar to those understood by Web search engines.
Setting Up
We start by creating a triple store with a text index:
> (create-triple-store "fti")
> (create-freetext-index "index1")
And add a few triples:
> (enable-!-reader)
> (register-namespace "ex" "http://www.franz.com/example#")
> (add-triple !ex:Nietzsche !ex:quote
!"We have art in order not to die of the truth.")
> (add-triple !ex:Picasso !ex:quote
!"Art is the lie that enables us to realize the truth.")
> (add-triple !ex:Wilde !ex:quote !"All art is quite useless.")
Basic Queries
The words in the subjects of each of these triples are indexed. We can now use freetext-get-triples to query them. This function returns a cursor, from which we just collect the subjects:
> (collect-cursor (freetext-get-triples "art") :transform 'subject)
({Nietzsche} {Picasso} {Wilde})
Here, the query is simply "art"
. The index is case-insensitive, so the triple where the word 'art' is capitalized is also found, and a query of "ART"
would have returned the same results.
More complicated queries can be made by combining queries with boolean operators:
> (collect-cursor (freetext-get-triples '(and "art" "truth"))
:transform 'subject)
({Nietzsche} {Picasso})
> (collect-cursor (freetext-get-triples '(or "useless" "die"))
:transform 'subject)
({Nietzsche} {Wilde})
To match an exact phrase, a query can use the phrase
operator, as in:
> (collect-cursor (freetext-get-triples '(phrase "art is"))
:transform 'subject)
({Picasso} {Wilde})
Note that and
and or
allow any kind of query to appear as their argument, so (and (phrase "art is") "truth")
is a valid query.
Wild-card and Fuzzy Matching
Freetext indices support wild-card matching with the match
operator:
> (collect-cursor (freetext-get-triples '(match "reali?e"))
:transform 'subject)
({Picasso})
The string given to match
can contain question marks to match single characters, or asterisks to match any number of characters. Be aware that when an asterisk appears early in the string, as in "*tastic"
, it is impossible for AllegroGraph to make effective use of the text index, and the query will be slower.
Another form of non-exact matching can be done with the fuzzy
operator. This one matches all words within a given Levenshtein distance (or edit distance) of the given term. The Levenshtein distance between two strings is the amount of edits that have to be made to change the one into the other, where an edit is the insertion or deletion of a single character, or the replacement of a character with another one. (fuzzy "realise" 1)
will match realise
, realize
, realist
, and so on. If the distance argument is omitted, one fifth of the word length (rounded up) is used.
Indexing By Predicate
The create-freetext-index function takes a lot more arguments than just a name. It is, for example, possible to only index triples that have certain predicates:
> (create-freetext-index "name-index" :predicates (list !ex:name))
> (list-freetext-indices)
("name-index" "index1")
This new index will not index the triples with the !ex:quote
predicate. But it will index these:
> (add-triple !ex:Nietzsche !ex:name !"Friedrich Wilhelm Nietzsche")
> (add-triple !ex:Picasso !ex:name !"Pablo Diego José Francisco de Paula
Juan Nepomuceno María de los Remedios Cipriano de la Santísima Trinidad
Ruiz y Picasso")
freetext-get-triples takes an :index
keyword argument that is used to only use a single index. If not given, all existing indices in the store are used.
> (count-cursor (freetext-get-triples "art"))
3
> (count-cursor (freetext-get-triples "art" :index "name-index"))
0
> (count-cursor (freetext-get-triples "Pablo" :index "name-index"))
1
Ignored Words
Several of our quotes contain the word is
, yet when we search for that, we get nothing:
> (collect-cursor (freetext-get-triples "is") :transform 'subject)
()
To prevent the index from containing useless information, by default all words shorter than three characters are ignored. On top of that, there is a stop-word list, the content of which is also ignored. By default this list contains a number of common English words. If we are interested in short and common words, we can create an index that indexes them:
> (create-freetext-index "index2" :min-word-size 2 :stop-words ())
> (collect-cursor (freetext-get-triples "is") :transform 'subject)
({Picasso} {Wilde})
Indexing Subject, Predicate, and Graph Fields
create-freetext-index also allows control over which part of a triple are indexed. The :index-fields
argument defaults to (:object)
, meaning only the object is indexed, but may contain any of :subject
, :predicate
, :object
, and :graph
. Our subjects and predicates contain resources, not literals, though, and by default those are not indexed. We can set :index-resources
to T
when creating an index to fix this. Another possible value is :short
, which will cause only the part of the resource after the last /
or #
to be indexed.
> (create-freetext-index "sp-index"
:index-fields '(:subject :predicate)
:index-resources :short)
> (collect-triples (freetext-get-triples "Nietzsche" :index "sp-index"))
(<Nietzsche quote We have art in order not to die of the truth.>
<Nietzsche name Friedrich Wilhelm Nietzsche>)
A similar setting exists to control the indexing of literals, :index-literals
. This defaults to T
, causing all literals to be indexed, but can be set to nil
to not index literals, or to a list of resources to index only literals with the given types.
> (create-freetext-index "typed-literals"
:predicates (list !ex:test)
:index-literals (list !ex:indexme))
> (add-triple !ex:A !ex:test !"hello")
> (add-triple !ex:B !ex:test !"hello"^^ex:indexme)
> (collect-cursor (freetext-get-triples "hello" :index "typed-literals")
:transform 'subject)
({B})
Indices can be deleted at any time with drop-freetext-index:
> (drop-freetext-index "typed-literals")
> (list-freetext-indices)
("sp-index" "name-index" "index1")
Searching from SPARQL
The AllegroGraph SPARQL engine defines a 'magic' predicate <http://franz.com/ns/allegrograph/2.2/textindex/match>
, which can be used to generate bindings for the subjects of triples that match a given freetext query.
> (sparql:run-sparql
"PREFIX fti: <http://franz.com/ns/allegrograph/2.2/textindex/>
SELECT ?x WHERE { ?x fti:match 'remedios' }"
:results-format :lists)
(({Picasso}))
:select
(?x)
This matches only ex:Picasso
because the only triple in which "remedios"
occurs is Picasso's name.
Textual Query Syntax
To express more complicated queries, the fti:match
predicate understands a simple language, where multiple words mean and
, and a pipe character means or
.
> (sparql:run-sparql
"PREFIX fti: <http://franz.com/ns/allegrograph/2.2/textindex/>
SELECT ?x WHERE { ?x fti:match '(art | truth) useless' }"
:results-format :lists)
(({Wilde}))
:select
(?x)
Furthermore, double-quotes around a piece of text can be used to express the phrase
operator, which matches only triples that contain the whole phrase. Wild-card matching can be done simply by including question marks and asterisks in words, and fuzzy matching is done by appending a tilde (~
) to a word, optionally followed by a maximum edit distance.
The function ag.text-index:parse-query
transforms such a textual query into an S-expression.
> (ag.text-index:parse-query
"allegro* \"freetext index\" (fuzzy | levenshtein~2)")
(:and (:match "allegro*") (:phrase "freetext index")
(:or "fuzzy " (:fuzzy "levenshtein" 2)))