Lisp Freetext Indexing Tutorial

As in the other tutorials, you should read this while running an interactive Lisp session. The forms should be evaluated one after another from the top of the tutorial to the bottom.

Getting setup

First, we'll get AllegroGraph ready to run:

(require :agraph)                    ; load Agraph  
 
(in-package :triple-store-user)      ; go in the right package  
 
(enable-print-decoded t)             ; make printing look nicer  
 
(enable-!-reader)                    ; make typing easier  
 
(make-tutorial-store)                ; create a triple-store

Tell AllegroGraph the predicates that you want to index: comments and labels.

> (register-freetext-predicate  
    !<http://www.w3.org/2000/01/rdf-schema#comment>)  
t  
 
> (register-freetext-predicate  
    !<http://www.w3.org/2000/01/rdf-schema#label>)  
t

Make sure AllegroGraph understood

> (freetext-registered-predicates)  
("http://www.w3.org/2000/01/rdf-schema#label"                                                      
 "http://www.w3.org/2000/01/rdf-schema#comment")

The basics: adding and querying

Now we add some triples:

> (add-triple !"Jans" !rdfs:comment  
              !"Born in Amsterdam in the Netherlands")  
1  
 
> (add-triple !"Gary" !rdfs:comment  
              !"Born in Springfield in the USA")  
2  
 
> (add-triple !"Steve" !rdfs:label  
              !"Born in New Amsterdam in the USA")  
3

Free-text indexing happens automatically as the triples are added. Here are some simple examples of using the free-text index:

; return all triple-ids that match "amsterdam"  
> (freetext-get-ids "amsterdam")        
(3 1)  
 
; a boolean expression  
> (freetext-get-ids '(and "amsterdam" "usa"))  
(3)  
 
; a phrase, note the quotation marks  
> (freetext-get-ids "\"in the USA\"")  
(3 2)

We can also return a cursor:

> (freetext-get-triples '(and "usa" "born"))  ; return a cursor  
#<db.agraph::triple-id-list-cursor @ #x10034e1232>

Which we can use in the usual way. For example, we can bind it to the variable cursor.

> (setf cursor (freetext-get-triples '(and "usa" "born")))  
#<db.agraph::triple-id-list-cursor @ #x10ebac82>

then loop over the cursor with the handy iterate-cursor function.

> (iterate-cursor (triple cursor)  
    (print triple))  
<2: "Gary" rdfs:comment "Born in Springfield in the USA">  
<3: "Steve" rdfs:label "Born in New Amsterdam in the USA">

Sometimes it is handy to get them in a list:

> (freetext-get-triples-list '(and "usa" "born"))  
(<triple 3: "Steve" rdfs:label Born in New Amsterdam in the USA default-graph>  
 <triple 2: "Gary" rdfs:comment Born in Springfield in the USA default-graph>)

And sometimes you only want the subjects back:

> (freetext-get-unique-subjects '(and "netherlands" "born"))  
({"Jans"})

A silly but more interesting example

We'll register own our namespace ex and use it to select the triples to index.

> (register-namespace "ex" "http://www.franz.com/simple#")  
"http://www.franz.com/simple#"

First we add some new triples to our open triple-store, note that the object of each new triple is a long string filled with random numbers (in English). We're going to add triples in a somewhat round about fashion:

first we'll create an N-Triples file of our data
and then we'll use load-ntriples to load this file

Here is the code:

(defun fill-dummy-ntriple-file-and-load-it (count)  
  (let ((list '("one " "two " "three " "four "  
            "five " "six " "seven " "eight " "nine " "ten ")))  
    (with-open-file (out "sample.ntriples" :direction :output  
             :if-exists :supersede)  
      (dotimes (i count)  
        (let ((subject (string+ '<subject- i "> ")))  
          (dotimes (j 5)  
            (let ((predicate "<http://www.w3.org/2000/01/rdf-schema#comment> ")  
              (object (apply 'triple-store::string+  
                     (let ((li nil))  
                       (dotimes (i (1+ (random 8)))  
                         (push (nth (random 10) list) li))  
                       li))))  
              (format out "~a~a~s .~%" subject predicate object))))))  
    (load-ntriples "sample.ntriples")  
    (index-all-triples)))

Let's try it out:

> (fill-dummy-ntriple-file-and-load-it 10)

And look at some triples:

> (dolist (e (get-triples-list))  
    (print e))

So now we want to play with this file: let us write a little test function:

(defun print-freetext-triples (query)  
  (print-triples (freetext-get-triples query) :format :terse))  
 
;; an easier to type version!  
(defun pft (query)  
  (print-freetext-triples query))

Querying simple expressions

Since the triples are generated randomly, your results will not match ours! They will, however, be correct for the query.

> (pft "eight")  
<30: subject-5 rdfs:comment "two two four eight ">  
<22: subject-3 rdfs:comment "five three ten four eight ten four one ">  
<20: subject-3 rdfs:comment "nine four eight three six five one ">  
<51: subject-9 rdfs:comment "five nine eight two five three ten ">  
...  
 
> (pft '(and "ten" "eight"))  
<32: subject-5 rdfs:comment "five nine ten ten four eight nine three ">  
<36: subject-6 rdfs:comment "seven five ten nine eight ">  
<10: subject-1 rdfs:comment "eight ten one ">  
<45: subject-8 rdfs:comment "nine ten eight four five seven nine seven ">  
...  
 
> (pft '(and "ten" "eight" (or "three" "four")))  
<51: subject-9 rdfs:comment "five nine eight two five three ten ">  
<48: subject-8 rdfs:comment "six five eight three ten five three four ">  
<32: subject-5 rdfs:comment "five nine ten ten four eight nine three ">  
...  
 
> (pft '(or (and "five" "one")  
    (and "ten" "eight" (or "three" "four"))))  
<45: subject-8 rdfs:comment "nine ten eight four five seven nine seven ">  
<32: subject-5 rdfs:comment "five nine ten ten four eight nine three ">  
<48: subject-8 rdfs:comment "six five eight three ten five three four ">  
<51: subject-9 rdfs:comment "five nine eight two five three ten ">  
...

Querying wildcards

The freetext query grammar uses * and ? as wildcard characters. As is traditional:

A * matches zero or more occurrences of anything
A ? matches exactly one character

The * is not allowed to occur in phrases.

> (pft "?i*") ; e.g., five six nine  
<52: subject-9 rdfs:comment "ten two six ten two ">  
<51: subject-9 rdfs:comment "five nine eight two five three ten ">  
<50: subject-9 rdfs:comment "nine three ten ">  
...  
 
> (pft "?i?e") ; only five or nine  
<51: subject-9 rdfs:comment "five nine eight two five three ten ">  
<50: subject-9 rdfs:comment "nine three ten ">  
<48: subject-8 rdfs:comment "six five eight three ten five three four ">  
...  
 
> (pft  
   '(or (and "fiv*" "on*")  
        (and "te*" "eigh*" (or "th*ree" "fo*ur" "\"one five\""))))  
<51: subject-9 rdfs:comment "five nine eight two five three ten ">  
<48: subject-8 rdfs:comment "six five eight three ten five three four ">  
<32: subject-5 rdfs:comment "five nine ten ten four eight nine three ">  
<45: subject-8 rdfs:comment "nine ten eight four five seven nine seven ">  
...

A larger example with realistic data

Finally, here is an example of a large file, filled with weapon systems, terrorists, and a lot of common knowledge from the Cyc database (available on request: please mail [email protected]).

We include this non-trivial example because it will allow us to do some select queries. You'll need to change the path in the read-gov function to match the location of the N-Triples file that you have.

(defun read-gov ()  
  (format t "~%Add triples")¯  
  ;; make sure that the comments are indexed  
  (register-freetext-predicate  
      !<http://www.w3.org/2000/01/rdf-schema#comment>)  
  (load-ntriples "/path/to/the/gov/data/Gov.ntriples")  
  (format t "~%Index-all-triples")  
  (index-all-triples))  
 
(time (read-gov))

To make working with the dataset easier, we'll register another namespace.

> (register-namespace "c" "http://www.cyc.com/2002/04/08/cyc#")  
"http://www.cyc.com/2002/04/08/cyc#"

Let's take a quick look at the data:

> (get-triples-list :p !rdfs:comment)  
(<132748: c:situationInvolving rdfs:comment  
    "(#$situationInvolving ?Entity ?Situation) returns  
        all situations involving ?Entity, successively binding each  
        situation fort to ?Situation. A situation is here considered  
        any fort that relates to ?Entity via a preActors (or a  
        spec-pred of this) slot.">  
 <1117: c:ActiveVBarTemplate-NPGap rdfs:comment  
    "This is a #$ParsingTemplateCategory for parsing  
    active voice VBar-level constituents which contain some NP gap.">  
 <195777: c:UIA-Clothing-DemoEnvironmentMt rdfs:comment  
    "The #$ApplicationContext for UIA demonstrations  
    regarding clothing and fashion.">  
 ...

Querying Gov

The pft (print-freetext-triples) function was defined above. Let's start by looking at the difference between queries for triples that contain both words "collection" and "people" versus the triples that contain the exact phrase "collection of people":

> (pft '(and "collection" "people"))  
<16388: c:ChileanPerson rdfs:comment  
    "The collection of people who are #$citizens of  
    #$Chile, or participate in its #$NationalCulture..">  
<65570: c:OmanPerson rdfs:comment  
    "The collection of people who are #$citizens of  
    #$Oman, or participate in its #$NationalCulture.">  
<16434: c:ChinesePerson rdfs:comment  
    "The collection of people who are #$citizens of  
    #$China-PeoplesRepublic, or participate in its #$NationalCulture.">  
...  
 
> (pft "\"collection of people\"")  
<12284: c:BrazilianPerson rdfs:comment  
    "This is the collection of people who are  
    #$citizens of #$Brazil, or participate in its #$NationalCulture.">  
<86522: c:SouthAmericanCitizenOrSubject rdfs:comment  
    "The collection of people who are citizens of, or  
    otherwise by status under the jurisdiction of, a country or  
    other #$GeopoliticalEntity whose territory is located in the  
    #$ContinentOfSouthAmerica.  A person native to an area of  
    South America not under the jurisdiction of any  
    #$GeopoliticalEntity would also be an instance of #$SouthAmericanPerson.">  
...

Querying from Prolog

We can combine the Prolog select queries with freetext indexing. We'll use the same freetext queries that we did above (i.e., one for "collection" and "people" and one for the phrase "collection of people") but we'll add the requirement that each subject found must also be a subclass of !c:AsianCitizenOrSubject. ¹

> (select (?person)  
    (lisp ?list  
          (freetext-get-unique-subjects  
            '(and "collection" "people")))  
    (member ?person ?list)  
    (q- ?person !rdfs:subClassOf !c:AsianCitizenOrSubject))  
(("http://www.cyc.com/2002/04/08/cyc#AzerbaijaniPerson")  
 ("http://www.cyc.com/2002/04/08/cyc#GeorgianPerson")  
 ("http://www.cyc.com/2002/04/08/cyc#TurkmenistanPerson")  
 ("http://www.cyc.com/2002/04/08/cyc#YemenPerson")  
 ("http://www.cyc.com/2002/04/08/cyc#ArmenianPerson")  
 ("http://www.cyc.com/2002/04/08/cyc#HongKongPerson")  
 ("http://www.cyc.com/2002/04/08/cyc#QatarPerson")  
 ("http://www.cyc.com/2002/04/08/cyc#PakistaniPerson")  
 ("http://www.cyc.com/2002/04/08/cyc#AfghanPerson")  
 ("http://www.cyc.com/2002/04/08/cyc#TajikistaniPerson") ...)  
 
> (select (?person)  
    (lisp ?list (freetext-get-unique-subjects  
                  "\"collection of people\""))  
    (member ?person ?list)  
    (q- ?person !rdfs:subClassOf !c:AsianCitizenOrSubject))  
(("http://www.cyc.com/2002/04/08/cyc#HongKongPerson")  
 ("http://www.cyc.com/2002/04/08/cyc#QatarPerson")  
 ("http://www.cyc.com/2002/04/08/cyc#ChinesePerson")  
 ("http://www.cyc.com/2002/04/08/cyc#TaiwanesePerson")  
 ("http://www.cyc.com/2002/04/08/cyc#UnitedArabEmiratesPerson")  
 ("http://www.cyc.com/2002/04/08/cyc#TajikistaniPerson")  
 ("http://www.cyc.com/2002/04/08/cyc#PakistaniPerson")  
 ("http://www.cyc.com/2002/04/08/cyc#TurkmenistanPerson")  
 ("http://www.cyc.com/2002/04/08/cyc#AzerbaijaniPerson")  
 ("http://www.cyc.com/2002/04/08/cyc#AfghanPerson") ...)

Using Free-Text Indexing from SPARQL

You can refer to the contents of the free-text index from within your SPARQL queries by using one of the "magic" predicate:

fti:match (full URI ).
fti:matchExpression (full URI ).

Use fti:match when you want to match simple strings and phrases; Use fti:matchExpression if you need to handle more complex text matching expressions (e.g., ones with ands and ors in them).

A triple pattern such as

?x fti:match "baseball"

will generate bindings for ?x, where each binding is the subject of a matching triple. "Matching" means that the predicate of the triple is registered with the free-text indexing system, and the object of the triple matches the query (in this case, "baseball"). For example

> (sparql:run-sparql  
    "PREFIX fti: <http://franz.com/ns/allegrograph/2.2/textindex/>  
     SELECT ?x WHERE { ?x fti:match \"baseball\" }"  
     :results-format :fitted-table)  
--------------------------------  
| x                            |  
================================  
| Translation-Complete         |  
| SportsEvent                  |  
| BaseballDelivery             |  
| facets-Covering              |  
| BaseballInning               |  
| firstSubEvents               |  
...

You can use all of your normal ¯free-text patterns here, and you can use multiple fti:match triple patterns in your queries (recall that the strings used in SPARQL expressions can use single quotes which helps reduce the number of characters you need to escape immensely.).

Phrase Searches (note the mix of double and single quotes):

> (sparql:run-sparql  
    "PREFIX fti: <http://franz.com/ns/allegrograph/2.2/textindex/>  
     SELECT ?x WHERE { ?x fti:match '\"collection of people\"' }"  
     :results-format :table)     
----------------------------  
| x                        |  
============================  
| BelgianPerson            |  
| OmanPerson               |  
| EgyptianPerson           |  
| MoroccanPerson           |  
| CzechRepublicPerson      |  
| VietnamesePerson         |  
...

The results of this next query will vary depending on whether or not you are using a reasoning triple-store (see the reference guide for details on AllegroGraph's RDFS++ reasoner). The table shown below is displaying only ground triples; no reasoning is involved.

> (sparql:run-sparql  
      "PREFIX fti:  <http://franz.com/ns/allegrograph/2.2/textindex/>  
       PREFIX c:    <http://www.cyc.com/2002/04/08/cyc#>  
       PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>  
       SELECT ?x WHERE {  
         ?x fti:match '\"collection of people\"' .  
         ?x rdfs:subClassOf c:PersonWithOccupation  
       }"  
    :results-format :fitted-table)  
------------------------------  
| x                          |  
==============================  
| OrganizedCrimeProfessional |  
| Hitperson                  |  
------------------------------

Multiple fti:match predicates in a single query (here we use single quotes instead of double):

> (sparql:run-sparql  
    "PREFIX fti:  <http://franz.com/ns/allegrograph/2.2/textindex/>  
     PREFIX c:    <http://www.cyc.com/2002/04/08/cyc#>  
     PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>  
     SELECT ?x WHERE {  
        ?x fti:match 'people' .  
        ?x fti:match 'murder' .  
      }"  
   :results-format :fitted-table)  
------------  
| x        |  
============  
| Murderer |  
------------  
t  
:select  
(?x)

And, finally, an example of fti:matchExpression:

> (sparql:run-sparql  
    "PREFIX fti:  <http://franz.com/ns/allegrograph/2.2/textindex/>  
     PREFIX c:    <http://www.cyc.com/2002/04/08/cyc#>  
     PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>  
     SELECT ?x WHERE {  
        ?x fti:matchExpression  
           '(and (or \"people\" \"person\") \"murder\")' .  
      }"  
   :results-format :fitted-table)  
------------------------  
| x                    |  
========================  
| Murderer             |  
| AttemptedMurder      |  
| AssassinatingSomeone |  
------------------------  
t  
:select  
(?x)

Why a magic predicate?

The motivation for providing a magic predicate is that SPARQL FILTERs cannot generate new bindings. In many cases generating new bindings is unnecessary:

SELECT ?name {  
    ?x foaf:name ?name .  
    FILTER (regex(?name, "John", "i"))  
}

but this is not always true. There is also precedent for the magic predicate approach in other implementations.

Note that we are using q-, not q so we will not use RDFS++ reasoning and the triples returned will need to be direct subclasses. ↩

AllegroGraph 3.3 Freetext Indexing Tutorial

Table of Contents

Getting setup

The basics: adding and querying

A silly but more interesting example

Querying simple expressions

Querying wildcards

A larger example with realistic data

Querying Gov

Querying from Prolog

Using Free-Text Indexing from SPARQL

Why a magic predicate?

Getting setup

The basics: adding and querying

A silly but more interesting example

Querying simple expressions

Querying wildcards

A larger example with realistic data

Querying Gov

Querying from Prolog

Using Free-Text Indexing from SPARQL

Why a magic predicate?