A text indexer using B-trees

Customers have asked about text-indexing software. We now have a text-indexer which stores associatons between words and numbers. As we show in this note, one example of the use of this tool is to index all words (3 characters or longer, and excluding a designated list of words to ignore) in a collection of files. The indexing uses AllegroCache and B-trees.

The example uses four files with are normally put in a directory named text-index/. A zipped tar file named text-index.tar.gz can be downloaded (click on the filename). Extract that file into a convenient directory, such as the Allegro 8.1 directory (where Allegro CL 8.1 was installed). A subdirectory named text-index/ will be created with the files listed below.

The file included are:

  • text-index.cl: the text-indexing software.
  • string-hash.cl: an auxiliary file needed by text-index.cl.
  • text-index.txt: documentation for the text-indexing functionality.
  • tutorial.cl: forms to evaluate to illustrate the text-indexing functionality.

The first step in creating a text index is to create an index file. This is done with the function open-text-index and it returns an text-index instance. Then you add to this index strings and locations. The following transcript illustrates this:

;;  First we set the value of user::*test-index-dir*. Where
;;  we are running Lisp, it is the text-index/ subdirectory of the
;;  current directory:

cl-user(3): (setf user::*text-index-dir* (pathname "text-index/"))
#P"text-index/"

;;  Now we compile and load the file test-index.cl. The file
;;  string-hash.cl should also be compiled and loaded:

cl-user(4): :cload text-index/text-index.cl
;;; Compiling file text-index/text-index.cl
;;; ...
;;; Compiling file text-index/string-hash.cl
;;; ...
; Fast loading text-index/text-index.fasl

;;  We use the text-index package so we do not have to qualify
;;  variable and operator names:

cl-user(10): (use-package :text-index)
t

;;  We defined a filename for the index and we create (or open, if
;;  it exists) the index:

cl-user(14): (defparameter testfname "text-index/tutorial.ti")
testfname
cl-user(15): (defparameter ind (open-text-index testfname :if-does-not-exist :create))
ind
cl-user(16): 

;;  Now things are set up, we add items to the index:

cl-user(16): (index-word ind "jans" 1)
nil
cl-user(17): (index-word ind "ahmon" 1)
nil
cl-user(18): (index-word ind "fritz" 2)
nil
cl-user(19): (index-word ind "kevin" 2)
nil
cl-user(20): 

The index associates strings with a value. This value should be an integer, one that can be represented by a (signed-byte 32). We have indexed four names, jans, ahmon, fritz, and kevin, and placed two in location 1 and two in location 2.

So far, this may seem simple and unimpressive. The power of the text indexer is that it can place very, very many words in an index very quickly and extract index information very quickly. So far we haven't done anything that could not as easily be done with hashtables (the key is the string and the value the location). But hashtables are hard to make persistent (they do not have a easy representation storable in files) and they tend to work less well when you have tens or hundreds of millions of entries.

We are not in this example going to have millions of entries because there is no convenient suitable collection of files which all users reading this note will have access to. All we can do it show how it can be used to index a collection of files and then you can try it out on some collection of your own.

Anyway, we have a small and simple index. Let us demonstrate a few operations on it:

;;  You find index entries with LOOKUP-WORD:
cl-user(20): (lookup-word ind "jans")
(1)

;; Wildcards like * and ? are supported:
cl-user(24): (lookup-word ind "jan*" :wild t)
(1)
cl-user(25): (lookup-word ind "j?ns" :wild t)
(1)

;;  LOOKUP-TEXT looks up several words and returns objects that
;;  contains all:
cl-user(26): (lookup-text ind "jans ahmon")
(1)
cl-user(27): (lookup-text ind "jans kevin")
nil

;;  Even though both jans and kevin are in the index, jans is indexed
;;  at location 1 and kevin at location 2 so no location contains
;;  them both.

;;  You can insert multiple words with INDEX-STRING:
cl-user(28): (index-string ind "jans is away for a week" 1)
nil

;;  Each word is extracted and added to the index at the
;;  specified location (1 in the example):
cl-user(36): (lookup-word ind "away")
(1)
cl-user(37): (lookup-text ind "week for")
(1)

;;  But "is" is not there:
cl-user(38): (lookup-word ind "is")
nil

;;  The indexer (as implemented) does not index works less than
;;  3 characters nor more than *max-word* characters, nor any of
;;  a defined list of stopwords (the value of *stop-words*) which
;;  includes words like "was", "their", "and", "not", and so on.

;;  We are done with the simple example. Let us close the index and 
;;  then delete it:
cl-user(39): (close-text-index ind)
nil
cl-user(40): (delete-text-index testfname)
t

Now let us do a more complicated example. We will index all appropriate files in a directory. (By "appropriate", we mean text files of interest.) We need a way to associate integers (which will be used as locations) with filenames. We use a hashtable whose keys are integers are whose values are filenames.

cl-user(41): (defparameter filetable (make-hash-table))
filetable
cl-user(42): (setf ind (open-text-index testfname :if-does-not-exist :create))
#<text-index @ #x500dca2>

The function do-one-dir takes a directory name as an argument and finds all appropriate files in that directory. It ignores subdirectories, emacs temp names (which have a # pre- and post-pended):

(defun do-one-dir (dir)
  (let ((c 0))
    (dolist (e (directory dir))
      (let ((name (namestring e)))
	(print name)
	(when (not (position #\# name)) ; filter out emacs temps
	  (when (not (excl:file-directory-p name)) ; filter out subdirs
	    (setf (gethash c filetable)  name)
	    (index-file ind name c)	    
	    (incf c)))))))

The work of do-one-dir is the (index-file ind name c) form. Everything else is just setup to get at the files to be indexed. index-file finds all words in a file and then adds them to the index. Each file is thus a location. See text-index.txt for a description of other indexing options.

We load do-one-dir into Lisp and use it on a directory. We use it on the directory text-index/ (which presumably users trying out the code in this note will have):

cl-user(47): (compile 'do-one-dir)
do-one-dir
nil
nil
cl-user(48): (do-one-dir "text-index/*.cl")

"text-index/string-hash.cl" 
"text-index/text-index.cl" 
"text-index/tutorial.cl" 
nil
cl-user(49): (do-one-dir "text-index/*.txt")

"text-index/text-index.txt" 
nil

We indexed the three .cl files and the one .txt file. Now let us look for some words:

cl-user(56): (lookup-word ind "defun")
(2 1 0)
cl-user(57): (lookup-word ind "jans")
(2)

;;  Well, defun appears in the Lisp source files but not in the txt file
;;  and jans appears in the tutorial file (2) because it contains the
;;  examples we have been running. But the numeric indexes are not
;;  that useful. Here is a function that converts the numeric designations
;;  to filenames:

cl-user(58): (defun get-files (lis)
	       (let (res)
		 (dolist (i lis)
		   (push (gethash i filetable) res))
		 (nreverse res)))
get-files
cl-user(59): (get-files (lookup-word ind "defun"))
("text-index/tutorial.cl"
 "text-index/text-index.cl"
 "text-index/text-index.txt")
cl-user(60): 

;;  Suppose we wanted to find the definition of LOOKUP-WORD.
;;  The file must contain "defun lookup-word". LOOKUP-TEXT is not
;;  quite right, because it finds the files that contains both words
;;  but not necessarily together:
cl-user(60): (get-files (lookup-text ind "defun lookup-word"))
("text-index/text-index.txt"
 "text-index/text-index.cl"
 "text-index/tutorial.cl")
cl-user(61): 

;;  No real help there. But the function LOOKUP-PHRASE does what
;;  we want. It takes an index and a phrase (just like LOOKUP-TEXT) 
;;  but it also take a function that takes a index location and returns
;;  a string to be searched for the desired phrase (using the Lisp
;;  SEARCH function). Now we will find the definition. The function we
;;  we use is (lambda (a) (excl:file-contents (gethash a filetable))).
;;  excl:file-contents returns a file's contents as a single string.
cl-user(61): (lookup-phrase ind "defun lookup-word"
	       (lambda (a)
		 (file-contents (gethash a filetable))))
(1)
cl-user(62): (get-files *)
("text-index/text-index.cl")
cl-user(63): 

;;  The definition of LOOKUP-WORD is in text-index.cl.

lookup-phrase works by applying lookup-text to the phrase so that the files that contain all the words of the phrase are found (but the words are just present, not necessarily forming the phrase). Then the argument function is applied so the strings to search are extracted and searched for the phrase.

In the file tutorial.cl, near the end, there is some elaborate code for indexing Lisp files. It starts:

;;;; a mini project creating  a search engine for all your lisp and text files...


(defstruct ti
  counter
  filetable
  index)  

(defparameter ti
    (make-ti :counter 0
	     :filetable (make-hash-table :size 50000)
	     :index (open-text-index testfname :if-does-not-exist :create)))

(defun ti-index-file (ti name)
  (index-file (ti-index ti)
	      name
	      (incf (ti-counter ti)))
  (setf (gethash (ti-counter ti) (ti-filetable ti)) name))
;; etc.

The function build-my-lisp-index indexes a directory of Lisp and related source files. If you have a directory of Lisp source files, you can test it out.

We have only given a short introduction to the text indexer. See text-index.txt for documentation of all the defined functions.

The text indexing code is free to use for any purpose (we ask only that you acknowledge Franz Inc. as the original source and require others to do so as well). No text indexer is perfect and ours can certainly be improved. If any user makes improvements or comes up with different approaches, and wishes these publicized, please tell us. Again click on the filename to download the text-indexing files: text-index.tar.gz.

Copyright © 2014 Franz Inc., All Rights Reserved | Privacy Statement
Delicious Google Buzz Twitter Google+