A discussion of the permuted index generation code

A permuted index takes an indexing expression and makes an entry for each word in the expression. For example, the index expression run-shell-command (a reference to the function run-shell-command) is thus indexed under R (for run), S (for shell), and C (for command) in the permuted index. The expression What if the system seems to hang is indexed under H (for hang), I (for if), S (for seems and again for system), T (for the and again for to), and W (for what).

Here are some issues when creating a permuted index:

  • You must decide what to consider an expression that generates permuted index entries. We gave two examples above: run-shell-command and What if the system seems to hang. These can from index entries run-shell-command: Function, excl package and interrupting for sure: see What if the system seems to hang, on Unix: Enough C-c's (on Unix) will always interrupt, on Windows: The Allegro icon on the system tray will always interrupt on Windows both in startup.htm. The portions of the main index entries in italics indicate linked text (between markers <a ...> and </a>). We decided to use linked text as expressions in the Allegro CL permuted index. (Using link text has the particular advantage that there is no dispute about where it should link. The interrupting for sure... entry contains links to four different locations, so where would the whole entry link if it was placed in the permuted index?)
  • You must decide whether to filter words like a, the, on, if, etc. This is a very difficult issue unless you decide to simply ignore it (we mostly ignore it, filtering out only single-letter words). It is difficult because words like the may look irrelevant but sometimes might not be (see the, a description of the the special operator), but determining meaningful and non-meaningful uses of the requires special markup or a pretty good semantic analyzer, neither of which are available. The issue can be ignored because the almost all downside is bigger files, which in the age of cheap disks are not that big a deal. (See the entries under htm in H: we use htm as the extension for our documentation files so most appear, arising from entries like the interrupting for sure: ... in startup.htm one given above. They are not really useful in the permuted index H file, but they do not do much harm either.)
  • You must decide what constitutes a word. This one is pretty easy: alphanumeric text surrounded by whitespace (a space, or a carriage return), a dash (-) or other punctuation mark, or the beginning or end of an expression.

The power of creating a permuted index is shown from the What if the system seems to hang example. As shown, the index entry is in index-i.htm, starting with the text interrupting for sure. In the permuted index, we get additional entries for hang, system (with hang close by), and system tray, among others, automatically. Further, knowing how the permuted index is generated encourages writing main index entries that will generate useful permuted index entries (by having the link text contain useful words).

Now let us look at the code for generating the index. The steps are as follows:

  1. Read in the index html files and convert to lhtml lists. lhtml represents HTML-marked text as a list of lists, with the first element of and sublist being a HTML tag (represented as a keyword -- :p for the paragraph tag p, for example), or a list of the tag and attribute value pairs, and the remaining elements the text marked by the tag. The Lisp-based HTML parser available in Allegro CL is described in phtml.htm.
  2. Find the index entries by searching the lhtml tree. This step, performed by the function collect-all-entries, depends on knowledge of the structure of the Allegro CL documentation index files which provide the data for the permuted index. The index portion of these files is in bulleted lists (starting with HTML tag UL). collect-all-entries finds such bulleted lists and collects its entries from the bulleted items (tagged by HTML tag LI).
  3. Collect the link text as expressions for the permuted index (and keep the link available).
  4. Organize the links into ordered alphabetic blocks.
  5. Split the indexed entries into indexed words.
  6. Create lhtml lists from these blocks.
  7. Write the permuted index files.

Here is the code. This code should run in Allegro CL 6.1 (but you may hit the Trial version heap limits). Note that the permuted index files generated, named my-permuted-index-*.htm, are somewhat simpler than the ones in the documentation update. They lack the navigation bars at the top and the bottom and other boilerplate information, and they lack the secondary index (links into the list of entries for fast scrolling).

Three parameters must be set. They have been given reasonable defaults, but are likely best set specifically by you. The parameters are:

  • *doc-index-wildcard*: the directory where the supplied index files are located, followed by `index-*.htm'.
  • *index-target-url*: the location where the newly generated permuted index files will be placed. We recommend that you do not write files to the Allegro CL doc/ directory even if you have write permission there.
  • *index-base-url*: a path to prepend to the current index links so the newly generated permuted index files will contain valid links. The initial value should be correct for the defaults of the other two parameters.

We have provided the code here. Click here to download a version of the code in a seperate file (cutting and pasting from a browser often results in poorly formated code).

;;  Code to generate permuted index files from standard
;;  Allegro CL index files.

(in-package :user)


(eval-when (compile load eval)
  (require :phtml)
  (require :aserve))


#|
;; The program works by reading the index pages from a regular
;; doc set and creating permuted index pages.  

;; Configure these pathnames as desired.

;; *doc-index-wildcard* is the path of the doc/ directory, followed by
;;       the filenames described by index-*.htm 
;;       (i.e. the standard index files)
;; *index-target-url* the location where the new permuted index files
;;       will be placed. The initial value is "/tmp/"
;; *index-base-url* the directory that will be added to a link
;;       from an index file into the documentation. If
;;       the documentation is the doc/ subdirectory, of 
;;       (on Windows machines, this is the usual location)
;;       "/Program Files/acl61/doc/" and *index-target-url*
;;       is "/tmp/", the value should be 
;;       "../Program Files/acl61/doc/".
;;       On UNIX machines, the location is typically /usr/local/acl61

|#

;; This is from where to read the index files to create the permuted index.
(defparameter *doc-index-wildcard* #+mswindows "doc/index-*.htm"
              #-mswindows "/usr/local/acl61/doc/index-*.htm")
;; This is where to write the created my-permuted-index*.htm files.
(defparameter *index-target-url* "/tmp/")
;; This is the base for URL references to the generated indexed doc set.
(defparameter *index-base-url* #+mswindows "../Program Files/acl61/doc/"
              #-mswindows "/usr/local/acl61/doc/")

;;  The following functions callect information from
;;  the doc/index-*.htm files.

;; The first element of an lhtml list is either a tag
;; or a list of a tag and attribute/value pairs. This
;; function extracts the actual tag.
(defun element-tag (element)
  (if (consp element) (car element) element))

(defun find-element (html predicate continuation)
  (when (consp html)
    (destructuring-bind (element &rest content) html
      (if (if (symbolp predicate)
              (eq (element-tag element) predicate)
            (funcall predicate element))
          (funcall continuation element content)
        (loop for sub in content
            do (find-element sub predicate continuation))))))

(defun collect-strings (content)
  (with-output-to-string (s)
    (labels ((collect-strings-1 (content)
               (typecase content
                 (string (write-string content s))
                 (cons (collect-strings-1 (car content))
                       (collect-strings-1 (cdr content))))))
      (collect-strings-1 content))))

(defun whitespacep (char)
  (member char '(#\space #\tab #\linefeed #\return)))

(defun normalize-whitespace (str)
  ;; Normalize all space to a single nbsp, dropping leading and trailing space.
  (with-output-to-string (s)
    (loop with startedp = nil
        with last-whitespace = nil
        for c across str
        do (if (whitespacep c)
               (when startedp (setq last-whitespace t))
             (progn (when last-whitespace
                      (write-char #\space s) ; #\no-break_space
                      (setq last-whitespace nil))
                    (setq startedp t)
                    (write-char c s))))))

(defun collect-all-entries ()
  #|
This function depends on the known structure of index-*.htm files:
the index entries are in a bulleted list, tagged with UL, with entries
tagged with LI and links tagged with A. We further know that the only
butlleted lists are lists entries. So this function gets the LHTML list
from each index-*.htm file, and strats walking through it, finding
A tags within LI tags within UL tags and collecing them
|#
  (let ((ret nil))
    (loop for letter-index-file in (directory *doc-index-wildcard*)
        as html = (with-open-file (f letter-index-file)
                    (net.html.parser:parse-html f))
        do (find-element
            (car html)
            :ul
            (lambda (e c)
              (find-element
               (cons e c)
               :li
               (lambda (e c)
                 (find-element
                  (cons e c)
                  :a
                  (lambda (e c)
                    (let ((url (getf (cdr e) :href)))
                      (push (cons url (normalize-whitespace (collect-strings c)))
                            ret)))))))))
    ;; Delete duplicates.  Certain entries for multiple names 
    ;; (e.g. string-upcase, nstring-upcase, etc.) will have 
    ;; been gathered from multiple pages index-letter pages.
    (loop for x on (sort ret (lambda (x y)
                               (cond ((string< (car x) (car y)) t)
                                     ((string> (car x) (car y)) nil)
                                     ((string< (cdr x) (cdr y)) t)
                                     (t nil))))
        unless (equal (car x) (cadr x)) collect (car x))))


;; Name is a single string with no leading or trailing 
;; whitespace and all internal
;; whitespace consisting of only a single space.  
;; Returns a list of indexes into the string
;; where permuted entries should start.

(defun generate-index-entries (name)
  ;; Need to delete noise entries.
  (loop with last-alpha = nil
      for n from 0
      for c across name
      if (not (alphanumericp c))
      do (setf last-alpha nil)
      else when (null last-alpha)
      collect n
      and do (setf last-alpha t)))

(defun gen-index-entries ()
  (loop for entries on
        (stable-sort
         (sort (loop for (url . name) in (collect-all-entries)
                   as target-url = (concatenate 'string *index-base-url* url)
                   as entry = (cons name target-url)
                   nconc (loop for split on (generate-index-entries name)
                             when 
                               (> (- (or (cadr split) (1+ (length name))) 
                                     (car split)) 2)
                             collect 
                               (list* 
                                (subseq name 
                                        (car split) 
                                        (and (cadr split) 
                                             (loop for i 
                                                 from (1- (cadr split)) 
                                                 downto (car split)
                                                 when 
                                                   (alpha-char-p (char name i))
                                                             return (1+ i))))
                                            (car split)
                                            (cadr split)
                                            entry)))
               #'string-lessp :key #'fourth)
         #'string-lessp :key #'car)
        ;; Delete any duplicate entries.
      unless (equal (car entries) (cadr entries))
      collect (car entries)))

;;  This function creates a lhtml list for a collection for a single
;;  my-permuted-index-*.htm page and returns it.

(defun emit-page-table (entries title)
  (let ((lhtml-list (list :html (list :head (list :title title))))
        (table-head (list (list :table "border" "0" "cellpadding" "0"
                                "cellspacing" "0")))
        (table-body nil)
        (body1-list (list 
                     (list :h1 title)
                     (append (list :p) (gen-links-list))))
        (body2-list (list (append (list :p) (gen-links-list))))
        (result nil))
    (loop for entry in entries
        as (item ind1 ind2 text . url) = entry
        do 
          (setq table-body 
            (append table-body
                    (list 
                     (list :tr 
                           (list '(:td "align" "right")
                                 (if (eql ind1 0)
                                     " "
                                   (gen-a-entry  
                                    url
                                    ;; If the last char of the left segment 
                                    ;; is a space, replace it with a nbsp.  
                                    ;; Otherwise html rendering will remove
                                    ;; it entirely.  We can't just replace 
                                    ;; all spaces with nbsp
                                    ;; because then the renderer 
                                    ;; can't fold very long lines.
                                                 (progn 
                                                   (if (eql #\space 
                                                            (char text 
                                                                  (1- ind1)))
                                                       (let 
                                                           ((str 
                                                             (subseq 
                                                              text 0 ind1)))
                                                         (setf 
                                                             (char str 
                                                                   (1- ind1)) 
                                                           #\no-break_space)
                                                         str)
                                                     (subseq text 0 ind1))))))
                                 (list '(:td "align" "left")
                                       (gen-a-entry  
                                        url
                                        (concatenate 'string
                                          (if ind2
                                              (when ind2
                                                (loop while (> ind2 ind1)
                                                    until 
                                                      (alpha-char-p 
                                                       (char text (1- ind2)))
                                                    do (decf ind2)))
                                            "")
                                          item
                                          (if ind2 (subseq text ind2) "")))))
                                 )))
          (setq result (append lhtml-list 
                             (list (append (list :body)
                                     body1-list 
                                    (list (append table-head table-body)) 
                                    body2-list)))))
    result))

;;  This function creates the main permuted index page (with links
;;  all other pages.  

(defun gen-permuted-index-main ()
  (let ((lhtml-list (list :html)) (body-list nil) 
        (links-list (gen-links-list)))
    (setq lhtml-list (append lhtml-list 
                             (list (list :head
                                         (list :title "Permuted Index")))))
    (setq body-list 
      (list 
       (list :h1 "Permuted Index")
       (append (list :p) links-list)))
    (setq lhtml-list (append lhtml-list
                             (list (append (list :body) body-list))))))

;;  This function creates a lhtml :a entry

(defun gen-a-entry (href entry)
  (list (list :a "href" href) entry))

;; This function creates the lhtml links (to other my-permuted-index-*
;; files).

(defun gen-links-list ()
  (let ((links-list nil))
    (loop for c across "abcdefghijklmnopqrstuvwxyz"
        do 
          (setq links-list 
            (append links-list
                    (list
                     (gen-a-entry
                      (format nil "my-permuted-index-~a.htm" c)
                      (list :b 
                            (concatenate 'string (string-upcase c) " ")))))))
    (setq links-list (append links-list 
                             (list 
                              (gen-a-entry 
                               "my-permuted-index-non-alphabetic.htm"
                               (list :b "Non-Alphabetic")))))))



(defun gen-permuted-index-files ()
  ;; writes permuted index files
  
  ;;  First the main file:
  (with-open-file (s (format nil "~amy-permuted-index.htm"
                             *index-target-url*)
                   :direction :output :if-exists :supersede)
    (net.html.generator:html-print (gen-permuted-index-main) s))
  
  ;;  Now the individual files:
  (let ((entries (gen-index-entries)))
    (flet ((emit (filenamex predicate)
             (with-open-file (stream (format nil "~amy-permuted-index-~a.htm"
                                             *index-target-url* filenamex)
                              :direction :output :if-exists :supersede)
               (let 
                   ((result (funcall 'emit-page-table
                                     (loop for entry in entries
                                         when (funcall predicate (car entry))
                                         collect entry)
                                     (format nil "Permuted Index - ~:(~a~)" 
                                             filenamex))))
                 (net.html.generator:html-print result stream)))))
      (loop for c across "abcdefghijklmnopqrstuvwxyz"
          do (emit c 
                   (lambda (name) (char-equal (char name 0) c))))
      (emit "non-alphabetic" 
            (lambda (name) (not (alpha-char-p (char name 0)))))
      )))
Copyright © 2014 Franz Inc., All Rights Reserved | Privacy Statement
Delicious Google Buzz Twitter Google+