Some techniques for reducing consing when processing very large text files

At Franz Inc., we are building various AllegroCache and AllegroGraph applications that need to read enormously large text based files. Typical examples are ntriple rdf files or comma separated files with hundreds of millions of entries. Consing is a big problem when processing these files naively, not because the processing itself is slow, but because excessive consing results in garbage collections that take tens of seconds. During these GC's, the program is unable to serve external clients. We needed some new functionality and some tricks to significantly reduce or completely get rid of the consing.

We have already provided cons-free variants of read-line (see the earlier Tech Corner article Space-efficient variants of read-line). The simple-stream-read-line function (one of the variants) allows you to use the same resource buffer over and over again.

We create a macro, do-lines, that calls simple-stream-read-line and is suited to our specific needs (this saves us having to remember the syntax of simple-stream-read-line). With the macro (defined just below), we can read lines of text and process them with a simple form like:

(do-lines (line end filename)
  (do-my-stuff line end))

Here line is the variable that is bound to each line as it is read and end is the variable that tells you where the information read into the buffer ends. Here is the definition of the macro:

(defmacro do-lines 
    ((line end file &key (len 1000) 
                         (external-format *default-external-format*)) 
     &body body)
  (let ((in (gensym))
        (done (gensym))
        (t-end (gensym))
        (buf (gensym)))
    `(with-open-file (,in ,file :external-format ,external-format)
       (let ((,buf (make-array ,len :element-type 'character))
             (,end nil))
         (declare (dynamic-extent ,buf))
           (multiple-value-bind (,line ,done ,t-end)
               (simple-stream-read-line ,in nil nil ,buf)
             (declare (ignore ,done))
             (unless ,line (return))
             (setf ,end (or ,t-end (length ,line)))

Let us compare using read-line with using do-lines. We first create some fake telecom-like data in a csv file. The file has data in four fields:

  1. a-number, the calling telephone number
  2. b-number, the receiving telephone number
  3. start-time
  4. duration

The following function creates a file with one comma-delimited data quad per line (it creates the file /tmp/; if that path is not suitable on you machine, specify another one here and below where the file is read):

(defun make-fake-data ()
  (with-open-file (out "/tmp/"
		   :direction :output :if-exists :supersede)
    (let ((c 0))
      (dotimes (i 1000000)
	(format out "~d,~d,~d,~d~%"
		(1+ (random 10000))
		(1+ (random 10000))
		(incf c (random 3))
		(1+ (random 10)))))))

Now we use read-line to read each line of the data file. We do not do anything yet: we will show that below. We are only interested in the space used for the reading.

(defun test1 ()
  (with-open-file (in "/tmp/")
    (let ((line nil)
	  (c 0))
      (while (setf line (read-line in nil nil))
	(incf c))

Here is a transcript:

cl-user(69): (compile 'test1)
cl-user(70): (time (test1))
; cpu time (non-gc) 480 msec user, 20 msec system
; cpu time (gc)     170 msec user, 0 msec system
; cpu time (total)  650 msec user, 20 msec system
; real time  673 msec
; space allocation:
;  141 cons cells, 79,887,072 other bytes, 0 static bytes

Now we use our do-lines macro (and simple-stream-read-line):

cl-user(71): (defun test2 ()
	       (let ((c 0))
		 (do-lines (line end "~/temp/")
		   (incf c))
cl-user(72): (compile *)
cl-user(73): (time (test2))
; cpu time (non-gc) 330 msec user, 30 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  330 msec user, 30 msec system
; real time  356 msec
; space allocation:
;  85 cons cells, 8,944 other bytes, 0 static bytes

The do-lines creates only 8 Kb of garbage whereas the naive version create almost 80 Mb of garbage.

Now let us do something with the contents of each line. Doing something will almost always use space as the result must be stored somewhere. Suppose, for example, we want to count the number of unique substrings in each column of the fake file that we created. We need a hash-table for each column, where we will store each unique string:

(defvar htlist nil)
(defvar a-ht nil)
(defvar b-ht nil)
(defvar t-ht nil)
(defvar d-ht nil)

(defun init-hash-tables ()
  (setf htlist
      (setf a-ht 
            (make-hash-table :size 10000 :test #'equal)) ; calling number
      (setf b-ht 
            (make-hash-table :size 10000 :test #'equal)) ; receiving number
      (setf t-ht (make-hash-table :size 10000 :test #'equal)) ; start time
      (setf d-ht (make-hash-table :size 10000 :test #'equal))))) ; duration

(You may note we have made the start time hashtable (t-ht) too small: 10000 initial size where 10 or 100 times bigger will likely be necessary. But we only know that because we created the data just above in this note and we know its characteristics. 10000 may be a reasonable number absent additional information. In any case, further optimization may be available by better estimating the hash-table size required, but we leave that optimization to the reader.)

Here is a function to parse a line of data:

(defun parse-telecom-line-1 (line end)
  (labels ((next-comma (string end start)
             (loop for i from start to end do
                   (when (or (= i end) (char-equal (schar string i) #\,))
                     (return i)))))
    (let ((start 0))
      (dolist (ht htlist)
        (let ((pos (next-comma line end start)))
          (incf (gethash (subseq line start pos) ht 0))
          ;(print (subseq line start pos))
          (setf start (1+ pos)))))))

(defun count-all-1 ()
  (do-lines (line end "/tmp/")
    (parse-telecom-line-1 line end)))

Now we compile the functions and run our test:

cl-user(74) (compile 'parse-telecom-line-1)
cl-user(75) (compile 'count-all-1)
cl-user(75) (dotimes (i 4)(gc))  ;; (clean up: most live data
                                 ;; goes to oldspace with 4 gc's
cl-user(76) (time (count-all-1))
; cpu time (non-gc) 9,350 msec user, 20 msec system
; cpu time (gc)     1,150 msec user, 0 msec system
; cpu time (total)  10,500 msec user, 20 msec system
; real time  10,526 msec
; space allocation:
;  214 cons cells, 257,219,712 other bytes, 0 static bytes

Nearly 260 MB of data is created to parse the file. What is creating it? (Here we can pitch our runtime-analyzer. It identifies the functions which allocate the most data, using the :space analysis option. However, we know the answer and will just tell you: the repeated calls to subseq. To reduce garbage, we must to get rid of some of the subseq's.

Here is a trick to do so:

(defvar +expected-maximum-string-length+ 80)
(defvar +string-types+ 1)

(defvar *string-source* 
  (loop with strings = (make-array (list +expected-maximum-string-length+ +string-types+))
     for length below +expected-maximum-string-length+ do
       (loop for type below +string-types+ do
            (setf (aref strings length type) (make-string length)))
     finally (return strings)))

(defun mysubseq (type string start end)
  "Return a substring, using one of the string values from
  *STRING-SOURCE* as the return value if the substring is less
  than +EXPECTED-MAXIMUM-STRING-LENGTH+. These strings are passed
  to the callback function; it must not hold onto after it
  returns; if it needs to it must make a copy. In most cases
  that's okay, since the string values are immediately written to
  a file."
  (declare (optimize speed))
  (let* ((length (- end start))
         (substring (if (< length +expected-maximum-string-length+)
                        (aref *string-source* length type)
                        (make-string length))))
    (do ((c start (1+ c))
         (i 0 (1+ i)))
        ((= c end))
      (declare (fixnum i c))
      (setf (schar substring i) (schar string c)))

Here we use it:

(defun parse-telecom-line-2 (line end)
  (labels ((next-comma (string end start)
	     (loop for i from start to end do
		   (when (or (= i end) (char-equal (schar string i) #\,))
		     (return i)))))
    (let ((start 0))
      (dolist (ht htlist)
	(let ((pos (next-comma line end start)))
	  (let ((str (mysubseq 0 line start pos)))
	    (if* (not (gethash str ht))
	       then (setf (gethash (subseq str 0 (length str)) ht) 1)
	       else (incf (gethash str ht)))
	     ;(print (subseq line start pos))
	     (setf start (1+ pos))))))))

(defun count-all-2 ()
  (do-lines (line end "~/temp/")
    (parse-telecom-line-2 line end)))

cl-user(77): (compile 'parse-telecom-line-2)
cl-user(78): (compile 'count-all-2)
cl-user(164): (dotimes (i 4)
22701264 bytes have been tenured, next gc will be global.
See the documentation for variable *global-gc-behavior* for more information.

cl-user(79): (time (count-all-2))
; cpu time (non-gc) 9,740 msec user, 30 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  9,740 msec user, 30 msec system
; real time  9,769 msec
; space allocation:
;  79 cons cells, 98,212,576 other bytes, 0 static bytes

Now only 98 Mb of data is created. (We cannot get rid of it all because we are storing the unique strings.)

CAVEAT: please remember that this is a little bit of a dumbed down example. With the fake dataset we would not have had to use the hashtables and we could just have created vectors with an entry for each user so then we could have done this completely cons free. However, in the real telecom call detail records there is too much data for such an alternative approach.

Copyright © 2019 Franz Inc., All Rights Reserved | Privacy Statement Twitter Google+