Space-efficient variants of read-line

The Common Lisp function read-line reads a single line from a stream and returns that line as a string of characters. It is a very convenient function when parsing user data which is not Lisp data (and thus not suitable from reading by functions like read).

The downside of read-line is it can waste space. A new string is created for each line read. This can result in significant excess consing when many lines are read.

Two new functions, read-line-into and simple-stream-read-line, provide space-efficient variants of read-line. They work by allowing the programmer to supply the target string (for read-line-into) or a buffer (for simple-stream-read-line) and thus avoid the creation of unnecessary new strings.

The arguments and return values of these new functions are described on there documentation pages. In this note, we just want to demonstrate their space efficiency. You can get the update, which is for version 8.0 only, by calling sys:update-allegro.

read-line-into is the most space-efficient function for reading lines, but it has a complicated interface since the programmer must keep track of where the next line should go in the target string. simple-stream-read-line has a simpler interface at the cost of using some more space. It is almost as easy to use as read-line.

The file excl-ops.xml is the XML source file for the documentation pages for operators in the excl package. (Almost all Allegro CL documentation is written in XML and converted to HTML before publication. This allows generic changes and things like automatic creation of links to be done in processing without source changes, thus allowing faster updated of the documentation.) This file is reasonably large: over 21,000 lines and 700,000 characters. We will do two things with this file:

  1. Read it into Lisp to create a string of all its characters without the newlines. We will do this be creating a target string (1,000,000 characters long) and reading one line of the file at a time, stuffing that line into the target string. When we use read-line, we create a new string for each line and copy that string into the target. When we use read-line-into, the characters are copied directly into the target and no new strings are created.
  2. Check each line to see if the substring "update-allegro" occurs in the line, returning the number of lines containing that substring. Again, when we use read-line, a new string is created for each line read. When we use simple-stream-read-line with a buffer string, that buffer string is used whenever possible, reducing or eliminating the need for new strings. (We try a buffer string 100 characters long, and find that 36 lines are longer and require new strings. A buffer string 1000 characters long has no overflows.)

As you will see, the savings are mostly in space. read-line is already efficient timewise. There are time savings -- less garbage means fewer gc's and string allocation is not instantaneous, but it is the space savings that are significant.

Here is our code. The function rf takes a filename and a target string, and then reads the file one line at a time, stuffing each resulting string into the target:

(defun rf (file string)
  (with-open-file (f file :direction :input)
    (do* ((x (read-line f nil f) (read-line f nil f))
          (len (length x) (if (stringp x) (length x)))
          (index 0))
         ((eq x f) string)
      (setf (subseq string index (+ index len)) x)
      (setq index (+ index len)))))

The function rf2 uses read-line-into instead of read-line. read-line-into returns three values but we only need the first, which tells how many characters were read and placed in the target string (we use that to know where to place the next line read).

(defun rf2 (file string)
  (with-open-file (f file :direction :input)
    (let ((ind 0) rv1)
      (loop 
	(setq rv1 (read-line-into string f nil f :start ind))
	(if (eq rv1 f) (return string))
	(setq ind (+ ind rv1))))))

Here is a transcript of the calls to rf and rf2. This was down on a Windows machine. We wrap the calls with a progn form that returns nil because rf and rf2 return the target strings, which are very large and which we do not want to print out. We verify the two results have the same contents by verify the contents at character 234,567 are the same.

cg-user(2): (setq aa (make-string 1000000 :initial-element #\a) bb nil)
nil
cg-user(3): (defvar *r* nil)
*r*
cg-user(4): (defvar *s* nil)
*s*
cg-user(5): (time (progn (setq *r* (rf "excl-ops.xml" aa)) nil))
; cpu time (non-gc) 80 msec user, 100 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  80 msec user, 100 msec system
; real time  271 msec
; space allocation:
;  67 cons cells, 1,896,432 other bytes, 0 static bytes
cg-user(6): (time (progn (setq *s* (rf2 "excl-ops.xml" aa)) nil))
; cpu time (non-gc) 71 msec user, 0 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  71 msec user, 0 msec system
; real time  71 msec
; space allocation:
;  65 cons cells, 5,224 other bytes, 0 static bytes
cg-user(7): (subseq *r* 234567 234590)
"can be :stream    when "
cg-user(8): (subseq *s* 234567 234590)
"can be :stream    when "
cg-user(9):

A small gain in time combined with a huge reduction in space.

Example 2

In the second example, we read one line at a time and see if a test string appears in it.

The function r4 takes a file name and the test string as arguments. It uses read-line. The function r5 does the same thing but creates and uses a buffer string and read-line-into. r5 takes a third, optional, argument specifying the size of the buffer string (default is 100).

r4 returns the number of lines containing the test string and also the number of lines read. r5 returns those two value and, as a third value, the number of lines larger than the buffer string. (For those lines, a new string must be allocated.)

We again use the file excl-ops.xml and we use the test string "update-allegro". (sys:update-allegro is the name of the function that downloads Allegro CL updates. You should call it if you want to download the functionality described here. Remember, this update is for version 8.0 only.) Here are the results:

cg-user(10): (time (r4 "excl-ops.xml" "update-allegro"))
; cpu time (non-gc) 80 msec user, 10 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  80 msec user, 10 msec system
; real time  90 msec
; space allocation:
;  65 cons cells, 1,896,376 other bytes, 0 static bytes
13
21102
cg-user(11): (time (r5 "excl-ops.xml" "update-allegro" 100))
; cpu time (non-gc) 80 msec user, 0 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  80 msec user, 0 msec system
; real time  80 msec
; space allocation:
;  102 cons cells, 14,752 other bytes, 0 static bytes
13
21102
36
cg-user(12): (time (r5 "excl-ops.xml" "update-allegro" 1000))
; cpu time (non-gc) 80 msec user, 0 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  80 msec user, 0 msec system
; real time  80 msec
; space allocation:
;  66 cons cells, 7,248 other bytes, 0 static bytes
13
21102
0
cg-user(13):

The call to r4 used about 1,900,000 other bytes. The first call to r5 used 14,000 other bytes. Why so many? 36 lines overflowed the 100 character buffer, necessitating newly allocated strings. That is at least 3600 characters, which is at least 7,200+ bytes (each character is 2 bytes). A second call to r5 with a 1000 character buffer resulted in no overflows, and only 7,248 other bytes.

The times are comparable but the space saving is significant.

Copyright © 2014 Franz Inc., All Rights Reserved | Privacy Statement
Delicious Google Buzz Twitter Google+