Support for any newline format in text input

Exchanging text between computer systems is easy thanks to character encoding standards such as ASCII, Unicode, and other widely used standards. The ease of text exchange, though, is marred by different computer systems using different line break conventions.

Unix systems represent line breaks with the ASCII linefeed (10 or #xa) character. DOS/Windows based systems combine both the ASCII carriage-return (13 or #xd) and linefeed (#xa) characters. Older Macintosh systems used only the ASCII carriage-return (#xd). (Mac OS X uses Unix-style linefeed. When we use :mac below, we refer to the older Mac style.)

These differing conventions cause problems as text using the older Macintosh convention may be interpreted by a Unix application as consisting of a single line, since there will likely be no linefeed characters. Similarly, a Unix text may appear as a single line to a Windows application. In addition, a Windows file will appear to have extra characters at the ends of each line when transported to either a Unix or Macintosh application.

This problem is an issue for Common Lisp since the ANSI Common Lisp specification states that the single Lisp #\Newline character is to indicate the end of a line of text. Thus, Lisp reading and writing of text requires understanding how line endings are handled on the native platform so that they can be properly converted to #\Newline.

The Windows-based Allegro CL uses DOS/Windows end-of-line conventions by default. On all other platforms, Allegro CL defaults to using the Unix end-of-line convention.

Allegro CL includes a function, eol-convention, that can be used to query the default end-of-line handling on a stream. On Windows, Allegro CL uses the DOS/Windows convention by default, so the default eol-convention is :dos. On all other platforms, the default eol-convention is :unix.


 ;; Windows Allegro CL
 (eol-convention <stream>)  ==> :dos

 ;; Unix Allegro CL
 (eol-convention <stream>) ==> :unix

One can use (setf eol-convention) to change dynamically a stream's end-of-line convention, thus changing how external character codes are converted to/from Lisp's #\Newline. So, if you have a stream open to a file using the older Macintosh end-of-line convention, you can properly read the line breaks on that stream by changing the eol-convention as follows:

 (setf (eol-convention <stream>) :mac)

A recent patch to Allegro CL 8.0 adds a new end-of-line convention for reading characters from a stream. This new convention is called anynl (for Any NewLine). With this 8.0 patch, one can then specify :anynl as the value of eol-convention of a stream. Input from the stream will then be read correctly regardless of what end of line convention it uses.


;; This causes read-char/read-sequence/etc. to convert any
;; carriage-return, linefeed, or carriage-return/linefeed combination
;; to a Lisp #\Newline
(setf (eol-convention <stream>) :anynl)

Note that specifying :anynl as above affects only how characters are read from the stream. Characters written to the stream still follow the previous eol-convention setting. A more detailed way to describe this is that there are actually three anynl conventions: :anynl-unix, :anynl-dos, and :anynl-mac. If the previous eol-convention setting on a stream is :unix, then setting the eol-convention to :anynl actually sets the eol-convention to :anynl-unix. Similarly for :dos to :anynl-dos and :mac to :anynl-mac.

The :anynl-unix convention converts any line-break convention to #\Newline on input, but if the stream is also an output stream, then #\Newline characters will be converted to ASCII linefeed (following the Unix convention). Similarly the :anynl-mac eol-convention handles output characters using the older Macintosh style, and :anynl-dos handles output characters using Dos/Windows style.


 (setf (eol-convention <unix-stream>) :anynl)

is equivalent to

 (setf (eol-convention <unix-stream>) :anynl-unix)

Internally, the end-of-line conventions are implemented using composing external-formats. An external-format is an object that tells a stream how to translate external character codes to/from Lisp Unicode characters. Since an end-of-line convention is independent of whether an external text is representing using, say, UTF-8, JIS (Japanese), or ASCII, Allegro CL uses a base external-format for the main character code conversion, and then a composing external-format to handle the newline conversion. Thus, the eol-convention function simply changes a stream's composing external-format. You can see this as follows:

 (setq a (open "..." :external-format :utf-8))
 (setf (eol-convention a) :anynl-dos)

 [actual display will be slightly different]
 (stream-external-format a)
   ==> #<external-format (:e-crlf :utf8-base)>

The (:e-crlf :utf8-base) indicates an external-format where crlf processing (used by :dos eol-convention) is composing a utf-8 external-format.

Thus, if you change a stream's external-format mid-stream, which you can do dynamically in Allegro CL (see External formats in iacl.htm), you may also need to re-set the stream's eol-convention.

Allegro CL is already internally using the anynl eol-convention for parsing XML since the XML specifies that any of the line-break conventions are to be treated as equivalent. We hope that users will find this facility useful in their applications needing to be cross-platfrom independent.

Copyright © 2014 Franz Inc., All Rights Reserved | Privacy Statement
Delicious Google Buzz Twitter Google+