International Characters Tutorial

Go to the tutorial main page.

What we call international characters refer to characters used in written languages around the world. Allegro CL represents characters, diacritical marks, phonetics, and punctuation for alphabets including Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, Devanagari, Bengali, Gurmukhi, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Hangul, Ethiopic, Cherokee, Canadian-Aboriginal Syllabics, Ogham, Runic, Tagalog, Hanunˇo, Buhid, Tagbanwa, Khmer, Mongolian, Limbu, Tai Le, Han (Japanese, Chinese, Korean ideographs), Hiragana, Katakana, Bopomofo, and Yi. In addition, a large variety of symbols and dingbats used for punctuation and mathematics, science, and other specialized usage are also supported.

International Allegro CL represents each Lisp character using a 16-bit character code. The encoding used is the Unicode Basic Multilingual Plane. In other words, each integer between 0 and 65536 (#xffff) is the Lisp character code for the corresponding Unicode code point. Here are some examples:

> (code-char 65)
#\A
> (code-char 66)
#\B
> (code-char 97)
#\a
> (code-char 98)
#\b
> (char-code #\;)
59
> (char-code #\semicolon)
59
> (char-code #\latin_capital_letter_a_with_circumflex)
194
> (code-char 195)
#\latin_capital_letter_a_with_tilde
> (char-code #\cyrillic_capital_letter_ka)
1050

As shown in the examples above, Unicode character names are used in Lisp. A multiword Unicode name, e.g., "latin capital letter a with circumflex", is turned into a Lisp character name by using underscore to represent space #\latin_capital_letter_a_with_circumflex.

1. Input/Output

Characters used in different written languages, especially those outside the ASCII encoding, have become encoded for computer representation in many different ways. For example, JISX0208, JIS, Shift-JIS, and EUC, are just some of the different encodings for Japanese alone. What's more, these Japanese encodings include characters used in Chinese and Korean, but each of these other languages have their own encodings.

The Unicode project unifies all characters into a single encoding, but it is still necessary to perform Input/Output to/from Lisp using arbitrary external encodings. Allegro CL achieves this encoding translation by using external-formats. A Lisp external-format can be thought of as a translator between Lisp characters and external octets (an octet is an 8-bit byte). External-formats are associated with Lisp streams and character translation happens automatically as characters are being read/written. The default external-format is that associated with the current locale (c.f., excl:*locale* variable).

> (setq a (open "myfile" :direction :io))

;; The default external-format in this Lisp session is Latin1.
;;
> (stream-external-format a)
#<external-format :latin1 [(crlf-base-ef :latin1)] @ #x100fe542>

;; Change the stream's external-format to :jis so that Japanese characters
;; can be read/written.  ASCII characters, as well as many others, are
;; represented in JIS, but characters not in this encoding will typically
;; be written as question-marks.
;;
> (setf (stream-external-format a) (find-external-format :jis))
#<external-format :jis [(crlf-base-ef :jis)] @ #x10b2b472>

;; All Unicode characters can be represented in UTF-8.  In other
;; words, any Lisp/Unicode character can be read/written on this stream.
;;
> (setf (stream-external-format a) (find-external-format :utf-8))
#<external-format :utf8 [(crlf-base-ef :utf8)] @ #x100fe4c2>

As shown in the example, the UTF-8 encoding, which represents all of Unicode, is available as an external-format.

2. External format example

The function file-contents takes an external-format argument, so the following is a simple way to translate files from one encoding to another:

;; Set up example by creating a Lisp string containing Japanese characters:
;; 
(defparameter *aiu-string* (coerce '(#\hiragana_letter_a
                                     #\hiragana_letter_i
                                     #\hiragana_letter_u)
                                   'string))

;; Create a file containing the JIS encoding of the string
;;
(setf (file-contents "jis-version.txt" :external-format :jis)
  *aiu-string*)

;; Create a file euc-version.txt which contains euc-encoded text in
;; from the existing jis-encoded file jis-version.txt.
;;
(setf (file-contents "euc-version.txt" :external-format :euc)
      (file-contents "jis-version.txt" :external-format :jis))

;; Compare contents
;;
(file-contents "jis-version.txt" :element-type '(unsigned-byte 8))
#(27 36 66 36 34 36 36 36 38 27 40 66)

(file-contents "euc-version.txt" :element-type '(unsigned-byte 8))
#(164 162 164 164 164 166)

This is the end of the tutorial. The documentation for international character support in Allegro CL is in doc/iacl.htm.

Go to the tutorial main page.