| Allegro CL version 8.2 Unrevised from 8.1 to 8.2. 8.1 version |
This document contains the following sections:
1.0 Regular Expression handling in Allegro CL
In the regexp2 module, the functions are named by symbols in the
excl
package, and are named compile-re, match-re,
split-re,
and replace-re. (There is
also an older regexp module which is maintained for backward
compatibility but no longer discussed in this document.)
The regexp2 module provides a fast, mostly-Perl-compatible regular expression matcher. It handles Unicode character set (UCS-2). Symbols in the regexp2 module are in the excl package.
A patch released in February, 2013, modifies split-re to be more like its PERL (and CL-PPCRE) counterparts. This means return values from split-re have changed in some cases. See split-re fior details and examples.
The patch also fixed some other bugs.
A subsequent patch, released in July, 2013, fixed a bug in replace-re having to do with BOS (Beginning of String) and EOS markers. The pre-patch behavior was:
cl-user(133): (replace-re "abc abc bc" "^abc\\s+" "_") "__bc"
Note both occurances of 'abc' were replaced by '_'. That is incorrect. The ^ in the regexp argument indicates that 'abc' should be replaced if it is at the beginning of the string (or at the beginning of a line within the string if the multiple-lines argument is true). After applying the patch, the behavior is corrected to:
cl-user(151): (replace-re "abc abc bc" "^abc\\s+" "_") "_abc bc" ;; Similarly for EOS: ;; Before patch: cl-user(101): (replace-re "abc def " "def$" "_" :end 7) "abc _ " ;; After Patch cl-user(162): (replace-re "abc def " "def$" "_" :end 7) "abc def "
match-re is also changed to add new keyword arguments start-unbounded and end-unbounded, which will find matches when BOS or EOS markers are used only if the match is at the actual start (end) of the string, not where the substring specified by start-unbounded starts or end-unbounded ends. So:
cl-user(152): (match-re "^abc" "abc def") t "abc" cl-user(153): (match-re "^abc" " abc def") nil cl-user(154): (match-re "^abc" " abc def" :start 1) t "abc" cl-user(155): (match-re "^abc" " abc def" :start-unbounded 1) nil cl-user(156):
Thus this patch corrects the results of replace-re in relevant cases but does not change the results of match-re, but match-re does now have additional capability.
You load the regexp2 module with the form
(require :regexp2)
Like Perl, there are four types of 'mode switches' that affect the meaning of regular expressions. The switch can be specified by the keyword arguments passed to regexp APIs, or by embedding 'flags' in the regular expression itself. The following table lists the valid mode switches.
Flag | Keyword argument | Descrption (when 'on') |
i |
:case-fold
|
Case-insensitive match (when true). Currently, case folding is only effective for ASCII characters. |
m |
:multiple-lines
|
Treats the input string as multiple lines. If turned on, "^" and "$" matches the start and end of any line instead of the beginning and end of the whole string. |
s |
:single-line
|
Treats the input string as a single line. If turned on, "." matches any character, even a newline. Normally "." matches any character except a newline. |
x |
:ignore-whitespace
|
Extend syntax. Whitespace in the regular expression is ignored, and comments can be inserted, to increase legibility of the regular expression. |
Within a regular expression, a mode switch can be turned on/off
locally by (?:...)
construct. For example,
(?i:foo)
makes foo
match case
insensitively. (?-i:foo)
makes
foo
match case sensitively. You can combine
multiple flags, like (?im-sx:foo)
. A construct
like (?i)
changes the mode switch until the end of
the current grouping.
The backslash character is treated by the Lisp reader as an escape character, telling the reader to treat the next character as a literal rather than some sort of a control charcater. Thus, suppose you want to make a string of
ab"cd
You want the double quote character to be part of the string, but the reader will misunderstand
"ab"cd"
so you specify it as
"ab\"cd"
When those 8 characters are read, a 5 character string will be stored, with no backslash and the double quote as the third character:
cl-user(4): (setq str "ab\"cd") "ab\"cd" cl-user(5): (length str) 5 cl-user(6): (char str 2) #\" ;; note " not \ cl-user(7):
Backslashes are used extensively in regular expressions. In order to specify a backslash in a string, you enter two backslashes (the first is read as an escape and the second as the character (a backslash) which you want to include. So
cl-user(7): "\|" "|" cl-user(8): "\\|" "\\|" cl-user(9): (length *) 2 cl-user(10): (split-re "\\|" "this|is|a|string") ("this" "is" "a" "string") cl-user(11): ;; Note this give an error: cl-user(11): (split-re "\|" "this|is|a|string") Error: regexp "|" matched zero-length string in split-re Restart actions (select using :continue): 0: Return to Top Level (an "abort" restart). 1: Abort entirely from this (lisp) process. [1] cl-user(12): ;; More examples with tabs (\t): [1] cl-user(12): :pop cl-user(13): (split-re "\\t" "this is a string with tabs") ("this" "is" "a" "string" "with" "tabs") cl-user(14): (split-re "\\x09" "this is a string with tabs") ("this" "is" "a" "string" "with" "tabs") cl-user(15):
Construct | Matches |
x | The character x |
\\ | A backslash character |
\0 | #\null |
\0n | (<= 0 n 7) A character x where (= (char-code x) #on) |
\0nn | (<= 0 n 7) A character x where (= (char-code x) #onn) |
\nn | (<= 0 n 7) A character x where (= (char-code x) #onn), if this cannot be a back-reference |
\mnn | (<= 0 m 3) (<= 0 n 7) A character x where (= (char-code x) #omnn) |
\xhh | A character x where (= (char-code x) #xhh) |
\xh | A character x where (= (char-code x) #xh) if at the end of regexp or followed by non-hexdigit char. |
\x{h...} | A character x where (= (char-code x) #xh...). |
\a | #\bell |
\e | #\esc |
\f | #\form |
\t | #\tab |
\n | #\linefeed |
\r | #\return |
\cx | control character corresponding to x (e.g. \cG for #\bell). |
Construct | Matches |
. | any character except newline (if 's' flag is on, matches any character including newline). |
\d | A digit character [0-9] |
\D | A non-digit character [^0-9] |
\s | A whitespace character [ \n\r\t\f] (that is, #\space, #\newline, #\return, #\tab, and #\formfeed). |
\S | A non-whitespace character [^ \n\r\t\f] |
\w | A word-constituent character. Currently, alphanumeric characters, #\_, and any character whose code is larger than 128 is considered as a word constituent character. The treatment of large character set may be improved in the later versions. |
\W | A non-word-constituent character |
Construct | Matches |
[abc] | a, b, or c |
[^abc] | any character except a, b, or c |
[a-zA-Z] | character range (a through z or A through Z) |
\n, \nn, \mnn, \x, \xh, \xhh, \x{h..} \\, \a, \e, \f, \t, \n, \r, \cx | Octal and hexadecimal character notation, and the above single character escape, are valid in character class as well. |
other | \d, \D, \s, \S, \w, and \W can also be used within character class notation, e.g. [abc\d] == [abc0-9]. |
Construct | Matches |
^ | Beginning of a string. (if 'm' flag is on, matches at the beginning of a string or just after a linefeed). |
$ | End of a string. (if 'm' flag is on, matches at the end of a string or just before a linefeed). |
\b | A word boundary. |
\B | A non word boundary. |
\A | Beginning of a string. |
\Z | End of a string, or before linefeed at the end. |
\z | End of a string. |
Construct | Matches |
X? | Zero or one occurrence of X |
X* | Zero or more occurrence of X |
X+ | One or more occurrence of X |
X{n} | n times of X |
X{n,} | n times or more repetition of X |
X{n,m} | between n and m times of repetition of X |
Construct | Matches |
X?? | Zero or one occurrence of X |
X*? | Zero or more occurrences of X |
X+? | One or more occurrences of X |
X{n}? | n times of X |
X{n,}? | n times or more repetition of X |
X{n,m}? | between n and m times of repetition of X |
Construct | Matches |
XY | X followed by Y |
X|Y | X or Y |
See Section 2.3 Capturing and back reference below for the operation of capturing and reference.
Construct | Matches |
(X) | Groups X, and captures submatch. |
(?<name>X) | Groups X, and captures submatch, assigning name as the name of the submatch. Name must consist of a sequence of word-constituent characters. |
(?:X) | Groups X, but does no captures. |
(?imsx-imsx:X) | Groups X, with turning on/off the given flags. |
(?imsx-imsx) | Successfully matches nothing, and changes the given flags within the current group. |
(?=X) | Zero-width positive look-ahead |
(?!X) | Zero-width negative look-ahead |
(?<=X) | Zero-width positive look-behind |
(?<!X) | Zero-width negative look-behind |
(?>X) | Matches X, and never backtrack. (also known as "standalone" group) |
(?(n)Y) | If submatch N has a value, match Y. |
(?(n)Y|Z) | If submatch X has a value, match Y; otherwise match Z. |
(?(?=X)Y|Z) (?(?!X)Y|Z) (?(?<=X)Y|Z) (?(?<!X)Y|Z) |
Try the look-ahead or look-behind assertion of X, and if succeeds, match Y; otherwise match Z. '|Z' part can be omitted. |
Construct | Matches |
\n | Matches the n-th submatch (n>=1). If n-th submatch doesn't have a value, it doesn't match anything (match just fails). If n is more than one digit, it becomes a back reference only if there are that many capturing groups in the regular expressions; otherwise, it is interpreted as an octal character notation if possible. |
\k<name> | Matches the submatch with name. If there are more than one submatches that has the name, this tries to match the one that has the biggest number first, and if it fails, the smaller numbered submatches are tried. |
Capturing submatches (X)
and
(?<name>X)
are numbered in the order of its
opening parenthesis from left to right. Named submatch is counted the
same as unnamed submatch, and can be back-referenced by both name and
number.
When the input string matches X, the portion of the input string is saved. It can be referenced within a regular expression, by the back reference construct, or can be returned to the user program as a submatch.
If the capturing construct matches more than once, it saves the last match.
Most functions that accept a regexp string such as match-re and compile-re also accept a regexp tree. A regexp tree is an s-expression with the syntax described below. The syntax was defined by CL-PPCRE and is intended to be compatible with it. This documentation was taken from the CL-PPCRE documentation, available as http://www.weitz.de/cl-ppcre/.
Programmers are usually most familiar with regexp string syntax, and it suffices for many normal regexp applications. However, string syntax does not scale very well -- complex regexps are hard to write and even harder to parse. The frequent need for backslach escapes is a further complication. In such cases, programmers may find tree syntax easiler to code in Lisp source code editors with their autmatic indentation and parentheses matching.
Further, in any application that generates regular expressions on the fly will undoubtedly find it easier to generate s-expr tress than trying to perform and extra level of encoding into string syntax, only to force the regexp system immediately to parse the string back.
Tree syntax is as follows:
:void
is equivalent to the empty string.
:everything
is equivalent to Perl's dot,
i.e it matches everything (except maybe a newline character depending
on the mode).
:word-boundary
and
:non-word-boundary
are equivalent to Perl's "\b"
and "\B".
:digit-class
,
:non-digit-class
,
:word-char-class
,
:non-word-char-class
,
:whitespace-char-class
, and
:non-whitespace-char-class are equivalent to
Perl's special character classes "\d", "\D", "\w", "\W", "\s", and
"\S" respectively.
:start-anchor
,
:end-anchor
,
:modeless-start-anchor
,
:modeless-end-anchor
, and
:modeless-end-anchor-no-newline
are equivalent to
Perl's "^", "$", "\A", "\Z", and "\z" respectively.
:case-insensitive-p
,
:case-sensitive-p
,
:multi-line-mode-p
,
:not-multi-line-mode-p
,
:single-line-mode-p
, and
:not-single-line-mode-p
are equivalent to Perl's
embedded modifiers "(?i)", "(?-i)", "(?m)", "(?-m)", "(?s)", and
"(?-s)". As usual, changes applied to modes are kept local to the
innermost enclosing grouping or clustering construct.
(:flags {<modifier>}*)
where <modifier>
is one of the modifier symbols from above is used to group modifier
symbols. The modifiers are applied from left to right. (This construct
is obviously redundant. It is only there because it's used by the
parser.)
(:sequence {<parse-tree>}*)
means a sequence
of parse trees, i.e. the parse trees must match one after
another. Example: (:sequence #\f #\o #\o)
is
equivalent to the parse tree "foo".
(:group {<parse-tree>}*)
is like
:sequence
but changes applied to modifier flags
(see above) are kept local to the parse trees enclosed by this
construct. Think of it as the S-expression variant of Perl's
"(?:<pattern>)" construct.
(:alternation {<parse-tree>}*)
means an
alternation of parse trees, i.e. one of the parse trees must
match. Example: (:alternation #\b #\a #\z)
is
equivalent to the Perl regex string "b|a|z".
(:branch <test> <parse-tree>)
is for
conditional regular expressions with the syntax
(?(condition)yes-pattern) (?(condition)yes-pattern|no-pattern)
<test> is either a number which stands for a register or a parse tree which is a look-ahead or look-behind assertion. If the parse-tree following the condition is an alteration, it must enclose exactly one or two parse trees where the second one (if present) will be treated as the "no-pattern" - in all other cases <parse-tree> will be treated as the "yes-pattern", executed if the condition matched. If the condition refers to a register, the condition succeeds if that register already succeeded in matching somehing.
(:positive-lookahead|:negative-lookahead|:positive-lookbehind|:negative-lookbehind
<parse-tree>)
-- the meaning should be pretty obvious.
(:greedy-repetition|:non-greedy-repetition
<min> <max> <parse-tree>)
where <min> is a
non-negative integer and <max> is either a non-negative integer
not smaller than <min> or nil
will
result in a regular expression which tries to match <parse-tree>
at least <min> times and at most <max> times (or as often
as possible if <max> is nil
). So, e.g.,
(:non-greedy-repetition 0 1 "ab")
is equivalent to
the Perl regex string "(?:ab)??".
(:standalone <parse-tree>)
is an
"independent" subexpression, i.e. (:standalone
"bar")
is equivalent to the Perl regex string "(?>bar)".
(:register <parse-tree>)
is a capturing
register group. As usual, registers are counted from left to right
beginning with 1.
(:back-reference <number>)
where
<number> is a positive integer is a back-reference to a register
group. (:char-class|:inverted-char-class
{<item>}*)
where <item> is either a character, a
character range, or a symbol for a special character class (see above)
will be translated into a (one character wide) character class. A
character range looks like (:range <char1>
<char2>)
where <char1> and <char2> are
characters such that (char<= <char1>
<char2>)
is true. Example:
(:inverted-char-class #\a (:range #\d #\g)
:digit-class)
is equivalent to the Perl regex string
"[^aD-G\d]".
There is a small region of ambiguity between string syntax and tree syntax. Although a single string is a valid parse tree, it will be interpreted as a Perl regexp strings when given to compile-re and friends. To circumvent this you can use the equivalent parse tree (:GROUP <string>). See also quote-re.
If you want to find out how parse trees are related to Perl regex strings you should play around with parse-re - a function which converts a Perl regexp strings to a parse tree.
There are seven functions (listed first) and three macros in the API:
\N{name} named char \lx lowercase x \ux uppercase x \Lx..\E lowercase x.. \Ux..\E uppercase x.. \Qx..\E quote non-alphanumeric chars in x.
Theoretically ACL's regexp library uses the same mechanism that Perl and CL-PPCRE are using: Nondeterministic finite automaton (NFA). When there are multiple possibilies of matching, it "remembers" the current state and tries each possibility one at a time. If a trial fails, it backs up the last saved state and tries the next possibility; that is called a "backtrack". It is possible to compose a very short regular expression that does a huge number of backtracks; if you have nested repetitions and alternations, the number of required backtracks grows exponentially.
The regexp optimizer tries to reduce the number of backtracks, but it is not always possible. Here are some tips to improve the performance of the matcher.
(?>X)
in
the inner loop. It effectively turns the nested repetition into
unnested repetition from the viewpoint of NFA. Lots of optimizations
can be done for unnested repetition, but not much can be done for
a nested one.
Note that an unnested greedy repetition followed by a character or a
character set that are exclusive to the beginning of the repetition
becomes a non-backtracking regexp; for example, the regular expression
(\w+)\s+
runs without backtracking.
Copyright (c) 1998-2016, Franz Inc. Oakland, CA., USA. All rights reserved.
This page was not revised from the 8.1 page.
Created 2010.1.21.
| Allegro CL version 8.2 Unrevised from 8.1 to 8.2. 8.1 version |