ToC

DocOverview

CGDoc

RelNotes

FAQ

Index

PermutedIndex

Allegro CL version 10.1
Moderately revised from 10.0.
10.0 version

URI and IRI support in Allegro CL

This document contains the following sections:

1.0 Introduction
1.1 RFC2396 no longer governs
2.0 The URI and IRI API definition
3.0 Parsing, escape decoding/encoding and the path
4.0 Interning URIs
5.0 Allegro CL implementation notes
6.0 Deviations from the RFC grammars and strict parsing
7.0 Examples

1.0 Introduction

URI stands for Universal Resource Identifier. For a description of URIs, see RFC3986 (which replaces the obsolete RFC2396), which can be found in several places such as the IETF web site (https://tools.ietf.org/html/rfc3986). The related URN syntax is described in RFC8141 (https://tools.ietf.org/html/rfc8141)

IRI stands for Internationalized Resource Identifier, which is (according to Wikipedia) "an internet protocol standard which extends the ASCII characters subset of the Uniform Resource Identifier (URI) protocol". It is defined in RFC 3987. While URIs are limited to a subset of the ASCII character set, IRIs may contain characters from the Universal Character Set (Unicode/ISO 10646), including Chinese or Japanese kanji, Korean, Cyrillic characters, and so forth.

URIs are a superset in functionality and syntax to URLs (Universal Resource Locators) and URNs (Universal Resource Names).

In URL slang, the scheme is usually called the `protocol', but it is called scheme in RFC1738. A URL `host' corresponds to the URI `authority.' The URL slang `bookmark' or `anchor' is `fragment' in URI lingo.

The URI facility might not be in an Allegro CL image by default. Evaluate (require :uri) to ensure the facility is loaded (that form returns nil if the URI module is already loaded).

Broadly, the URI facility creates a Lisp object that represents a URI, and provides setters and accessors to fields in the URI object. The URI object can also be interned, much like symbols in CL are. This document describes the facility and the related operators.

Aside from the obvious slots which are called out in the RFC, URIs also have a property list. With interning, this is another similarity between URIs and CL symbols.

1.1 RFC2396 no longer governs

Allegro CL used to process URIs according to RFE2396. Now RFC3986 is used. The change was made in an unpdate released in September, 2018. The change causes changes one important area:

(net.uri:merge-uris (net.uri:parse-uri "?bar")
                    (net.uri:parse-uri "http://example.com/foo"))
RETURNS #<uri http://example.com/foo?bar>

RATHER THAN #<uri http://example.com/?bar>

Other than that, there are new fields and accessors (such as the URN accessor urn-q-component, as called for in RFC8141).

2.0 The URI and IRI API definition

The uri module, which can be loaded with (require :uri), contains both the URI and the IRI functionality. Symbols naming objects (functions, variables, etc.) in the uri module are exported from the net.uri package.

URIs are represented by CLOS objects. Their slots are:

scheme 
host 
port 
path 
query
fragment 
plist 
ipv6
zone-id

The host and port slots together correspond to the authority (see RFC3986). There is an accessor-like function, uri-authority, that can be used to extract the authority from a URI. See the RFC3986 specifications pointed to at the beginning of the Section 1.0 Introduction for details of all the slots except plist. The plist slot contains a standard Common Lisp property list.

IRIs are also represented by CLOS objects and have the same slots as URI objects.

All symbols are external in the net.uri package, unless otherwise noted. Brief descriptions are given in this document, with complete descriptions in the individual pages.

uri: the class of URI objects.
urn: the class of URN objects.
iri: the class of IRI objects. This is a subclass of uri.
uri-p
Arguments: object

Returns true if object is an instance of class uri. Because iri is a subclass or uri, this method also returns true on all IRI objects.
iri-p
Arguments: object

Returns true if object is an instance of class iri.
copy-uri

Arguments: uri &key place scheme host port path query fragment plist

Copies the specified URI object. See the description page for information on the keyword arguments.
uri-scheme
uri-host
uri-port
uri-path
uri-query
uri-fragment
uri-plist
uri-ipv6
uri-zone-id

Arguments: uri-object

These accessors return the value of the associated slots of the uri-object
uri-authority

Arguments: uri-object

Returns the authority of uri-object. The authority combines the host and port.
render-uri

Arguments: uri stream

Print to stream the printed representation of uri. This operator is the inverse of parse-uri. This operator should not be applied to the output of string-to-uri.
parse-uri

Arguments: string &key (class (quote uri))

Parse string into a URI object. This operator is the inverse of render-uri. This operator should not be applied to the output of uri-to-string.
uri-to-string

Arguments: uri

Convert uri to a string. This operator is the inverse of string-to-uri. This operator should not be applied to the output of render-uri.
string-to-uri

Arguments: string

Parse string into a URI object if possible. Signal an error if not possible. This operator is the inverse of uri-to-string. This operator should not be applied to the output of render-uri. string-to-uri differs from parse-uri does in that it does not decode the query portion of the string while parse-uri does.
iri-to-string

Arguments: iri

Convert iri to a string. This operator is the inverse of string-to-iri.
string-to-iri

Arguments: string

Parse string into a IRI object if possible. Signal an error if not possible. This operator is the inverse of iri-to-string. Like string-to-uri and unlike parse-uri, string-to-iri does not decode the query portion of the string.
merge-uris

Arguments: uri base-uri &optional place

Return an absolute URI, based on uri, which can be relative, and base-uri which must be absolute.
enough-uri

Arguments: uri base

Converts uri into a relative URI using base as the base URI.
uri-parsed-path

Arguments: uri

Return the parsed representation of the path.
uri

Arguments: object

Defined methods: if argument is a uri object, return it; create a uri object if possible and return it, or error if not possible.
pathname-to-uri

Arguments: pathname

Converts a pathname to a file scheme URI.
uri-to-pathname

Arguments: uri

Converts a file scheme URI to a pathname.
urn-nid
urn-nss
urn-q-component
urn-r-component
urn-f-component

Arguments: urn-object

These accessors return the value of the associated slots of the urn-object

The variable *strict-parse* controls how strictly the parser observes syntax rules (many websites violate these rules and so will not parse when they are strictly observed). *strict-parse* has initial value t.

3.0 Parsing, escape decoding/encoding and the path

The method uri-path returns the path portion of the URI, in string form. The method uri-parsed-path returns the path portion of the URI, in list form. This list form is discussed below, after a discussion of decoding/encoding.

RFC2396 lays out a method for inserting into URIs reserved characters. You do this by escaping the character. An escaped character is defined like this:

escaped = "%" hex hex 

hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | "a" | "b" | "c" | "d" | "e" | "f"

In addition, the RFC defines excluded characters:

"<" | ">" | "#" | "%" | <"> | "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

The set of reserved characters are:

";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","

with the following exceptions:

within the authority component, the characters ";", ":", "@", "?", and "/" are reserved.
within a path segment, the characters "/", ";", "=", and "?" are reserved.
within a query component, the characters ";", "/", "?", ":", "@", "&", "=", "+", ",", and "$" are reserved.

From the RFC, there are two important rules about escaping and unescaping (encoding and decoding):

decoding should only happen when the URI is parsed into component parts;
encoding can only occur when a URI is made from component parts (ie, rendered for printing).

The implication of this is that to decode the URI, it must be in a parsed state. That is, you can't convert %2f (the escaped form of "/") until the path has been parsed into its component parts. Another important desire is for the application viewing the component parts to see the decoded values of the components. For example, consider:

http://franz.com/calculator/3%2f2

This might be the implementation of a calculator, and how someone would execute 3/2. Clearly, the application that implements this would want to see path components of "calculator" and "3/2". "3%2f2" would not be useful to the calculator application.

For the reasons given above, a parsed version of the path is available and has the following form:

([:absolute | :relative] component1 [component2...])

where components are:

element | (element param1 [param2 ...])

and element is a path element, and the param's are path element parameters. For example, the result of

(uri-parsed-path (parse-uri "foo;10/bar:x;y;z/baz.htm"))

(:relative ("foo" "10") ("bar:x" "y" "z") "baz.htm")

There is a certain amount of canonicalization that occurs when parsing:

A path of (:absolute) or (:absolute "") is equivalent to a nil path. That is, http://a/ is parsed with a nil path and printed as http://a.
Escaped characters that are not reserved are not escaped upon printing. For example, "foob%61r" is parsed into "foobar" and appears as "foobar" when the URI is printed.

The variable *strict-parse* controls how strictly the parser observes syntax rules (many websites violate these rules and so will not parse when they are strictly observed).

4.0 Interning URIs

This section describes how to intern URIs. Interning is not mandatory. URIs can be used perfectly well without interning them.

Interned URIs in Allegro are like symbols. That is, a string representing a URI, when parsed and interned, will always yield an eq object. For example:

(eq (intern-uri "http://franz.com") 
    (intern-uri "http://franz.com"))

is always true. (Two strings with identical contents may or may not be eq in Common Lisp, note.)

The functions associated with interning are:

make-uri-space
Arguments: &key size

Make a new hash-table object to contain interned URIs.
uri-space
Arguments:

Return the object into which URIs are currently being interned.
uri=
Arguments: uri1 uri2

Returns true if uri1 and uri2 are equivalent.
intern-uri
Arguments: uri-name &optional uri-space

Intern the uri object specified in the uri-space specified. Methods exist for strings and uri objects.
unintern-uri
Arguments: uri &optional uri-space

Unintern the uri object specified or all uri objects (in uri-space if specified) if uri is t.
do-all-uris
Arguments: (var &optional uri-space result) &body body

Bind var to all currently defined uris (in uri-space if specified) and evaluate body.

5.0 Allegro CL implementation notes

The following are true:
(uri= (parse-uri "http://franz.com/")
(parse-uri "http://franz.com"))
(eq (intern-uri "http://franz.com/")
(intern-uri "http://franz.com"))
The following is true:
(eq (intern-uri "http://franz.com:80/foo/bar.htm")
(intern-uri "http://franz.com/foo/bar.htm"))
(I.e. specifying the default port is the same as specifying no port at all. This is specific in RFC2396.)
The scheme and authority are case-insensitive. In Allegro CL, the scheme is a keyword that appears in the normal case for the Lisp in which you are executing.
#u"..." is shorthand for (parse-uri "...") but if an existing #u dispatch macro definition exists, it will not be overridden.
The interaction between setting the scheme, host, port, path, query, and fragment slots of URI objects, in conjunction with interning URIs will have very bad and unpredictable results.
The printable representation of URIs is cached, for efficiency. This caching is undone when the above slots are changed. That is, when you create a URI the printed representation is cached. When you change one of the above mentioned slots, the printed representation is cleared and calculated when the URI is next printed. For example:

user(10): (setq u #u"http://foo.bar.com/foo/bar") 
#<uri http://foo.bar.com/foo/bar> 
user(11): (setf (net.uri:uri-host u) "foo.com") 
"foo.com" 
user(12): u 
#<uri http://foo.com/foo/bar> 
user(13):

This allows URIs behavior to follow the principle of least surprise.

6.0 Deviations from the RFC grammars and strict parsing

There are deviations from the grammar in the RFCs. The special net.uri:*strict-parse* controls whether the parser is RFC compliant. When net.uri:*strict-parse* is nil, the parse will differ in these ways:

URI queries allow "|" and "^" characters
URI fragments allow the "#" character

Both of these changes are necessary for parsing URIs available in the wild.

7.0 Examples

uri(10): (use-package :net.uri)
t
uri(11): (parse-uri "foo")
#<uri foo>
uri(12): #u"foo"
#<uri foo>
uri(13): (setq base (intern-uri "http://franz.com/foo/bar/"))
#<uri http://franz.com/foo/bar/>
uri(14): (merge-uris (parse-uri "foo.htm") base)
#<uri http://franz.com/foo/bar/foo.htm>
uri(15): (merge-uris (parse-uri "?foo") base)
#<uri http://franz.com/foo/bar/?foo>
uri(16): (setq base (intern-uri "http://franz.com/foo/bar/baz.htm"))
#<uri http://franz.com/foo/bar/baz.htm>
uri(17): (merge-uris (parse-uri "foo.htm") base)
#<uri http://franz.com/foo/bar/baz.htm?foo.htm>
uri(18): (merge-uris #u"?foo" base)
#<uri http://franz.com/foo/bar/?foo>
uri(19): (describe #u"http://franz.com")
#<uri http://franz.com> is an instance of #<standard-class net.uri:uri>:
 The following slots have :instance allocation:
(describe #u"http://franz.com")
#<uri http://franz.com> is an instance of #<standard-class uri>:
 The following slots have :instance allocation:
  net.uri::scheme      :http
  net.uri::userinfo    nil
  net.uri::port        nil
  net.uri::path        nil
  net.uri::query       nil
  net.uri::fragment    nil
  net.uri::plist       nil
  net.uri::.host       "franz.com"
  net.uri::.ipv6       nil
  net.uri::.zone-id    nil
  net.uri::escaped     nil
  string               "http://franz.com"
  net.uri::parsed-path nil
  net.uri::hashcode    nil

uri(20): #u"foobar#baz%23xxx"
#<uri foobar#baz#xxx>