|
Allegro CL version 8.2 Unchanged from 8.1. |
This is a preliminary document which will be updated over time. Upates will be available for downloading.
Introduction and a simple example
LXML parse output format
parse-xml non-validating parser properties
PXML compatibility with the SAX parser
Usage notes
parse-xml requires Modern Lisp's mixed case and international character support
parse-xml and packages
parse-xml, the XML Namespace specification, and packages
The parse-xml function has been tested
ACL does not support Unicode 4 byte scalar values
debugging aids
XML Conformance test results
parse-xml reference
*debug-xml*
*debug-dtd*
The pxml module is provided for compatibility with existing applications. There are known errors and omissions in this modules. The errors and omissions are corrected in the new sax module. The pxml-sax module implements the pxml API on top of the sax module and supports most of the functionality in the pxml module.
We recommend that users migrate applications to the pxml-sax
module (see sax.htm). In most cases, the only
change needed is to replace (require :pxml)
with (require :pxml-sax)
.
If an application depends on some feature of pxml that is not supported by pxml-sax, please contact customer support ([email protected]).
If an application depends on the incorrect or idiosyncratic behavior of pxml, we recommend that the application be corrected.
The parse-xml generic function processes XML input, returning a list of XML tags, attributes, and text.
The :pxml module is loaded with the form (require :pxml). Symbols naming functionality in the module are in the net.xml.parser package. Examples in this document assume (use-package :net.xml.parser) has been evaluated.
Here is a simple example:
(parse-xml "<item1><item2 att1='one'/>this is some text</item1>") -->((item1 ((item2 att1 "one")) "this is some text"))
The output format is known as LXML (Lisp XML) format.
LXML is a list representation of XML tags and content.
Each list member may be:
More on members of type # 2: if the XML tag does not have associated attributes, then the first list member will be a symbol representing the XML tag, and the other elements will represent the content, which can be a string (text content), a symbol (XML tag with no attributes or content), or list (nested XML tag with associated attributes and/or content). If there are associated attributes, then the first list member will be a list containing a symbol followed by two list members for each associated attribute; the first member is a symbol representing the attribute, and the next member is a string corresponding to the attribute value.
parse-xml is a non-validating XML parser. It will detect non-well-formed XML input. When processing valid XML input, parse-xml will optionally produce the same output as a validating parser would, including the processing of an external DTD subset and external entity declarations.
By default, parse-xml outputs a DTD parse along with the parsed XML contents. The DTD parse may be optionally suppressed. The following example shows DTD parsed output components:
(defvar *xml-example-external-url* "<!ENTITY ext1 'this is some external entity %param1;'>") (defun example-callback (var-name token &optional public) (declare (ignorable token public)) (setf var-name (uri-path var-name)) (if* (equal var-name "null") then nil else (let ((string (eval (intern var-name (find-package :user))))) (make-string-input-stream string)))) (defvar *xml-example-string* "<?xml version='1.0' encoding='utf-8'?> <!-- the following XML input is well-formed but its validity has not been checked ... --> <?piexample this is an example processing instruction tag ?> <!DOCTYPE example SYSTEM '*xml-example-external-url*' [ <!ELEMENT item1 (item2* | (item3+ , item4))> <!ELEMENT item2 ANY> <!ELEMENT item3 (#PCDATA)> <!ELEMENT item4 (#PCDATA)> <!ATTLIST item1 att1 CDATA #FIXED 'att1-default' att2 ID #REQUIRED att3 ( one | two | three ) 'one' att4 NOTATION ( four | five ) 'four' > <!ENTITY % param1 'text'> <!ENTITY nentity SYSTEM 'null' NDATA somedata> <!NOTATION notation SYSTEM 'notation-processor'> ]> <item1 att2='1'><item3>&ext1;</item3></item1>") (pprint (parse-xml *xml-example-string* :external-callback 'example-callback)) --> ((:xml :version "1.0" :encoding "utf-8") (:comment " the following XML input is well-formed but may or may not be valid ") (:pi :piexample "this is an example processing instruction tag ") (:DOCTYPE :example (:[ (:ELEMENT :item1 (:choice (:* :item2) (:seq (:+ :item3) :item4))) (:ELEMENT :item2 :ANY) (:ELEMENT :item3 :PCDATA) (:ELEMENT :item4 :PCDATA) (:ATTLIST item1 (att1 :CDATA :FIXED "att1-default") (att2 :ID :REQUIRED) (att3 (:enumeration :one :two :three) "one") (att4 (:NOTATION :four :five) "four")) (:ENTITY :param1 :param "text") (:ENTITY :nentity :SYSTEM "null" :NDATA :somedata) (:NOTATION :notation :SYSTEM "notation-processor")) (:external (:ENTITY :ext1 "this is some external entity text"))) ((item1 att1 "att1-default" att2 "1" att3 "one" att4 "four") (item3 "this is some external entity text")))
Allegro CL also offers a SAX parser (described in sax.htm). There is a PXML-SAX compatibility package (see the LXML section in sax.htm).
There are :
1. parse-xml requires Modern Lisp's mixed case and
international character support
2. parse-xml and packages
3. parse-xml, the XML Namespace specification, and
packages
4. The parse-xml function has been tested
ACL does not support
Unicode 4 byte scalar values
5. debugging aids
(setf *xml-example-string4* "<bibliography xmlns:bib='http://www.bibliography.org/XML/bib.ns' xmlns='urn:royal-mail.gov.uk/XML/ns/postal.ns,1999'> <bib:book owner='Smith'> <bib:title>A Tale of Two Cities</bib:title> <bib:bibliography xmlns:bib='https://franz.com/XML/bib.ns' xmlns='urn:royal-mail2.gov.uk/XML/ns/postal.ns,1999'> <bib:library branch='Main'>UK Library</bib:library> <bib:date calendar='Julian'>1999</bib:date> </bib:bibliography> <bib:date calendar='Julian'>1999</bib:date> </bib:book> </bibliography>") (setf *uri-to-package* nil) (setf *uri-to-package* (acons (net.uri:parse-uri "http://www.bibliography.org/XML/bib.ns") (make-package "bib") *uri-to-package*)) (setf *uri-to-package* (acons (net.uri:parse-uri "urn:royal-mail.gov.uk/XML/ns/postal.ns,1999") (make-package "royal") *uri-to-package*)) (setf *uri-to-package* (acons (net.uri:parse-uri "https://franz.com/XML/bib.ns") (make-package "franz-ns") *uri-to-package*)) (pprint (multiple-value-list (parse-xml *xml-example-string4* :uri-to-package *uri-to-package*))) --> ((((bibliography |xmlns:bib| "http://www.bibliography.org/XML/bib.ns" xmlns "urn:royal-mail.gov.uk/XML/ns/postal.ns,1999") " " ((bib::book royal::owner "Smith") " " (bib::title "A Tale of Two Cities") " " ((bib::bibliography royal::|xmlns:bib| "https://franz.com/XML/bib.ns" royal::xmlns "urn:royal-mail2.gov.uk/XML/ns/postal.ns,1999") " " ((franz-ns::library net.xml.namespace.0::branch "Main") "UK Library") " " ((franz-ns::date net.xml.namespace.0::calendar "Julian") "1999") " ") " " ((bib::date royal::calendar "Julian") "1999") " ") " ")) ((#<uri urn:royal-mail2.gov.ukXML/ns/postal.ns,1999> . #<The net.xml.namespace.0 package>) (#<uri https://franz.com/XML/bib.ns> . #<The franz-ns package>) (#<uri urn:royal-mail.gov.ukXML/ns/postal.ns,1999> . #<The royal package>) (#<uri http://www.bibliography.org/XML/bib.ns> . #<The bib package>)))In the absence of XML Namespace attributes, element and attribute symbols are interned in the current package. Note that this implies that attributes and elements referenced in DTD content will be interned in the current package.
- ACL does not support 4 byte Unicode scalar values, so input containing such data will not be processed correctly. (Note, however, that parse-xml does correctly detect and process wide Unicode input.)
- An initial <?xml declaration in external entity files is skipped without a check being made to see if the <?xml declaration is itself incorrect.
*debug-xml*
and *debug-dtd*
are useful. When
not bound to nil, these variables cause lexical analysis and intermediate parsing results
to be output to *standard-output*
.Using the OASIS test suite (http://www.oasis-open.org), here are the current parse-xml results:
Arguments:
Returns a string identifying the PXML version loaded into the running Lisp. The version is typically changed when a patch with new features is provided. This function is useful to determine whether the loaded PXML module has the new features.
Arguments: input-source &key external-callback content-only general-entities parameter-entities uri-to-package
This generic function returns multiple values:
- LXML and parsed DTD output, as described above in this document.
- An association list containing the uri-to-package argument conses (if any) and conses associated with any XML Namespace packages created during the parse (see uri-to-package argument description, below).
The arguments and their effects are:
- The external-callback argument, if specified, is a function object or symbol that parse-xml will execute when encountering an external DTD subset or external entity DTD declaration. Here is an example which shows what arguments the function should expect, and the value it should return:
(defun file-callback (uri-object token &optional public) ;; the uri-object is an ACL URI object created from ;; the XML input. In this example, this function ;; assumes that all uri's will be file specifications. ;; ;; the token argument identifies what token is associated ;; with the external parse (for example :DOCTYPE for external ;; DTD subset ;; ;; the public argument contains the associated PUBLIC string, ;; when present ;; (declare (ignorable token public)) ;; an open stream is returned on success ;; a nil return value indicates that the external ;; parse should not occur ;; Note that parse-xml will close the open stream before ;; exiting (ignore-errors (open (uri-path uri-object))))
- The general-entities argument is an association list containing general entity symbol and replacement text pairs. The entity symbols should be in the keyword package. Note that this option may be useful in generating desirable parse results in situations where you do not wish to parse external entities or the external DTD subset.
- The parameter-entities argument is an association list containing parameter entity symbol and replacement text pairs. The entity symbols should be in the keyword package. Note that this option may be useful in generating desirable parse results in situations where you do not wish to parse external entities or the external DTD subset.
- The uri-to-package argument is an association list containing uri objects and package objects. Typically, the uri objects correspond to XML Namespace attribute values, and the package objects correspond to the desired package for interning symbols associated with the uri namespace. If the parser encounters an uri object not contained in this list, it will generate a new package. The first generated package will be named net.xml.namespace.0, the second will be named net.xml.namespace.1, and so on.
(parse-xml (p stream) &key external-callback content-only general-entities parameter-entities uri-to-package) (parse-xml (str string) &key external-callback content-only general-entities parameter-entities uri-to-package)
An easy way to parse a file containing XML input:
(with-open-file (p "example.xml") (parse-xml p :content-only p))
When not bound to nil, generates XML lexical state and intermediary parse result debugging output.
When not bound to nil, generates DTD lexical state and intermediary parse result debugging output.
Copyright (c) 2000, 2001 by Franz Inc. All rights reserved.
Documentation for Allegro CL.
Created 2001.9.12.
|
Allegro CL version 8.2 Unchanged from 8.1. |