Online Chat Sessions

Transcript of XML/HTML parsers

Transcript of XML/HTML parsers

Steve Jacobson -- (10:03:33 PDT)
Hello everyone, I'm Steve Jacobson, the Franz developer who worked on the XML and HTML parsers

Steve Jacobson -- (10:04:37 PDT)
I'll start it off by telling anyone who doesn't know already that the up to date versions and documentation
is available at opensource.franz.com

Steve Jacobson -- (10:05:02 PDT)
The source code for the parsers are included

Steve Jacobson -- (10:05:29 PDT)
Now, who wants to ask the first question?

bane@gst -- (10:07:17 PDT)
Thanks, I was going to ask for that.

Steve Jacobson -- (10:07:24 PDT)
Oh, first maybe I should mention current projects

Steve Jacobson -- (10:08:20 PDT)
There are two current projects another developer is working on:

1: optional SAX like event driven XML parser interface

2. XML generator/pretty printer

Steve Jacobson -- (10:09:17 PDT)
I believe Bob noted inconsistent DTD parse results.

Steve Jacobson -- (10:09:58 PDT)
The conformance suites really only test entity handling, since we don't chack validity.

Steve Jacobson -- (10:10:37 PDT)
Also, to me there seems to be a conflict between the DTD and the namespace stuff

Steve Jacobson -- (10:11:08 PDT)
Since the namespace stuff doesn't come until XML is processed, how can the DTD address namespace symbols

Steve Jacobson -- (10:11:38 PDT)
It's messy, I believe. We slogged through it to get the entities to work right.

Steve Jacobson -- (10:12:21 PDT)
Bob, are you actually parsing DTD, or was your question more of an architectural one?

Steve Jacobson -- (10:13:12 PDT)
Also, Bob asked if a parser generator was involved.

Steve Jacobson -- (10:13:35 PDT)
The answer is no. It was all done by hand.

bane@gst -- (10:13:39 PDT)
I want to parse DTDs. My application wants to
display XML templates, and the DTD is the way
to do that.

Steve Jacobson -- (10:14:51 PDT)
OK, I'll keep the questions you posted. Send any others to [email protected].
We will try to deal with issues you bring up.

bane@gst -- (10:14:54 PDT)
Egad. It sure looked like something yacc-ish. I
suppose when you write a formal parser, and follow
a spec carefully, it ends up looking like that.

Steve Jacobson -- (10:16:48 PDT)
Well, we started with the HTML parser, and initially not even parsing every thing, so it sort of grew step by step.

I think using a parser generator with XML would be diffivult because so much of it is "context" sensitive

Steve Jacobson -- (10:17:59 PDT)
I believe Frank had some questions:

1. SOAP (Simple Object Access Protocol)

[email protected] -- (10:18:26 PDT)
Yup....any plans to build on your XML paser?

Steve Jacobson -- (10:18:30 PDT)
We weren't familiar with SOAP, but we were happy that Frank asked the question.

Steve Jacobson -- (10:19:50 PDT)
We are in the early statges of a Lisp to Lisp RPC mechanism.

We will look carefully at SOAP to see if that's the way to go. If so, then components used to do it would be modular and useful to you.

Steve Jacobson -- (10:20:33 PDT)
The RPC project is growing out of our jlinker (Java Linking) add-on.

The protocol there is not SOAP.

Steve Jacobson -- (10:21:08 PDT)
Frank also asked about XSLT (stylesheets)

[email protected] -- (10:21:24 PDT)
If SOAP were available under Lisp, we would be doing Java->Lisp (RPC, remote link) and various MS product (Office add-ins) -> Lisp (again, remote link)....

Steve Jacobson -- (10:22:39 PDT)
I will make sure our jlinker developer sees this transcript. There are so many ways to do object sharing, it's difficult to keep up. We are doing CORBA, COM, jlinker,...

Steve Jacobson -- (10:23:31 PDT)
Our technology VP felt that style sheets were just for formatting, so people would use existing (external) tools.

Steve Jacobson -- (10:24:19 PDT)
The idea of the parser was to get XML/HTML into a lisp format that could then be manipulated in a typical Lisp way, and then perhaps output as XML/HTML

Steve Jacobson -- (10:25:16 PDT)
I think Frank also asked about XQL (querying nodes in the XML parse tree)

Steve Jacobson -- (10:25:57 PDT)
I think the event driven interface may be applicable, though not directly

Steve Jacobson -- (10:26:17 PDT)
We had an early DOM version, but it was too slow.

[email protected] -- (10:27:25 PDT)
When you parse XML/HTML, what does the resulting Lisp structure look like?

Steve Jacobson -- (10:27:50 PDT)
If you consider an inside/outside model - in Lisp, you can do whatever you want, making sure you transform to a "standard" on the way out (like over a stream)

shahid -- (10:28:25 PDT)
I thought XSL sheets could be used for transformation of xml to xml, doing real work,
in addition to just formatting.

Steve Jacobson -- (10:28:51 PDT)
With some of the protocols, it's hard to tell if they concern how to talk across barriers, or whether they are more of an internal implementation guide that is more appropriate for Java

Steve Jacobson -- (10:30:16 PDT)
I agree with you Shahid, but the VP's feeling was that if YOU wanted to transform XML, you would use Lisp's power. If you wanted to use somebody's stylesheet, you would process it externally, and then parse the result.

Steve Jacobson -- (10:30:57 PDT)
And, of course, XSL is XML, so you could parse the stylesheet if you wanted to.

Steve Jacobson -- (10:31:18 PDT)
But I can see the next question - we should provide some help.

Steve Jacobson -- (10:32:14 PDT)
And we already did something like that with the namespace support - that's just XML, also, but we tweaked the parser to use Lisp packages when recognizing namespace XML

danielfinster -- (10:33:36 PDT)
How dependent is the XML parser really on using the case-sensitive "modern" lisp mode? is it impossible to make it work with ANSI std?

Steve Jacobson -- (10:34:00 PDT)
We would appreciate scenarios where you tell us what you are trying to do and then we can figure out how to help. We don't want to just mirror every XML protocol with Lisp equivalents of Java stuff

Steve Jacobson -- (10:35:52 PDT)
Daniel: the original version wouldn't work in ANSI mode, but the version at the opensource URL does.
It completely passes every confomance test the modern mode does. It just looks bad:
you'll see: (|Sometag| "here is text") as opposed to (Sometag "here is text")

bane@gst -- (10:37:01 PDT)
You just have to use a case-sensitive readtable. Life is like that in XML. *sigh*...

Steve Jacobson -- (10:37:12 PDT)
I have a question - is anybody out there outputting XML already using lisp?

bane@gst -- (10:38:10 PDT)
I want to. For my purposes htmlgen will likely
be close enough to start with.

shahid -- (10:38:11 PDT)
What I would like to do is to use Lisp's language power to manipulate a XML dom tree or the SAX events. While some of the people around me prefer to use Javs to do so, my familarity with lisp will make me more productive. Therefore, what would be useful to me would be fast parsing from XML documents to a Lisp Dom object and fast
serialization from a manipulated Lisp Dom Object to the XML document.

bane@gst -- (10:39:29 PDT)
But having something that is "guaranteed" in some fashion
to have XML-read/XML-print consistency like Lisp's read-print
consistency, would be nice.

[email protected] -- (10:39:53 PDT)
What shahid said.....! We're looking at Lisp as a strong alternative to Java for performance and additional capabilities.

shahid -- (10:40:06 PDT)
I have been using Lisp in the past but not for XML. I have been outputting XML from java programs, and doing transforms using XSL. I havent yet tried using XML with lisp.


shahid -- (10:41:11 PDT)
Right, I found myself more productive in Lisp than in Java, and my programs were a lot shorter, so if you can make a strong case for performance with Lisp, Great!

Steve Jacobson -- (10:41:32 PDT)
Shahid & Frank: Would it be fair to summarize your needs as this:

You are familiar with the XML standards like DOM and XSL, but you prefer manipulating them with Lisp

shahid -- (10:43:01 PDT)
Yes, I would prefer to manipulate the DOM objects in Lisp because of my productivity, but for simple transformations I think XSL is a much better solution.

[email protected] -- (10:43:39 PDT)
DOM is, to me, what you were saying: a Java-ish (and C-ish) way of looking at XML. So, strictly following DOM isn't critical -- be Lisp! (But developers know DOM so compatability is a plus?)

Steve Jacobson -- (10:44:38 PDT)
Shahid, I can't tell if you are saying that you DON'T need Lisp to underdig XSL

[email protected] -- (10:45:17 PDT)
But yes, we *could* do XML in Java, but our app -- a major financial database/analytical system -- would greatly benefit from Lisp (all sorts of plans there). Performance is better in Lisp because we can make good use of memoization and a dedicated "analytical compiler".

shahid -- (10:45:22 PDT)
I agree with bane.
We should have lisp accept an XML document and check for well-formedness and validity, get the DOM tree, and then write out the DOM tree so we get an "equivalent" XML document if not the identical initial document.

bane@gst -- (10:45:49 PDT)
You ought to be able to make XML parsing cook in Lisp.
I've been thinking about stuff like advice to the parser,
being able to tell it that "my application doesn't
care about the contents of this tag, so parse it but don't
cons the results." Don't know if this would be worth doing yet.

[email protected] -- (10:46:24 PDT)
I agree with that as well.... That basic parse->validate->retransmit process is a cornerstone of XML.

shahid -- (10:47:09 PDT)
Yes, what I am saying is that XSL offers an alternative to manipulating the dom tree in a program, and that it is a declarative method for
specifying what I want done. So, for simple transforms I dont need Lisp.

Steve Jacobson -- (10:47:48 PDT)
Note that we don't validate. Wouldn't an external validator do the trick there?

We think that it is usually parse->manipulate->retransmit

bane@gst -- (10:48:15 PDT)
"Identical document" is too strong. I would be
happy with "document that parses identically"

(unroll all the entities and leave them unrolled, etc.)

[email protected] -- (10:48:54 PDT)
The reason I like the idea of an AXLT in Lisp is because then I can build an XSL doc on the fly (XSL is XML, so build it using Lisp "DOM" objects), then hand that off to the XSLT engine. The declarative nature of XSL is the key to its utility.

shahid -- (10:49:44 PDT)
Excellent, excellent, the ability to generate XSL docs on the fly will give tremendous power.

[email protected] -- (10:50:18 PDT)
External validation is too slow for me. I need to minimize latency, and I cant trust my clients to give me a valid doc. So, Id like to have the XML parser tell me "heres a valid parse tree" OR "bad doc"

shahid -- (10:50:36 PDT)
I agree that "Identical document" is too strong,
however in the last XML one conf in Santa Clara,
there was some discussion on what an equivalent document meant.

Steve Jacobson -- (10:51:01 PDT)
what about DTD vs schema issues?

shahid -- (10:51:57 PDT)
I agree that external validation is too slow; however if the XML parser can advise me about where the document is not well-formed or not valid
that would be very helpful in debugging.

[email protected] -- (10:52:20 PDT)
Agreed, shahid!

Steve Jacobson -- (10:53:08 PDT)
Bob, are you trying to check validity also, or are you using the DTD so you can then generate valid XML?

shahid -- (10:53:12 PDT)
Unless I am mistaken, schemas are the way of the future, but not yet standardized. DTD's are agreed up standards, but very limited, and used a little bit differently from XML.

danielfinster -- (10:53:36 PDT)
speaking of which, do you have any performance numbers on the xml parser? how much cpu time is used and how much consing is done to parse a moderately-complex document?

bane@gst -- (10:53:41 PDT)
I'm going to want schema information, to get field type
information to generate declarations for my processing code.

Once schemas settle down enough to be useful, of course.

shahid -- (10:54:11 PDT)
So, what that means is that initially validation would need to be done on DTD's and then later on for xschemas.

Steve Jacobson -- (10:55:08 PDT)
Daniel, we don't have numbers, but it is engineered to be fast and not generate garbage. We invite you to experiment.

Steve Jacobson -- (10:55:31 PDT)
The HTML parser has options to speed up and ignore parts you are not interested

Steve Jacobson -- (10:56:10 PDT)
The XML parser doesn't have that yet - we thought we would do the SAX stuff to make that happen

shahid -- (10:56:16 PDT)
Another thought, what if there was a direct call in from lisp to a validating parser just to indicate if the document were valid; all you need to do then would be to strengthen the link to the validating parser (using jlink maybe?).

[email protected] -- (10:56:37 PDT)
We would definitely like Schema support, even to the point of tracking the standard..... Schema support is a prereq for SOAP...

Steve Jacobson -- (10:56:49 PDT)
Shahid, that is where we were in our thinking...

Steve Jacobson -- (10:57:54 PDT)
It's been about an hour - is there anything else someone wants to bring upo before wec wind things down?

bane@gst -- (10:58:34 PDT)
But using an external verifying parser would mean parsing
the document twice, wouldn't it? How hard can it be to verify
LXML against LDTD (he says in near complete ignorance of what
"validation" really means for XML)?

danielfinster -- (10:59:32 PDT)
how compliant is the parser with the current standards on XML?

Steve Jacobson -- (10:59:33 PDT)
Frank, Shahid, and Bob,

Is it OK for me to ask someone in marketing (or maybe an engineer) to ask you followups on your needs?

bane@gst -- (11:00:24 PDT)
Please do. Thanks for the pointer to the HTML
parser feature - I'll look at it.

[email protected] -- (11:00:40 PDT)
Definitely...!

shahid -- (11:00:49 PDT)
Yes, it would mean parsing the document twice, but if you know that the document is valid, then you dont need to validate it. What I was thinking of was that it would be a compromise between having to write a validating parser in Lisp versus exploiting what already exists.

Steve Jacobson -- (11:01:19 PDT)
Daniel,

It is very very compliant - we pass more conformance suite tests than the Java parsers.
The only ones we can't pass are the ones that use 4 byte encoded UTF characters because the Lisp doesn't support thos characters

shahid -- (11:01:29 PDT)
Definitely - and when it is robust and meets the needs, Lisp would definitely make me more productive.

Steve Jacobson -- (11:01:38 PDT)
Note that this is using ACL 6.0 - ACL 5.0.1 doesn't have Unicode support

danielfinster -- (11:02:27 PDT)
Ok. Couldn't a composing external format be written to support those characters though?

Steve Jacobson -- (11:03:03 PDT)
Daniel, as discussed here, we don't validate, so we don't detect invalid XML. We DO detect "Non well-formed" XML, and when the input IS valid we produce the same results as a valid checking parser. That means we go all the way to get the right entity values.

Steve Jacobson -- (11:03:43 PDT)
Daniel, I'm not sure of the answerr - I'll show your question to the appropriate developer.

Lisa -- (11:04:14 PDT)
Thanks for participating everyone. A copy of this transcript will be sent to you as soon as it's available (it will also be posted on our website).

Steve Jacobson -- (11:04:43 PDT)
OK, I would like to sign off with something I saw a customer say in the AllegroServe chat session:

"I have to go because my cat just threw up."

Copyright © 2023 Franz Inc., All Rights Reserved | Privacy Statement Twitter