Building a tag-soup parser #1
Tuesday, 8 March 2005
I am working on a HTML parser – again. Sounds simple, right? Well in reality it really isn’t. I really don’t want it to be ‘just another’ html parser class.
Take the following example
<h3>Trackback 2.0 and Referrer 1.0 for Nucleus</h3> <p> It's been a long day, but I am happy to announce that the Trackback and Referrer plugins for Nucleus are finished... </p><!--
<!-- at the end of the fragment? It is the beginning of a comment. The result: the rest of the page is not shown, and your web application stops functioning. Just because you trusted your input.
The solution is to filter the input, but in order to achieve this you need a HTML parser. A parser that is stable and that doesn’t rely on a well-formed HTML. If it looks like HTML it should be able to parse it. Unlike most PHP based HTML parser it should output a XML DOM structure. Easy to manipulate and easy to turn back into (X)HTML. And finally I want the parser to be able to parse tag-soup HTML code and output well-formed validating XHTML.
Doesn’t sound so simply anymore, right?
The core is based on Jose Solorzano parser, but greatly extended. My first task is to harden the parser against malformed HTML, but I am also adding proper support for entities and other character encodings. I am also attempting to build a proper DOM from the output of the parser. This means I will have to teach it knowledge about HTML, which is giving me considerable headache, because simply following the specifications isn’t an option. It will have to make something useful out of tag-soup.
Consider the following snippet of HTML:
<span style='font-size: 2em; color: red;'> <p>The quick brown fox...</p> <ul></span></ul> <p>... jumps over the lazy dog</p>
You may have noticed that the snippet is not well formed because the
<span> tag is not properly nested. But, if you ignore all of this and simply try to validate it against a HTML or XHTML DTD it will also fail. There are two reasons:
<p> is not allowed as a child of
<span> is not a valid child of
<ul>. So if my parser would create a DOM from the snippet above, how would it look?
The parser would have to be very forgiving. The first problem it will encounter is that
<p> is not allowed as a child of
<span>. Fortunately this can be simply solved by turning the tags around. So the first part would look like this:
<p> <span style='font-size: 2em; color: red;'> The quick brown fox... </span> </p>
Next is the
<ul> tag. There are two issues with it.
First of all the
</span> closing tag inside the
<ul> tag. It’s presents is breaking the well-formedness of the snipped and secondly, even if it would not break the well-formedness, it would still be illegal, because only
<li> tags are valid children of a
<ul> tag. Other tags are simply not allowed inside a
Now, if we would consider the
</span> tag to be illegal and simply non-existant, the
<ul> construct would still be illegal, because according to the XHTML 1.0 DTD the
<ul> tag must contain one or more
<li> tags. The
<ul> is empty and not allowed. The end result is that the whole line containing the
</span> would disappear from the DOM, or would it?
In this case the next paragraph would still be under the influence of the unclosed
<span>. And just like the first paragraph we need to bring the
<p> to create a validating snippet:
<p> <span style='font-size: 2em; color: red;'> ... jumps over the lazy dog </span> </p>