Scrub
Scrub is a class for PHP which is (or at least will be) able to parse malformed HTML documents and produce a tree with all the information needed to recreate the document at a later stage. Malformed HTML is corrected in several ways during the parsing stage and based on the resulting tree it would be possible to write a syntactically correct HTML documents. This does not mean the output will be valid HTML, because that is determined by more than just the syntax.
Scrub 0.1.16
The second stage is a filter class which is able to make transformation to the tree created by the parser. Currently this is still very much under development. However it does have support for removing empty tags, reordering whitespace and moving <br>
tags out of inline tags such as <strong>
or <em>
. Another recent addition are filtering nodes based on their type, for example dropping all comments from the tree. The class also allows you to filter out tags and attributes, set required attributes and default values for specific tags and replace tags with one or more tags of a different type. This mechanism even allows tags to be converted to text based on its attributes, for example you could replace all images with their alt text. New additions are filtering of classes and inline css style definitions and filtering Word HTML.
The third stage is a writer class which takes the tree created by the parser, or transformed by the filter class and turns it back into a HTML document. This class is largely complete and support writing XHTML-style or HTML-style tags. So depending on the settings it either writes <br />
or <br>
. A new experimental addition to the writer class is the ability to ignore existing whitespace and completely reformat the document to make it more readable using newly created whitespace with linebreaks and indentation.
Possible usages
Since this class is still in it’s infancy, it is currently not that useful. When it is finished I expect it to be very useful for a number of different purposes. One practical application would be on forums and CMS systems like Geeklog, Postnuke and PHPNuke. With this class you can control which HTML tags are allowed and which are stripped. This means you could allow the use of HTML without any security risks and without the risk of users submitting HTML code that messes up the layout of the website. Another specific application could be cleaning up HTML code created by IE and Mozilla’s internal inline HTML editors or a Word HTML filter.
Another useful addition I am considering is a HTML cutter, which cuts of a document after a specific number of characters or words. Of course tags are not counted when deciding the cut point, the document is not cut in the middle of a tag and any left open tags are automatically closed.
Lib-Scrub will probably never be a replacement for HTML Tidy. Althought it certainly overlaps with some of its features, Lib-Scrub is intended to clean up user-submitted HTML in web applications to make sure it is syntactically correct and doesn’t pose any security risks or will not cause problems with integrating the user-submitted HTML into the website.
Scrub Features
Parsing malformed HTML
The parser of Lib-Scrub tries to parse malformed attributes values in tags in several ways. Improper usage of quotes is corrected in several ways, including:
<a href='http://www.rakaz.nl>
<a href='http://www.rakaz.nl'>
<a href=http://www.rakaz.nl">
<a href="http://www.rakaz.nl">
<img src="image.gif alt=Image'>
<img src="image.gif" alt='Image'>
Improperly nested tags are also evaluated and properly resolved. Although the way Lib-Scrub does this may not be technically the correct way, it’s behavoir mimics how existing browsers solve this problem such as IE, Mozilla and Safari.
<strong> bold <em> bold-italic </strong> italic </em>
<strong> bold <em> bold-italic </em></strong><em> italic </em>
Lib-Scrub also tries to detect unclosed tags and determines wether they should be closed or not.
<p> paragraph <em> emphasis </p>
<p> paragraph <em> emphasis </em></p>
<table><tr><td> cell
<table><tr><td> cell </td></tr></table>
When Lib-Scrub encounters a closing tag it tries to figure out to which opening tag it belongs. When it can’t find any opening tag to which it could belong, it will remove the closing tag from the document.
<p> paragraph </em> paragraph </p>
<p> paragraph paragraph </p>
Lib-Scrub also tries to determine if a previous tag needs to be closed when another opening tag is encountered. Once again, Lib-Scrub tries to mimick how existing browsers handle these situations tries to solve the problem in the same way.
<p> paragraph <h1> heading </h1>
<p> paragraph </p><h1> heading </h1>
<p> paragraph <p> paragraph </p>
<p> paragraph </p><p> paragraph </p>
Attribute values and plain text are also properly HTML encoded using entities. Lib-Scrub cleans up existing entities and creates new entities for characters. Lib-Scrub always decodes entities which can be expressed using regular characters and properly encodes characters which are not allowed in the current character set. In addition to this CP-1251 characters and entities that express CP-1251 characters are converted to their Unicode equivalents.
© and ¡ => © and ¡
€ and ™ => € and ™
€ and ™ => € and ™
♠ and ♣ => ♠ and ♣
അ => അ
Filtering
The filter of Lib-Scrub is extremely paranoid. Basically you need to to tell it what to allowed, because everything else is automatically not allowed. Lib-Scrub can be configured to allow specific tags, specific attributes for specific tags and even specific classes or css properties which can be used.
<form> ..... </form>
.....
<font color='red' size='+3'> large red text? </font>
<font color='red'> large red text? </font>
<p class='indent nonsense'> paragraph </p>
<p class='indent'> paragraph </p>
<p class='font-size: 200%; color: red;'> paragraph </p>
<p class='color: red;'> paragraph </p>
Filtering of classes and css properties can also be configured using a glob-style asterisk. For example if you want to allow all font related css properties you can simply configure Lib-Scrub to allow ‘font-*’. Of course it is also possible to disallow a subset of the allowed group of css properties. For example all font related css properties are allowed, except for ‘font-size’.
Each attribute value can also be filtered by using a regular expression mask. If according to the mask an attribute value is not allowed it will use a pre-configured default value, if specified. If there is no default value specified it will remove the attribute in its entirety. The same thing applies to the values of css properties. Lib-Scrub has the regular expression masks for most of the attributes of most of the tags already predefined, so you won’t have to manually define them. You can just select what values you want to allow. Of course it is still possible to specify your own regular expression if you want to allow even more specific values.
<p align='nonsense'> paragraph </p>
<p> paragraph </p>
<p align='right'> paragraph </p>
<p align='right'> paragraph </p>
When instructed to do so, Lib-Scrub can also substitute specific tags with other tags, or even multiple tags. This feature is extremely useful for filtering out unwanted things by transforming them into something that is allowed.
<h1> Heading </h1>
<p><b> Heading </b></p>
<strong> Bold Text </strong>
<b> Bold Text </b>
<center> Centered Text </center>
<p align='center'> Centered Text </p>
As an extention to the previous example, tags can also be converted into regular text using the attributes as parameters.
<img src='image.jpg' alt='Photograph of Darl McBride'>
[Omitted image: Photograph of Darl McBride]
Lib-Scrub also allows you to set default values and specify required attribute for tags. If it encounters a required attribute and it is not specified by the user it will use the default value and insert the required attribute.
<table>
<table cellspacing='0' cellpadding='3' border='1' summary=''>
<img src='image.jpg'>
<img src='image.jpg alt=''>
It is also possible to overwrite existing attribute values with your own configurable value. Combined with the default value this option makes sure that the configured value is always used for one specific tag.
<a href='http://www.rakaz.nl'>
<a href='http://www.rakaz.nl' target='_blank'>
<a href='http://www.rakaz.nl' target='_self'>
<a href='http://www.rakaz.nl' target='_blank'>
You can configure which protocols are allowed in src or href attribute in order to prevent unwanted scripts to be loaded.
<a href='http://www.rakaz.nl'> Regular Link </a>
<a href='http://www.rakaz.nl'> Regular Link </a>
<a href='javascript:alert();'> Javascript </a>
<a> Javascript </a>
<a href='javascript:javascript:alert();'> Trying to fool the filter </a>
<a> Trying to fool the filter </a>
Unwanted nodes such as comments and literals can also be removed from the result.
Text <!-- This is just a regular comment --> and more text
Text and more text
Text <?php echo ', a script'; ?> and more text
Text and more text
The filter can even be configured to remove all kinds of scripting, such as:
<script language='javascript'>
<!--
alert();
//-->
</script>
<a href='javascript:alert();'>
<a>
<a href='http://www.rakaz.nl' onclick='alert();'>
<a href='http://www.rakaz.nl'>
<h1 align='&{variable};'>
<h1>
And on top of all of this, Lib-Scrub is also able to clean up the horrible mess that Microsoft Word makes of HTML files. It removes a lot of annoying and unneeded declaration, classes, styles and layers.
<div class='Section3'>
<p> .... </p>
</div>
<p> .... </p>
<p class='MsoNormal'> Paragraph </p>
<p> Paragraph </p>
<td style='mso-padding: 32pt'> Cell </td>
<td> Cell </td>
Lib-Scrub is also able to remove the <html> and <body> tags from the source document, so it’s contents can be inserted into a different documents. This feature is particularly useful for forums and CMS systems.
<html>
<head>
<title>Untitled Document</title>
</head>
<body>
<p> Paragraph </p>
</body>
</html>
<p> Paragraph </p>
Reformatting
Lib-Scrub has also some functionality to improve the formatting of your HTML document embedded in it’s filter class. This functionality includes the ability to remove empty tags and move whitespace and line-breaks around.
Text <strong><em></em></strong> Text
Text Text
Line 1 <strong><em><br></em></strong> Line 2
Line 1 <br> Line 2
<u>The quick brown fox<br>
</u>jumps over the lazy dog
<u>The quick brown fox</u><br>
jumps over the lazy dog
<h3>Heading 3 </h3>
<h3>Heading 3</h3>