Parsing feeds using PHP
Friday, 14 July 2006
Introducing FeedParser: a new library for parsing feeds in PHP.
I have been working on this library for a long time. My first attempt was over two years ago – it was horribly buggy. Last year I tried again, the result was not too bad, it supported many features of Atom 1.0, but it was inflexible. Unfortunately I got sidetracked and the Nucleus ‘Planet’ plugin I was writing at the time was never released. Two days ago I decided to start all over and basically throw everything away.
The result will become a brand-new parsing library. It is progressing nicely and will probably be ready in a week or two. Currently it is able to parse Atom 0.3, 1.0 and most RSS flavours, including RSS 0.9.x, 1.0, 1.1 and 2.0. It supports some basic extensions and over the next couple of days I will keep adding support for other extensions such as MediaRSS.
Download the source or try this demo.
When the library encounters embedded XHTML it will ensure that all embedded content is preserved. Embedded SVG and MathML should not give any problems. Relative URIs in XHTML content and SVG and MathML are automatically resolved to an absolute URI.
There is still a lot to do. Currently HTML content is passed through Tidy, but this does not guarantee that it won’t cause any problems. Aside from this the feed is not filtered in any way. I still need to look at a way to clean up the HTML and XHTML to drop any unwanted tags. Existing solutions will not work properly, because it needs to be fully compatible with embedded SVG and MathML…
The whole idea behind this library is to take as much complexity away from the user of this library. You don’t need to know the differences between the competing formats and extensions. All you need to do is a very simple call and process the data.
$parser = new FeedParserURL(); $result = $parser->Parse('http://rakaz.nl/index.atom'); echo $result['feed']['title']['value']; echo $result['feed']['entries'][0]['link']['href'];
Instead of giving back the ‘raw’ elements such as other libraries, FeedParser will convert everything to an Atom based internal structure. The advantages of this approach is enormous and will simplify development considerably. Of course there is also a downside to this approach: the library needs to know about every single extension, otherwise it will simply ignore it – without giving the user of this library the ability to retrieve the information. Fortunately it is very easy to add a new extension… All you need to do is let the library know about the namespace and create a simple class that parses the elements and attributes of the extension.
$parser = new FeedParserURL(); $parser->addCustomNamespace ('http://backend.userland.com/creativeCommonsRssModule', 'creativeCommons') $result = $parser->Parse('http://rakaz.nl/index.atom'); class FeedParserExtensionCreativeCommons extends FeedParserHelper { function parseElementLicense(& $context, & $tag) { $link = array (); $link['rel'] = 'license'; if (isset($tag['value'])) $link['href'] = $this->_parseUrl($tag['value']); $context['links'][] = $link; } }
Sorry to comment on such an old article, but I couldn’t find any contact info for you. I’m using your Feedparser library to run an update notifier for a WordPress theme, and some of my users have been reporting warnings of this kind:
Warning: curl_setopt() [function.curl-setopt]: CURLOPT_FOLLOWLOCATION cannot be activated when in safe_mode or an open_basedir is set in /…/wp-content/themes/tarski/library/feedparser/lib-feedparser.php on line 781
A little research seems to indicate that it’s due to a PHP version change which doesn’t allow CURLOPT_FOLLOWLOCATION to be activated when open_basedir is enabled. If you could suggest a fix, I’d be very grateful.