Thursday, 5 May 2005
Building a universal feed reader is not easy. All the different flavours of RSS are not exactly compatible with each other. Flavours evolve and new extensions are created.
In case of Atom there is even an older unofficial standard (version 0.3) and the ever changing IETF draft which should lead to version 1.0. It’s not easy, but I think I managed this quite well for my Nucleus Planet plugin.
Over the last weeks I’ve encountered some really strange feeds. Those feeds are sometimes plain errors, or sometimes very creative use of the standards. In the first case I’ve tried to make sure it won’t create problems for the plugin. In the second case I’ve tried to support it properly.
<script> tags, event model attributes such as
onmouseover and URLs that use the
Norman Walsh uses an Atom feed, but not your regular Atom 0.3 feed. Instead he uses a feed that is based on revision 5 of the IETF submission draft. This is the first and only feed I encountered that uses this specification. The reason is probably because the Atom WG and the specification itself discourage the use of this draft.
The Atom format is a work-in-progress, and this draft is both incomplete and likely to change rapidly. As a result, THE FORMAT DESCRIBED BY THIS DRAFT SHOULD NOT BE DEPLOYED, either in production systems or in any non-experimental fashion on the Internet.
Nevertheless, it is a valid feed and thanks to this feed I’ve updated the Planet plugin to accept this draft and any other IETF draft for that matter, including the latest Draft 8.
<?xml version="1.0" encoding="utf-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="/style/atom.xsl"?> <feed xmlns="http://purl.org/atom/ns#draft-ietf-atompub-format-05" version="draft-ietf-atompub-format-05 : do not deploy" xml:lang="EN-us"> <head> <title>norman.walsh.name</title> ...
Sam Ruby is an expert on this subject. He sits on the Atom WG, build www.feedvalidator.org together with Mark Pilgrim and probably has more knowledge about feeds and XML in his left pinkie than I have in my whole body. His website offers feeds in more formats than you can imagine.
One of the format he uses is RSS 2.0. The Planet plugin already had support for this format so I was surprised that it didn’t work properly on Sam’s site. The feed didn’t give any errors, but instead only the content was missing. It used to work before, but suddenly stopped working.
The reason is simple. Sam Ruby changed the way the content was encoded in the feed. He now uses a method that is known from the Atom spec and the RSS 1.1 Payload module: Inline XHTML in the XHTML namespace. Where the previous two methods offer a container element for this, the RSS 2.0 specification does not. This didn’t seem to hold Sam back, not he just placed the inline XHTML in the
<item> element instead.
Is this legal according to the RSS 2.0 spec? Well, it is certainly not forbidden. In fact the specification specifically allowed non RSS elements as long as they are in their own namespace.
A RSS feed may contain elements not described on this page, only if those elements are defined in a namespace.
One thing Sam’s ingeniously created feed required was a change to my Planet plugin. It now works perfectly, but I imagine there are a lot of other feed readers that do not know how to deal with this method of providing inline XHTML content.
<item> <title>Glue Layer People</title> <link>http://www.intertwingly.net/blog/2005/05/04/Glue-Layer-People</link> <guid isPermaLink="false">http://www.intertwingly.net/blog/1971.html</guid> <body xmlns="http://www.w3.org/1999/xhtml"> <p><a href="http://koranteng.blogspot.com/2005/05/get-on-bus.html"> Koranteng Ofosu-Amaah</a>: <em>As an application designer my perspective has mostly been “inside out” and I’ve been forever amazed at the serendipitous magic that you glue layer people have been able to do with things I’ve built.</em></p> ...
That even experts can make little mistakes proves his RSS 1.1 feed. There is nothing out of the ordinary about this feed, except that Sam forgot to define one namespace in the
<Channel> element. If that namespace definition would be present the Planet plugin would be able to read it without any problems.
<?xml version="1.0" encoding="utf-8" ?> <Channel rdf:about="http://www.intertwingly.net/blog/" xmlns:admin="http://webns.net/mvcb/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:p="http://purl.org/net/rss1.1/payload#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns="http://purl.org/net/rss1.1#"> ... <content:encoded><p> <a href="http://koranteng.blogspot.com/2005/05/get-on-bus.html"> Koranteng Ofosu-Amaah</a>: <em> ...
The next example is an example of how little mistakes can make the life of programmers of feed readers a problem. Adam uses a RSS 2.0 feed. This is no surprise because he is listed as one of the members of the RSS Advisory Board that published the RSS 2.0 spec.
The mistake is very small, but incredibly important for feed readers. If we are talking about XML and standards it usually doesn’t matter if the mistake is large or small. If it is not conforming 100% it will be problematic.
<comments> http://www.curry.com/comments?u=gurry&amp;p=7494&amp;link=http%3A%2F%2Fwww.curry.com%2F2005%2F05%2F03%23a7494 </comments>
Did you notice it? Don’t worry if you didn’t. The problem are the
&amp; constructs in the URL. The actual URL looks like this:
If we place that URL in an XML document we need to escape the &’s, because the & is a reserved character in XML. The & is encoded as a named character entity:
&. So the XML document would look like this:
<comments> http://www.curry.com/comments?u=gurry&p=7494&link=http%3A%2F%2Fwww.curry.com%2F2005%2F05%2F03%23a7494 </comments>
The URL is encoded twice, which will become a problem when a feed reader will offer that link to its users. Any user that clicks on that link will probably get a
404 Not found error because that URL simply does not exist. The URL was encoded twice, and only decoded once.
Bart Decrem, of Firefox fame, has a comment spam problem. Usually this is not a problem that is associated with feeds, unless the weblog offers a comment feed. But that is not the case here. Bart added a
<dc:contributor> element with some FOAF elements for each commenter for each of his stories in the main feed of its website. Normally it would be a nice idea.
Considering the number of comment spammers that have visited his website this creates a bit of a problem, because the feed now contains an enormous number of these elements. For just 15 stories it contains 1536 contributors. That is 9220 lines of code or an unneeded increase of 414 Kb.
<dc:contributor> <foaf:person foaf:name="big boobs"> <foaf:homepage rdf:resource="http://bigboobsz.w.interia.pl" /> <foaf:email rdf:resource="firstname.lastname@example.org" /> </foaf:person> </dc:contributor>
What about my own feeds? Well I am ashamed to confess that my comment feeds are broken, luckily my main feed works properly. I’m considering supporting multiple feeds, such as RSS 1.0, RSS 1.1, RSS 2.0, Atom 0.3 and Atom Draft 8 (which will be upgraded with each new draft until version 1.0 is finalized).