rakaz

about standards, webdesign, usability and open source

Weird feeds

Building a universal feed reader is not easy. All the different flavours of RSS are not exactly compatible with each other. Flavours evolve and new extensions are created.

In case of Atom there is even an older unofficial standard (version 0.3) and the ever changing IETF draft which should lead to version 1.0. It’s not easy, but I think I managed this quite well for my Nucleus Planet plugin.

Over the last weeks I’ve encountered some really strange feeds. Those feeds are sometimes plain errors, or sometimes very creative use of the standards. In the first case I’ve tried to make sure it won’t create problems for the plugin. In the second case I’ve tried to support it properly.

QuartzComps

QuarzComps uses a RSS 2.0 generated by WordPress 1.5. The feed uses 3 RSS 1.0 namespace extensions, which is not that unusual. Unfortunately neither is the value of the content:encoded tag. Instead of offering a cleaned-up version of the content that is inserted in the the XHTML page, it is using the exact same content, including some page specific Javascript.

<item>
<title>Welcome To QuartzComps</title>
<link>http://quartzcomps.com/2005/04/24/1/</link>
...
<content:encoded><![CDATA[
    <script type="text/javascript">
    window.document.getElementById('post-1').parentNode.className += ' adhesive_post';
    </script>

    <p><img src='/wp-content/images/qc-icon.jpg' class='alignleft' /></p>
    QuartzComps.com is a weblog and file archive devoted to the vast possibilities of
    creating media using Apple&#8217;s new Quartz Composer application which shipped
    as part of the developer tools of Mac OS 10.4 (Tiger).  For more information about
    ...

http://quartzcomps.com/feed/

A workaround would be to move the Javascript to a more proper place, such as the head of the XHTML document or try to achieve the same effect without any Javascript. This way the feed is not ‘infected’ with unnecessary scripts. A true solution would be that WordPress would clean the content of the post before offering it in a feed.

The planet plugin also has a workaround for this potential problem. It simply strips out any Javascript it finds in the contents of the feed. It looks for <script> tags, event model attributes such as onmouseover and URLs that use the javascript: protocol.

Norman.walsh.name

Norman Walsh uses an Atom feed, but not your regular Atom 0.3 feed. Instead he uses a feed that is based on revision 5 of the IETF submission draft. This is the first and only feed I encountered that uses this specification. The reason is probably because the Atom WG and the specification itself discourage the use of this draft.

The Atom format is a work-in-progress, and this draft is both incomplete and likely to change rapidly. As a result, THE FORMAT DESCRIBED BY THIS DRAFT SHOULD NOT BE DEPLOYED, either in production systems or in any non-experimental fashion on the Internet.

Nevertheless, it is a valid feed and thanks to this feed I’ve updated the Planet plugin to accept this draft and any other IETF draft for that matter, including the latest Draft 8.

<?xml version="1.0" encoding="utf-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="/style/atom.xsl"?>
<feed xmlns="http://purl.org/atom/ns#draft-ietf-atompub-format-05"
      version="draft-ietf-atompub-format-05 : do not deploy"
      xml:lang="EN-us">
   <head>
      <title>norman.walsh.name</title>
      ...

http://norman.walsh.name/atom/whatsnew.xml

Sam Ruby

Sam Ruby is an expert on this subject. He sits on the Atom WG, build www.feedvalidator.org together with Mark Pilgrim and probably has more knowledge about feeds and XML in his left pinkie than I have in my whole body. His website offers feeds in more formats than you can imagine.

One of the format he uses is RSS 2.0. The Planet plugin already had support for this format so I was surprised that it didn’t work properly on Sam’s site. The feed didn’t give any errors, but instead only the content was missing. It used to work before, but suddenly stopped working.

The reason is simple. Sam Ruby changed the way the content was encoded in the feed. He now uses a method that is known from the Atom spec and the RSS 1.1 Payload module: Inline XHTML in the XHTML namespace. Where the previous two methods offer a container element for this, the RSS 2.0 specification does not. This didn’t seem to hold Sam back, not he just placed the inline XHTML in the <item> element instead.

Is this legal according to the RSS 2.0 spec? Well, it is certainly not forbidden. In fact the specification specifically allowed non RSS elements as long as they are in their own namespace.

A RSS feed may contain elements not described on this page, only if those elements are defined in a namespace.

One thing Sam’s ingeniously created feed required was a change to my Planet plugin. It now works perfectly, but I imagine there are a lot of other feed readers that do not know how to deal with this method of providing inline XHTML content.

<item>
  <title>Glue Layer People</title>
  <link>http://www.intertwingly.net/blog/2005/05/04/Glue-Layer-People</link>
  <guid isPermaLink="false">http://www.intertwingly.net/blog/1971.html</guid>
  <body xmlns="http://www.w3.org/1999/xhtml">
    <p><a href="http://koranteng.blogspot.com/2005/05/get-on-bus.html">
    Koranteng Ofosu-Amaah</a>: <em>As an application designer my perspective
    has mostly been &#8220;inside out&#8221; and I&#8217;ve been forever
    amazed at the serendipitous magic that you glue layer people have
    been able to do with things I&#8217;ve built.</em></p>
    ...

http://www.intertwingly.net/blog/index.rss2

That even experts can make little mistakes proves his RSS 1.1 feed. There is nothing out of the ordinary about this feed, except that Sam forgot to define one namespace in the <Channel> element. If that namespace definition would be present the Planet plugin would be able to read it without any problems.

<?xml version="1.0" encoding="utf-8" ?>
<Channel
  rdf:about="http://www.intertwingly.net/blog/"
  xmlns:admin="http://webns.net/mvcb/"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:p="http://purl.org/net/rss1.1/payload#"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
  xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/"
  xmlns="http://purl.org/net/rss1.1#">
  ...
  <content:encoded>&lt;p&gt;
    &lt;a href="http://koranteng.blogspot.com/2005/05/get-on-bus.html"&gt;
    Koranteng Ofosu-Amaah&lt;/a&gt;: &lt;em&gt;
    ...

http://www.intertwingly.net/blog/index.rss11

Adam Curry

The next example is an example of how little mistakes can make the life of programmers of feed readers a problem. Adam uses a RSS 2.0 feed. This is no surprise because he is listed as one of the members of the RSS Advisory Board that published the RSS 2.0 spec.

The mistake is very small, but incredibly important for feed readers. If we are talking about XML and standards it usually doesn’t matter if the mistake is large or small. If it is not conforming 100% it will be problematic.

<comments> http://www.curry.com/comments?u=gurry&amp;amp;p=7494&amp;amp;link=http%3A%2F%2Fwww.curry.com%2F2005%2F05%2F03%23a7494
</comments>

http://www.curry.com/xml/rss.xml

Did you notice it? Don’t worry if you didn’t. The problem are the &amp;amp; constructs in the URL. The actual URL looks like this:


http://www.curry.com/comments?u=gurry&p=7494&link= http%3A%2F%2Fwww.curry.com%2F2005%2F05%2F03%23a7494

If we place that URL in an XML document we need to escape the &’s, because the & is a reserved character in XML. The & is encoded as a named character entity: &amp;. So the XML document would look like this:

<comments>

http://www.curry.com/comments?u=gurry&p=7494&link=http%3A%2F%2Fwww.curry.com%2F2005%2F05%2F03%23a7494

</comments>

The URL is encoded twice, which will become a problem when a feed reader will offer that link to its users. Any user that clicks on that link will probably get a 404 Not found error because that URL simply does not exist. The URL was encoded twice, and only decoded once.

Bart Decrem

Bart Decrem, of Firefox fame, has a comment spam problem. Usually this is not a problem that is associated with feeds, unless the weblog offers a comment feed. But that is not the case here. Bart added a <dc:contributor> element with some FOAF elements for each commenter for each of his stories in the main feed of its website. Normally it would be a nice idea.

Considering the number of comment spammers that have visited his website this creates a bit of a problem, because the feed now contains an enormous number of these elements. For just 15 stories it contains 1536 contributors. That is 9220 lines of code or an unneeded increase of 414 Kb.

<dc:contributor>
  <foaf:person foaf:name="big boobs">
    <foaf:homepage rdf:resource="http://bigboobsz.w.interia.pl" />
    <foaf:email rdf:resource="xoring@mail.ru" />
  </foaf:person>
</dc:contributor>

http://decrem.com/bart/index-mozilla.rdf

Rakaz

What about my own feeds? Well I am ashamed to confess that my comment feeds are broken, luckily my main feed works properly. I’m considering supporting multiple feeds, such as RSS 1.0, RSS 1.1, RSS 2.0, Atom 0.3 and Atom Draft 8 (which will be upgraded with each new draft until version 1.0 is finalized).

6 Responses to “Weird feeds”

  1. Norman Walsh wrote on May 5th, 2005 at 6:17 pm

    Yes, publishing my fees in Atom 05 was a questionable decision at best. It must have seemed like a good idea at the time, but I’ll migrate to Atom 1.0 as fast as possible.

    I actually used 08 for a moment or two, but that turned out to be a real mistake :-)

  2. rakaz wrote on May 5th, 2005 at 10:07 pm

    Hi Norman,

    Using draft 8 was actually a very good thing. I just build in support for draft 5 when you changed your feed to draft 8, which caused my Planet plugin to stop working again. So you are the one responsible for me going all the way and supporting up to draft 8. :)

  3. Sam Ruby wrote on May 8th, 2005 at 2:27 am

    Good catches!

    My intent on my RSS 1.1 feed was to use the payload element, but due to a coding error, this was not done. I’ve now corrected this.

    As to my RSS 2.0 feed, many feeds use an xhtml:body element, such as Dare Obasanjo’s. The primary difference between his and mine is that I don’t duplicate this information. I use description for summaries of my longer entries.

  4. Mark wrote on May 8th, 2005 at 3:51 pm

    > Building a universal feed reader is not easy.

    So why don’t you re-use one that’s already written?

    http://feedparser.org/

  5. rakaz wrote on May 8th, 2005 at 5:03 pm

    Why not re-use your universal feed parser? Well, mainly because your is written in Python and I use PHP. I also looked at some other parsers written in PHP, most notable Magpie, but that one is based on a different principle than what I was looking for.

  6. anand wrote on May 12th, 2005 at 3:55 pm

    Here is a request.

    Can you add an API call where I send out a feed and get back a structure of posts in return ?

    Will be useful for building feed specific applications.

    TIA!