rakaz

about standards, webdesign, usability and open source

Reducing the bandwidth used by feeds

The last couple of days I have been working on implementing proper Atom support for a proprietary content management system. Atom support itself wasn’t a big problem, but I did run into a problem implementing some of the more common techniques to save some bandwidth. The result is a tool to check if your own weblog uses one or more of these techniques. For now it is called the “Bandwidth-saving Header Validator for Feeds“, I am still looking for a shorter name.

The idea is simple – only send the complete Atom feed to new consumers and if there is a change in the feed. Otherwise simply tell the consumer that the feed has not been changed. Given that consumers try to retrieve the feed more often than the feed is updated this saves considerable bandwidth. There are two main methods to achieve this effect. Before we get into how these methods work I must point out that both the server that sends the feed and the consumer must support these techniques – if the server and the consumer do not support the same technique the complete feed will be send every time – just to be sure.

Last-Modified / If-Modified-Since

The first method is using the Last-Modified and the If-Modified-Since HTTP headers. Basically what happens is that each time the feed is fetched – it will also send the date the feed was last modified. That date is stored by the consumer and the next time it requests the feed it will include instructions to return the feed only if it has been modified since that previous modification date. If it hasn’t been modified since, the server will only return a 304 Not Modified status code which will let the consumer know that nothing
has changed.

GET /index.atom
Host: rakaz.nl

HTTP/1.0 200 OK
Content-Type: application/atom+xml
Content-Length: 43023
Last-Modified: Tue, 05 Dec 2006 13:04:54 GMT

<feed> … </feed>
GET /index.atom
Host: rakaz.nl
If-Modified-Since: Tue, 05 Dec 2006 13:04:54 GMT

HTTP/1.0 304 Not Modified
Content-Type: application/atom+xml
Content-Length: 0
Last-Modified: Tue, 05 Dec 2006 13:04:54 GMT

ETag / If-None-Match

The second method is very similar to the first, but instead of looking at the modification date it looks at a token that is updated every time the feed changes. Often this is a MD5 digest of the feed itself or some other form of hashing. The server sends this token by using the ETag header and the consumer stores this token for future use. If the consumer wants to know if the feed is updated it will send a If-None-Match header with the stored token. The server will return the complete feed if the token send by the consumer is different from it’s own token. If both tokens are the same it will once again return a 304 Not Modified status code indicating the feed has not been modified.

GET /index.atom
Host: rakaz.nl

HTTP/1.0 200 OK
Content-Type: application/atom+xml
Content-Length: 43023
ETag: "2938ef27a739cd30e30fe02339402aabf"

<feed> … </feed>
GET /index.atom
Host: rakaz.nl
If-None-Match: "2938ef27a739cd30e30fe02339402aabf"

HTTP/1.0 304 Not Modified
Content-Type: application/atom+xml
Content-Length: 0
ETag: "2938ef27a739cd30e30fe02339402aabf"

A-IM: feed / IM: feed

These two methods were pretty simple to implement and worked flawless. When I wanted to implement a third method my problems started. First of all I’ll try to explain how this method works. The third method is based on the second and is sometimes called Delta encoding or RFC3229+feed. When the feed is changed it will not send the complete feed, but instead it will only send the new or changed items.

Just like the previous method the server will send an token by using the ETag header and just like the previous method, the consumer will store this token for future use. The difference it that when the consumer wants to update the feed it will send an additional A-IM header together with the If-None-Match header. This will let the server know that the consumer supports Delta encoding.

The server must now determine – solely based on the token – which items were already send to the consumer and if any of those items were changed in the mean time or if any new items were created. The way the server determines this depends on the implementation, but globally each token represents a certain state. All the server has to do is keep a log of all changes and store the token together with information about what changed between states.

If nothing is changed it will simply send a 304 Not Modified status code – just like before. If something was changed it will send a 226 IM Used status code – letting the consumer know that something was changed – and return a feed with only the changed items.

GET /index.atom
Host: rakaz.nl

HTTP/1.0 200 OK
Content-Type: application/atom+xml
Content-Length: 43023
ETag: "19b32871240d41ad4234-49-50-51"

<feed> … [3 items] … </feed>
GET /index.atom
Host: rakaz.nl
A-IM: feed
If-None-Match: "19b32871240d41ad4234-49-50-51"

HTTP/1.0 226 IM Used
IM: feed
ETag: "29341230cd2823aa2bcd-49-50-51-52"
Content-Type: application/atom+xml
Content-Length: 438

<feed> … [1 item] … </feed>

After implementing this third method I noticed that instead of the 226 IM Used status code I got a 500 Internal Server Error status code. I returned the 226 IM Used status code in my script, but somehow it got changed to 500 along the way. It took me a while to find the source of the problem – Apache 1.3. I still use Apache 1.3 on my development machine and even on some of my servers. Apparently Apache uses a static list of status codes and when it encounters a status code it does not recognize it will replace it with 500 Internal Server Error. Great!

Given that the 226 IM Used status code is required by the RFC3229 specification it is impossible to support Delta encoding on Apache 1.3. Luckily this problem has been fixed in Apache 2, but not everybody has control over which version of Apache they use. Some shared hosting solutions still use Apache 1.3 because they consider Apache 2 to be less stable than 1.3. So I have to check the server version and will not use Delta encoding on Apache versions older than 2.

Testing the headers of your own feeds

The content management system I am working on is closed-source, so I am not able to share any of the work, but I am able to make the tool I created for testing public. With this tool you can check if your own feed supports any of the features mentioned above. It currently supports fully automated testing of the first two methods and it will also allow you to test Delta encoding – but you need to run the test first – manually add a new item to your weblog – and then continue with the test by clicking on a link specified on the first step. Enjoy!

3 Responses to “Reducing the bandwidth used by feeds”

  1. Philip Withnall wrote on December 5th, 2006 at 11:49 am

    Very insightful. I must admin I’d never thought about this before. Cheers!

  2. Sam Ruby wrote on December 5th, 2006 at 12:26 pm

    Have you also looked into Content-Encoding: gzip, deflate?

  3. rakaz wrote on December 5th, 2006 at 3:53 pm

    Sam: The content management system already handles gzip encoding transparently, so I didn’t need to look into it and forgot to include it in the validator. But it is a good idea, so I’ve just added it.