XML Plugins 0.14
Tuesday, 7 December 2004
The latest combined release of RSS/Atom Reader, RSS/Atom Aggregator and XBEL Reader now has improved support of different character set handling, such as the different iso-8559 standards, windows codepages, Shift JIS, EUC-JP, Big5 and GB2312. In addition to this these plugins now also support utf-8, utf-16 and utf-32 (big endian and little endian, with or without byte order marker).
One thing I have learned in the last couple of day is working with different character encodings is a pain. Working with different character encodings in PHP is an even greater pain. One of the main problem is the fact that PHP’s internal character encoding handling is lousy. The PHP XML extension only supports 3 different encoding: utf-8, iso-8859-1 and us-ascii. Other encodings can only be used after conversion using a PHP extension called mbstring which isn’t always installed by default and doesn’t support many one-byte encodings or an even rarer external application called iconv.
For these plugins I did not want rely on any of these ‘optional’ PHP extensions, so I’ve created a new class, just for character set conversion. Using so called matrices it can convert many different encodings and for the more complicated encodings I’ve also created special algorithms. You simply can use a string encoded using a large number of different character sets as the input and retrieve the utf-8 equivalent. The resulting string is then parsed in utf-8 mode by the XML parser and finally converted, using the same class to us-ascii with both named and numeric entities to represent the different Unicode characters.
The end result is a large headache and plugins which allows you to put practically any feed on your weblog, regardless of the character encoding of the feed and the character encoding of your own weblog.
For those who want to see this in action I’ve created two demo pages:
Character set demo, UTF demo
Hi,
The only feed I can get to work is the slashdot one, for example the following don’t seem to be parsed :
http://feed.newsxs.com/?s=1079
http://www.macbidouille.com…
Could you post the character set conversion class here? I’d love to see how you did it :P
Mathieu: The feeds you linked seem to work fine here.
Daniel: I am going to release the character set conversion class also
seperately. In the mean time you can simply download the plugin and look in the
xmlsupport folder. Everything in there is part of the conversion class.
Thank-o!
rakaz, why adding 1MB of encoding conversion PHP classes, this is insane(!), when you have iconv() and mb_convert_encoding() PHP functions? In fact, iconv PHP extensions is included in PHP 5 by default, so it is a bad habit to substitute PHP extensions by myryads of duplicate classes.
Rakaz,
Good work on this library — I’m currently doing some work with an old server that doesn’t have iconv or mb_convert_encoding installed, so this is incredibly useful to me.
One possible minor correction though: while digging in to entity_named.php, I think I may have found a typo. œ is set to å, but I think it should be œ instead. (I switched it and it seems to work correctly.)
Oops. I wasn’t sure how entities were handled in my last comment. That last sentence should be:
œ is set to å, but I think it should be œ instead.