XML Plugins 0.14
Tuesday, 7 December 2004
The latest combined release of RSS/Atom Reader, RSS/Atom Aggregator and XBEL Reader now has improved support of different character set handling, such as the different iso-8559 standards, windows codepages, Shift JIS, EUC-JP, Big5 and GB2312. In addition to this these plugins now also support utf-8, utf-16 and utf-32 (big endian and little endian, with or without byte order marker).
One thing I have learned in the last couple of day is working with different character encodings is a pain. Working with different character encodings in PHP is an even greater pain. One of the main problem is the fact that PHP’s internal character encoding handling is lousy. The PHP XML extension only supports 3 different encoding: utf-8, iso-8859-1 and us-ascii. Other encodings can only be used after conversion using a PHP extension called mbstring which isn’t always installed by default and doesn’t support many one-byte encodings or an even rarer external application called iconv.
For these plugins I did not want rely on any of these ‘optional’ PHP extensions, so I’ve created a new class, just for character set conversion. Using so called matrices it can convert many different encodings and for the more complicated encodings I’ve also created special algorithms. You simply can use a string encoded using a large number of different character sets as the input and retrieve the utf-8 equivalent. The resulting string is then parsed in utf-8 mode by the XML parser and finally converted, using the same class to us-ascii with both named and numeric entities to represent the different Unicode characters.
The end result is a large headache and plugins which allows you to put practically any feed on your weblog, regardless of the character encoding of the feed and the character encoding of your own weblog.