0.12 0.13 – UTF-8 Support
Wednesday, 1 December 2004
As some users have noticed, my XML plugins for reading Atom and RSS were a bit lacking when handling UTF-8 encoded feeds. XML Plugins 0.13 should completely solve this and in addition it also provides charset conversion routines for most other encoding formats. Basically everything, except two-byte Japanese is now supported.
After releasing version 0.12 I was left with an unsatified feeling. Sure, it was working properly and you could easily use feeds in various character set encodings, but it was slow. Very slow. The slowdown was because of the charset conversion library I ‘borrowed’.
It took a couple of hours, but I’ve completely rewritten the conversion routines. It works basically like this: Included with the plugin are 63 small PHP files which contain mapping matrices. One matrix for every supported charset. For every valid character in the source charset the matrix contains a numeric unicode entity. Once the source file is parsed by the XML parser, the proper matrix is used to convert charset specific characters to their numeric entity equivalent.
If the source is encoded in UTF-8 it works a bit different. Using a UTF-8 decoding algorithm each ‘multi-byte’ character is mapped to it’s entity equivalent. The end result is the same as any of the other supported character sets.
The next step is easy, it cleans up any pre-existing named and numeric entities and if appropriate any left over numeric entities are converted back to their named equivalent. So, in the end the result is properly encoded using entities, any existing entities are cleanup and you have something that can be included in any HTML or XHTML document, regardless of the character set of the output.
cp037, cp424, cp437, cp500, cp737, cp775, cp850, cp852, cp8855, cp856, cp857, cp860, cp861, cp862, cp863, cp864, cp865, cp866, cp869, cp874, cp875, cp1006, cp1026, gsm0338, iso-8859-1, iso-8859-2, iso-8859-3, iso-8859-4, iso-8859-5, iso-8859-6, iso-8859-7, iso-8859-8, iso-8859-9, iso-8859-10, iso-8859-11, iso-8859-13, iso-8859-14, iso-8859-15, iso-8859-16, koi8-r, koi8-u, mazovia, nextstep, stdenc, symbol, turkish, us-ascii, us-ascii-quotes, windows-1250, windows-1251, windows-1252, windows-1253, windows-1254, windows-1255, windows-1256, windows-1257, windows-1258, x-mac-ce, x-mac-cyrillic, x-mac-greek, x-mac-icelandic, x-mac-roman, zdingbat.
big5, gb12345, gb1988, gb2312, jis0201, jis0208, jis0212, ksc5601, shiftjis, tis-620.