rakaz

about standards, webdesign, usability and open source

XML Plugins 0.12 0.13 – UTF-8 Support

As some users have noticed, my XML plugins for reading Atom and RSS were a bit lacking when handling UTF-8 encoded feeds. XML Plugins 0.13 should completely solve this and in addition it also provides charset conversion routines for most other encoding formats. Basically everything, except two-byte Japanese is now supported.

After releasing version 0.12 I was left with an unsatified feeling. Sure, it was working properly and you could easily use feeds in various character set encodings, but it was slow. Very slow. The slowdown was because of the charset conversion library I ‘borrowed’.

It took a couple of hours, but I’ve completely rewritten the conversion routines. It works basically like this: Included with the plugin are 63 small PHP files which contain mapping matrices. One matrix for every supported charset. For every valid character in the source charset the matrix contains a numeric unicode entity. Once the source file is parsed by the XML parser, the proper matrix is used to convert charset specific characters to their numeric entity equivalent.

If the source is encoded in UTF-8 it works a bit different. Using a UTF-8 decoding algorithm each ‘multi-byte’ character is mapped to it’s entity equivalent. The end result is the same as any of the other supported character sets.

The next step is easy, it cleans up any pre-existing named and numeric entities and if appropriate any left over numeric entities are converted back to their named equivalent. So, in the end the result is properly encoded using entities, any existing entities are cleanup and you have something that can be included in any HTML or XHTML document, regardless of the character set of the output.

Currently supported:

cp037, cp424, cp437, cp500, cp737, cp775, cp850, cp852, cp8855, cp856, cp857, cp860, cp861, cp862, cp863, cp864, cp865, cp866, cp869, cp874, cp875, cp1006, cp1026, gsm0338, iso-8859-1, iso-8859-2, iso-8859-3, iso-8859-4, iso-8859-5, iso-8859-6, iso-8859-7, iso-8859-8, iso-8859-9, iso-8859-10, iso-8859-11, iso-8859-13, iso-8859-14, iso-8859-15, iso-8859-16, koi8-r, koi8-u, mazovia, nextstep, stdenc, symbol, turkish, us-ascii, us-ascii-quotes, windows-1250, windows-1251, windows-1252, windows-1253, windows-1254, windows-1255, windows-1256, windows-1257, windows-1258, x-mac-ce, x-mac-cyrillic, x-mac-greek, x-mac-icelandic, x-mac-roman, zdingbat.

Coming soon:

big5, gb12345, gb1988, gb2312, jis0201, jis0208, jis0212, ksc5601, shiftjis, tis-620.

10 Responses to “XML Plugins 0.12 0.13 – UTF-8 Support”

  1. Aaron wrote on December 2nd, 2004 at 10:44 pm

    Hey Rakaz, MSN came out with their own blog. check out http://spaces.msn.com

  2. Mehran wrote on December 3rd, 2004 at 12:24 am

    This is great, it works fine for me. First I tried .12 version but it didn’t work, then I found out you release .13 and when I installed it , it works fine. I didn’t test everything, but the RSS/Atom Reader works fine, sofar.
    I will test it for different feeds and will tel you if I found any problem. Good work.

    The feed that you put as sample in your website is for Arabic language and as you know, Arabic and Persian are Right to Left, it means you must set the direction for that feed to something like :
    direction: rtl;
    and after that every thing would be correct.

    Thank you very much for this greate plug-in. It is realy usefull for me.

  3. rakaz wrote on December 3rd, 2004 at 12:30 am

    Mehran: I am aware of the right-to-left problem. However, I am not sure yet how to solve it yet. I’ll take a look.

  4. Mehran wrote on December 3rd, 2004 at 12:55 am

    Try this, this should fix the problem.

    Arabic Test Feed
    ……

    .boxPersian{
    direction: rtl;
    }

    If you need a Persian fed, you can use from BBC, here is the link:
    http://www.bbc.co.uk/persian/i ndex.rdf

    Let me know if you need any help.

    Thanks.

  5. Mehran wrote on December 3rd, 2004 at 12:58 am

    I think you comment system pars the HTML code, so I just repeat it here :

    h4- Arabic Test Feed -/h4
    div class="boxPersian"- …… -/div

  6. rakaz wrote on December 3rd, 2004 at 1:22 am

    Mehran: the actual HTML or CSS code isn’t the problem. Mostly it is deciding which character set is left-to-right and which is right-to-left and how to handle this properly even when you have an aggregated feed with different characters set of which some or one way while the others are the other way.

  7. Mehran wrote on December 3rd, 2004 at 1:39 am

    As far as I know there are 3 languages that they using Right to Left script: Persian(Farsi), Arabic and Hebrew.

    I think the best way is to add a new parameter to your skin var and then through that parameter user can choose which charachter set or encoding they need and also if the text need to be render Right to left, they be able to difine it by creating some CLASS or ID for the root tag, then when you render feeds you can see each one in prober direction and encoding.

    I will check tonight to see if I can find better suggestion.

  8. rakaz wrote on December 5th, 2004 at 1:13 am

    I’ve made some progress in the character set conversion algorithm. The next version will definately support the following additional character sets: Big5, GB2312, Shift-JIS, JIS and EUC-JP.

  9. rakaz wrote on December 6th, 2004 at 1:28 am

    More progress… I’ve rewritten some of the conversion routines to make it even more faster. Also I am now using UTF-8 as the internal format passed to the XML parser, which makes a little more sense. Now I just have to clean up the entity routines and add some additional character sets…

  10. kostia wrote on May 10th, 2005 at 11:55 pm

    I’ve been trying to get the RSSAtom plugin to work, but wherever I put it in my main index skin, the page stops parsing at that point. Everything before <%RSSAtom [etc.] displays fine, but after that, nothing. No error messages, no output, the page just stops drawing.

    Any ideas?