XML Plugins 0.12 0.13 – UTF-8 Support
Wednesday, 1 December 2004
As some users have noticed, my XML plugins for reading Atom and RSS were a bit lacking when handling UTF-8 encoded feeds. XML Plugins 0.13 should completely solve this and in addition it also provides charset conversion routines for most other encoding formats. Basically everything, except two-byte Japanese is now supported.
After releasing version 0.12 I was left with an unsatified feeling. Sure, it was working properly and you could easily use feeds in various character set encodings, but it was slow. Very slow. The slowdown was because of the charset conversion library I ‘borrowed’.
It took a couple of hours, but I’ve completely rewritten the conversion routines. It works basically like this: Included with the plugin are 63 small PHP files which contain mapping matrices. One matrix for every supported charset. For every valid character in the source charset the matrix contains a numeric unicode entity. Once the source file is parsed by the XML parser, the proper matrix is used to convert charset specific characters to their numeric entity equivalent.
If the source is encoded in UTF-8 it works a bit different. Using a UTF-8 decoding algorithm each ‘multi-byte’ character is mapped to it’s entity equivalent. The end result is the same as any of the other supported character sets.
The next step is easy, it cleans up any pre-existing named and numeric entities and if appropriate any left over numeric entities are converted back to their named equivalent. So, in the end the result is properly encoded using entities, any existing entities are cleanup and you have something that can be included in any HTML or XHTML document, regardless of the character set of the output.
Currently supported:
cp037, cp424, cp437, cp500, cp737, cp775, cp850, cp852, cp8855, cp856, cp857, cp860, cp861, cp862, cp863, cp864, cp865, cp866, cp869, cp874, cp875, cp1006, cp1026, gsm0338, iso-8859-1, iso-8859-2, iso-8859-3, iso-8859-4, iso-8859-5, iso-8859-6, iso-8859-7, iso-8859-8, iso-8859-9, iso-8859-10, iso-8859-11, iso-8859-13, iso-8859-14, iso-8859-15, iso-8859-16, koi8-r, koi8-u, mazovia, nextstep, stdenc, symbol, turkish, us-ascii, us-ascii-quotes, windows-1250, windows-1251, windows-1252, windows-1253, windows-1254, windows-1255, windows-1256, windows-1257, windows-1258, x-mac-ce, x-mac-cyrillic, x-mac-greek, x-mac-icelandic, x-mac-roman, zdingbat.
Coming soon:
big5, gb12345, gb1988, gb2312, jis0201, jis0208, jis0212, ksc5601, shiftjis, tis-620.
Hey Rakaz, MSN came out with their own blog. check out http://spaces.msn.com
This is great, it works fine for me. First I tried .12 version but it didn’t work, then I found out you release .13 and when I installed it , it works fine. I didn’t test everything, but the RSS/Atom Reader works fine, sofar.
I will test it for different feeds and will tel you if I found any problem. Good work.
The feed that you put as sample in your website is for Arabic language and as you know, Arabic and Persian are Right to Left, it means you must set the direction for that feed to something like :
direction: rtl;
and after that every thing would be correct.
Thank you very much for this greate plug-in. It is realy usefull for me.
Mehran: I am aware of the right-to-left problem. However, I am not sure yet how to solve it yet. I’ll take a look.
Try this, this should fix the problem.
Arabic Test Feed
……
.boxPersian{
direction: rtl;
}
If you need a Persian fed, you can use from BBC, here is the link:
http://www.bbc.co.uk/persian/i ndex.rdf
Let me know if you need any help.
Thanks.
I think you comment system pars the HTML code, so I just repeat it here :
h4- Arabic Test Feed -/h4
div class="boxPersian"- …… -/div
Mehran: the actual HTML or CSS code isn’t the problem. Mostly it is deciding which character set is left-to-right and which is right-to-left and how to handle this properly even when you have an aggregated feed with different characters set of which some or one way while the others are the other way.
As far as I know there are 3 languages that they using Right to Left script: Persian(Farsi), Arabic and Hebrew.
I think the best way is to add a new parameter to your skin var and then through that parameter user can choose which charachter set or encoding they need and also if the text need to be render Right to left, they be able to difine it by creating some CLASS or ID for the root tag, then when you render feeds you can see each one in prober direction and encoding.
I will check tonight to see if I can find better suggestion.
I’ve made some progress in the character set conversion algorithm. The next version will definately support the following additional character sets: Big5, GB2312, Shift-JIS, JIS and EUC-JP.
More progress… I’ve rewritten some of the conversion routines to make it even more faster. Also I am now using UTF-8 as the internal format passed to the XML parser, which makes a little more sense. Now I just have to clean up the entity routines and add some additional character sets…
I’ve been trying to get the RSSAtom plugin to work, but wherever I put it in my main index skin, the page stops parsing at that point. Everything before <%RSSAtom [etc.] displays fine, but after that, nothing. No error messages, no output, the page just stops drawing.
Any ideas?