Processing XML files in CJK encodings

G

gs

Python gurus,

I need to parse XML files in CJK encodings like GB2312 and Ja in UTF-8.
I was using xml.dom.minidom first. It works with Ja in UTF-8, but doesn't
work with GB2312. An article says,

http://mail.python.org/pipermail/xml-sig/2003-December/010034.html

Then I tried xml.parsers.xmlproc. It works fine with GB2312, but now it
doesn't work with Ja in UTF-8. Another article says,

http://mail.python.org/pipermail/xml-sig/2003-September/009802.html

Is there any way to parse both of them correctly?

Thanks,
-Gen
 
U

Uche Ogbuji

Python gurus,

I need to parse XML files in CJK encodings like GB2312 and Ja in UTF-8.
I was using xml.dom.minidom first. It works with Ja in UTF-8, but doesn't
work with GB2312. An article says,

http://mail.python.org/pipermail/xml-sig/2003-December/010034.html

Then I tried xml.parsers.xmlproc. It works fine with GB2312, but now it
doesn't work with Ja in UTF-8. Another article says,

http://mail.python.org/pipermail/xml-sig/2003-September/009802.html

Is there any way to parse both of them correctly?

You say "doesn't work". Can you be more specific?

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com
A hands-on introduction to ISO Schematron -
http://www-106.ibm.com/developerworks/edu/x-dw-xschematron-i.html
Schematron abstract patterns -
http://www.ibm.com/developerworks/xml/library/x-stron.html
Wrestling HTML (using Python) -
http://www.xml.com/pub/a/2004/09/08/pyxml.html
Enterprise data goes high fashion -
http://www.adtmag.com/article.asp?id=10061
Principles of XML design: Considering container elements -
http://www-106.ibm.com/developerworks/xml/library/x-contain.html
Hacking XML Hacks - http://www-106.ibm.com/developerworks/xml/library/x-think26.html
A survey of XML standards -
http://www-106.ibm.com/developerworks/xml/library/x-stand4/
 
A

Andrew Clover

Gen said:
I need to parse XML files in CJK encodings like GB2312 and Ja in UTF-8.

I assume you've already got CJKCodecs (or Python 2.4 where it's
built-in).

The main problem is that the expat parser (on which much Python XML
kit relies) doesn't understand the DBCS encodings. There are two ways
around this: either use an initial recoding step:

xml= unicode(bytes, 'gb2312').encode('utf-8')
doc= minidom.parseString(xml)

(If your input documents have an <?xml ... encoding="gb2312" ?>
declaration this will also have to be changed to encoding="utf-8" or
simply removed.)

OR, use a pure-Python XML parser, so it'll have access to CJKCodecs.
That means xmlproc+4DOM (validating) or pxdom (non-validating). This
is, in comparison to the recoding method, rather slow.

[Aside: have just released pxdom 1.2:

http://www.doxdesk.com/software/py/pxdom.html

I've processed a bunch of Shift-JIS material with this before without
problem.]
Then I tried xml.parsers.xmlproc. It works fine with GB2312, but now it
doesn't work with Ja in UTF-8.

Ohh. That's a bad one. Actually I'm surprised if it works with GB.

Here's a quick fix; I can't guarantee it's correct as I haven't really
played with xmlproc much but it fixes the error for me when parsing
strings. Oh, checking this out at the SourceForge tracker it looks
like the original reporter came up with the same idea, so it might be
okay. :)

Near the end of method parse_xml_decl (in PyXML 0.8.3 this is at line
723) in _xmlplus.parsers.xmlproc.xmlutils:

try:
self.data = self.charset_converter(self.data)
self.datasize= len(self.data) ### ADD THIS LINE
except UnicodeError, e:
self._handle_decoding_error(self.data, e)
self.input_encoding = enc1
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,043
Latest member
CannalabsCBDReview

Latest Threads

Top