Processing XML files in CJK encodings

gs · Oct 21, 2004

Python gurus,

I need to parse XML files in CJK encodings like GB2312 and Ja in UTF-8.
I was using xml.dom.minidom first. It works with Ja in UTF-8, but doesn't
work with GB2312. An article says,

http://mail.python.org/pipermail/xml-sig/2003-December/010034.html

Then I tried xml.parsers.xmlproc. It works fine with GB2312, but now it
doesn't work with Ja in UTF-8. Another article says,

http://mail.python.org/pipermail/xml-sig/2003-September/009802.html

Is there any way to parse both of them correctly?

Thanks,
-Gen

Uche Ogbuji · Oct 23, 2004

Python gurus,

I need to parse XML files in CJK encodings like GB2312 and Ja in UTF-8.
I was using xml.dom.minidom first. It works with Ja in UTF-8, but doesn't
work with GB2312. An article says,

http://mail.python.org/pipermail/xml-sig/2003-December/010034.html

Then I tried xml.parsers.xmlproc. It works fine with GB2312, but now it
doesn't work with Ja in UTF-8. Another article says,

http://mail.python.org/pipermail/xml-sig/2003-September/009802.html

Is there any way to parse both of them correctly?

You say "doesn't work". Can you be more specific?

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com
A hands-on introduction to ISO Schematron -
http://www-106.ibm.com/developerworks/edu/x-dw-xschematron-i.html
Schematron abstract patterns -
http://www.ibm.com/developerworks/xml/library/x-stron.html
Wrestling HTML (using Python) -
http://www.xml.com/pub/a/2004/09/08/pyxml.html
Enterprise data goes high fashion -
http://www.adtmag.com/article.asp?id=10061
Principles of XML design: Considering container elements -
http://www-106.ibm.com/developerworks/xml/library/x-contain.html
Hacking XML Hacks - http://www-106.ibm.com/developerworks/xml/library/x-think26.html
A survey of XML standards -
http://www-106.ibm.com/developerworks/xml/library/x-stand4/

Andrew Clover · Oct 24, 2004

Gen said:
I need to parse XML files in CJK encodings like GB2312 and Ja in UTF-8.

I assume you've already got CJKCodecs (or Python 2.4 where it's
built-in).

The main problem is that the expat parser (on which much Python XML
kit relies) doesn't understand the DBCS encodings. There are two ways
around this: either use an initial recoding step:

xml= unicode(bytes, 'gb2312').encode('utf-8')
doc= minidom.parseString(xml)

(If your input documents have an <?xml ... encoding="gb2312" ?>
declaration this will also have to be changed to encoding="utf-8" or
simply removed.)

OR, use a pure-Python XML parser, so it'll have access to CJKCodecs.
That means xmlproc+4DOM (validating) or pxdom (non-validating). This
is, in comparison to the recoding method, rather slow.

[Aside: have just released pxdom 1.2:

http://www.doxdesk.com/software/py/pxdom.html

I've processed a bunch of Shift-JIS material with this before without
problem.]

Then I tried xml.parsers.xmlproc. It works fine with GB2312, but now it
doesn't work with Ja in UTF-8.

Ohh. That's a bad one. Actually I'm surprised if it works with GB.

Here's a quick fix; I can't guarantee it's correct as I haven't really
played with xmlproc much but it fixes the error for me when parsing
strings. Oh, checking this out at the SourceForge tracker it looks
like the original reporter came up with the same idea, so it might be
okay.

Near the end of method parse_xml_decl (in PyXML 0.8.3 this is at line
723) in _xmlplus.parsers.xmlproc.xmlutils:

try:
self.data = self.charset_converter(self.data)
self.datasize= len(self.data) ### ADD THIS LINE
except UnicodeError, e:
self._handle_decoding_error(self.data, e)
self.input_encoding = enc1

read from file with mixed encodings in Python3	2	Nov 7, 2011
pythonXX.dll size: please split CJK codecs out	14	Aug 20, 2005
Problem with processing XML	8	Jan 22, 2008
Best ways of managing text encodings in source/regexes?	6	Nov 26, 2007
Processing XML that's embedded in HTML	10	Jan 22, 2008
XML in XMPP	8	Jul 6, 2012
Dealing with string encodings	3	Dec 2, 2007
Python and encodings drives me crazy	7	Jun 20, 2005

Processing XML files in CJK encodings

gs

Uche Ogbuji

Andrew Clover

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads