Some questions about decode/encode

John Machin · Jan 27, 2008

SAX uses the expat parser. From the pyexpat module docs:

Expat doesn't support as many encodings as Python does, and its repertoire
of encodings can't be extended; it supports UTF-8, UTF-16, ISO-8859-1
(Latin1), and ASCII. If encoding is given it will override the implicit or
explicit encoding of the document.

--Mark

Thank you for pointing out where that list of encodings had been
cunningly concealed. However the relevance of dropping it in as an
apparent response to my answer to the OP's question about decoding
possibly butchered GBK strings is .... what?

In any case, it seems to support other 8-bit encodings e.g. iso-8859-2
and koi8-r ...

C:\junk>type gbksax.py
import xml.sax, xml.sax.saxutils
import cStringIO

unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in
range(4))
print 'unistr=%r' % unistr
gbkstr = unistr.encode('gbk')
print 'gbkstr=%r' % gbkstr
unistr2 = gbkstr.decode('gbk')
assert unistr2 == unistr

print "latin1 FF -> utf8 = %r" %
'\xff'.decode('iso-8859-1').encode('utf8')
print "latin2 FF -> utf8 = %r" %
'\xff'.decode('iso-8859-2').encode('utf8')
print "koi8r FF -> utf8 = %r" % '\xff'.decode('koi8-r').encode('utf8')

xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
data>"""

asciidoc = xml_template % ('ascii', 'The quick brown fox etc')
utf8doc = xml_template % ('utf-8', unistr.encode('utf8'))
latin1doc = xml_template % ('iso-8859-1', 'nil illegitimati
carborundum' + '\xff')
latin2doc = xml_template % ('iso-8859-2', 'duo secundus' + '\xff')
koi8rdoc = xml_template % ('koi8-r', 'Moskva' + '\xff')
gbkdoc = xml_template % ('gbk', gbkstr)

for doc in (asciidoc, utf8doc, latin1doc, latin2doc, koi8rdoc,
gbkdoc):
f = cStringIO.StringIO()
handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
xml.sax.parseString(doc, handler)
result = f.getvalue()
f.close
print repr(result[result.find('<data>'):])

C:\junk>gbksax.py
unistr=u'\u4e00W\u4e01X\u4e02Y\u4e03Z'
gbkstr='\xd2\xbbW\xb6\xa1X\x81@Y\xc6\xdfZ'
latin1 FF -> utf8 = '\xc3\xbf'
latin2 FF -> utf8 = '\xcb\x99'
koi8r FF -> utf8 = '\xd0\xaa'
'<data>The quick brown fox etc</data>'
'<data>\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z</data>'
'<data>nil illegitimati carborundum\xc3\xbf</data>'
'<data>duo secundus\xcb\x99</data>'
'<data>Moskva\xd0\xaa</data>'
Traceback (most recent call last):
File "C:\junk\gbksax.py", line 27, in <module>
xml.sax.parseString(doc, handler)
File "C:\Python25\lib\xml\sax\__init__.py", line 49, in parseString
parser.parse(inpsrc)
File "C:\Python25\lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "C:\Python25\lib\xml\sax\xmlreader.py", line 123, in parse
self.feed(buffer)
File "C:\Python25\lib\xml\sax\expatreader.py", line 211, in feed
self._err_handler.fatalError(exc)
File "C:\Python25\lib\xml\sax\handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:30: unknown
encoding

C:\junk>

glacier · Jan 27, 2008

SAX uses the expat parser. From the pyexpat module docs:

Click to expand...

Expat doesn't support as many encodings as Python does, and its repertoire
of encodings can't be extended; it supports UTF-8, UTF-16, ISO-8859-1
(Latin1), and ASCII. If encoding is given it will override the implicit or
explicit encoding of the document.

Click to expand...

--Mark

Click to expand...

Thank you for pointing out where that list of encodings had been
cunningly concealed. However the relevance of dropping it in as an
apparent response to my answer to the OP's question about decoding
possibly butchered GBK strings is .... what?

In any case, it seems to support other 8-bit encodings e.g. iso-8859-2
and koi8-r ...

C:\junk>type gbksax.py
import xml.sax, xml.sax.saxutils
import cStringIO

unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in
range(4))
print 'unistr=%r' % unistr
gbkstr = unistr.encode('gbk')
print 'gbkstr=%r' % gbkstr
unistr2 = gbkstr.decode('gbk')
assert unistr2 == unistr

print "latin1 FF -> utf8 = %r" %
'\xff'.decode('iso-8859-1').encode('utf8')
print "latin2 FF -> utf8 = %r" %
'\xff'.decode('iso-8859-2').encode('utf8')
print "koi8r FF -> utf8 = %r" % '\xff'.decode('koi8-r').encode('utf8')

xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
data>"""

asciidoc = xml_template % ('ascii', 'The quick brown fox etc')
utf8doc = xml_template % ('utf-8', unistr.encode('utf8'))
latin1doc = xml_template % ('iso-8859-1', 'nil illegitimati
carborundum' + '\xff')
latin2doc = xml_template % ('iso-8859-2', 'duo secundus' + '\xff')
koi8rdoc = xml_template % ('koi8-r', 'Moskva' + '\xff')
gbkdoc = xml_template % ('gbk', gbkstr)

for doc in (asciidoc, utf8doc, latin1doc, latin2doc, koi8rdoc,
gbkdoc):
f = cStringIO.StringIO()
handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
xml.sax.parseString(doc, handler)
result = f.getvalue()
f.close
print repr(result[result.find('<data>'):])

C:\junk>gbksax.py
unistr=u'\u4e00W\u4e01X\u4e02Y\u4e03Z'
gbkstr='\xd2\xbbW\xb6\xa1X\x81@Y\xc6\xdfZ'
latin1 FF -> utf8 = '\xc3\xbf'
latin2 FF -> utf8 = '\xcb\x99'
koi8r FF -> utf8 = '\xd0\xaa'
'<data>The quick brown fox etc</data>'
'<data>\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z</data>'
'<data>nil illegitimati carborundum\xc3\xbf</data>'
'<data>duo secundus\xcb\x99</data>'
'<data>Moskva\xd0\xaa</data>'
Traceback (most recent call last):
File "C:\junk\gbksax.py", line 27, in <module>
xml.sax.parseString(doc, handler)
File "C:\Python25\lib\xml\sax\__init__.py", line 49, in parseString
parser.parse(inpsrc)
File "C:\Python25\lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "C:\Python25\lib\xml\sax\xmlreader.py", line 123, in parse
self.feed(buffer)
File "C:\Python25\lib\xml\sax\expatreader.py", line 211, in feed
self._err_handler.fatalError(exc)
File "C:\Python25\lib\xml\sax\handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:30: unknown
encoding

C:\junk>- Òþ²Ø±»ÒýÓÃÎÄ×Ö -

- ÏÔÊ¾ÒýÓÃµÄÎÄ×Ö -

Thanks,John.
It's no doubt that you proved SAX didn't support GBK encoding.
But can you give some suggestion on how to make SAX parse some GBK
string?

John Machin · Jan 28, 2008

Thanks,John.
It's no doubt that you proved SAX didn't support GBK encoding.
But can you give some suggestion on how to make SAX parse some GBK
string?

Yes, the same suggestion as was given to you by others very early in
this thread, the same as I demonstrated in the middle of proving that
SAX doesn't support a GBK-encoded input file.

Suggestion: Recode your input from GBK to UTF-8. Ensure that the XML
declaration doesn't have an unsupported encoding. Your handler will
get data encoded as UTF-8. Recode that to GBK if needed.

Here's a cut down version of the previous script, focussed on
demonstrating that the recoding strategy works.

C:\junk>type gbksax2.py
import xml.sax, xml.sax.saxutils
import cStringIO
unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in
range(4))
gbkstr = unistr.encode('gbk')
print 'This is a GBK-encoded string: %r' % gbkstr
utf8str = gbkstr.decode('gbk').encode('utf8')
print 'Now recoded as UTF-8 to be fed to a SAX parser: %r' % utf8str
xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
data>"""
utf8doc = xml_template % ('utf-8', unistr.encode('utf8'))
f = cStringIO.StringIO()
handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
xml.sax.parseString(utf8doc, handler)
result = f.getvalue()
f.close()
start = result.find('<data>') + 6
end = result.find('</data>')
mydata = result[start:end]
print "SAX output (UTF-8): %r" % mydata
print "SAX output recoded to GBK: %r" %
mydata.decode('utf8').encode('gbk')

C:\junk>gbksax2.py
This is a GBK-encoded string: '\xd2\xbbW\xb6\xa1X\x81@Y\xc6\xdfZ'
Now recoded as UTF-8 to be fed to a SAX parser: '\xe4\xb8\x80W
\xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z'
SAX output (UTF-8): '\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y
\xe4\xb8\x83Z'
SAX output recoded to GBK: '\xd2\xbbW\xb6\xa1X\x81@Y\xc6\xdfZ'

HTH,
John

glacier · Jan 28, 2008

Thanks,John.
It's no doubt that you proved SAX didn't support GBK encoding.
But can you give some suggestion on how to make SAX parse some GBK
string?

Click to expand...

Yes, the same suggestion as was given to you by others very early in
this thread, the same as I demonstrated in the middle of proving that
SAX doesn't support a GBK-encoded input file.

Suggestion: Recode your input from GBK to UTF-8. Ensure that the XML
declaration doesn't have an unsupported encoding. Your handler will
get data encoded as UTF-8. Recode that to GBK if needed.

Here's a cut down version of the previous script, focussed on
demonstrating that the recoding strategy works.

C:\junk>type gbksax2.py
import xml.sax, xml.sax.saxutils
import cStringIO
unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in
range(4))
gbkstr = unistr.encode('gbk')
print 'This is a GBK-encoded string: %r' % gbkstr
utf8str = gbkstr.decode('gbk').encode('utf8')
print 'Now recoded as UTF-8 to be fed to a SAX parser: %r' % utf8str
xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
data>"""
utf8doc = xml_template % ('utf-8', unistr.encode('utf8'))
f = cStringIO.StringIO()
handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
xml.sax.parseString(utf8doc, handler)
result = f.getvalue()
f.close()
start = result.find('<data>') + 6
end = result.find('</data>')
mydata = result[start:end]
print "SAX output (UTF-8): %r" % mydata
print "SAX output recoded to GBK: %r" %
mydata.decode('utf8').encode('gbk')

C:\junk>gbksax2.py
This is a GBK-encoded string: '\xd2\xbbW\xb6\xa1X\x81@Y\xc6\xdfZ'
Now recoded as UTF-8 to be fed to a SAX parser: '\xe4\xb8\x80W
\xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z'
SAX output (UTF-8): '\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y
\xe4\xb8\x83Z'
SAX output recoded to GBK: '\xd2\xbbW\xb6\xa1X\x81@Y\xc6\xdfZ'

HTH,
John

Thanks a lot John

I'll try it.

decode a string to "Perl's internal form" without Encode module?	4	Feb 28, 2007
Python - Map decode routine problem?	0	Dec 14, 2007
Another question about JSON	1	Sep 13, 2013
DUPLICATE MODS, PLEASE DELETE, SORRY!	1	Sep 4, 2023
How to decode JavaScript's encodeURIComponent in Perl.	4	Jan 22, 2007
unicode: is decode-process-encode a "good" aproach?	2	Sep 28, 2004
some questions on C++	11	Apr 14, 2008
HOWTO: Parsing email using Python part1	2	Jul 3, 2011

Some questions about decode/encode

John Machin

glacier

John Machin

glacier

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads