J
John Machin
SAX uses the expat parser. From the pyexpat module docs:
Expat doesn't support as many encodings as Python does, and its repertoire
of encodings can't be extended; it supports UTF-8, UTF-16, ISO-8859-1
(Latin1), and ASCII. If encoding is given it will override the implicit or
explicit encoding of the document.
--Mark
Thank you for pointing out where that list of encodings had been
cunningly concealed. However the relevance of dropping it in as an
apparent response to my answer to the OP's question about decoding
possibly butchered GBK strings is .... what?
In any case, it seems to support other 8-bit encodings e.g. iso-8859-2
and koi8-r ...
C:\junk>type gbksax.py
import xml.sax, xml.sax.saxutils
import cStringIO
unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in
range(4))
print 'unistr=%r' % unistr
gbkstr = unistr.encode('gbk')
print 'gbkstr=%r' % gbkstr
unistr2 = gbkstr.decode('gbk')
assert unistr2 == unistr
print "latin1 FF -> utf8 = %r" %
'\xff'.decode('iso-8859-1').encode('utf8')
print "latin2 FF -> utf8 = %r" %
'\xff'.decode('iso-8859-2').encode('utf8')
print "koi8r FF -> utf8 = %r" % '\xff'.decode('koi8-r').encode('utf8')
xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
data>"""
asciidoc = xml_template % ('ascii', 'The quick brown fox etc')
utf8doc = xml_template % ('utf-8', unistr.encode('utf8'))
latin1doc = xml_template % ('iso-8859-1', 'nil illegitimati
carborundum' + '\xff')
latin2doc = xml_template % ('iso-8859-2', 'duo secundus' + '\xff')
koi8rdoc = xml_template % ('koi8-r', 'Moskva' + '\xff')
gbkdoc = xml_template % ('gbk', gbkstr)
for doc in (asciidoc, utf8doc, latin1doc, latin2doc, koi8rdoc,
gbkdoc):
f = cStringIO.StringIO()
handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
xml.sax.parseString(doc, handler)
result = f.getvalue()
f.close
print repr(result[result.find('<data>'):])
C:\junk>gbksax.py
unistr=u'\u4e00W\u4e01X\u4e02Y\u4e03Z'
gbkstr='\xd2\xbbW\xb6\xa1X\x81@Y\xc6\xdfZ'
latin1 FF -> utf8 = '\xc3\xbf'
latin2 FF -> utf8 = '\xcb\x99'
koi8r FF -> utf8 = '\xd0\xaa'
'<data>The quick brown fox etc</data>'
'<data>\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z</data>'
'<data>nil illegitimati carborundum\xc3\xbf</data>'
'<data>duo secundus\xcb\x99</data>'
'<data>Moskva\xd0\xaa</data>'
Traceback (most recent call last):
File "C:\junk\gbksax.py", line 27, in <module>
xml.sax.parseString(doc, handler)
File "C:\Python25\lib\xml\sax\__init__.py", line 49, in parseString
parser.parse(inpsrc)
File "C:\Python25\lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "C:\Python25\lib\xml\sax\xmlreader.py", line 123, in parse
self.feed(buffer)
File "C:\Python25\lib\xml\sax\expatreader.py", line 211, in feed
self._err_handler.fatalError(exc)
File "C:\Python25\lib\xml\sax\handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:30: unknown
encoding
C:\junk>