Some questions about decode/encode

J

John Machin

SAX uses the expat parser. From the pyexpat module docs:

Expat doesn't support as many encodings as Python does, and its repertoire
of encodings can't be extended; it supports UTF-8, UTF-16, ISO-8859-1
(Latin1), and ASCII. If encoding is given it will override the implicit or
explicit encoding of the document.

--Mark

Thank you for pointing out where that list of encodings had been
cunningly concealed. However the relevance of dropping it in as an
apparent response to my answer to the OP's question about decoding
possibly butchered GBK strings is .... what?

In any case, it seems to support other 8-bit encodings e.g. iso-8859-2
and koi8-r ...

C:\junk>type gbksax.py
import xml.sax, xml.sax.saxutils
import cStringIO

unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in
range(4))
print 'unistr=%r' % unistr
gbkstr = unistr.encode('gbk')
print 'gbkstr=%r' % gbkstr
unistr2 = gbkstr.decode('gbk')
assert unistr2 == unistr

print "latin1 FF -> utf8 = %r" %
'\xff'.decode('iso-8859-1').encode('utf8')
print "latin2 FF -> utf8 = %r" %
'\xff'.decode('iso-8859-2').encode('utf8')
print "koi8r FF -> utf8 = %r" % '\xff'.decode('koi8-r').encode('utf8')

xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
data>"""

asciidoc = xml_template % ('ascii', 'The quick brown fox etc')
utf8doc = xml_template % ('utf-8', unistr.encode('utf8'))
latin1doc = xml_template % ('iso-8859-1', 'nil illegitimati
carborundum' + '\xff')
latin2doc = xml_template % ('iso-8859-2', 'duo secundus' + '\xff')
koi8rdoc = xml_template % ('koi8-r', 'Moskva' + '\xff')
gbkdoc = xml_template % ('gbk', gbkstr)

for doc in (asciidoc, utf8doc, latin1doc, latin2doc, koi8rdoc,
gbkdoc):
f = cStringIO.StringIO()
handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
xml.sax.parseString(doc, handler)
result = f.getvalue()
f.close
print repr(result[result.find('<data>'):])

C:\junk>gbksax.py
unistr=u'\u4e00W\u4e01X\u4e02Y\u4e03Z'
gbkstr='\xd2\xbbW\xb6\xa1X\x81@Y\xc6\xdfZ'
latin1 FF -> utf8 = '\xc3\xbf'
latin2 FF -> utf8 = '\xcb\x99'
koi8r FF -> utf8 = '\xd0\xaa'
'<data>The quick brown fox etc</data>'
'<data>\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z</data>'
'<data>nil illegitimati carborundum\xc3\xbf</data>'
'<data>duo secundus\xcb\x99</data>'
'<data>Moskva\xd0\xaa</data>'
Traceback (most recent call last):
File "C:\junk\gbksax.py", line 27, in <module>
xml.sax.parseString(doc, handler)
File "C:\Python25\lib\xml\sax\__init__.py", line 49, in parseString
parser.parse(inpsrc)
File "C:\Python25\lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "C:\Python25\lib\xml\sax\xmlreader.py", line 123, in parse
self.feed(buffer)
File "C:\Python25\lib\xml\sax\expatreader.py", line 211, in feed
self._err_handler.fatalError(exc)
File "C:\Python25\lib\xml\sax\handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:30: unknown
encoding

C:\junk>
 
G

glacier

SAX uses the expat parser. From the pyexpat module docs:
Expat doesn't support as many encodings as Python does, and its repertoire
of encodings can't be extended; it supports UTF-8, UTF-16, ISO-8859-1
(Latin1), and ASCII. If encoding is given it will override the implicit or
explicit encoding of the document.

Thank you for pointing out where that list of encodings had been
cunningly concealed. However the relevance of dropping it in as an
apparent response to my answer to the OP's question about decoding
possibly butchered GBK strings is .... what?

In any case, it seems to support other 8-bit encodings e.g. iso-8859-2
and koi8-r ...

C:\junk>type gbksax.py
import xml.sax, xml.sax.saxutils
import cStringIO

unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in
range(4))
print 'unistr=%r' % unistr
gbkstr = unistr.encode('gbk')
print 'gbkstr=%r' % gbkstr
unistr2 = gbkstr.decode('gbk')
assert unistr2 == unistr

print "latin1 FF -> utf8 = %r" %
'\xff'.decode('iso-8859-1').encode('utf8')
print "latin2 FF -> utf8 = %r" %
'\xff'.decode('iso-8859-2').encode('utf8')
print "koi8r FF -> utf8 = %r" % '\xff'.decode('koi8-r').encode('utf8')

xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
data>"""

asciidoc = xml_template % ('ascii', 'The quick brown fox etc')
utf8doc = xml_template % ('utf-8', unistr.encode('utf8'))
latin1doc = xml_template % ('iso-8859-1', 'nil illegitimati
carborundum' + '\xff')
latin2doc = xml_template % ('iso-8859-2', 'duo secundus' + '\xff')
koi8rdoc = xml_template % ('koi8-r', 'Moskva' + '\xff')
gbkdoc = xml_template % ('gbk', gbkstr)

for doc in (asciidoc, utf8doc, latin1doc, latin2doc, koi8rdoc,
gbkdoc):
f = cStringIO.StringIO()
handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
xml.sax.parseString(doc, handler)
result = f.getvalue()
f.close
print repr(result[result.find('<data>'):])

C:\junk>gbksax.py
unistr=u'\u4e00W\u4e01X\u4e02Y\u4e03Z'
gbkstr='\xd2\xbbW\xb6\xa1X\x81@Y\xc6\xdfZ'
latin1 FF -> utf8 = '\xc3\xbf'
latin2 FF -> utf8 = '\xcb\x99'
koi8r FF -> utf8 = '\xd0\xaa'
'<data>The quick brown fox etc</data>'
'<data>\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z</data>'
'<data>nil illegitimati carborundum\xc3\xbf</data>'
'<data>duo secundus\xcb\x99</data>'
'<data>Moskva\xd0\xaa</data>'
Traceback (most recent call last):
File "C:\junk\gbksax.py", line 27, in <module>
xml.sax.parseString(doc, handler)
File "C:\Python25\lib\xml\sax\__init__.py", line 49, in parseString
parser.parse(inpsrc)
File "C:\Python25\lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "C:\Python25\lib\xml\sax\xmlreader.py", line 123, in parse
self.feed(buffer)
File "C:\Python25\lib\xml\sax\expatreader.py", line 211, in feed
self._err_handler.fatalError(exc)
File "C:\Python25\lib\xml\sax\handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:30: unknown
encoding

C:\junk>- Òþ²Ø±»ÒýÓÃÎÄ×Ö -

- ÏÔʾÒýÓõÄÎÄ×Ö -

Thanks,John.
It's no doubt that you proved SAX didn't support GBK encoding.
But can you give some suggestion on how to make SAX parse some GBK
string?
 
J

John Machin

Thanks,John.
It's no doubt that you proved SAX didn't support GBK encoding.
But can you give some suggestion on how to make SAX parse some GBK
string?

Yes, the same suggestion as was given to you by others very early in
this thread, the same as I demonstrated in the middle of proving that
SAX doesn't support a GBK-encoded input file.

Suggestion: Recode your input from GBK to UTF-8. Ensure that the XML
declaration doesn't have an unsupported encoding. Your handler will
get data encoded as UTF-8. Recode that to GBK if needed.

Here's a cut down version of the previous script, focussed on
demonstrating that the recoding strategy works.

C:\junk>type gbksax2.py
import xml.sax, xml.sax.saxutils
import cStringIO
unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in
range(4))
gbkstr = unistr.encode('gbk')
print 'This is a GBK-encoded string: %r' % gbkstr
utf8str = gbkstr.decode('gbk').encode('utf8')
print 'Now recoded as UTF-8 to be fed to a SAX parser: %r' % utf8str
xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
data>"""
utf8doc = xml_template % ('utf-8', unistr.encode('utf8'))
f = cStringIO.StringIO()
handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
xml.sax.parseString(utf8doc, handler)
result = f.getvalue()
f.close()
start = result.find('<data>') + 6
end = result.find('</data>')
mydata = result[start:end]
print "SAX output (UTF-8): %r" % mydata
print "SAX output recoded to GBK: %r" %
mydata.decode('utf8').encode('gbk')

C:\junk>gbksax2.py
This is a GBK-encoded string: '\xd2\xbbW\xb6\xa1X\x81@Y\xc6\xdfZ'
Now recoded as UTF-8 to be fed to a SAX parser: '\xe4\xb8\x80W
\xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z'
SAX output (UTF-8): '\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y
\xe4\xb8\x83Z'
SAX output recoded to GBK: '\xd2\xbbW\xb6\xa1X\x81@Y\xc6\xdfZ'

HTH,
John
 
G

glacier

Thanks,John.
It's no doubt that you proved SAX didn't support GBK encoding.
But can you give some suggestion on how to make SAX parse some GBK
string?

Yes, the same suggestion as was given to you by others very early in
this thread, the same as I demonstrated in the middle of proving that
SAX doesn't support a GBK-encoded input file.

Suggestion: Recode your input from GBK to UTF-8. Ensure that the XML
declaration doesn't have an unsupported encoding. Your handler will
get data encoded as UTF-8. Recode that to GBK if needed.

Here's a cut down version of the previous script, focussed on
demonstrating that the recoding strategy works.

C:\junk>type gbksax2.py
import xml.sax, xml.sax.saxutils
import cStringIO
unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in
range(4))
gbkstr = unistr.encode('gbk')
print 'This is a GBK-encoded string: %r' % gbkstr
utf8str = gbkstr.decode('gbk').encode('utf8')
print 'Now recoded as UTF-8 to be fed to a SAX parser: %r' % utf8str
xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
data>"""
utf8doc = xml_template % ('utf-8', unistr.encode('utf8'))
f = cStringIO.StringIO()
handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
xml.sax.parseString(utf8doc, handler)
result = f.getvalue()
f.close()
start = result.find('<data>') + 6
end = result.find('</data>')
mydata = result[start:end]
print "SAX output (UTF-8): %r" % mydata
print "SAX output recoded to GBK: %r" %
mydata.decode('utf8').encode('gbk')

C:\junk>gbksax2.py
This is a GBK-encoded string: '\xd2\xbbW\xb6\xa1X\x81@Y\xc6\xdfZ'
Now recoded as UTF-8 to be fed to a SAX parser: '\xe4\xb8\x80W
\xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z'
SAX output (UTF-8): '\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y
\xe4\xb8\x83Z'
SAX output recoded to GBK: '\xd2\xbbW\xb6\xa1X\x81@Y\xc6\xdfZ'

HTH,
John

Thanks a lot John:)
I'll try it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,265
Latest member
TodLarocca

Latest Threads

Top