S
Steven Bethard
I'm having trouble using elementtree with an XML file that has some
gbk-encoded text. (I can't read Chinese, so I'm taking their word for
it that it's gbk-encoded.) I always have trouble with encodings, so I'm
sure I'm just screwing something simple up. Can anyone help me?
Here's the interactive session. Sorry it's a little verbose, but I
figured it would be better to include too much than not enough. I
basically expected et.ElementTree(file=...) to fail since no encoding
was specified, but I don't know what I'm doing wrong when I use
codecs.open(...)
Thanks in advance for the help!
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 543, in
__init__
self.parse(file)
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 583, in
parse
parser.feed(data)
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 1242,
in feed
self._parser.Parse(data, 0)
ExpatError: not well-formed (invalid token): line 8, column 6Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 543, in
__init__
self.parse(file)
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 583, in
parse
parser.feed(data)
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 1242,
in feed
self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode characters in position
133-135: ordinal not in range(128)'<DOC>\n<DOCID>ART242</DOCID>\n<HEADER>\n
<DATE></DATE>\n</HEADER>\n<BODY>\n<HEADLINE>\n<S ID=2566>\n( (IP-HLN
(LCP-TMP (IP (NP-PN-SBJ (NR \xb7\xfc\xc3\xf7\xcf\xbc)) \n\t\t (VP
(VV \xbb\xf1\xb5\xc3) \n\t\t\t (NP-OBJ (NN \xc5\xae\xd7\xd3)
\n\t\t\t\t (NN \xcc\xf8\xcc\xa8) \n\t\t\t\t (NN \xcc\xf8\xcb\xae)
\n\t\t\t\t (NN \xb9\xda\xbe\xfc)))) \n\t\t (LC \xba\xf3)) \n
(PU \xa3\xac) \n (NP-SBJ (NP-PN (NR
\xcb\xd5\xc1\xaa\xb6\xd3)) \n (NP (NN
\xbd\xcc\xc1\xb7))) \n (VP (ADVP (AD \xc8\xc8\xc7\xe9)) \n
(PP-DIR (P \xcf\xf2) \n\t\t (NP (PN \xcb\xfd))) \n
(VP (VV \xd7\xa3\xba\xd8))) \n (PU \xa1\xa3)) )
\n</S>\n<S ID=2567>\n( (FRAG (NR \xd0\xc2\xbb\xaa\xc9\xe7) \n
(NN \xbc\xc7\xd5\xdf) \n (NR \xb3\xcc\xd6\xc1\xc9\xc6) \n
(VV \xc9\xe3) )) \n</S>\n</HEADLINE>\n<TEXT>\n</TEXT>\n</BODY>\n</DOC>\n'
STeVe
gbk-encoded text. (I can't read Chinese, so I'm taking their word for
it that it's gbk-encoded.) I always have trouble with encodings, so I'm
sure I'm just screwing something simple up. Can anyone help me?
Here's the interactive session. Sorry it's a little verbose, but I
figured it would be better to include too much than not enough. I
basically expected et.ElementTree(file=...) to fail since no encoding
was specified, but I don't know what I'm doing wrong when I use
codecs.open(...)
Thanks in advance for the help!
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 543, in
__init__
self.parse(file)
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 583, in
parse
parser.feed(data)
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 1242,
in feed
self._parser.Parse(data, 0)
ExpatError: not well-formed (invalid token): line 8, column 6Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 543, in
__init__
self.parse(file)
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 583, in
parse
parser.feed(data)
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 1242,
in feed
self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode characters in position
133-135: ordinal not in range(128)'<DOC>\n<DOCID>ART242</DOCID>\n<HEADER>\n
<DATE></DATE>\n</HEADER>\n<BODY>\n<HEADLINE>\n<S ID=2566>\n( (IP-HLN
(LCP-TMP (IP (NP-PN-SBJ (NR \xb7\xfc\xc3\xf7\xcf\xbc)) \n\t\t (VP
(VV \xbb\xf1\xb5\xc3) \n\t\t\t (NP-OBJ (NN \xc5\xae\xd7\xd3)
\n\t\t\t\t (NN \xcc\xf8\xcc\xa8) \n\t\t\t\t (NN \xcc\xf8\xcb\xae)
\n\t\t\t\t (NN \xb9\xda\xbe\xfc)))) \n\t\t (LC \xba\xf3)) \n
(PU \xa3\xac) \n (NP-SBJ (NP-PN (NR
\xcb\xd5\xc1\xaa\xb6\xd3)) \n (NP (NN
\xbd\xcc\xc1\xb7))) \n (VP (ADVP (AD \xc8\xc8\xc7\xe9)) \n
(PP-DIR (P \xcf\xf2) \n\t\t (NP (PN \xcb\xfd))) \n
(VP (VV \xd7\xa3\xba\xd8))) \n (PU \xa1\xa3)) )
\n</S>\n<S ID=2567>\n( (FRAG (NR \xd0\xc2\xbb\xaa\xc9\xe7) \n
(NN \xbc\xc7\xd5\xdf) \n (NR \xb3\xcc\xd6\xc1\xc9\xc6) \n
(VV \xc9\xe3) )) \n</S>\n</HEADLINE>\n<TEXT>\n</TEXT>\n</BODY>\n</DOC>\n'
STeVe