Python 3.0 automatic decoding of UTF16

J

Johannes Bauer

Hello group,

I'm having trouble reading a utf-16 encoded file with Python3.0. This is
my (complete) code:

#!/usr/bin/python3.0

class AddressBook():
def __init__(self, filename):
f = open(filename, "r", encoding="utf16")
while True:
line = f.readline()
if line == "": break
print([line[x] for x in range(len(line))])
f.close()

a = AddressBook("2008_11_05_Handy_Backup.txt")

This is the file (only 1 kB, if hosting doesn't work please tell me and
I'll see if I can put it someplace else):

http://www.file-upload.net/download-1297291/2008_11_05_Handy_Backup.txt.gz.html

What I get: The file reads file the first few lines. Then, in the last
line, I get lots of garbage (looking like uninitialized memory):

['E', 'n', 't', 'r', 'y', '0', '0', 'T', 'e', 'x', 't', ' ', '=', ' ',
'"', 'A', 'D', 'A', 'C', ' ', 'V', 'e', 'r', 'k', 'e', 'h', 'r', 's',
'i', 'n', 'f', 'o', '"', '\u0d00', '\u0a00', '䔀', '渀', 'ç€', '爀', '礀
', '\u3000', '\u3100', 'å€', '礀', '瀀', '攀', '\u2000', 'ã´€', '\u2000',
'一', '甀', '洀', '戀', '攀', '爀', '䴀', '漀', '戀', '椀', '氀', '攀',
'\u0d00', '\u0a00', '䔀', '渀', 'ç€', '爀', '礀', '\u3000', '\u3100', '
å€', '攀', 'ç €', 'ç€', '\u2000', 'ã´€', '\u2000', '∀', '⬀', 'ã€', '㤀',
'\u3100', '㜀', '㤀', '㈀', '㈀', 'ã€', '㤀', '㤀', '∀', '\u0d00',
'\u0a00', '\u0d00', '\u0a00', '嬀', '倀', '栀', '漀', '渀', '攀', '倀',
'䈀', '䬀', '\u3000', '\u3000', 'ã€', 'å´€', '\u0d00', '\u0a00']

Where the line

Entry00Text = "ADAC Verkehrsinfo"\r\n

is actually the only thing the line contains, Python makes the rest up.

The actual file is much longer and contains private numbers, so I
truncated them away. When I let python process the original file, it
dies with another error:

Traceback (most recent call last):
File "./modify.py", line 12, in <module>
a = AddressBook("2008_11_05_Handy_Backup.txt")
File "./modify.py", line 7, in __init__
line = f.readline()
File "/usr/local/lib/python3.0/io.py", line 1807, in readline
while self._read_chunk():
File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
File "/usr/local/lib/python3.0/io.py", line 1293, in decode
output = self.decoder.decode(input, final=final)
File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
_buffer_decode
return self.decoder(input, self.errors, final)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 74-75:
illegal encoding

With the place where it dies being exactly the place where it outputs
the weird garbage in the shortened file. I guess it runs over some page
boundary here or something?

Kind regards,
Johannes
 
J

J Kenneth King

Johannes Bauer said:
Traceback (most recent call last):
File "./modify.py", line 12, in <module>
a = AddressBook("2008_11_05_Handy_Backup.txt")
File "./modify.py", line 7, in __init__
line = f.readline()
File "/usr/local/lib/python3.0/io.py", line 1807, in readline
while self._read_chunk():
File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
File "/usr/local/lib/python3.0/io.py", line 1293, in decode
output = self.decoder.decode(input, final=final)
File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
_buffer_decode
return self.decoder(input, self.errors, final)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 74-75:
illegal encoding

It probably means what it says: that the input file contains characters
it cannot read using the specified encoding.

Are you generating the file from python using a file object with the
same encoding? If not, then you might want to look at your input data
and find a way to deal with the exception.
 
J

Johannes Bauer

J said:
It probably means what it says: that the input file contains characters
it cannot read using the specified encoding.

No, it doesn't. The file is just fine, just as the example.
Are you generating the file from python using a file object with the
same encoding? If not, then you might want to look at your input data
and find a way to deal with the exception.

I did. The file is fine. Could you try out the example?

Regards,
Johannes
 
R

Richard Brodie

It probably means what it says: that the input file contains characters
it cannot read using the specified encoding.

That was my first thought. However it appears that there is an off by one
error somewhere in the intersection of line ending/codec processing.
Half way through the codec starts byte-flipping characters.
 
T

Terry Reedy

Johannes said:
Hello group,

I'm having trouble reading a utf-16 encoded file with Python3.0. This is
my (complete) code:

what OS. This is often critical when you have a problem interacting
with the OS.
#!/usr/bin/python3.0

class AddressBook():
def __init__(self, filename):
f = open(filename, "r", encoding="utf16")
while True:
line = f.readline()
if line == "": break
print([line[x] for x in range(len(line))])
f.close()

a = AddressBook("2008_11_05_Handy_Backup.txt")

This is the file (only 1 kB, if hosting doesn't work please tell me and
I'll see if I can put it someplace else):

http://www.file-upload.net/download-1297291/2008_11_05_Handy_Backup.txt.gz.html

What I get: The file reads file the first few lines. Then, in the last
line, I get lots of garbage (looking like uninitialized memory):

['E', 'n', 't', 'r', 'y', '0', '0', 'T', 'e', 'x', 't', ' ', '=', ' ',
'"', 'A', 'D', 'A', 'C', ' ', 'V', 'e', 'r', 'k', 'e', 'h', 'r', 's',
'i', 'n', 'f', 'o', '"', '\u0d00', '\u0a00', '䔀', '渀', 'ç€', '爀', '礀
', '\u3000', '\u3100', 'å€', '礀', '瀀', '攀', '\u2000', 'ã´€', '\u2000',
'一', '甀', '洀', '戀', '攀', '爀', '䴀', '漀', '戀', '椀', '氀', '攀',
'\u0d00', '\u0a00', '䔀', '渀', 'ç€', '爀', '礀', '\u3000', '\u3100', '
å€', '攀', 'ç €', 'ç€', '\u2000', 'ã´€', '\u2000', '∀', '⬀', 'ã€', '㤀',
'\u3100', '㜀', '㤀', '㈀', '㈀', 'ã€', '㤀', '㤀', '∀', '\u0d00',
'\u0a00', '\u0d00', '\u0a00', '嬀', '倀', '栀', '漀', '渀', '攀', '倀',
'䈀', '䬀', '\u3000', '\u3000', 'ã€', 'å´€', '\u0d00', '\u0a00']

Where the line

Entry00Text = "ADAC Verkehrsinfo"\r\n

From \r\n I guess Windows. Correct?

I suspect that '?' after \n (\u0a00) is indicates not 'question-mark'
but 'uninterpretable as a utf16 character'. The traceback below
confirms that. It should be an end-of-file marker and should not be
passed to Python. I strongly suspect that whatever wrote the file
screwed up the (OS-specific) end-of-file marker. I have seen this
occasionally on Dos/Windows with ascii byte files, with the same symptom
of reading random garbage pass the end of the file. Or perhaps
end-of-file does not work right with utf16.
is actually the only thing the line contains, Python makes the rest up.

No it does not. It echoes what the OS gives it with system calls, which
is randon garbage to the end of the disk block.

Try open with explicit 'rt' and 'rb' modes and see what happens. Text
mode should be default, but then \r should be deleted.
The actual file is much longer and contains private numbers, so I
truncated them away. When I let python process the original file, it
dies with another error:

Traceback (most recent call last):
File "./modify.py", line 12, in <module>
a = AddressBook("2008_11_05_Handy_Backup.txt")
File "./modify.py", line 7, in __init__
line = f.readline()
File "/usr/local/lib/python3.0/io.py", line 1807, in readline
while self._read_chunk():
File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
File "/usr/local/lib/python3.0/io.py", line 1293, in decode
output = self.decoder.decode(input, final=final)
File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
_buffer_decode
return self.decoder(input, self.errors, final)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 74-75:
illegal encoding

With the place where it dies being exactly the place where it outputs
the weird garbage in the shortened file. I guess it runs over some page
boundary here or something?

Malformed EOF more likely.

Terry Jan Reedy
 
J

Johannes Bauer

Terry said:
what OS. This is often critical when you have a problem interacting
with the OS.

It's a 64-bit Linux, currently running:

Linux joeserver 2.6.20-skas3-v9-pre9 #4 SMP PREEMPT Wed Dec 3 18:34:49
CET 2008 x86_64 Intel(R) Core(TM)2 CPU 6400 @ 2.13GHz GenuineIntel GNU/Linux

Kernel, however, 2.6.26.1 yields the same problem.
From \r\n I guess Windows. Correct?

Well, not really. The file was created with gammu, a Linux opensource
tool to extract a phonebook off cell phones. However, gammu seems to
generate those Windows-CRLF lineendings.
I suspect that '?' after \n (\u0a00) is indicates not 'question-mark'
but 'uninterpretable as a utf16 character'. The traceback below
confirms that. It should be an end-of-file marker and should not be
passed to Python. I strongly suspect that whatever wrote the file
screwed up the (OS-specific) end-of-file marker. I have seen this
occasionally on Dos/Windows with ascii byte files, with the same symptom
of reading random garbage pass the end of the file. Or perhaps
end-of-file does not work right with utf16.

So UTF-16 has an explicit EOF marker within the text? I cannot find one
in original file, only some kind of starting sequence I suppose
(0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a,
simple \r\n line ending.
No it does not. It echoes what the OS gives it with system calls, which
is randon garbage to the end of the disk block.

Could it not be, as Richard suggested, that there's an off-by-one?
Try open with explicit 'rt' and 'rb' modes and see what happens. Text
mode should be default, but then \r should be deleted.

rt:

[...]
['[', 'P', 'h', 'o', 'n', 'e', 'P', 'B', 'K', '0', '0', '3', ']', '\n']
['L', 'o', 'c', 'a', 't', 'i', 'o', 'n', ' ', '=', ' ', '0', '0', '3', '\n']
['E', 'n', 't', 'r', 'y', '0', '0', 'T', 'y', 'p', 'e', ' ', '=', ' ',
'N', 'a', 'm', 'e', '\n']
Traceback (most recent call last):
File "./modify.py", line 12, in <module>
a = AddressBook("2008_11_05_Handy_Backup.txt")
File "./modify.py", line 7, in __init__
line = f.readline()
File "/usr/local/lib/python3.0/io.py", line 1807, in readline
while self._read_chunk():
File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
File "/usr/local/lib/python3.0/io.py", line 1293, in decode
output = self.decoder.decode(input, final=final)
File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
_buffer_decode
return self.decoder(input, self.errors, final)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 74-75:
illegal encoding

rb works, as it doesn't take an encoding parameter.
Malformed EOF more likely.

Could you please elaborate?

Kind regards,
Johannes
 
J

Joe Strout

So UTF-16 has an explicit EOF marker within the text?

No, it does not. I don't know what Terry's thinking of there, but
text files do not have any EOF marker. They start at the beginning
(sometimes including a byte-order mark), and go till the end of the
file, period.
I cannot find one in original file, only some kind of starting
sequence I suppose
(0xfeff).

That's your byte-order mark (BOM).
The last characters of the file are 0x00 0x0d 0x00 0x0a,
simple \r\n line ending.

Sounds like a perfectly normal file to me.

It's hard to imagine, but it looks to me like you've found a bug.

Best,
- Joe
 
I

info

Hello group,

I'm having trouble reading a utf-16 encoded file with Python3.0. This is
my (complete) code:

#!/usr/bin/python3.0

class AddressBook():
        def __init__(self, filename):
                f = open(filename, "r", encoding="utf16")
                while True:
                        line = f.readline()
                        if line == "": break
                        print([line[x] for x in range(len(line))])
                f.close()

a = AddressBook("2008_11_05_Handy_Backup.txt")

This is the file (only 1 kB, if hosting doesn't work please tell me and
I'll see if I can put it someplace else):

http://www.file-upload.net/download-1297291/2008_11_05_Handy_Backup.t...

What I get: The file reads file the first few lines. Then, in the last
line, I get lots of garbage (looking like uninitialized memory):

['E', 'n', 't', 'r', 'y', '0', '0', 'T', 'e', 'x', 't', ' ', '=', ' ',
'"', 'A', 'D', 'A', 'C', ' ', 'V', 'e', 'r', 'k', 'e', 'h', 'r', 's',
'i', 'n', 'f', 'o', '"', '\u0d00', '\u0a00', '䔀', '渀', 'ç€', '爀', '礀
', '\u3000', '\u3100', 'å€', '礀', '瀀', '攀', '\u2000', 'ã´€', '\u2000',
'一', '甀', '洀', '戀', '攀', '爀', '䴀', '漀', '戀', '椀', '氀', '攀',
'\u0d00', '\u0a00', '䔀', '渀', 'ç€', '爀', '礀', '\u3000', '\u3100', '
å€', '攀', 'ç €', 'ç€', '\u2000', 'ã´€', '\u2000', '∀', '⬀', 'ã€', '㤀',
'\u3100', '㜀', '㤀', '㈀', '㈀', 'ã€', '㤀', '㤀', '∀', '\u0d00',
'\u0a00', '\u0d00', '\u0a00', '嬀', '倀', '栀', '漀', '渀', '攀', '倀',
'䈀', '䬀', '\u3000', '\u3000', 'ã€', 'å´€', '\u0d00', '\u0a00']

Where the line

Entry00Text = "ADAC Verkehrsinfo"\r\n

is actually the only thing the line contains, Python makes the rest up.

The actual file is much longer and contains private numbers, so I
truncated them away. When I let python process the original file, it
dies with another error:

Traceback (most recent call last):
  File "./modify.py", line 12, in <module>
    a = AddressBook("2008_11_05_Handy_Backup.txt")
  File "./modify.py", line 7, in __init__
    line = f.readline()
  File "/usr/local/lib/python3.0/io.py", line 1807, in readline
    while self._read_chunk():
  File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
    self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
  File "/usr/local/lib/python3.0/io.py", line 1293, in decode
    output = self.decoder.decode(input, final=final)
  File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
  File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
_buffer_decode
    return self.decoder(input, self.errors, final)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 74-75:
illegal encoding

With the place where it dies being exactly the place where it outputs
the weird garbage in the shortened file. I guess it runs over some page
boundary here or something?

Kind regards,
Johannes

--
"Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
         -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
                         <[email protected]>

2 problems: endianness and trailing zer byte.
This works for me:

class AddressBook():
def __init__(self, filename):
f = open(filename, "r", encoding="utf_16_be", newline="\r\n")
while True:
line = f.readline()
if len(line) == 0:
break
print (line.replace("\r\n",""))
f.close()


a = AddressBook("2008_11_05_Handy_Backup2.txt")

Please note the filename: I modified your file by dropping the
trailing zer byte
 
M

MRAB

Joe said:
No, it does not. I don't know what Terry's thinking of there, but text
files do not have any EOF marker. They start at the beginning
(sometimes including a byte-order mark), and go till the end of the
file, period.
Text files _do_ sometimes have an EOF marker, such as character 0x1A. It
can occur in text files in Windows.
 
J

John Machin

So UTF-16 has an explicit EOF marker within the text? I cannot find one
in original file, only some kind of starting sequence I suppose
(0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a,
simple \r\n line ending.

Sorry, *WRONG*. It ends in 00 0d 00 0a 00. The file is 1559 bytes
long, an ODD number, which shouldn't happen with utf16. The file is
stuffed. Python 3.0 has a bug; it should give a meaningful error
message.

Python 2.6.0 silently ignores the problem [that's a BUG] when read by
a similar method:

| >>> import codecs
| >>> lines = codecs.open('x.txt', 'r', 'utf16').readlines()
| >>> lines[-1]
| u'[PhonePBK004]\r\n'

Python 2.x does however give a meaningful precise error message if you
try a decode on the file contents:

| >>> s = open('x.txt', 'rb').read()
| >>> len(s)
| 1559
| >>> s[-35:]
| '\x00\r\x00\n\x00[\x00P\x00h\x00o\x00n\x00e\x00P\x00B\x00K
\x000\x000\x004\x00]\x00\r\x00\n\x00'
| >>> u = s.decode('utf16')
| Traceback (most recent call last):
| File "<stdin>", line 1, in <module>
| File "C:\python26\lib\encodings\utf_16.py", line 16, in decode
| return codecs.utf_16_decode(input, errors, True)
| UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position
1558: truncated data

HTH,
John
 
S

Steven D'Aprano

No, it does not. I don't know what Terry's thinking of there, but text
files do not have any EOF marker. They start at the beginning
(sometimes including a byte-order mark), and go till the end of the
file, period.

Windows text files still interpret ctrl-Z as EOF, or at least Windows XP
does. Vista, who knows?
 
J

John Machin

Windows text files still interpret ctrl-Z as EOF, or at least Windows XP
does. Vista, who knows?

This applies only to files being read in an 8-bit text mode. It is
inherited from MS-DOS, which followed the CP/M convention, which was
necessary because CP/M's file system recorded only the physical file
length in 128-byte sectors, not the logical length. It is likely to
continue in perpetuity, just as standard railway gauge is (allegedly)
based on the axle-length of Roman chariots.

None of this is relevant to the OP's problem; his file appears to have
been truncated rather than having spurious data appended to it.
 
M

MRAB

John said:
This applies only to files being read in an 8-bit text mode. It is
inherited from MS-DOS, which followed the CP/M convention, which was
necessary because CP/M's file system recorded only the physical file
length in 128-byte sectors, not the logical length. It is likely to
continue in perpetuity, just as standard railway gauge is (allegedly)
based on the axle-length of Roman chariots.
The chariots in question were drawn by 2 horses, so the gauge is based
in the width of a horse. :)
 
J

Johannes Bauer

2 problems: endianness and trailing zer byte.
This works for me:

This is very strange - when using "utf16", endianness should be detected
automatically. When I simply truncate the trailing zero byte, I receive:

Traceback (most recent call last):
File "./modify.py", line 12, in <module>
a = AddressBook("2008_11_05_Handy_Backup.txt")
File "./modify.py", line 7, in __init__
line = f.readline()
File "/usr/local/lib/python3.0/io.py", line 1807, in readline
while self._read_chunk():
File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
File "/usr/local/lib/python3.0/io.py", line 1293, in decode
output = self.decoder.decode(input, final=final)
File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
_buffer_decode
return self.decoder(input, self.errors, final)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 0:
truncated data

But I suppose something *is* indeed weird because the file I uploaded
and which did not yield the "truncated data" error ia 1559 bytes, which
just cannot be.

Regards,
Johannes
 
J

Johannes Bauer

John said:
Sorry, *WRONG*. It ends in 00 0d 00 0a 00. The file is 1559 bytes
long, an ODD number, which shouldn't happen with utf16. The file is
stuffed. Python 3.0 has a bug; it should give a meaningful error
message.

Yes, you are right. I fixed the file, yet another error pops up
(http://www.file-upload.net/download-1299688/2008_12_05_Handy_Backup.txt.html):

Traceback (most recent call last):
File "./modify.py", line 12, in <module>
a = AddressBook("2008_12_05_Handy_Backup.txt")
File "./modify.py", line 7, in __init__
line = f.readline()
File "/usr/local/lib/python3.0/io.py", line 1807, in readline
while self._read_chunk():
File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
File "/usr/local/lib/python3.0/io.py", line 1293, in decode
output = self.decoder.decode(input, final=final)
File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
_buffer_decode
return self.decoder(input, self.errors, final)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 0:
truncated data

File size is 1630 bytes - so this clearly cannot be.

Regards,
Johannes
 
M

MRAB

Johannes said:
This is very strange - when using "utf16", endianness should be detected
automatically. When I simply truncate the trailing zero byte, I receive:

Traceback (most recent call last):
File "./modify.py", line 12, in <module>
a = AddressBook("2008_11_05_Handy_Backup.txt")
File "./modify.py", line 7, in __init__
line = f.readline()
File "/usr/local/lib/python3.0/io.py", line 1807, in readline
while self._read_chunk():
File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
File "/usr/local/lib/python3.0/io.py", line 1293, in decode
output = self.decoder.decode(input, final=final)
File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
_buffer_decode
return self.decoder(input, self.errors, final)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 0:
truncated data

But I suppose something *is* indeed weird because the file I uploaded
and which did not yield the "truncated data" error ia 1559 bytes, which
just cannot be.
It might be that the EOF marker (b'\x1A' or u'\u001A') was written or is
being read as a single byte instead of 2 bytes for UTF-16 text.
 
M

Mark Tolonen

Johannes Bauer said:
Yes, you are right. I fixed the file, yet another error pops up
(http://www.file-upload.net/download-1299688/2008_12_05_Handy_Backup.txt.html):

Traceback (most recent call last):
File "./modify.py", line 12, in <module>
a = AddressBook("2008_12_05_Handy_Backup.txt")
File "./modify.py", line 7, in __init__
line = f.readline()
File "/usr/local/lib/python3.0/io.py", line 1807, in readline
while self._read_chunk():
File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
File "/usr/local/lib/python3.0/io.py", line 1293, in decode
output = self.decoder.decode(input, final=final)
File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
_buffer_decode
return self.decoder(input, self.errors, final)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 0:
truncated data

File size is 1630 bytes - so this clearly cannot be.

How about posting your code? The first file is incorrect. It contains an
extra 0x00 byte at the end of the file, but is otherwise correctly encoded
with a big-endian UTF16 BOM and data. The second file is a correct UTF16-BE
file as well.

This code (Python 2.6) decodes the first file, removing the trailing extra
byte:

raw = open('2008_11_05_Handy_Backup.txt').read()
data = raw[:-1].decode('utf16')

and this code (Python 2.6) decodes the second:

raw = open('2008_12_05_Handy_Backup.txt').read()
data = raw.decode('utf16')

Python 3.0 also has no problems with decoding or accurate error messages:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\dev\python30\lib\io.py", line 1724, in read
decoder.decode(self.buffer.read(), final=True))
File "C:\dev\python30\lib\io.py", line 1295, in decode
output = self.decoder.decode(input, final=final)
File "C:\dev\python30\lib\codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "c:\dev\python30\lib\encodings\utf_16.py", line 61, in _buffer_decode
codecs.utf_16_ex_decode(input, errors, 0, final)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position 1558:
trunc
ated data

-Mark
 
J

John Machin

How about posting your code?

He did. Ugly stuff using readline() :) Should still work, though.
There are definite problems with readline() and readlines(),
including:

First file: silently ignores error *and* the last line returned is
garbage [consists of multiple actual lines, and the trailing
codepoints have been byte-swapped]

Second file: as he has just reported. I've reproduced it with f.open
('second_file.txt', encoding='utf16')
followed by each of:
(1) f.readlines()
(2) list(f)
(3) for line in f:
print(repr(line))
With the last one, the error happens after printing the last actual
line in his file.
 
D

David Bolen

Johannes Bauer said:
This is very strange - when using "utf16", endianness should be detected
automatically. When I simply truncate the trailing zero byte, I receive:

Any chance that whatever you used to "simply truncate the trailing
zero byte" also removed the BOM at the start of the file? Without it,
utf16 wouldn't be able to detect endianness and would, I believe, fall
back to native order.

-- David
 
J

John Machin

Any chance that whatever you used to "simply truncate the trailing
zero byte" also removed the BOM at the start of the file?  Without it,
utf16 wouldn't be able to detect endianness and would, I believe, fall
back to native order.

When I read this, I thought "O no, surely not!". Seems that you are
correct:
[Python 2.5.2, Windows XP]
| >>> nobom = u'abcde'.encode('utf_16_be')
| >>> nobom
| '\x00a\x00b\x00c\x00d\x00e'
| >>> nobom.decode('utf16')
| u'\u6100\u6200\u6300\u6400\u6500'

This may well explain one of the Python 3.0 problems that the OP's 2
files exhibit: data appears to have been byte-swapped under some
conditions. Possibility: it is reading the file a chunk at a time and
applying the utf_16 encoding independently to each chunk -- only the
first chunk will have a BOM.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,930
Messages
2,570,072
Members
46,522
Latest member
Mad-Ram

Latest Threads

Top