Python 3.0 automatic decoding of UTF16

Discussion in 'Python' started by Johannes Bauer, Dec 5, 2008.

  1. Hello group,

    I'm having trouble reading a utf-16 encoded file with Python3.0. This is
    my (complete) code:

    #!/usr/bin/python3.0

    class AddressBook():
    def __init__(self, filename):
    f = open(filename, "r", encoding="utf16")
    while True:
    line = f.readline()
    if line == "": break
    print([line[x] for x in range(len(line))])
    f.close()

    a = AddressBook("2008_11_05_Handy_Backup.txt")

    This is the file (only 1 kB, if hosting doesn't work please tell me and
    I'll see if I can put it someplace else):

    http://www.file-upload.net/download-1297291/2008_11_05_Handy_Backup.txt.gz.html

    What I get: The file reads file the first few lines. Then, in the last
    line, I get lots of garbage (looking like uninitialized memory):

    ['E', 'n', 't', 'r', 'y', '0', '0', 'T', 'e', 'x', 't', ' ', '=', ' ',
    '"', 'A', 'D', 'A', 'C', ' ', 'V', 'e', 'r', 'k', 'e', 'h', 'r', 's',
    'i', 'n', 'f', 'o', '"', '\u0d00', '\u0a00', '䔀', '渀', 'ç€', '爀', '礀
    ', '\u3000', '\u3100', 'å€', '礀', '瀀', '攀', '\u2000', 'ã´€', '\u2000',
    '一', '甀', '洀', '戀', '攀', '爀', '䴀', '漀', '戀', '椀', '氀', '攀',
    '\u0d00', '\u0a00', '䔀', '渀', 'ç€', '爀', '礀', '\u3000', '\u3100', '
    å€', '攀', 'ç €', 'ç€', '\u2000', 'ã´€', '\u2000', '∀', '⬀', 'ã€', '㤀',
    '\u3100', '㜀', '㤀', '㈀', '㈀', 'ã€', '㤀', '㤀', '∀', '\u0d00',
    '\u0a00', '\u0d00', '\u0a00', '嬀', '倀', '栀', '漀', '渀', '攀', '倀',
    '䈀', '䬀', '\u3000', '\u3000', 'ã€', 'å´€', '\u0d00', '\u0a00']

    Where the line

    Entry00Text = "ADAC Verkehrsinfo"\r\n

    is actually the only thing the line contains, Python makes the rest up.

    The actual file is much longer and contains private numbers, so I
    truncated them away. When I let python process the original file, it
    dies with another error:

    Traceback (most recent call last):
    File "./modify.py", line 12, in <module>
    a = AddressBook("2008_11_05_Handy_Backup.txt")
    File "./modify.py", line 7, in __init__
    line = f.readline()
    File "/usr/local/lib/python3.0/io.py", line 1807, in readline
    while self._read_chunk():
    File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
    self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
    File "/usr/local/lib/python3.0/io.py", line 1293, in decode
    output = self.decoder.decode(input, final=final)
    File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
    _buffer_decode
    return self.decoder(input, self.errors, final)
    UnicodeDecodeError: 'utf16' codec can't decode bytes in position 74-75:
    illegal encoding

    With the place where it dies being exactly the place where it outputs
    the weird garbage in the shortened file. I guess it runs over some page
    boundary here or something?

    Kind regards,
    Johannes

    --
    "Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
    verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
    -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
    <48d8bf1d$0$7510$>
    Johannes Bauer, Dec 5, 2008
    #1
    1. Advertising

  2. Johannes Bauer <> writes:

    > Traceback (most recent call last):
    > File "./modify.py", line 12, in <module>
    > a = AddressBook("2008_11_05_Handy_Backup.txt")
    > File "./modify.py", line 7, in __init__
    > line = f.readline()
    > File "/usr/local/lib/python3.0/io.py", line 1807, in readline
    > while self._read_chunk():
    > File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
    > self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
    > File "/usr/local/lib/python3.0/io.py", line 1293, in decode
    > output = self.decoder.decode(input, final=final)
    > File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
    > (result, consumed) = self._buffer_decode(data, self.errors, final)
    > File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
    > _buffer_decode
    > return self.decoder(input, self.errors, final)
    > UnicodeDecodeError: 'utf16' codec can't decode bytes in position 74-75:
    > illegal encoding


    It probably means what it says: that the input file contains characters
    it cannot read using the specified encoding.

    Are you generating the file from python using a file object with the
    same encoding? If not, then you might want to look at your input data
    and find a way to deal with the exception.
    J Kenneth King, Dec 5, 2008
    #2
    1. Advertising

  3. J Kenneth King schrieb:

    > It probably means what it says: that the input file contains characters
    > it cannot read using the specified encoding.


    No, it doesn't. The file is just fine, just as the example.

    > Are you generating the file from python using a file object with the
    > same encoding? If not, then you might want to look at your input data
    > and find a way to deal with the exception.


    I did. The file is fine. Could you try out the example?

    Regards,
    Johannes

    --
    "Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
    verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
    -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
    <48d8bf1d$0$7510$>
    Johannes Bauer, Dec 5, 2008
    #3
  4. "J Kenneth King" <> wrote in message
    news:...

    > It probably means what it says: that the input file contains characters
    > it cannot read using the specified encoding.


    That was my first thought. However it appears that there is an off by one
    error somewhere in the intersection of line ending/codec processing.
    Half way through the codec starts byte-flipping characters.
    Richard Brodie, Dec 5, 2008
    #4
  5. Johannes Bauer

    Terry Reedy Guest

    Johannes Bauer wrote:
    > Hello group,
    >
    > I'm having trouble reading a utf-16 encoded file with Python3.0. This is
    > my (complete) code:


    what OS. This is often critical when you have a problem interacting
    with the OS.

    > #!/usr/bin/python3.0
    >
    > class AddressBook():
    > def __init__(self, filename):
    > f = open(filename, "r", encoding="utf16")
    > while True:
    > line = f.readline()
    > if line == "": break
    > print([line[x] for x in range(len(line))])
    > f.close()
    >
    > a = AddressBook("2008_11_05_Handy_Backup.txt")
    >
    > This is the file (only 1 kB, if hosting doesn't work please tell me and
    > I'll see if I can put it someplace else):
    >
    > http://www.file-upload.net/download-1297291/2008_11_05_Handy_Backup.txt.gz.html
    >
    > What I get: The file reads file the first few lines. Then, in the last
    > line, I get lots of garbage (looking like uninitialized memory):
    >
    > ['E', 'n', 't', 'r', 'y', '0', '0', 'T', 'e', 'x', 't', ' ', '=', ' ',
    > '"', 'A', 'D', 'A', 'C', ' ', 'V', 'e', 'r', 'k', 'e', 'h', 'r', 's',
    > 'i', 'n', 'f', 'o', '"', '\u0d00', '\u0a00', '䔀', '渀', 'ç€', '爀', '礀
    > ', '\u3000', '\u3100', 'å€', '礀', '瀀', '攀', '\u2000', 'ã´€', '\u2000',
    > '一', '甀', '洀', '戀', '攀', '爀', '䴀', '漀', '戀', '椀', '氀', '攀',
    > '\u0d00', '\u0a00', '䔀', '渀', 'ç€', '爀', '礀', '\u3000', '\u3100', '
    > å€', '攀', 'ç €', 'ç€', '\u2000', 'ã´€', '\u2000', '∀', '⬀', 'ã€', '㤀',
    > '\u3100', '㜀', '㤀', '㈀', '㈀', 'ã€', '㤀', '㤀', '∀', '\u0d00',
    > '\u0a00', '\u0d00', '\u0a00', '嬀', '倀', '栀', '漀', '渀', '攀', '倀',
    > '䈀', '䬀', '\u3000', '\u3000', 'ã€', 'å´€', '\u0d00', '\u0a00']
    >
    > Where the line
    >
    > Entry00Text = "ADAC Verkehrsinfo"\r\n


    From \r\n I guess Windows. Correct?

    I suspect that '?' after \n (\u0a00) is indicates not 'question-mark'
    but 'uninterpretable as a utf16 character'. The traceback below
    confirms that. It should be an end-of-file marker and should not be
    passed to Python. I strongly suspect that whatever wrote the file
    screwed up the (OS-specific) end-of-file marker. I have seen this
    occasionally on Dos/Windows with ascii byte files, with the same symptom
    of reading random garbage pass the end of the file. Or perhaps
    end-of-file does not work right with utf16.

    > is actually the only thing the line contains, Python makes the rest up.


    No it does not. It echoes what the OS gives it with system calls, which
    is randon garbage to the end of the disk block.

    Try open with explicit 'rt' and 'rb' modes and see what happens. Text
    mode should be default, but then \r should be deleted.

    > The actual file is much longer and contains private numbers, so I
    > truncated them away. When I let python process the original file, it
    > dies with another error:
    >
    > Traceback (most recent call last):
    > File "./modify.py", line 12, in <module>
    > a = AddressBook("2008_11_05_Handy_Backup.txt")
    > File "./modify.py", line 7, in __init__
    > line = f.readline()
    > File "/usr/local/lib/python3.0/io.py", line 1807, in readline
    > while self._read_chunk():
    > File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
    > self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
    > File "/usr/local/lib/python3.0/io.py", line 1293, in decode
    > output = self.decoder.decode(input, final=final)
    > File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
    > (result, consumed) = self._buffer_decode(data, self.errors, final)
    > File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
    > _buffer_decode
    > return self.decoder(input, self.errors, final)
    > UnicodeDecodeError: 'utf16' codec can't decode bytes in position 74-75:
    > illegal encoding
    >
    > With the place where it dies being exactly the place where it outputs
    > the weird garbage in the shortened file. I guess it runs over some page
    > boundary here or something?


    Malformed EOF more likely.

    Terry Jan Reedy
    Terry Reedy, Dec 5, 2008
    #5
  6. Terry Reedy schrieb:
    > Johannes Bauer wrote:
    >> Hello group,
    >>
    >> I'm having trouble reading a utf-16 encoded file with Python3.0. This is
    >> my (complete) code:

    >
    > what OS. This is often critical when you have a problem interacting
    > with the OS.


    It's a 64-bit Linux, currently running:

    Linux joeserver 2.6.20-skas3-v9-pre9 #4 SMP PREEMPT Wed Dec 3 18:34:49
    CET 2008 x86_64 Intel(R) Core(TM)2 CPU 6400 @ 2.13GHz GenuineIntel GNU/Linux

    Kernel, however, 2.6.26.1 yields the same problem.

    >> Entry00Text = "ADAC Verkehrsinfo"\r\n

    >
    > From \r\n I guess Windows. Correct?


    Well, not really. The file was created with gammu, a Linux opensource
    tool to extract a phonebook off cell phones. However, gammu seems to
    generate those Windows-CRLF lineendings.

    > I suspect that '?' after \n (\u0a00) is indicates not 'question-mark'
    > but 'uninterpretable as a utf16 character'. The traceback below
    > confirms that. It should be an end-of-file marker and should not be
    > passed to Python. I strongly suspect that whatever wrote the file
    > screwed up the (OS-specific) end-of-file marker. I have seen this
    > occasionally on Dos/Windows with ascii byte files, with the same symptom
    > of reading random garbage pass the end of the file. Or perhaps
    > end-of-file does not work right with utf16.


    So UTF-16 has an explicit EOF marker within the text? I cannot find one
    in original file, only some kind of starting sequence I suppose
    (0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a,
    simple \r\n line ending.

    >> is actually the only thing the line contains, Python makes the rest up.

    >
    > No it does not. It echoes what the OS gives it with system calls, which
    > is randon garbage to the end of the disk block.


    Could it not be, as Richard suggested, that there's an off-by-one?

    > Try open with explicit 'rt' and 'rb' modes and see what happens. Text
    > mode should be default, but then \r should be deleted.


    rt:

    [...]
    ['[', 'P', 'h', 'o', 'n', 'e', 'P', 'B', 'K', '0', '0', '3', ']', '\n']
    ['L', 'o', 'c', 'a', 't', 'i', 'o', 'n', ' ', '=', ' ', '0', '0', '3', '\n']
    ['E', 'n', 't', 'r', 'y', '0', '0', 'T', 'y', 'p', 'e', ' ', '=', ' ',
    'N', 'a', 'm', 'e', '\n']
    Traceback (most recent call last):
    File "./modify.py", line 12, in <module>
    a = AddressBook("2008_11_05_Handy_Backup.txt")
    File "./modify.py", line 7, in __init__
    line = f.readline()
    File "/usr/local/lib/python3.0/io.py", line 1807, in readline
    while self._read_chunk():
    File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
    self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
    File "/usr/local/lib/python3.0/io.py", line 1293, in decode
    output = self.decoder.decode(input, final=final)
    File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
    _buffer_decode
    return self.decoder(input, self.errors, final)
    UnicodeDecodeError: 'utf16' codec can't decode bytes in position 74-75:
    illegal encoding

    rb works, as it doesn't take an encoding parameter.

    > Malformed EOF more likely.


    Could you please elaborate?

    Kind regards,
    Johannes

    --
    "Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
    verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
    -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
    <48d8bf1d$0$7510$>
    Johannes Bauer, Dec 5, 2008
    #6
  7. Johannes Bauer

    Joe Strout Guest

    On Dec 5, 2008, at 11:36 AM, Johannes Bauer wrote:

    >> I suspect that '?' after \n (\u0a00) is indicates not 'question-mark'
    >> but 'uninterpretable as a utf16 character'. The traceback below
    >> confirms that. It should be an end-of-file marker and should not be
    >> passed to Python. I strongly suspect that whatever wrote the file
    >> screwed up the (OS-specific) end-of-file marker. I have seen this
    >> occasionally on Dos/Windows with ascii byte files, with the same
    >> symptom
    >> of reading random garbage pass the end of the file. Or perhaps
    >> end-of-file does not work right with utf16.

    >
    > So UTF-16 has an explicit EOF marker within the text?


    No, it does not. I don't know what Terry's thinking of there, but
    text files do not have any EOF marker. They start at the beginning
    (sometimes including a byte-order mark), and go till the end of the
    file, period.

    > I cannot find one in original file, only some kind of starting
    > sequence I suppose
    > (0xfeff).


    That's your byte-order mark (BOM).

    > The last characters of the file are 0x00 0x0d 0x00 0x0a,
    > simple \r\n line ending.


    Sounds like a perfectly normal file to me.

    It's hard to imagine, but it looks to me like you've found a bug.

    Best,
    - Joe
    Joe Strout, Dec 5, 2008
    #7
  8. Johannes Bauer

    Guest

    On Dec 5, 3:25 pm, Johannes Bauer <> wrote:
    > Hello group,
    >
    > I'm having trouble reading a utf-16 encoded file with Python3.0. This is
    > my (complete) code:
    >
    > #!/usr/bin/python3.0
    >
    > class AddressBook():
    >         def __init__(self, filename):
    >                 f = open(filename, "r", encoding="utf16")
    >                 while True:
    >                         line = f.readline()
    >                         if line == "": break
    >                         print([line[x] for x in range(len(line))])
    >                 f.close()
    >
    > a = AddressBook("2008_11_05_Handy_Backup.txt")
    >
    > This is the file (only 1 kB, if hosting doesn't work please tell me and
    > I'll see if I can put it someplace else):
    >
    > http://www.file-upload.net/download-1297291/2008_11_05_Handy_Backup.t...
    >
    > What I get: The file reads file the first few lines. Then, in the last
    > line, I get lots of garbage (looking like uninitialized memory):
    >
    > ['E', 'n', 't', 'r', 'y', '0', '0', 'T', 'e', 'x', 't', ' ', '=', ' ',
    > '"', 'A', 'D', 'A', 'C', ' ', 'V', 'e', 'r', 'k', 'e', 'h', 'r', 's',
    > 'i', 'n', 'f', 'o', '"', '\u0d00', '\u0a00', '䔀', '渀', 'ç€', '爀', '礀
    > ', '\u3000', '\u3100', 'å€', '礀', '瀀', '攀', '\u2000', 'ã´€', '\u2000',
    > '一', '甀', '洀', '戀', '攀', '爀', '䴀', '漀', '戀', '椀', '氀', '攀',
    > '\u0d00', '\u0a00', '䔀', '渀', 'ç€', '爀', '礀', '\u3000', '\u3100', '
    > å€', '攀', 'ç €', 'ç€', '\u2000', 'ã´€', '\u2000', '∀', '⬀', 'ã€', '㤀',
    > '\u3100', '㜀', '㤀', '㈀', '㈀', 'ã€', '㤀', '㤀', '∀', '\u0d00',
    > '\u0a00', '\u0d00', '\u0a00', '嬀', '倀', '栀', '漀', '渀', '攀', '倀',
    > '䈀', '䬀', '\u3000', '\u3000', 'ã€', 'å´€', '\u0d00', '\u0a00']
    >
    > Where the line
    >
    > Entry00Text = "ADAC Verkehrsinfo"\r\n
    >
    > is actually the only thing the line contains, Python makes the rest up.
    >
    > The actual file is much longer and contains private numbers, so I
    > truncated them away. When I let python process the original file, it
    > dies with another error:
    >
    > Traceback (most recent call last):
    >   File "./modify.py", line 12, in <module>
    >     a = AddressBook("2008_11_05_Handy_Backup.txt")
    >   File "./modify.py", line 7, in __init__
    >     line = f.readline()
    >   File "/usr/local/lib/python3.0/io.py", line 1807, in readline
    >     while self._read_chunk():
    >   File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
    >     self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
    >   File "/usr/local/lib/python3.0/io.py", line 1293, in decode
    >     output = self.decoder.decode(input, final=final)
    >   File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
    >     (result, consumed) = self._buffer_decode(data, self.errors, final)
    >   File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
    > _buffer_decode
    >     return self.decoder(input, self.errors, final)
    > UnicodeDecodeError: 'utf16' codec can't decode bytes in position 74-75:
    > illegal encoding
    >
    > With the place where it dies being exactly the place where it outputs
    > the weird garbage in the shortened file. I guess it runs over some page
    > boundary here or something?
    >
    > Kind regards,
    > Johannes
    >
    > --
    > "Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
    > verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
    >          -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
    >                          <48d8bf1d$0$7510$>


    2 problems: endianness and trailing zer byte.
    This works for me:

    class AddressBook():
    def __init__(self, filename):
    f = open(filename, "r", encoding="utf_16_be", newline="\r\n")
    while True:
    line = f.readline()
    if len(line) == 0:
    break
    print (line.replace("\r\n",""))
    f.close()


    a = AddressBook("2008_11_05_Handy_Backup2.txt")

    Please note the filename: I modified your file by dropping the
    trailing zer byte
    , Dec 5, 2008
    #8
  9. Johannes Bauer

    MRAB Guest

    Joe Strout wrote:
    > On Dec 5, 2008, at 11:36 AM, Johannes Bauer wrote:
    >
    >>> I suspect that '?' after \n (\u0a00) is indicates not 'question-mark'
    >>> but 'uninterpretable as a utf16 character'. The traceback below
    >>> confirms that. It should be an end-of-file marker and should not be
    >>> passed to Python. I strongly suspect that whatever wrote the file
    >>> screwed up the (OS-specific) end-of-file marker. I have seen this
    >>> occasionally on Dos/Windows with ascii byte files, with the same symptom
    >>> of reading random garbage pass the end of the file. Or perhaps
    >>> end-of-file does not work right with utf16.

    >>
    >> So UTF-16 has an explicit EOF marker within the text?

    >
    > No, it does not. I don't know what Terry's thinking of there, but text
    > files do not have any EOF marker. They start at the beginning
    > (sometimes including a byte-order mark), and go till the end of the
    > file, period.
    >

    Text files _do_ sometimes have an EOF marker, such as character 0x1A. It
    can occur in text files in Windows.

    >> I cannot find one in original file, only some kind of starting
    >> sequence I suppose
    >> (0xfeff).

    >
    > That's your byte-order mark (BOM).
    >
    >> The last characters of the file are 0x00 0x0d 0x00 0x0a,
    >> simple \r\n line ending.

    >
    > Sounds like a perfectly normal file to me.
    >
    > It's hard to imagine, but it looks to me like you've found a bug.
    >
    MRAB, Dec 5, 2008
    #9
  10. Johannes Bauer

    John Machin Guest

    On Dec 6, 5:36 am, Johannes Bauer <> wrote:
    > So UTF-16 has an explicit EOF marker within the text? I cannot find one
    > in original file, only some kind of starting sequence I suppose
    > (0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a,
    > simple \r\n line ending.


    Sorry, *WRONG*. It ends in 00 0d 00 0a 00. The file is 1559 bytes
    long, an ODD number, which shouldn't happen with utf16. The file is
    stuffed. Python 3.0 has a bug; it should give a meaningful error
    message.

    Python 2.6.0 silently ignores the problem [that's a BUG] when read by
    a similar method:

    | >>> import codecs
    | >>> lines = codecs.open('x.txt', 'r', 'utf16').readlines()
    | >>> lines[-1]
    | u'[PhonePBK004]\r\n'

    Python 2.x does however give a meaningful precise error message if you
    try a decode on the file contents:

    | >>> s = open('x.txt', 'rb').read()
    | >>> len(s)
    | 1559
    | >>> s[-35:]
    | '\x00\r\x00\n\x00[\x00P\x00h\x00o\x00n\x00e\x00P\x00B\x00K
    \x000\x000\x004\x00]\x00\r\x00\n\x00'
    | >>> u = s.decode('utf16')
    | Traceback (most recent call last):
    | File "<stdin>", line 1, in <module>
    | File "C:\python26\lib\encodings\utf_16.py", line 16, in decode
    | return codecs.utf_16_decode(input, errors, True)
    | UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position
    1558: truncated data

    HTH,
    John
    John Machin, Dec 5, 2008
    #10
  11. On Fri, 05 Dec 2008 12:00:59 -0700, Joe Strout wrote:

    >> So UTF-16 has an explicit EOF marker within the text?

    >
    > No, it does not. I don't know what Terry's thinking of there, but text
    > files do not have any EOF marker. They start at the beginning
    > (sometimes including a byte-order mark), and go till the end of the
    > file, period.


    Windows text files still interpret ctrl-Z as EOF, or at least Windows XP
    does. Vista, who knows?


    --
    Steven
    Steven D'Aprano, Dec 5, 2008
    #11
  12. Johannes Bauer

    John Machin Guest

    On Dec 6, 10:35 am, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com.au> wrote:
    > On Fri, 05 Dec 2008 12:00:59 -0700, Joe Strout wrote:
    > >> So UTF-16 has an explicit EOF marker within the text?

    >
    > > No, it does not.  I don't know what Terry's thinking of there, but text
    > > files do not have any EOF marker.  They start at the beginning
    > > (sometimes including a byte-order mark), and go till the end of the
    > > file, period.

    >
    > Windows text files still interpret ctrl-Z as EOF, or at least Windows XP
    > does. Vista, who knows?


    This applies only to files being read in an 8-bit text mode. It is
    inherited from MS-DOS, which followed the CP/M convention, which was
    necessary because CP/M's file system recorded only the physical file
    length in 128-byte sectors, not the logical length. It is likely to
    continue in perpetuity, just as standard railway gauge is (allegedly)
    based on the axle-length of Roman chariots.

    None of this is relevant to the OP's problem; his file appears to have
    been truncated rather than having spurious data appended to it.
    John Machin, Dec 6, 2008
    #12
  13. Johannes Bauer

    MRAB Guest

    John Machin wrote:
    > On Dec 6, 10:35 am, Steven D'Aprano <st...@REMOVE-THIS-
    > cybersource.com.au> wrote:
    >> On Fri, 05 Dec 2008 12:00:59 -0700, Joe Strout wrote:
    >>>> So UTF-16 has an explicit EOF marker within the text?
    >>> No, it does not. I don't know what Terry's thinking of there, but text
    >>> files do not have any EOF marker. They start at the beginning
    >>> (sometimes including a byte-order mark), and go till the end of the
    >>> file, period.

    >> Windows text files still interpret ctrl-Z as EOF, or at least Windows XP
    >> does. Vista, who knows?

    >
    > This applies only to files being read in an 8-bit text mode. It is
    > inherited from MS-DOS, which followed the CP/M convention, which was
    > necessary because CP/M's file system recorded only the physical file
    > length in 128-byte sectors, not the logical length. It is likely to
    > continue in perpetuity, just as standard railway gauge is (allegedly)
    > based on the axle-length of Roman chariots.
    >

    The chariots in question were drawn by 2 horses, so the gauge is based
    in the width of a horse. :)

    > None of this is relevant to the OP's problem; his file appears to have
    > been truncated rather than having spurious data appended to it.
    MRAB, Dec 6, 2008
    #13
  14. schrieb:

    > 2 problems: endianness and trailing zer byte.
    > This works for me:


    This is very strange - when using "utf16", endianness should be detected
    automatically. When I simply truncate the trailing zero byte, I receive:

    Traceback (most recent call last):
    File "./modify.py", line 12, in <module>
    a = AddressBook("2008_11_05_Handy_Backup.txt")
    File "./modify.py", line 7, in __init__
    line = f.readline()
    File "/usr/local/lib/python3.0/io.py", line 1807, in readline
    while self._read_chunk():
    File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
    self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
    File "/usr/local/lib/python3.0/io.py", line 1293, in decode
    output = self.decoder.decode(input, final=final)
    File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
    _buffer_decode
    return self.decoder(input, self.errors, final)
    UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 0:
    truncated data

    But I suppose something *is* indeed weird because the file I uploaded
    and which did not yield the "truncated data" error ia 1559 bytes, which
    just cannot be.

    Regards,
    Johannes

    --
    "Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
    verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
    -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
    <48d8bf1d$0$7510$>
    Johannes Bauer, Dec 6, 2008
    #14
  15. John Machin schrieb:
    > On Dec 6, 5:36 am, Johannes Bauer <> wrote:
    >> So UTF-16 has an explicit EOF marker within the text? I cannot find one
    >> in original file, only some kind of starting sequence I suppose
    >> (0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a,
    >> simple \r\n line ending.

    >
    > Sorry, *WRONG*. It ends in 00 0d 00 0a 00. The file is 1559 bytes
    > long, an ODD number, which shouldn't happen with utf16. The file is
    > stuffed. Python 3.0 has a bug; it should give a meaningful error
    > message.


    Yes, you are right. I fixed the file, yet another error pops up
    (http://www.file-upload.net/download-1299688/2008_12_05_Handy_Backup.txt.html):

    Traceback (most recent call last):
    File "./modify.py", line 12, in <module>
    a = AddressBook("2008_12_05_Handy_Backup.txt")
    File "./modify.py", line 7, in __init__
    line = f.readline()
    File "/usr/local/lib/python3.0/io.py", line 1807, in readline
    while self._read_chunk():
    File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
    self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
    File "/usr/local/lib/python3.0/io.py", line 1293, in decode
    output = self.decoder.decode(input, final=final)
    File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
    _buffer_decode
    return self.decoder(input, self.errors, final)
    UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 0:
    truncated data

    File size is 1630 bytes - so this clearly cannot be.

    Regards,
    Johannes

    --
    "Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
    verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
    -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
    <48d8bf1d$0$7510$>
    Johannes Bauer, Dec 6, 2008
    #15
  16. Johannes Bauer

    MRAB Guest

    Johannes Bauer wrote:
    > schrieb:
    >
    >> 2 problems: endianness and trailing zer byte.
    >> This works for me:

    >
    > This is very strange - when using "utf16", endianness should be detected
    > automatically. When I simply truncate the trailing zero byte, I receive:
    >
    > Traceback (most recent call last):
    > File "./modify.py", line 12, in <module>
    > a = AddressBook("2008_11_05_Handy_Backup.txt")
    > File "./modify.py", line 7, in __init__
    > line = f.readline()
    > File "/usr/local/lib/python3.0/io.py", line 1807, in readline
    > while self._read_chunk():
    > File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
    > self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
    > File "/usr/local/lib/python3.0/io.py", line 1293, in decode
    > output = self.decoder.decode(input, final=final)
    > File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
    > (result, consumed) = self._buffer_decode(data, self.errors, final)
    > File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
    > _buffer_decode
    > return self.decoder(input, self.errors, final)
    > UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 0:
    > truncated data
    >
    > But I suppose something *is* indeed weird because the file I uploaded
    > and which did not yield the "truncated data" error ia 1559 bytes, which
    > just cannot be.
    >

    It might be that the EOF marker (b'\x1A' or u'\u001A') was written or is
    being read as a single byte instead of 2 bytes for UTF-16 text.
    MRAB, Dec 6, 2008
    #16
  17. Johannes Bauer

    Mark Tolonen Guest

    "Johannes Bauer" <> wrote in message
    news:...
    >John Machin schrieb:
    >> On Dec 6, 5:36 am, Johannes Bauer <> wrote:
    >>> So UTF-16 has an explicit EOF marker within the text? I cannot find one
    >>> in original file, only some kind of starting sequence I suppose
    >>> (0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a,
    >>> simple \r\n line ending.

    >>
    >> Sorry, *WRONG*. It ends in 00 0d 00 0a 00. The file is 1559 bytes
    >> long, an ODD number, which shouldn't happen with utf16. The file is
    >> stuffed. Python 3.0 has a bug; it should give a meaningful error
    >> message.

    >
    >Yes, you are right. I fixed the file, yet another error pops up
    >(http://www.file-upload.net/download-1299688/2008_12_05_Handy_Backup.txt.html):
    >
    >Traceback (most recent call last):
    > File "./modify.py", line 12, in <module>
    > a = AddressBook("2008_12_05_Handy_Backup.txt")
    > File "./modify.py", line 7, in __init__
    > line = f.readline()
    > File "/usr/local/lib/python3.0/io.py", line 1807, in readline
    > while self._read_chunk():
    > File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
    > self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
    > File "/usr/local/lib/python3.0/io.py", line 1293, in decode
    > output = self.decoder.decode(input, final=final)
    > File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
    > (result, consumed) = self._buffer_decode(data, self.errors, final)
    > File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
    >_buffer_decode
    > return self.decoder(input, self.errors, final)
    >UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 0:
    >truncated data
    >
    >File size is 1630 bytes - so this clearly cannot be.


    How about posting your code? The first file is incorrect. It contains an
    extra 0x00 byte at the end of the file, but is otherwise correctly encoded
    with a big-endian UTF16 BOM and data. The second file is a correct UTF16-BE
    file as well.

    This code (Python 2.6) decodes the first file, removing the trailing extra
    byte:

    raw = open('2008_11_05_Handy_Backup.txt').read()
    data = raw[:-1].decode('utf16')

    and this code (Python 2.6) decodes the second:

    raw = open('2008_12_05_Handy_Backup.txt').read()
    data = raw.decode('utf16')

    Python 3.0 also has no problems with decoding or accurate error messages:

    >>> data = open('2008_12_05_Handy_Backup.txt',encoding='utf16').read()
    >>> data = open('2008_11_05_Handy_Backup.txt',encoding='utf16').read()

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "C:\dev\python30\lib\io.py", line 1724, in read
    decoder.decode(self.buffer.read(), final=True))
    File "C:\dev\python30\lib\io.py", line 1295, in decode
    output = self.decoder.decode(input, final=final)
    File "C:\dev\python30\lib\codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    File "c:\dev\python30\lib\encodings\utf_16.py", line 61, in _buffer_decode
    codecs.utf_16_ex_decode(input, errors, 0, final)
    UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position 1558:
    trunc
    ated data

    -Mark
    Mark Tolonen, Dec 6, 2008
    #17
  18. Johannes Bauer

    John Machin Guest

    On Dec 7, 6:20 am, "Mark Tolonen" <> wrote:
    > "Johannes Bauer" <> wrote in message
    >
    > news:...
    >
    >
    >
    > >John Machin schrieb:
    > >> On Dec 6, 5:36 am, Johannes Bauer <> wrote:
    > >>> So UTF-16 has an explicit EOF marker within the text? I cannot find one
    > >>> in original file, only some kind of starting sequence I suppose
    > >>> (0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a,
    > >>> simple \r\n line ending.

    >
    > >> Sorry, *WRONG*. It ends in 00 0d 00 0a 00. The file is 1559 bytes
    > >> long, an ODD number, which shouldn't happen with utf16.  The file is
    > >> stuffed. Python 3.0 has a bug; it should give a meaningful error
    > >> message.

    >
    > >Yes, you are right. I fixed the file, yet another error pops up
    > >(http://www.file-upload.net/download-1299688/2008_12_05_Handy_Backup.t....

    >
    > >Traceback (most recent call last):
    > >  File "./modify.py", line 12, in <module>
    > >    a = AddressBook("2008_12_05_Handy_Backup.txt")
    > >  File "./modify.py", line 7, in __init__
    > >    line = f.readline()
    > >  File "/usr/local/lib/python3.0/io.py", line 1807, in readline
    > >    while self._read_chunk():
    > >  File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
    > >    self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
    > >  File "/usr/local/lib/python3.0/io.py", line 1293, in decode
    > >    output = self.decoder.decode(input, final=final)
    > >  File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
    > >    (result, consumed) = self._buffer_decode(data, self.errors, final)
    > >  File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
    > >_buffer_decode
    > >    return self.decoder(input, self.errors, final)
    > >UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 0:
    > >truncated data

    >
    > >File size is 1630 bytes - so this clearly cannot be.

    >
    > How about posting your code?


    He did. Ugly stuff using readline() :) Should still work, though.
    There are definite problems with readline() and readlines(),
    including:

    First file: silently ignores error *and* the last line returned is
    garbage [consists of multiple actual lines, and the trailing
    codepoints have been byte-swapped]

    Second file: as he has just reported. I've reproduced it with f.open
    ('second_file.txt', encoding='utf16')
    followed by each of:
    (1) f.readlines()
    (2) list(f)
    (3) for line in f:
    print(repr(line))
    With the last one, the error happens after printing the last actual
    line in his file.
    John Machin, Dec 6, 2008
    #18
  19. Johannes Bauer

    David Bolen Guest

    Johannes Bauer <> writes:

    > This is very strange - when using "utf16", endianness should be detected
    > automatically. When I simply truncate the trailing zero byte, I receive:


    Any chance that whatever you used to "simply truncate the trailing
    zero byte" also removed the BOM at the start of the file? Without it,
    utf16 wouldn't be able to detect endianness and would, I believe, fall
    back to native order.

    -- David
    David Bolen, Dec 6, 2008
    #19
  20. Johannes Bauer

    John Machin Guest

    On Dec 7, 9:01 am, David Bolen <> wrote:
    > Johannes Bauer <> writes:
    > > This is very strange - when using "utf16", endianness should be detected
    > > automatically. When I simply truncate the trailing zero byte, I receive:

    >
    > Any chance that whatever you used to "simply truncate the trailing
    > zero byte" also removed the BOM at the start of the file?  Without it,
    > utf16 wouldn't be able to detect endianness and would, I believe, fall
    > back to native order.


    When I read this, I thought "O no, surely not!". Seems that you are
    correct:
    [Python 2.5.2, Windows XP]
    | >>> nobom = u'abcde'.encode('utf_16_be')
    | >>> nobom
    | '\x00a\x00b\x00c\x00d\x00e'
    | >>> nobom.decode('utf16')
    | u'\u6100\u6200\u6300\u6400\u6500'

    This may well explain one of the Python 3.0 problems that the OP's 2
    files exhibit: data appears to have been byte-swapped under some
    conditions. Possibility: it is reading the file a chunk at a time and
    applying the utf_16 encoding independently to each chunk -- only the
    first chunk will have a BOM.
    John Machin, Dec 6, 2008
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Xah Lee

    convert gb18030 to utf16

    Xah Lee, Mar 6, 2005, in forum: Python
    Replies:
    2
    Views:
    1,524
    Xah Lee
    Mar 7, 2005
  2. John Perks and Sarah Mount

    UTF16 codec doesn't round-trip?

    John Perks and Sarah Mount, May 28, 2005, in forum: Python
    Replies:
    1
    Views:
    440
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    May 28, 2005
  3. Fuzzyman
    Replies:
    4
    Views:
    544
    Fuzzyman
    Feb 7, 2006
  4. news.fe.internet.bosch.com

    Regarding UTF16

    news.fe.internet.bosch.com, Feb 2, 2006, in forum: C Programming
    Replies:
    5
    Views:
    342
    those who know me have no need of my name
    Feb 12, 2006
  5. R Wood
    Replies:
    4
    Views:
    533
    Adam Atlas
    Apr 24, 2007
Loading...

Share This Page