split() can help to read UTF-16 encoded file without codecs support,why?

Discussion in 'Python' started by Zhongjian Lu, Mar 17, 2006.

  1. Zhongjian Lu

    Zhongjian Lu Guest

    Hi Guys,

    I was processing a UTF-16 coded file with BOM and was not aware of the
    codecs package at first. I wrote the following code:
    ===== Code 1============================
    for i in open("d:\python24\lzjtest.xml", 'r').readlines():
    i = i.decode("utf-16")
    print i
    =======================================
    Output was:
    Traceback (most recent call last):
    File "D:\Python24\testutf-16.py", line 4, in -toplevel-
    i = i.decode("utf-16")
    File "D:\Python24\lib\encodings\utf_16.py", line 16, in decode
    return codecs.utf_16_decode(input, errors, True)
    UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position
    84: truncated data

    I searched google and found an article on the similar problem saying to use
    split(). I had not quite caught the meaning of the article and recode as:
    ==== Code 2==============================
    for i in open("d:\python24\lzjtest.xml", 'r').read().split('\r\n'):
    i = i.decode("utf-16")
    print i
    =======================================
    Then it worked (echo the file).

    Later I get to know codecs and write the following code:

    ==== Code 3 =============================
    import codecs
    for i in codecs.open("d:\python24\lzjtesttvs2.xml", 'r', 'utf-16').readlines():
    print i
    =======================================
    It worked and echo the file.

    I am wondering what is the problem with the first code and why the bug
    is fixed in
    the second.

    Thanks in advance.

    -Zhongjian
    Zhongjian Lu, Mar 17, 2006
    #1
    1. Advertising

  2. Zhongjian Lu

    Fuzzyman Guest

    Re: split() can help to read UTF-16 encoded file without codecs support, why?

    Zhongjian Lu wrote:
    > Hi Guys,
    >
    > I was processing a UTF-16 coded file with BOM and was not aware of the
    > codecs package at first. I wrote the following code:
    > ===== Code 1============================
    > for i in open("d:\python24\lzjtest.xml", 'r').readlines():
    > i = i.decode("utf-16")
    > print i
    > =======================================
    > Output was:
    > Traceback (most recent call last):
    > File "D:\Python24\testutf-16.py", line 4, in -toplevel-
    > i = i.decode("utf-16")
    > File "D:\Python24\lib\encodings\utf_16.py", line 16, in decode
    > return codecs.utf_16_decode(input, errors, True)
    > UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position
    > 84: truncated data
    >


    UTF16 is a 'two-byte encoding'. This means that '\r\n' is represented
    using :

    '\r\x00\n\x00'

    When you use readlines to split this up it splits on byte boundaries.
    This probably returns something like :

    '\r', '\x00\n', '\x00'

    You can see how the last bit is 'truncated' (single byte only) because
    the data has been split on bytes instead of characters.


    > I searched google and found an article on the similar problem saying to use
    > split(). I had not quite caught the meaning of the article and recode as:
    > ==== Code 2==============================
    > for i in open("d:\python24\lzjtest.xml", 'r').read().split('\r\n'):
    > i = i.decode("utf-16")
    > print i
    > =======================================
    > Then it worked (echo the file).
    >


    You will probably find that '\r\n' never occurs in the byte-string, so
    this does it *all* in one line, but the decode succeeds.

    HTH

    All the best,

    Fuzzyman
    http://www.voidspace.org.uk/python/index.shtml

    > Later I get to know codecs and write the following code:
    >
    > ==== Code 3 =============================
    > import codecs
    > for i in codecs.open("d:\python24\lzjtesttvs2.xml", 'r', 'utf-16').readlines():
    > print i
    > =======================================
    > It worked and echo the file.
    >
    > I am wondering what is the problem with the first code and why the bug
    > is fixed in
    > the second.
    >
    > Thanks in advance.
    >
    > -Zhongjian
    Fuzzyman, Mar 17, 2006
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mr. SweatyFinger
    Replies:
    2
    Views:
    1,763
    Smokey Grindel
    Dec 2, 2006
  2. smitty1e
    Replies:
    2
    Views:
    286
    smitty1e
    Jun 11, 2007
  3. moonhkt
    Replies:
    18
    Views:
    2,493
    Roedy Green
    Feb 5, 2010
  4. Stanley Xu
    Replies:
    2
    Views:
    593
    Stanley Xu
    Mar 23, 2011
  5. Karl Knechtel
    Replies:
    2
    Views:
    358
    Walter Dörwald
    Jul 10, 2012
Loading...

Share This Page