Unicode string handling problem

Discussion in 'Python' started by Richard Schulman, Sep 6, 2006.

  1. The following program fragment works correctly with an ascii input
    file.

    But the file I actually want to process is Unicode (utf-16 encoding).
    The file must be Unicode rather than ASCII or Latin-1 because it
    contains mixed Chinese and English characters.

    When I run the program below I get an attribute_count of zero, which
    is incorrect for the input file, which should give a value of fifteen
    or sixteen. In other words, the count function isn't recognizing the
    ", characters in the line being read. Here's the program:

    in_file = open("c:\\pythonapps\\in-graf1.my","rU")
    try:
    # Skip the first line; make the second available for processing
    in_file.readline()
    in_line = readline()
    attribute_count = in_line.count('",')
    print attribute_count
    finally:
    in_file.close()

    Any suggestions?

    Richard Schulman
    (For email reply, delete the 'xx' characters)
    Richard Schulman, Sep 6, 2006
    #1
    1. Advertising

  2. Richard Schulman

    John Machin Guest

    Richard Schulman wrote:
    > The following program fragment works correctly with an ascii input
    > file.
    >
    > But the file I actually want to process is Unicode (utf-16 encoding).
    > The file must be Unicode rather than ASCII or Latin-1 because it
    > contains mixed Chinese and English characters.
    >
    > When I run the program below I get an attribute_count of zero, which
    > is incorrect for the input file, which should give a value of fifteen
    > or sixteen. In other words, the count function isn't recognizing the
    > ", characters in the line being read. Here's the program:
    >
    > in_file = open("c:\\pythonapps\\in-graf1.my","rU")
    > try:
    > # Skip the first line; make the second available for processing
    > in_file.readline()
    > in_line = readline()


    You mean in_line = in_file.readline(), I hope. Do please copy/paste
    actual code, not what you think you ran.

    > attribute_count = in_line.count('",')
    > print attribute_count


    Insert
    print type(in_line)
    print repr(in_line)
    here [also make the appropriate changes to get the same info from the
    first line], run it again, copy/paste what you get, show us what you
    see.

    If you're coy about that, then you'll have to find out yourself if it
    has a BOM at the front, and if not whether it's little/big/endian.

    > finally:
    > in_file.close()
    >
    > Any suggestions?
    >


    1. Read the Unicode HOWTO.
    2. Read the docs on the codecs module ...

    You'll need to use

    in_file = codecs.open(filepath, mode, encoding="utf16???????")

    It would also be a good idea to get into the habit of using unicode
    constants like u'",'

    HTH,
    John
    John Machin, Sep 6, 2006
    #2
    1. Advertising

  3. Richard Schulman

    John Roth Guest

    Richard Schulman wrote:
    > The following program fragment works correctly with an ascii input
    > file.
    >
    > But the file I actually want to process is Unicode (utf-16 encoding).
    > The file must be Unicode rather than ASCII or Latin-1 because it
    > contains mixed Chinese and English characters.
    >
    > When I run the program below I get an attribute_count of zero, which
    > is incorrect for the input file, which should give a value of fifteen
    > or sixteen. In other words, the count function isn't recognizing the
    > ", characters in the line being read. Here's the program:
    >
    > in_file = open("c:\\pythonapps\\in-graf1.my","rU")
    > try:
    > # Skip the first line; make the second available for processing
    > in_file.readline()
    > in_line = readline()
    > attribute_count = in_line.count('",')
    > print attribute_count
    > finally:
    > in_file.close()
    >
    > Any suggestions?
    >
    > Richard Schulman
    > (For email reply, delete the 'xx' characters)


    You're not detecting the file encoding and then
    using it in the open statement. If you know this is
    utf-16le or utf-16be, you need to say so in the
    open. If you don't, then you should read it into
    a string, go through some autodetect logic, and
    then decode it with the <string>.decode(encoding)
    method.

    A clue: a properly formatted utf-16 or utf-32
    file MUST have a BOM as the first character.
    That's mandated in the unicode standard. If
    it doesn't have a BOM, then try ascii and
    utf-8 in that order. The first
    one that succeeds is correct. If neither succeeds,
    you're on your own in guessing the file encoding.

    John Roth
    John Roth, Sep 6, 2006
    #3
  4. Thanks for your excellent debugging suggestions, John. See below for
    my follow-up:

    Richard Schulman:
    >> The following program fragment works correctly with an ascii input
    >> file.
    >>
    >> But the file I actually want to process is Unicode (utf-16 encoding).
    >> The file must be Unicode rather than ASCII or Latin-1 because it
    >> contains mixed Chinese and English characters.
    >>
    >> When I run the program below I get an attribute_count of zero, which
    >> is incorrect for the input file, which should give a value of fifteen
    >> or sixteen. In other words, the count function isn't recognizing the
    >> ", characters in the line being read. Here's the program:
    >>...


    John Machin:
    >Insert
    > print type(in_line)
    > print repr(in_line)
    >here [also make the appropriate changes to get the same info from the
    >first line], run it again, copy/paste what you get, show us what you
    >see.


    Here's the revised program, per your suggestion:

    =====================================================

    # This program processes a UTF-16 input file that is
    # to be loaded later into a mySQL table. The input file
    # is not yet ready for prime time. The purpose of this
    # program is to ready it.

    in_file = open("c:\\pythonapps\\in-graf1.my","rU")
    try:
    # The first line read is a SQL INSERT statement; no
    # processing will be required.
    in_line = in_file.readline()
    print type(in_line) #For debugging
    print repr(in_line) #For debugging

    # The second line read is the first data row.
    in_line = in_file.readline()
    print type(in_line) #For debugging
    print repr(in_line) #For debugging

    # For this and subsequent rows, we must count all
    # the < ", > character-pairs in a given line/row.
    # This will provide an n-1 measure of the attributes
    # for a SQL insert of this row. All rows must have
    # sixteen attributes, but some don't yet.
    attribute_count = in_line.count('",')
    print attribute_count
    finally:
    in_file.close()

    =====================================================

    The output of this program, which I ran at the command line,
    must needs to be copied by hand and abridged, but I think I
    have included the relevant information:

    C:\pythonapps>python graf_correction.py
    <type 'str'>
    '\xff\xfeI\x00N\x00S... [the beginning of a SQL INSERT statement]
    ....\x00U\x00E\x00S\x00\n' [the VALUES keyword at the end of the row,
    followed by an end-of-line]
    <type 'str'>
    '\x00\n' [oh-oh! For the second row, all we're seeing
    is an end-of-line character. Is that from
    the first row? Wasn't the "rU" mode
    supposed to handle that]
    0 [the counter value. It's hardly surprising
    it's only zero, given that most of the row
    never got loaded, just an eol mark]

    J.M.:
    >If you're coy about that, then you'll have to find out yourself if it
    >has a BOM at the front, and if not whether it's little/big/endian.


    The BOM is little-endian, I believe.

    R.S.:
    >> Any suggestions?


    J.M.
    >1. Read the Unicode HOWTO.
    >2. Read the docs on the codecs module ...
    >
    >You'll need to use
    >
    >in_file = codecs.open(filepath, mode, encoding="utf16???????")


    Right you are. Here is the output produced by so doing:

    <type 'unicode'>
    u'\ufeffINSERT INTO [...] VALUES\N'
    <type 'unicode'>
    u'\n'
    0 [The counter value]

    >It would also be a good idea to get into the habit of using unicode
    >constants like u'",'


    Right.

    >HTH,
    >John


    Yes, it did. Many thanks! Now I've got to figure out the best way to
    handle that \n\n at the end of each row, which the program is
    interpreting as two rows. That represents two surprises: first, I
    thought that Microsoft files ended as \n\r ; second, I thought that
    Python mode "rU" was supposed to be the universal eol handler and
    would handle the \n\r as one mark.

    Richard Schulman
    Richard Schulman, Sep 6, 2006
    #4
  5. On 5 Sep 2006 19:50:27 -0700, "John Roth" <>
    wrote:

    >> [T]he file I actually want to process is Unicode (utf-16 encoding).
    >>...
    >> in_file = open("c:\\pythonapps\\in-graf1.my","rU")
    >>...


    John Roth:
    >You're not detecting the file encoding and then
    >using it in the open statement. If you know this is
    >utf-16le or utf-16be, you need to say so in the
    >open. If you don't, then you should read it into
    >a string, go through some autodetect logic, and
    >then decode it with the <string>.decode(encoding)
    >method.
    >
    >A clue: a properly formatted utf-16 or utf-32
    >file MUST have a BOM as the first character.
    >That's mandated in the unicode standard. If
    >it doesn't have a BOM, then try ascii and
    >utf-8 in that order. The first
    >one that succeeds is correct. If neither succeeds,
    >you're on your own in guessing the file encoding.


    Thanks for this further information. I'm now using the codec with
    improved results, but am still puzzled as to how to handle the row
    termination of \n\n, which is being interpreted as two rows instead of
    one.
    Richard Schulman, Sep 6, 2006
    #5
  6. On Wed, 06 Sep 2006 03:55:18 GMT, Richard Schulman
    <> wrote:

    >...I'm now using the codec with
    >improved results, but am still puzzled as to how to handle the row
    >termination of \n\n, which is being interpreted as two rows instead of
    >one.


    Of course, I could do a double read on each row and ignore the second
    read, which merely fetches the final of the two u'\n' characters. But
    that's not very elegant, and I'm sure there's a better way to do it
    (hint, hint someone).

    Richard Schulman (for email, drop the 'xx' in the reply-to)
    Richard Schulman, Sep 6, 2006
    #6
  7. Richard Schulman

    John Machin Guest

    Richard Schulman wrote:
    [big snip]
    >
    > The BOM is little-endian, I believe.

    Correct.

    > >in_file = codecs.open(filepath, mode, encoding="utf16???????")

    >
    > Right you are. Here is the output produced by so doing:


    You don't say which encoding you used, but I guess that you used
    utf_16_le.

    >
    > <type 'unicode'>
    > u'\ufeffINSERT INTO [...] VALUES\N'


    Use utf_16 -- it will strip off the BOM for you.

    > <type 'unicode'>
    > u'\n'
    > 0 [The counter value]
    >

    [snip]
    > Yes, it did. Many thanks! Now I've got to figure out the best way to
    > handle that \n\n at the end of each row, which the program is
    > interpreting as two rows.


    Well we don't know yet exactly what you have there. We need a byte dump
    of the first few bytes of your file. Get into the interactive
    interpreter and do this:

    open('yourfile', 'rb').read(200)
    (the 'b' is for binary, in case you are on Windows)
    That will show us exactly what's there, without *any* EOL
    interpretation at all.


    > That represents two surprises: first, I
    > thought that Microsoft files ended as \n\r ;


    Nah. Wrong on two counts. In text mode, Microsoft *lines* end in \r\n
    (not \n\r); *files* may end in ctrl-Z aka chr(26) -- an inheritance
    from CP/M.

    Ummmm ... are you saying the file has \n\r at the end of each row?? How
    did you know that if you didn't know what if any BOM it had??? Who
    created the file????

    > second, I thought that
    > Python mode "rU" was supposed to be the universal eol handler and
    > would handle the \n\r as one mark.


    Nah again. It contemplates only \n, \r, and \r\n as end of line. See
    the docs. Thus \n\r becomes *two* newlines when read with "rU".

    Having "\n\r" at the end of each row does fit with your symptoms:

    | >>> bom = u"\ufeff"
    | >>> guff = '\n\r'.join(['abc', 'def', 'ghi'])
    | >>> guffu = unicode(guff)
    | >>> import codecs
    | >>> f = codecs.open('guff.utf16le', 'wb', encoding='utf_16_le')
    | >>> f.write(bom+guffu)
    | >>> f.close()

    | >>> open('guff.utf16le', 'rb').read() #### see exactly what we've got

    |
    '\xff\xfea\x00b\x00c\x00\n\x00\r\x00d\x00e\x00f\x00\n\x00\r\x00g\x00h\x00i\x00'

    | >>> codecs.open('guff.utf16le', 'r', encoding='utf_16').read()
    | u'abc\n\rdef\n\rghi' ######### Look, Mom, no BOM!

    | >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16').read()
    | u'abc\n\ndef\n\nghi' #### U means \r -> \n

    | >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16_le').read()
    | u'\ufeffabc\n\ndef\n\nghi' ######### reproduces your second
    experience

    | >>> open('guff.utf16le', 'rU').readlines()
    | ['\xff\xfea\x00b\x00c\x00\n', '\x00\n', '\x00d\x00e\x00f\x00\n',
    '\x00\n', '\x00
    | g\x00h\x00i\x00']
    | >>> f = open('guff.utf16le', 'rU')
    | >>> f.readline()
    | '\xff\xfea\x00b\x00c\x00\n'
    | >>> f.readline()
    | '\x00\n' ######### reproduces your first experience
    | >>> f.readline()
    | '\x00d\x00e\x00f\x00\n'
    | >>>

    If that file is a one-off, you can obviously fix it by
    throwing away every second line. Otherwise, if it's an ongoing
    exercise, you need to talk sternly to the file's creator :)

    HTH,
    John
    John Machin, Sep 6, 2006
    #7
  8. Many thanks for your help, John, in giving me the tools to work
    successfully in Python with Unicode from here on out.

    It turns out that the Unicode input files I was working with (from MS
    Word and MS Notepad) were indeed creating eol sequences of \r\n, not
    \n\n as I had originally thought. The file reading statement that I
    was using, with unpredictable results, was

    #in_file =
    codecs.open("c:\\pythonapps\\in-graf2.my","rU",encoding="utf-16LE")

    This was reading to the \n on first read (outputting the whole line,
    including the \n but, weirdly, not the preceding \r). Then, also
    weirdly, the next readline would read the same \n again, interpreting
    that as the entirety of a phantom second line. So each input file line
    ended up producing two output lines.

    Once the mode string "rU" was dropped, as in

    in_file =
    codecs.open("c:\\pythonapps\\in-graf2.my",encoding="utf-16LE")

    all suddenly became well: no more doubled readlines, and one could see
    the \r\n termination of each line.

    This behavior of "rU" was not at all what I had expected from the
    brief discussion of it in _Python Cookbook_. Which all goes to point
    out how difficult it is to cook challenging dishes with sketchy
    recipes alone. There is no substitute for the helpful advice of an
    experienced chef.

    -Richard Schulman
    (remove "xx" for email reply)

    On 5 Sep 2006 22:29:59 -0700, "John Machin" <>
    wrote:

    >Richard Schulman wrote:
    >[big snip]
    >>
    >> The BOM is little-endian, I believe.

    >Correct.
    >
    >> >in_file = codecs.open(filepath, mode, encoding="utf16???????")

    >>
    >> Right you are. Here is the output produced by so doing:

    >
    >You don't say which encoding you used, but I guess that you used
    >utf_16_le.
    >
    >>
    >> <type 'unicode'>
    >> u'\ufeffINSERT INTO [...] VALUES\N'

    >
    >Use utf_16 -- it will strip off the BOM for you.
    >
    >> <type 'unicode'>
    >> u'\n'
    >> 0 [The counter value]
    >>

    >[snip]
    >> Yes, it did. Many thanks! Now I've got to figure out the best way to
    >> handle that \n\n at the end of each row, which the program is
    >> interpreting as two rows.

    >
    >Well we don't know yet exactly what you have there. We need a byte dump
    >of the first few bytes of your file. Get into the interactive
    >interpreter and do this:
    >
    >open('yourfile', 'rb').read(200)
    >(the 'b' is for binary, in case you are on Windows)
    >That will show us exactly what's there, without *any* EOL
    >interpretation at all.
    >
    >
    >> That represents two surprises: first, I
    >> thought that Microsoft files ended as \n\r ;

    >
    >Nah. Wrong on two counts. In text mode, Microsoft *lines* end in \r\n
    >(not \n\r); *files* may end in ctrl-Z aka chr(26) -- an inheritance
    >from CP/M.
    >
    >Ummmm ... are you saying the file has \n\r at the end of each row?? How
    >did you know that if you didn't know what if any BOM it had??? Who
    >created the file????
    >
    >> second, I thought that
    >> Python mode "rU" was supposed to be the universal eol handler and
    >> would handle the \n\r as one mark.

    >
    >Nah again. It contemplates only \n, \r, and \r\n as end of line. See
    >the docs. Thus \n\r becomes *two* newlines when read with "rU".
    >
    >Having "\n\r" at the end of each row does fit with your symptoms:
    >
    >| >>> bom = u"\ufeff"
    >| >>> guff = '\n\r'.join(['abc', 'def', 'ghi'])
    >| >>> guffu = unicode(guff)
    >| >>> import codecs
    >| >>> f = codecs.open('guff.utf16le', 'wb', encoding='utf_16_le')
    >| >>> f.write(bom+guffu)
    >| >>> f.close()
    >
    >| >>> open('guff.utf16le', 'rb').read() #### see exactly what we've got
    >
    >|
    >'\xff\xfea\x00b\x00c\x00\n\x00\r\x00d\x00e\x00f\x00\n\x00\r\x00g\x00h\x00i\x00'
    >
    >| >>> codecs.open('guff.utf16le', 'r', encoding='utf_16').read()
    >| u'abc\n\rdef\n\rghi' ######### Look, Mom, no BOM!
    >
    >| >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16').read()
    >| u'abc\n\ndef\n\nghi' #### U means \r -> \n
    >
    >| >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16_le').read()
    >| u'\ufeffabc\n\ndef\n\nghi' ######### reproduces your second
    >experience
    >
    >| >>> open('guff.utf16le', 'rU').readlines()
    >| ['\xff\xfea\x00b\x00c\x00\n', '\x00\n', '\x00d\x00e\x00f\x00\n',
    >'\x00\n', '\x00
    >| g\x00h\x00i\x00']
    >| >>> f = open('guff.utf16le', 'rU')
    >| >>> f.readline()
    >| '\xff\xfea\x00b\x00c\x00\n'
    >| >>> f.readline()
    >| '\x00\n' ######### reproduces your first experience
    >| >>> f.readline()
    >| '\x00d\x00e\x00f\x00\n'
    >| >>>
    >
    >If that file is a one-off, you can obviously fix it by
    >throwing away every second line. Otherwise, if it's an ongoing
    >exercise, you need to talk sternly to the file's creator :)
    >
    >HTH,
    >John
    Richard Schulman, Sep 7, 2006
    #8
  9. Richard Schulman

    John Machin Guest

    Richard Schulman wrote:
    > It turns out that the Unicode input files I was working with (from MS
    > Word and MS Notepad) were indeed creating eol sequences of \r\n, not
    > \n\n as I had originally thought. The file reading statement that I
    > was using, with unpredictable results, was
    >
    > #in_file =
    > codecs.open("c:\\pythonapps\\in-graf2.my","rU",encoding="utf-16LE")
    >
    > This was reading to the \n on first read (outputting the whole line,
    > including the \n but, weirdly, not the preceding \r). Then, also
    > weirdly, the next readline would read the same \n again, interpreting
    > that as the entirety of a phantom second line. So each input file line
    > ended up producing two output lines.
    >
    > Once the mode string "rU" was dropped, as in
    >
    > in_file =
    > codecs.open("c:\\pythonapps\\in-graf2.my",encoding="utf-16LE")
    >
    > all suddenly became well: no more doubled readlines, and one could see
    > the \r\n termination of each line.


    You are on Windows. I would *not* describe as "well" lines read in (the
    default) text mode ending in u"\r\n". It would expect it to convert the
    line endings to u"\n". At best, this should be documented. Perhaps
    someone with some knowledge of the intended treatment of line endings
    by codecs.open() in text mode could comment? The two problems are
    succintly described below:

    File created in Windows Notepad and saved with "Unicode" encoding.
    Results in UTF-16LE encoding, line terminator is CR LF, has BOM (LE) at
    front -- as show below.

    | Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
    on win32
    | Type "help", "copyright", "credits" or "license" for more
    information.
    | >>> open('notepad_uc.txt', 'rb').read()
    |
    '\xff\xfea\x00b\x00c\x00\r\x00\n\x00d\x00e\x00f\x00\r\x00\n\x00g\x00h\x00i\x00\r
    | \x00\n\x00'
    | >>> import codecs
    | >>> codecs.open('notepad_uc.txt', 'r',
    encoding='utf_16_le').readlines()
    | [u'\ufeffabc\r\n', u'def\r\n', u'ghi\r\n']
    | >>> codecs.open('notepad_uc.txt', 'r', encoding='utf_16').readlines()
    | [u'abc\r\n', u'def\r\n', u'ghi\r\n']
    ### presence ot u'\r' was *not* expected
    | >>> codecs.open('notepad_uc.txt', 'rU',
    encoding='utf_16_le').readlines()
    | [u'\ufeffabc\n', u'\n', u'def\n', u'\n', u'ghi\n', u'\n']
    | >>> codecs.open('notepad_uc.txt', 'rU',
    encoding='utf_16').readlines()
    | [u'abc\n', u'\n', u'def\n', u'\n', u'ghi\n', u'\n']
    ### 'U' flag does change the behaviour, but *not* as expected.

    Cheers,
    John
    John Machin, Sep 7, 2006
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Gabriele *darkbard* Farina

    Unicode digit to unicode string

    Gabriele *darkbard* Farina, May 16, 2006, in forum: Python
    Replies:
    2
    Views:
    494
    Gabriele *darkbard* Farina
    May 16, 2006
  2. Richard Schulman

    Unicode string handling problem (revised)

    Richard Schulman, Sep 6, 2006, in forum: Python
    Replies:
    1
    Views:
    246
    John Machin
    Sep 6, 2006
  3. Holger Joukl
    Replies:
    5
    Views:
    513
    Ben Finney
    Dec 13, 2006
  4. joy99
    Replies:
    4
    Views:
    389
    joy99
    Aug 16, 2009
  5. Chirag Mistry
    Replies:
    6
    Views:
    162
    Ollivier Robert
    Feb 8, 2008
Loading...

Share This Page