UTF-8 and stdin/stdout?

Discussion in 'Python' started by dave_140390@hotmail.com, May 28, 2008.

  1. Guest

    Hi,

    I have problems getting my Python code to work with UTF-8 encoding
    when reading from stdin / writing to stdout.

    Say I have a file, utf8_input, that contains a single character, é,
    coded as UTF-8:

    $ hexdump -C utf8_input
    00000000 c3 a9
    00000002

    If I read this file by opening it in this Python script:

    $ cat utf8_from_file.py
    import codecs
    file = codecs.open('utf8_input', encoding='utf-8')
    data = file.read()
    print "length of data =", len(data)

    everything goes well:

    $ python utf8_from_file.py
    length of data = 1

    The contents of utf8_input is one character coded as two bytes, so
    UTF-8 decoding is working here.

    Now, I would like to do the same with standard input. Of course, this:

    $ cat utf8_from_stdin.py
    import sys
    data = sys.stdin.read()
    print "length of data =", len(data)

    does not work:

    $ [/c/DiskCopy] python utf8_from_stdin.py < utf8_input
    length of data = 2

    Here, the contents of utf8_input is not interpreted as UTF-8, so
    Python believes there are two separate characters.

    The question, then:
    How could one get utf8_from_stdin.py to work properly with UTF-8?
    (And same question for stdout.)

    I googled around, and found rather complex stuff (see, for example,
    http://blog.ianbicking.org/illusive-setdefaultencoding.html), but even
    that didn't work: I still get "length of data = 2" even after
    successively calling sys.setdefaultencoding('utf-8').

    -- dave
     
    , May 28, 2008
    #1
    1. Advertising

  2. writes:

    > Hi,
    >
    > I have problems getting my Python code to work with UTF-8 encoding
    > when reading from stdin / writing to stdout.
    >
    > Say I have a file, utf8_input, that contains a single character, é,
    > coded as UTF-8:
    >
    > $ hexdump -C utf8_input
    > 00000000 c3 a9
    > 00000002
    >
    > If I read this file by opening it in this Python script:
    >
    > $ cat utf8_from_file.py
    > import codecs
    > file = codecs.open('utf8_input', encoding='utf-8')
    > data = file.read()
    > print "length of data =", len(data)
    >
    > everything goes well:
    >
    > $ python utf8_from_file.py
    > length of data = 1
    >
    > The contents of utf8_input is one character coded as two bytes, so
    > UTF-8 decoding is working here.
    >
    > Now, I would like to do the same with standard input. Of course, this:
    >
    > $ cat utf8_from_stdin.py
    > import sys
    > data = sys.stdin.read()
    > print "length of data =", len(data)


    Shouldn't you do data = data.decode('utf8') ?

    > does not work:
    >
    > $ [/c/DiskCopy] python utf8_from_stdin.py < utf8_input
    > length of data = 2


    --
    Arnaud
     
    Arnaud Delobelle, May 28, 2008
    #2
    1. Advertising

  3. Chris Guest

    On May 28, 11:08 am, wrote:
    > Hi,
    >
    > I have problems getting my Python code to work with UTF-8 encoding
    > when reading from stdin / writing to stdout.
    >
    > Say I have a file, utf8_input, that contains a single character, é,
    > coded as UTF-8:
    >
    >         $ hexdump -C utf8_input
    >         00000000  c3 a9
    >         00000002
    >
    > If I read this file by opening it in this Python script:
    >
    >         $ cat utf8_from_file.py
    >         import codecs
    >         file = codecs.open('utf8_input', encoding='utf-8')
    >         data = file.read()
    >         print "length of data =", len(data)
    >
    > everything goes well:
    >
    >         $ python utf8_from_file.py
    >         length of data = 1
    >
    > The contents of utf8_input is one character coded as two bytes, so
    > UTF-8 decoding is working here.
    >
    > Now, I would like to do the same with standard input. Of course, this:
    >
    >         $ cat utf8_from_stdin.py
    >         import sys
    >         data = sys.stdin.read()
    >         print "length of data =", len(data)
    >
    > does not work:
    >
    >         $ [/c/DiskCopy] python utf8_from_stdin.py < utf8_input
    >         length of data = 2
    >
    > Here, the contents of utf8_input is not interpreted as UTF-8, so
    > Python believes there are two separate characters.
    >
    > The question, then:
    > How could one get utf8_from_stdin.py to work properly with UTF-8?
    > (And same question for stdout.)
    >
    > I googled around, and found rather complex stuff (see, for example,http://blog.ianbicking.org/illusive-setdefaultencoding.html), but even
    > that didn't work: I still get "length of data = 2" even after
    > successively calling sys.setdefaultencoding('utf-8').
    >
    > -- dave


    weird thing is 'c3 a9' is é on my side... and copy/pasting the é
    gives me 'e9' with the first script giving a result of zero and second
    script gives me 1
     
    Chris, May 28, 2008
    #3
  4. Guest

    > Shouldn't you do data = data.decode('utf8') ?

    Yes, that's it! Thanks.

    -- dave
     
    , May 28, 2008
    #4
  5. Chris wrote:
    > On May 28, 11:08 am, wrote:
    >> Say I have a file, utf8_input, that contains a single character, é,
    >> coded as UTF-8:
    >>
    >> $ hexdump -C utf8_input
    >> 00000000  c3 a9
    >> 00000002

    [...]
    > weird thing is 'c3 a9' is é on my side... and copy/pasting the é
    > gives me 'e9' with the first script giving a result of zero and second
    > script gives me 1


    Don't worry, it can be that those are equivalent. The point is that some
    characters exist more than once and some exist in a composite form (e with
    accent) and separately (e and combining accent).

    Looking at http://unicode.org/charts I see that the letter above should have
    codepoint 0xe9 (combined character) or 0x61 (e) and 0x301 (accent).

    0xe9 = 1110 1001 (codepoint)
    0xc3 0xa9 = 1100 0011 1010 1001 (UTF-8)

    Anyhow, further looking at this shows that your editor simply doesn't
    interpret the two bytes as UTF-8 but as Latin-1 or similar encoding, where
    they represent the capital A with tilde and the copyrigth sign.

    Uli

    --
    Sator Laser GmbH
    Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932
     
    Ulrich Eckhardt, May 28, 2008
    #5
  6. > $ cat utf8_from_stdin.py
    > import sys
    > data = sys.stdin.read()
    > print "length of data =", len(data)


    sys.stdin is a byte stream in Python 2, not a character stream.
    To make it a character stream, do

    sys.stdin = codecs.getreader("utf-8")(sys.stdin)

    HTH,
    Martin
     
    Martin v. Löwis, May 28, 2008
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Vincent Touquet
    Replies:
    1
    Views:
    614
    Adrian B.
    Sep 3, 2004
  2. Vincent  Touquet
    Replies:
    0
    Views:
    460
    Vincent Touquet
    Sep 6, 2004
  3. Replies:
    2
    Views:
    678
    velle
    Jan 5, 2006
  4. Guido De Rosa
    Replies:
    1
    Views:
    142
    Brian Candler
    Mar 9, 2010
  5. Snail
    Replies:
    7
    Views:
    172
    Anno Siegel
    Apr 9, 2005
Loading...

Share This Page