Unicode problem

Discussion in 'Python' started by Rehceb Rotkiv, Apr 7, 2007.

  1. Please have a look at this little script:

    #!/usr/bin/python
    import sys
    import codecs
    fileHandle = codecs.open(sys.argv[1], 'r', 'utf-8')
    fileString = fileHandle.read()
    print fileString

    if I call it from a Bash shell like this

    $ ./test.py testfile.utf8.txt

    it works just fine, but when I try to pipe the output to another process
    ("|") or into a file (">"), e.g. like this

    $ ./test.py testfile.utf8.txt | cat

    I get an error:

    Traceback (most recent call last):
    File "./test.py", line 6, in ?
    print fileString
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
    position 538: ordinal not in range(128)

    I absolutely don't know what's the problem here, can you help?

    Thanks,
    Rehceb
     
    Rehceb Rotkiv, Apr 7, 2007
    #1
    1. Advertising

  2. Rehceb Rotkiv wrote:

    > #!/usr/bin/python
    > import sys
    > import codecs
    > fileHandle = codecs.open(sys.argv[1], 'r', 'utf-8')
    > fileString = fileHandle.read()
    > print fileString
    >
    > if I call it from a Bash shell like this
    >
    > $ ./test.py testfile.utf8.txt
    >
    > it works just fine, but when I try to pipe the output to another process
    > ("|") or into a file (">"), e.g. like this
    >
    > $ ./test.py testfile.utf8.txt | cat
    >
    > I get an error:
    >
    > Traceback (most recent call last):
    > File "./test.py", line 6, in ?
    > print fileString
    > UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
    > position 538: ordinal not in range(128)
    >
    > I absolutely don't know what's the problem here, can you help?


    Using codecs.open, when you read the file you get Unicode. When you
    print the Unicode object, it is encoded using your terminal default
    encoding (utf8 I presume?)
    But when you redirect the output, it's no more connected to your
    terminal so no encoding can be assumed, and the default encoding is
    used.

    Try this line at the top:
    print
    "stdout:",sys.stdout.encoding,"default:",sys.getdefaultencoding()
    I get stdout: ANSI_X3.4-1968 default: ascii normally and stdout: None
    default: ascii when redirected.

    You have to encode the Unicode object explicitely: print
    fileString.encode("utf-8")
    (or any other suitable one; I said utf-8 just because you read the
    input file using that)

    --
    Gabriel Genellina
     
    Gabriel Genellina, Apr 7, 2007
    #2
    1. Advertising

  3. On Sat, 07 Apr 2007 12:46:49 -0700, Gabriel Genellina wrote:

    > You have to encode the Unicode object explicitely: print
    > fileString.encode("utf-8")
    > (or any other suitable one; I said utf-8 just because you read the input
    > file using that)


    Thanks! That's a nice little stumbling block for a newbie like me ;) Is
    there a way to make utf-8 the default encoding for every string, so that
    I do not have to encode each string explicitly?
     
    Rehceb Rotkiv, Apr 8, 2007
    #3
  4. > Thanks! That's a nice little stumbling block for a newbie like me ;) Is
    > there a way to make utf-8 the default encoding for every string, so that
    > I do not have to encode each string explicitly?


    You can make sys.stdout encode each string with UTF-8, with

    sys.stdout = codecs.getwriter('utf-8')(sys.stdout)

    Make sure that you then that *all* strings that you print
    are Unicode strings.

    HTH,
    Martin
     
    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=, Apr 9, 2007
    #4
  5. Rehceb Rotkiv

    Georg Brandl Guest

    Martin v. Löwis schrieb:
    >> Thanks! That's a nice little stumbling block for a newbie like me ;) Is
    >> there a way to make utf-8 the default encoding for every string, so that
    >> I do not have to encode each string explicitly?

    >
    > You can make sys.stdout encode each string with UTF-8, with
    >
    > sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
    >
    > Make sure that you then that *all* strings that you print
    > are Unicode strings.


    BTW, any reason why an EncodedFile can't act like a Unicode writer/reader object
    if one of its encodings is explicitly set to None?

    IMO the docs don't make it clear that getwriter() is the correct API to use
    here. I've wanted to write "sys.stdout = codecs.EncodedFile(sys.stdout,
    'utf-8')" more than once.

    Georg
     
    Georg Brandl, Apr 9, 2007
    #5
  6. > BTW, any reason why an EncodedFile can't act like a Unicode
    > writer/reader object
    > if one of its encodings is explicitly set to None?


    AFAIU, that's not the intention of EncodedFile: instead, it is
    meant to do recoding. I find it a pretty useless API, and
    rather see it go away than being enhanced.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Apr 9, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    3,991
    Robert Mark Bram
    Sep 28, 2003
  2. ygao

    unicode wrap unicode object?

    ygao, Apr 8, 2006, in forum: Python
    Replies:
    6
    Views:
    579
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 8, 2006
  3. Gabriele *darkbard* Farina

    Unicode digit to unicode string

    Gabriele *darkbard* Farina, May 16, 2006, in forum: Python
    Replies:
    2
    Views:
    544
    Gabriele *darkbard* Farina
    May 16, 2006
  4. gabor
    Replies:
    13
    Views:
    578
    Leo Kislov
    Nov 18, 2006
  5. Jean-Paul Calderone
    Replies:
    23
    Views:
    713
    Leo Kislov
    Nov 21, 2006
Loading...

Share This Page