Unicode blues in Python3

Discussion in 'Python' started by nn, Mar 23, 2010.

  1. nn

    nn Guest

    I know that unicode is the way to go in Python 3.1, but it is getting
    in my way right now in my Unix scripts. How do I write a chr(253) to a
    file?

    #nntst2.py
    import sys,codecs
    mychar=chr(253)
    print(sys.stdout.encoding)
    print(mychar)

    > ./nntst2.py

    ISO8859-1
    ý

    > ./nntst2.py >nnout2

    Traceback (most recent call last):
    File "./nntst2.py", line 5, in <module>
    print(mychar)
    UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in
    position 0: ordinal not in range(128)

    > cat nnout2

    ascii

    ...Oh great!

    ok lets try this:
    #nntst3.py
    import sys,codecs
    mychar=chr(253)
    print(sys.stdout.encoding)
    print(mychar.encode('latin1'))

    > ./nntst3.py

    ISO8859-1
    b'\xfd'

    > ./nntst3.py >nnout3


    > cat nnout3

    ascii
    b'\xfd'

    ...Eh... not what I want really.

    #nntst4.py
    import sys,codecs
    mychar=chr(253)
    print(sys.stdout.encoding)
    sys.stdout=codecs.getwriter("latin1")(sys.stdout)
    print(mychar)

    > ./nntst4.py

    ISO8859-1
    Traceback (most recent call last):
    File "./nntst4.py", line 6, in <module>
    print(mychar)
    File "Python-3.1.2/Lib/codecs.py", line 356, in write
    self.stream.write(data)
    TypeError: must be str, not bytes

    ...OK, this is not working either.

    Is there any way to write a value 253 to standard output?
    nn, Mar 23, 2010
    #1
    1. Advertising

  2. On Tuesday 23 March 2010 10:33:33 nn wrote:
    > I know that unicode is the way to go in Python 3.1, but it is getting
    > in my way right now in my Unix scripts. How do I write a chr(253) to a
    > file?
    >
    > #nntst2.py
    > import sys,codecs
    > mychar=chr(253)
    > print(sys.stdout.encoding)
    > print(mychar)


    The following code works for me:

    $ cat nnout5.py
    #!/usr/bin/python3.1

    import sys
    mychar = chr(253)
    sys.stdout.write(mychar)
    $ echo $(cat nnout)
    ý

    Can I ask why you're using print() in the first place, rather than writing
    directly to a file? Python 3.x, AFAIK, distinguishes between text and binary
    files and will let you specify the encoding you want for strings you write.

    Hope that helps,
    Rami
    >
    > > ./nntst2.py

    >
    > ISO8859-1
    > ý
    >
    > > ./nntst2.py >nnout2

    >
    > Traceback (most recent call last):
    > File "./nntst2.py", line 5, in <module>
    > print(mychar)
    > UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in
    > position 0: ordinal not in range(128)
    >
    > > cat nnout2

    >
    > ascii
    >
    > ..Oh great!
    >
    > ok lets try this:
    > #nntst3.py
    > import sys,codecs
    > mychar=chr(253)
    > print(sys.stdout.encoding)
    > print(mychar.encode('latin1'))
    >
    > > ./nntst3.py

    >
    > ISO8859-1
    > b'\xfd'
    >
    > > ./nntst3.py >nnout3
    > >
    > > cat nnout3

    >
    > ascii
    > b'\xfd'
    >
    > ..Eh... not what I want really.
    >
    > #nntst4.py
    > import sys,codecs
    > mychar=chr(253)
    > print(sys.stdout.encoding)
    > sys.stdout=codecs.getwriter("latin1")(sys.stdout)
    > print(mychar)
    >
    > > ./nntst4.py

    >
    > ISO8859-1
    > Traceback (most recent call last):
    > File "./nntst4.py", line 6, in <module>
    > print(mychar)
    > File "Python-3.1.2/Lib/codecs.py", line 356, in write
    > self.stream.write(data)
    > TypeError: must be str, not bytes
    >
    > ..OK, this is not working either.
    >
    > Is there any way to write a value 253 to standard output?


    ----
    Rami Chowdhury
    "Ninety percent of everything is crap." -- Sturgeon's Law
    408-597-7068 (US) / 07875-841-046 (UK) / 01819-245544 (BD)
    Rami Chowdhury, Mar 23, 2010
    #2
    1. Advertising

  3. nn

    nn Guest

    Rami Chowdhury wrote:
    > On Tuesday 23 March 2010 10:33:33 nn wrote:
    > > I know that unicode is the way to go in Python 3.1, but it is getting
    > > in my way right now in my Unix scripts. How do I write a chr(253) to a
    > > file?
    > >
    > > #nntst2.py
    > > import sys,codecs
    > > mychar=chr(253)
    > > print(sys.stdout.encoding)
    > > print(mychar)

    >
    > The following code works for me:
    >
    > $ cat nnout5.py
    > #!/usr/bin/python3.1
    >
    > import sys
    > mychar = chr(253)
    > sys.stdout.write(mychar)
    > $ echo $(cat nnout)
    > ý
    >
    > Can I ask why you're using print() in the first place, rather than writing
    > directly to a file? Python 3.x, AFAIK, distinguishes between text and binary > files and will let you specify the encoding you want for strings you write.
    >
    > Hope that helps,
    > Rami
    > >
    > > > ./nntst2.py

    > >
    > > ISO8859-1
    > > ý
    > >
    > > > ./nntst2.py >nnout2

    > >
    > > Traceback (most recent call last):
    > > File "./nntst2.py", line 5, in <module>
    > > print(mychar)
    > > UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in
    > > position 0: ordinal not in range(128)
    > >
    > > > cat nnout2

    > >
    > > ascii
    > >
    > > ..Oh great!
    > >
    > > ok lets try this:
    > > #nntst3.py
    > > import sys,codecs
    > > mychar=chr(253)
    > > print(sys.stdout.encoding)
    > > print(mychar.encode('latin1'))
    > >
    > > > ./nntst3.py

    > >
    > > ISO8859-1
    > > b'\xfd'
    > >
    > > > ./nntst3.py >nnout3
    > > >
    > > > cat nnout3

    > >
    > > ascii
    > > b'\xfd'
    > >
    > > ..Eh... not what I want really.
    > >
    > > #nntst4.py
    > > import sys,codecs
    > > mychar=chr(253)
    > > print(sys.stdout.encoding)
    > > sys.stdout=codecs.getwriter("latin1")(sys.stdout)
    > > print(mychar)
    > >
    > > > ./nntst4.py

    > >
    > > ISO8859-1
    > > Traceback (most recent call last):
    > > File "./nntst4.py", line 6, in <module>
    > > print(mychar)
    > > File "Python-3.1.2/Lib/codecs.py", line 356, in write
    > > self.stream.write(data)
    > > TypeError: must be str, not bytes
    > >
    > > ..OK, this is not working either.
    > >
    > > Is there any way to write a value 253 to standard output?

    >


    #nntst5.py
    import sys
    mychar=chr(253)
    sys.stdout.write(mychar)

    > ./nntst5.py >nnout5

    Traceback (most recent call last):
    File "./nntst5.py", line 4, in <module>
    sys.stdout.write(mychar)
    UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in
    position 0: ordinal not in range(128)

    equivalent to print.

    I use print so I can do tests and debug runs to the screen or pipe it
    to some other tool and then configure the production bash script to
    write the final output to a file of my choosing.
    nn, Mar 23, 2010
    #3
  4. nn

    Gary Herron Guest

    nn wrote:
    > I know that unicode is the way to go in Python 3.1, but it is getting
    > in my way right now in my Unix scripts. How do I write a chr(253) to a
    > file?
    >


    Python3 make a distinction between bytes and string(i.e., unicode)
    types, and you are still thinking in the Python2 mode that does *NOT*
    make such a distinction. What you appear to want is to write a
    particular byte to a file -- so use the bytes type and a file open in
    binary mode:

    >>> b=bytes([253])
    >>> f = open("abc", 'wb')
    >>> f.write(b)

    1
    >>> f.close()


    On unix (at least), the "od" program can verify the contents is correct:
    > od abc -d

    0000000 253
    0000001


    Hope that helps.

    Gary Herron



    > #nntst2.py
    > import sys,codecs
    > mychar=chr(253)
    > print(sys.stdout.encoding)
    > print(mychar)
    >
    > > ./nntst2.py

    > ISO8859-1
    > ý
    >
    > > ./nntst2.py >nnout2

    > Traceback (most recent call last):
    > File "./nntst2.py", line 5, in <module>
    > print(mychar)
    > UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in
    > position 0: ordinal not in range(128)
    >
    >
    >> cat nnout2
    >>

    > ascii
    >
    > ..Oh great!
    >
    > ok lets try this:
    > #nntst3.py
    > import sys,codecs
    > mychar=chr(253)
    > print(sys.stdout.encoding)
    > print(mychar.encode('latin1'))
    >
    >
    >> ./nntst3.py
    >>

    > ISO8859-1
    > b'\xfd'
    >
    >
    >> ./nntst3.py >nnout3
    >>

    >
    >
    >> cat nnout3
    >>

    > ascii
    > b'\xfd'
    >
    > ..Eh... not what I want really.
    >
    > #nntst4.py
    > import sys,codecs
    > mychar=chr(253)
    > print(sys.stdout.encoding)
    > sys.stdout=codecs.getwriter("latin1")(sys.stdout)
    > print(mychar)
    >
    > > ./nntst4.py

    > ISO8859-1
    > Traceback (most recent call last):
    > File "./nntst4.py", line 6, in <module>
    > print(mychar)
    > File "Python-3.1.2/Lib/codecs.py", line 356, in write
    > self.stream.write(data)
    > TypeError: must be str, not bytes
    >
    > ..OK, this is not working either.
    >
    > Is there any way to write a value 253 to standard output?
    >
    Gary Herron, Mar 23, 2010
    #4
  5. nn

    nn Guest

    Gary Herron wrote:
    > nn wrote:
    > > I know that unicode is the way to go in Python 3.1, but it is getting
    > > in my way right now in my Unix scripts. How do I write a chr(253) to a
    > > file?
    > >

    >
    > Python3 make a distinction between bytes and string(i.e., unicode)
    > types, and you are still thinking in the Python2 mode that does *NOT*
    > make such a distinction. What you appear to want is to write a
    > particular byte to a file -- so use the bytes type and a file open in
    > binary mode:
    >
    > >>> b=bytes([253])
    > >>> f = open("abc", 'wb')
    > >>> f.write(b)

    > 1
    > >>> f.close()

    >
    > On unix (at least), the "od" program can verify the contents is correct:
    > > od abc -d

    > 0000000 253
    > 0000001
    >
    >
    > Hope that helps.
    >
    > Gary Herron
    >
    >
    >
    > > #nntst2.py
    > > import sys,codecs
    > > mychar=chr(253)
    > > print(sys.stdout.encoding)
    > > print(mychar)
    > >
    > > > ./nntst2.py

    > > ISO8859-1
    > > ý
    > >
    > > > ./nntst2.py >nnout2

    > > Traceback (most recent call last):
    > > File "./nntst2.py", line 5, in <module>
    > > print(mychar)
    > > UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in
    > > position 0: ordinal not in range(128)
    > >
    > >
    > >> cat nnout2
    > >>

    > > ascii
    > >
    > > ..Oh great!
    > >
    > > ok lets try this:
    > > #nntst3.py
    > > import sys,codecs
    > > mychar=chr(253)
    > > print(sys.stdout.encoding)
    > > print(mychar.encode('latin1'))
    > >
    > >
    > >> ./nntst3.py
    > >>

    > > ISO8859-1
    > > b'\xfd'
    > >
    > >
    > >> ./nntst3.py >nnout3
    > >>

    > >
    > >
    > >> cat nnout3
    > >>

    > > ascii
    > > b'\xfd'
    > >
    > > ..Eh... not what I want really.
    > >
    > > #nntst4.py
    > > import sys,codecs
    > > mychar=chr(253)
    > > print(sys.stdout.encoding)
    > > sys.stdout=codecs.getwriter("latin1")(sys.stdout)
    > > print(mychar)
    > >
    > > > ./nntst4.py

    > > ISO8859-1
    > > Traceback (most recent call last):
    > > File "./nntst4.py", line 6, in <module>
    > > print(mychar)
    > > File "Python-3.1.2/Lib/codecs.py", line 356, in write
    > > self.stream.write(data)
    > > TypeError: must be str, not bytes
    > >
    > > ..OK, this is not working either.
    > >
    > > Is there any way to write a value 253 to standard output?
    > >


    Actually what I want is to write a particular byte to standard output,
    and I want this to work regardless of where that output gets sent to.
    I am aware that I could do
    open('nnout','w',encoding='latin1').write(mychar) but I am porting a
    python2 program and don't want to rewrite everything that uses that
    script.
    nn, Mar 23, 2010
    #5
  6. nn, 23.03.2010 19:46:
    > Actually what I want is to write a particular byte to standard output,
    > and I want this to work regardless of where that output gets sent to.
    > I am aware that I could do
    > open('nnout','w',encoding='latin1').write(mychar) but I am porting a
    > python2 program and don't want to rewrite everything that uses that
    > script.


    Are you writing text or binary data to stdout?

    Stefan
    Stefan Behnel, Mar 23, 2010
    #6
  7. nn

    nn Guest

    Stefan Behnel wrote:
    > nn, 23.03.2010 19:46:
    > > Actually what I want is to write a particular byte to standard output,
    > > and I want this to work regardless of where that output gets sent to.
    > > I am aware that I could do
    > > open('nnout','w',encoding='latin1').write(mychar) but I am porting a
    > > python2 program and don't want to rewrite everything that uses that
    > > script.

    >
    > Are you writing text or binary data to stdout?
    >
    > Stefan


    latin1 charset text.
    nn, Mar 23, 2010
    #7
  8. nn wrote:
    >
    > Stefan Behnel wrote:
    >> nn, 23.03.2010 19:46:
    >>> Actually what I want is to write a particular byte to standard output,
    >>> and I want this to work regardless of where that output gets sent to.
    >>> I am aware that I could do
    >>> open('nnout','w',encoding='latin1').write(mychar) but I am porting a
    >>> python2 program and don't want to rewrite everything that uses that
    >>> script.

    >> Are you writing text or binary data to stdout?
    >>
    >> Stefan

    >
    > latin1 charset text.


    Are you sure about that? If you carefully reconsider, could you come to
    the conclusion that you are not writing text at all, but binary data?

    If it really was text that you write, why do you need to use
    U+00FD (LATIN SMALL LETTER Y WITH ACUTE). To my knowledge, that
    character is really infrequently used in practice. So that you try to
    write it strongly suggests that it is not actually text what you are
    writing.

    Also, your formulation suggests the same:

    "Is there any way to write a value 253 to standard output?"

    If you would really be writing text, you'd ask


    "Is there any way to write 'ý' to standard output?"

    Regards,
    Martin
    Martin v. Loewis, Mar 23, 2010
    #8
  9. On Tue, 23 Mar 2010 11:46:33 -0700, nn wrote:

    > Actually what I want is to write a particular byte to standard output,
    > and I want this to work regardless of where that output gets sent to.


    What do you mean "work"?

    Do you mean "display a particular glyph" or something else?

    In bash:

    $ echo -e "\0101" # octal 101 = decimal 65
    A
    $ echo -e "\0375" # decimal 253
    �

    but if I change the terminal encoding, I get this:

    $ echo -e "\0375"
    ý

    Or this:

    $ echo -e "\0375"
    ²

    depending on which encoding I use.

    I think your question is malformed. You need to work out what behaviour
    you actually want, before you can ask for help on how to get it.



    --
    Steven
    Steven D'Aprano, Mar 24, 2010
    #9
  10. nn

    nn Guest

    Martin v. Loewis wrote:
    > nn wrote:
    > >
    > > Stefan Behnel wrote:
    > >> nn, 23.03.2010 19:46:
    > >>> Actually what I want is to write a particular byte to standard output,
    > >>> and I want this to work regardless of where that output gets sent to.
    > >>> I am aware that I could do
    > >>> open('nnout','w',encoding='latin1').write(mychar) but I am porting a
    > >>> python2 program and don't want to rewrite everything that uses that
    > >>> script.
    > >> Are you writing text or binary data to stdout?
    > >>
    > >> Stefan

    > >
    > > latin1 charset text.

    >
    > Are you sure about that? If you carefully reconsider, could you come to
    > the conclusion that you are not writing text at all, but binary data?
    >
    > If it really was text that you write, why do you need to use
    > U+00FD (LATIN SMALL LETTER Y WITH ACUTE). To my knowledge, that
    > character is really infrequently used in practice. So that you try to
    > write it strongly suggests that it is not actually text what you are
    > writing.
    >
    > Also, your formulation suggests the same:
    >
    > "Is there any way to write a value 253 to standard output?"
    >
    > If you would really be writing text, you'd ask
    >
    >
    > "Is there any way to write '�' to standard output?"
    >
    > Regards,
    > Martin


    To be more informative I am both writing text and binary data
    together. That is I am embedding text from another source into stream
    that uses non-ascii characters as "control" characters. In Python2 I
    was processing it mostly as text containing a few "funny" characters.
    nn, Mar 24, 2010
    #10
  11. nn

    nn Guest

    Steven D'Aprano wrote:
    > On Tue, 23 Mar 2010 11:46:33 -0700, nn wrote:
    >
    > > Actually what I want is to write a particular byte to standard output,
    > > and I want this to work regardless of where that output gets sent to.

    >
    > What do you mean "work"?
    >
    > Do you mean "display a particular glyph" or something else?
    >
    > In bash:
    >
    > $ echo -e "\0101" # octal 101 = decimal 65
    > A
    > $ echo -e "\0375" # decimal 253
    > �
    >
    > but if I change the terminal encoding, I get this:
    >
    > $ echo -e "\0375"
    > ý
    >
    > Or this:
    >
    > $ echo -e "\0375"
    > ²
    >
    > depending on which encoding I use.
    >
    > I think your question is malformed. You need to work out what behaviour
    > you actually want, before you can ask for help on how to get it.
    >
    >
    >
    > --
    > Steven


    Yes sorry it is a bit ambiguous. I don't really care what glyph is,
    the program reading my output reads 8 bit values expects the binary
    value 0xFD as control character and lets everything else through as is.
    nn, Mar 24, 2010
    #11
  12. Le Tue, 23 Mar 2010 10:33:33 -0700, nn a écrit :

    > I know that unicode is the way to go in Python 3.1, but it is getting in
    > my way right now in my Unix scripts. How do I write a chr(253) to a
    > file?
    >
    > #nntst2.py
    > import sys,codecs
    > mychar=chr(253)
    > print(sys.stdout.encoding)
    > print(mychar)


    print() writes to the text (unicode) layer of sys.stdout.
    If you want to access the binary (bytes) layer, you must use
    sys.stdout.buffer. So:

    sys.stdout.buffer.write(chr(253).encode('latin1'))

    or:

    sys.stdout.buffer.write(bytes([253]))

    See http://docs.python.org/py3k/library/io.html#io.TextIOBase.buffer
    Antoine Pitrou, Mar 24, 2010
    #12
  13. Steven D'Aprano wrote:
    > I think your question is malformed. You need to work out what behaviour
    > you actually want, before you can ask for help on how to get it.


    It may or may not be malformed, but I understand the question. So let
    eme translate for you. How can he write arbitrary bytes ( 0x0 through
    0xff) to stdout without having them mangled by encodings. It's a very
    simple question, really. Looks like Antoine Pitrou has answered this
    question quite nicely as well.
    Michael Torrie, Mar 24, 2010
    #13
  14. nn

    nn Guest

    Antoine Pitrou wrote:
    > Le Tue, 23 Mar 2010 10:33:33 -0700, nn a écrit :
    >
    > > I know that unicode is the way to go in Python 3.1, but it is getting in
    > > my way right now in my Unix scripts. How do I write a chr(253) to a
    > > file?
    > >
    > > #nntst2.py
    > > import sys,codecs
    > > mychar=chr(253)
    > > print(sys.stdout.encoding)
    > > print(mychar)

    >
    > print() writes to the text (unicode) layer of sys.stdout.
    > If you want to access the binary (bytes) layer, you must use
    > sys.stdout.buffer. So:
    >
    > sys.stdout.buffer.write(chr(253).encode('latin1'))
    >
    > or:
    >
    > sys.stdout.buffer.write(bytes([253]))
    >
    > See http://docs.python.org/py3k/library/io.html#io.TextIOBase.buffer


    Just what I needed! Now I full control of the output.

    Thanks Antoine. The new io stack is still a bit of a mystery to me.

    Thanks everybody else, and sorry for confusing the issue. Latin1 just
    happens to be very convenient to manipulate bytes and is what I
    thought of initially to handle my mix of textual and non-textual data.
    nn, Mar 24, 2010
    #14
  15. nn

    John Nagle Guest

    nn wrote:

    > To be more informative I am both writing text and binary data
    > together. That is I am embedding text from another source into stream
    > that uses non-ascii characters as "control" characters. In Python2 I
    > was processing it mostly as text containing a few "funny" characters.


    OK. Then you need to be writing arrays of bytes, not strings.
    Encoding is your problem. This has nothing to do with Unicode.

    John Nagle
    John Nagle, Mar 24, 2010
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dietrich
    Replies:
    1
    Views:
    636
    Joe Smith
    Jul 22, 2004
  2. Max
    Replies:
    0
    Views:
    455
  3. Dale Gerdemann

    python3 Unicode is slow

    Dale Gerdemann, Oct 25, 2009, in forum: Python
    Replies:
    1
    Views:
    612
    John Machin
    Oct 25, 2009
  4. kai_nerda
    Replies:
    0
    Views:
    618
    kai_nerda
    Apr 3, 2010
  5. Andrew Berg
    Replies:
    0
    Views:
    330
    Andrew Berg
    Jun 16, 2012
Loading...

Share This Page