Using Unicode scripts

Discussion in 'Python' started by yzzzzz, Jul 18, 2003.

  1. yzzzzz

    yzzzzz Guest

    Hi,

    I am writing my python programs using a Unicode text editor. The files are
    encoded in UTF-8. Python's default encoding seems to be Latin 1 (ISO-8859-1)
    or maybe Windows-1252 (CP1252) which aren't compatible with UTF-8.

    For example, if I type print "é", it prints é. If I use a unicode string:
    a=u"é" and if I choose to encode it in UTF-8, I get 4 Latin 1 characters,
    which makes sense if the interpreter thinks I typed in u"é".

    How can I solve this problem?

    Thank you

    PS. I have no problem using Unicode strings in Python, I know how to
    manipulate and convert them, I'm just looking for how to specify the default
    encoding for the scripts I write.
    yzzzzz, Jul 18, 2003
    #1
    1. Advertising

  2. "yzzzzz" <> writes:

    > Hi,
    >
    > I am writing my python programs using a Unicode text editor. The files are
    > encoded in UTF-8. Python's default encoding seems to be Latin 1 (ISO-8859-1)
    > or maybe Windows-1252 (CP1252) which aren't compatible with UTF-8.
    >
    > For example, if I type print "é", it prints é. If I use a unicode string:
    > a=u"é" and if I choose to encode it in UTF-8, I get 4 Latin 1 characters,
    > which makes sense if the interpreter thinks I typed in u"é".
    >
    > How can I solve this problem?
    >
    > Thank you
    >
    > PS. I have no problem using Unicode strings in Python, I know how to
    > manipulate and convert them, I'm just looking for how to specify the default
    > encoding for the scripts I write.


    Use Python 2.3, and read PEP 263.

    Thomas
    Thomas Heller, Jul 18, 2003
    #2
    1. Advertising

  3. yzzzzz wrote:
    > Hi,


    Hi "yzzzzz",

    > I am writing my python programs using a Unicode text editor. The files are
    > encoded in UTF-8. Python's default encoding seems to be Latin 1 (ISO-8859-1)
    > or maybe Windows-1252 (CP1252) which aren't compatible with UTF-8.
    >
    > For example, if I type print "é", it prints é. If I use a unicode string:
    > a=u"é" and if I choose to encode it in UTF-8, I get 4 Latin 1 characters,
    > which makes sense if the interpreter thinks I typed in u"é".
    >
    > How can I solve this problem?


    You might want to read the thread on this list/newsgroup I started
    yesterday called "Unicode problem"

    Is it feasible for you to upgrade to Python 2.3? If so I'd recommend you
    do it already. 2.3 is pretty close to release now and it has support for
    source files in Unicode format. If your Unicode editor saves the text
    file with a BOM (it should) then under Python 2.3 your scripts will work
    as expected.

    > Thank you
    >
    > PS. I have no problem using Unicode strings in Python, I know how to
    > manipulate and convert them, I'm just looking for how to specify the default
    > encoding for the scripts I write.


    See http://www.python.org/peps/pep-0263.html This is how it is
    implemented in Python 2.3.

    -- Gerhard
    =?ISO-8859-15?Q?Gerhard_H=E4ring?=, Jul 18, 2003
    #3
  4. yzzzzz

    yzzzzz Guest

    OK, problem solved!
    I got the new Python, it all works. I just had to add the UTF-8 BOM myself
    (UltraEdit doesn't do it by default) but that wasn't too difficult to do
    (copy and paste a ZWNBSP).

    One last question: I'm using windows, so the console's encoding is CP437. If
    I try to print a unicode string, the string is converted to CP437 and
    printed and that works fine. However if I try to print a normal
    (non-unicode) string from a UTF-8 encoded file with BOM, for example print
    "é", it sends out the two UTF-8 bytes é which appear as lines in the CP437
    charset. But if I print the exact same character in a Latin 1 encoded file,
    it comes out as the Latin 1 byte for "é" which shows up as a theta in CP437.
    This means that Python doesn't take into account the specified encoding
    (Latin 1 or UTF-8) and prints out the raw bytes as they appear in the source
    file, regardless of the encoding used. Is this normal? (this isn't really a
    problem for me as I am only going to use unicode strings now)
    yzzzzz, Jul 18, 2003
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jan Danielsson
    Replies:
    8
    Views:
    620
    Mike Meyer
    Jul 22, 2005
  2. Jp Calderone
    Replies:
    0
    Views:
    449
    Jp Calderone
    Jul 21, 2005
  3. Keith MacDonald

    Interpreting Unicode scripts

    Keith MacDonald, Feb 5, 2006, in forum: Python
    Replies:
    2
    Views:
    268
    Keith MacDonald
    Feb 5, 2006
  4. davidj411
    Replies:
    0
    Views:
    492
    davidj411
    Jun 27, 2008
  5. Replies:
    13
    Views:
    527
    Anno Siegel
    Sep 10, 2007
Loading...

Share This Page