How to get Python to default to UTF8

Discussion in 'Python' started by weheh, Dec 22, 2007.

  1. weheh

    weheh Guest

    I'm developing a cgi-bin application that must be unicode sensitive. I'm
    striving for a UTF8 implementation. I'm running python 2.3 on a development
    machine (windows xp) and a server (windows xp server). Both environments are
    running Apache 2.2 with the same configuration file.

    The problem is this. On my development machine I get the following unicode
    error:

    UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-6: invalid
    data
    args = ('utf8', 'adem\xe3\xa1s', 4, 7, 'invalid data')
    encoding = 'utf8'
    end = 7
    object = 'adem\xe3\xa1s'
    reason = 'invalid data'
    start = 4


    On my server, running exactly the same python code, I see the following
    unicode error:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 4:
    ordinal not in range(128)
    args = ('ascii', 'adem\xe3\xa1s', 4, 5, 'ordinal not in range(128)')
    encoding = 'ascii'
    end = 5
    object = 'adem\xe3\xa1s'
    reason = 'ordinal not in range(128)'
    start = 4

    Note the differences in the encoding -- on the development machine it's utf8
    but on the server it's ascii.

    I was under the impression that Python assumed ascii encoding by default.
    I'm wondering how did my development machine get to be utf8? And since my
    python code is the same on both machines, what is it about my configuration
    that could be causing a difference in default encoding? I checked site.py on
    both machines and both files default to ASCII, so I assume it's something
    else.

    Thanks in advance.
    weheh, Dec 22, 2007
    #1
    1. Advertising

  2. weheh wrote:

    > I'm developing a cgi-bin application that must be unicode sensitive. I'm
    > striving for a UTF8 implementation. I'm running python 2.3 on a development
    > machine (windows xp) and a server (windows xp server). Both environments are
    > running Apache 2.2 with the same configuration file.
    >
    > The problem is this. On my development machine I get the following unicode
    > error:
    >
    > UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-6: invalid
    > data
    > args = ('utf8', 'adem\xe3\xa1s', 4, 7, 'invalid data')
    > encoding = 'utf8'
    > end = 7
    > object = 'adem\xe3\xa1s'
    > reason = 'invalid data'
    > start = 4


    Could be that sys.stdin.encoding differs between the setups.

    *Where* do you get this exception? In the database layer? When the
    script is trying to read things from a file? When it's trying to output
    things? Somewhere else?

    </F>
    Fredrik Lundh, Dec 22, 2007
    #2
    1. Advertising

  3. weheh

    weheh Guest

    Hi Fredrik, thanks for responding. After reading up some more on this, I
    think my title should be changed to "How to get Python to default to ASCII".
    In point of fact, I want my 2 environments to agree so that I can debug
    thinkgs more easily. Right now it's a nightmare.

    As to your questions, in this example, I believe the exception was caused by
    trying to do a count of the number of times a string appears in an array.
    One of the strings was unicode and the other was encoded by Python by
    default.




    "Fredrik Lundh" <> wrote in message
    news:...
    > weheh wrote:
    >
    >> I'm developing a cgi-bin application that must be unicode sensitive. I'm
    >> striving for a UTF8 implementation. I'm running python 2.3 on a
    >> development
    >> machine (windows xp) and a server (windows xp server). Both environments
    >> are running Apache 2.2 with the same configuration file.
    >>
    >> The problem is this. On my development machine I get the following
    >> unicode error:
    >>
    >> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-6:
    >> invalid data
    >> args = ('utf8', 'adem\xe3\xa1s', 4, 7, 'invalid data')
    >> encoding = 'utf8'
    >> end = 7
    >> object = 'adem\xe3\xa1s'
    >> reason = 'invalid data'
    >> start = 4

    >
    > Could be that sys.stdin.encoding differs between the setups.
    >
    > *Where* do you get this exception? In the database layer? When the
    > script is trying to read things from a file? When it's trying to output
    > things? Somewhere else?
    >
    > </F>
    >
    weheh, Dec 22, 2007
    #3
  4. weheh wrote:

    > Hi Fredrik, thanks for responding. After reading up some more on this, I
    > think my title should be changed to "How to get Python to default to ASCII".
    > In point of fact, I want my 2 environments to agree so that I can debug
    > thinkgs more easily. Right now it's a nightmare.
    >
    > As to your questions, in this example, I believe the exception was caused by
    > trying to do a count of the number of times a string appears in an array.
    > One of the strings was unicode and the other was encoded by Python by
    > default.


    to fix this, figure out from where you got the encoded (8-bit) string,
    and make sure you decode it properly on the way in. only use Unicode
    strings on the "inside".

    (Python does have two encoding defaults; there's a default encoding that
    *shouldn't* ever be changed from the "ascii" default, and there's also a
    stdin/stdout encoding that's correctly set if you run the code in an
    ordinary terminal window. if you get your data from anywhere else, you
    cannot trust any of these, so you should do your own decoding on the way
    in, and encoding things on the way out).

    </F>
    Fredrik Lundh, Dec 22, 2007
    #4
  5. weheh

    weheh Guest

    Hi Fredrik,

    Thanks again for your feedback. I am much obliged.

    Indeed, I am forced to be exteremely rigorous about decoding on the way in
    and encoding on the way out everywhere in my program, just as you say. Your
    advice is excellent and concurs with other sources of unicode expertise.
    Following this approach is the only thing that has made it possible for me
    to get my program to work.

    However, the situation is still unacceptable to me because I often make
    mistakes and it is easy for me to miss places where encoding is necessary. I
    rely on testing to find my faults. On my development environment, I get no
    error message and it seems that everything works perfectly. However, once
    ported to the server, I see a crash. But this is too late a stage to catch
    the error since the app is already live.

    I assume that the default encoding that you mention shouldn't ever be
    changed is stored in the site.py file. I've checked this file and it's set
    to ascii in both machines (development and server). I haven't touched
    site.py. However, a week or so ago, following the advice of someone I read
    on the web, I did create a file in my cgi-bin directory called something
    like site-config.py, wherein encoding was set to utf8. I ran my program a
    few times, but then reading elsewhere that the site-config.py approach was
    outmoded, I decided to remove this file. I'm wondering whether it made a
    permanent change somewhere in the bowels of python while I wasn't looking?

    Can you elaborate on where to look to see what stdin/stdout encodings are
    set to? All inputs are coming at my app either via html forms or input
    files. All output goes either to the browser via html or to an output file.


    >
    > to fix this, figure out from where you got the encoded (8-bit) string, and
    > make sure you decode it properly on the way in. only use Unicode strings
    > on the "inside".
    >
    > (Python does have two encoding defaults; there's a default encoding that
    > *shouldn't* ever be changed from the "ascii" default, and there's also a
    > stdin/stdout encoding that's correctly set if you run the code in an
    > ordinary terminal window. if you get your data from anywhere else, you
    > cannot trust any of these, so you should do your own decoding on the way
    > in, and encoding things on the way out).
    >
    > </F>
    >
    weheh, Dec 23, 2007
    #5
  6. > However, the situation is still unacceptable to me because I often make
    > mistakes and it is easy for me to miss places where encoding is necessary. I
    > rely on testing to find my faults. On my development environment, I get no
    > error message and it seems that everything works perfectly. However, once
    > ported to the server, I see a crash. But this is too late a stage to catch
    > the error since the app is already live.


    If you want to check whether there is indeed no place where you forgot
    to properly .encode, you can set the default encoding on your
    development machine to "undefined" (see site.py). This will give you an
    exception whenever the default encoding is invoked, even if the encoding
    would have succeeded under the default default encoding (ie. "ascii")

    Such a setting should not be applied a production environment.

    > Can you elaborate on where to look to see what stdin/stdout encodings are
    > set to?


    Just print out sys.stdin.encoding and sys.stdout.encoding. Or were you
    asking for the precise source in the interpreter that sets them?

    > All inputs are coming at my app either via html forms or input
    > files. All output goes either to the browser via html or to an output file.


    Then sys.stdout.encoding will not be set to anything.

    Regards,
    Martin
    Martin v. Löwis, Dec 23, 2007
    #6
  7. weheh

    John Nagle Guest

    weheh wrote:
    > Hi Fredrik,
    >
    > Thanks again for your feedback. I am much obliged.
    >


    Bear in mind that in Python, ASCII currently means ASCII, values
    0..127. Type "str" will accept values > 127. However, the default
    conversion from "str" to "unicode" requires true ASCII values, in
    0..127. So if you take in data from some source which might have
    a byte value > 127, the default conversion to Unicode won't work.

    There are conversion functions for specifying the meaning of
    values 128..255, (the input might be "latin1" encoding, for
    example), or ignoring unexpected characters, or converting them
    to "?".

    John Nagle
    John Nagle, Dec 24, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,796
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. Christian Hanke

    Tomcat4: Default Page Encoding UTF8

    Christian Hanke, Nov 5, 2003, in forum: Java
    Replies:
    1
    Views:
    534
    Sudsy
    Nov 5, 2003
  3. Replies:
    2
    Views:
    589
    Rohit Gupta
    Jun 20, 2005
  4. gry
    Replies:
    2
    Views:
    703
    Alf P. Steinbach
    Mar 13, 2012
  5. David M. Cotter
    Replies:
    19
    Views:
    238
    David M. Cotter
    Aug 28, 2013
Loading...

Share This Page