Is setdefaultencoding bad?

Discussion in 'Python' started by moerchendiser2k3, Feb 23, 2011.

  1. Hi, I embedded Py2.6.1 in my app and I use UTF-8 encoded strings
    everywhere in the interface, so the interface between my app and
    Python is UTF-8 so I can simply write:

    print u"\uC042"
    print u"\uC042".encode("utf_8")

    and get the corresponding chinese char in the console. But currently
    sys.defaultencoding is still ascii. Should I change it in the
    and turn it to utf-8 or is this not recommended somehow? I often read
    its highly unrecommended but I can't find an explanation why.

    Thanks for any hints!!
    Bye, moerchendiser2k3
    moerchendiser2k3, Feb 23, 2011
    1. Advertisements

  2. moerchendiser2k3

    Nobody Guest

    You shouldn't use it.

    If your code needs to run on any system other than your own, it can't rely
    upon the default encoding being set to anything in particular. So
    changing the default encoding is an easy way to end up writing code which
    doesn't work on any system except your own.

    And you can't change the default encoding outside of because the
    value has to be constant throughout the lifetime of the process.

    IIRC, if you use a unicode string as a dictionary key, and the key can be
    converted using the default encoding, the hash is calculated on the
    encoded byte string (so that if you have equivalent unicode and byte
    strings, both hash to the same value). If you were to change the default
    encoding after any dictionaries have been created (internally, Python uses
    dictionaries quite extensively), subsequent lookups would use the wrong
    hash values.
    Nobody, Feb 23, 2011
    1. Advertisements

  3. Ok, but that the interface handles UTF-8 strings
    are still ok? The defaultencoding is still ascii.
    moerchendiser2k3, Feb 23, 2011
  4. moerchendiser2k3

    Chris Rebert Guest

    Yes, that's fine. UTF-8 is an excellent encoding choice, and
    encoding/decoding should always be done explicitly in Python, so the
    "default encoding" ideally ought to never come into play (and indeed,
    Python 3 does away with bug-prone implicit encoding/decoding entirely
    FWICT). Having ASCII as the "default encoding" ensures that implicit
    encoding/decoding bugs are relatively apparent.

    Chris Rebert, Feb 23, 2011
  5. moerchendiser2k3

    Nobody Guest

    On Unix, you have to go out of your way to avoid the use of implicit
    encoding/decoding with the "filesystem" encoding. This is because Unix
    extensively uses byte strings with no associated encoding, but Python 3
    tries to use Unicode for everything.

    3.0 was essentially unusable as a Unix scripting language for this reason,
    as argv and environ were converted to Unicode, with no possibility of
    recovering from unconvertible sequences.

    3.1 added the surrogate-escape mechanism which allows recovery of the
    original byte sequences, albeit with some effort (i.e. you had to
    explicitly decode os.environ and sys.argv).

    3.2 adds os.environb (bytes version of os.environ), but it appears that
    sys.argv still has to be encoded manually. It also provides os.fsencode()
    and os.fsdecode() to simplify the conversion.

    Most functions accept bytes arguments, most either return bytes when
    passed bytes or (if the function accepts no arguments) has a bytes
    equivalent. But variables tend to be Unicode strings with no bytes version
    (os.environb is the exception rather than the rule), and some functions
    have no bytes equivalent (e.g. os.ctermid(), os.uname(), os.ttyname();
    fortunately it's rather unlikely that the result from any of these
    functions will contain non-ASCII characters).
    Nobody, Feb 24, 2011
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.