Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in

Discussion in 'Python' started by John Machin, Jan 30, 2009.

  1. John Machin

    John Machin Guest

    Benjamin Kaplan <benjamin.kaplan <at> case.edu> writes:

    > First of all, you're right that might be confusing. I was thinking of

    auto-detect as in "check the platform and locale and guess what they usually
    use". I wasn't thinking of it like the web browsers use it.I think it uses
    locale.getpreferredencoding().

    You're probably right. I'd forgotten about locale.getpreferredencoding(). I'll
    raise a request on the bug tracker to get some more precise wording in the
    open() docs.

    > On my machine, I get sys.getpreferredencoding() == 'utf-8' and

    locale.getdefaultencoding()== 'cp1252'.

    sys <-> locale ... +1 long-range transposition typo of the year :)

    > If you check my response to Anjanesh's comment, I mentioned that he should

    either find out which encoding it is in particular or he should open the file in
    binary mode. I suggested utf-8 and latin1 because those are the most likely
    candidates for his file since cp1252 was already excluded.

    The OP is on a Windows machine. His file looks like a source code file. He is
    unlikely to be creating latin1 files himself on a Windows box. Under the
    hypothesis that he is accidentally or otherwise reading somebody else's source
    files as data, it could be any encoding. In one package with which I'm familiar,
    the encoding is declared as cp1251 in every .py file; AFAICT the only file with
    non-ASCII characters is an example script containing his wife's name!

    The OP's 0x9d is a defined character in code pages 1250, 1251, 1256, and 1257 --
    admittedly all as implausible as the latin1 control character.

    > Looking at a character map, 0x9d is a control character in latin1, so the page

    is probably UTF-8 encoded. Thinking about it now, it could also be MacRoman but
    that isn't as common as UTF-8.

    Late breaking news: I presume you can see two instances of U+00DD (LATIN CAPITAL
    LETTER Y WITH ACUTE) in the OP's report
    "query":"0 1Ȉ \u2021 0\u201a0 \u2021»Ã","

    Well, u'\xdd'.encode('utf8') is '\xc3\x9d' ... the Bayesian score for utf8 just
    went up a notch.

    The preceding character U+00BB (looks like >>) doesn't cause an exception
    because 0xBB unlike 0x9D is defined in cp1252.

    Curiously looking at the \uxxxx escape sequences:
    \u2021 is "double dagger", \u201a is "single low-9 quotation mark" ... what
    appears to be the value part of an item in a hard-coded dictionary is about as
    comprehensible as the Voynich manuscript.

    Trouble with cases like this is as soon as they become interesting, the OP often
    snatches somebody's one-liner that "works" (i.e. doesn't raise an exception),
    makes a quick break for the county line, and they're not seen again :)

    Cheers,
    John
     
    John Machin, Jan 30, 2009
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robin Siebler
    Replies:
    4
    Views:
    27,469
    Tim Peters
    Oct 8, 2004
  2. Anjanesh Lekshminarayanan
    Replies:
    0
    Views:
    1,021
    Anjanesh Lekshminarayanan
    Jan 29, 2009
  3. Benjamin Peterson
    Replies:
    0
    Views:
    476
    Benjamin Peterson
    Jan 29, 2009
  4. John Machin
    Replies:
    0
    Views:
    1,021
    John Machin
    Jan 29, 2009
  5. Anjanesh Lekshminarayanan
    Replies:
    0
    Views:
    1,328
    Anjanesh Lekshminarayanan
    Feb 27, 2009
Loading...

Share This Page