python encoding bug?

Discussion in 'Python' started by garabik-news-2005-05@kassiopeia.juls.savba.sk, Dec 30, 2005.

  1. Guest

    I was playing with python encodings and noticed this:

    garabik@lancre:~$ python2.4
    Python 2.4 (#2, Dec 3 2004, 17:59:05)
    [GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> unicode('\x9d', 'iso8859_1')

    u'\x9d'
    >>>


    U+009D is NOT a valid unicode character (it is not even a iso8859_1
    valid character)

    The same happens if I use 'latin-1' instead of 'iso8859_1'.

    This caught me by surprise, since I was doing some heuristics guessing
    string encodings, and 'iso8859_1' gave no errors even if the input
    encoding was different.

    Is this a known behaviour, or I discovered a terrible unknown bug in python encoding
    implementation that should be immediately reported and fixed? :)


    happy new year,

    --
    -----------------------------------------------------------
    | Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
    | __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
    -----------------------------------------------------------
    Antivirus alert: file .signature infected by signature virus.
    Hi! I'm a signature virus! Copy me into your signature file to help me spread!
    , Dec 30, 2005
    #1
    1. Advertising

  2. <> wrote in message
    news:dp4dqd$230e$...
    |
    | I was playing with python encodings and noticed this:
    |
    | garabik@lancre:~$ python2.4
    | Python 2.4 (#2, Dec 3 2004, 17:59:05)
    | [GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
    | Type "help", "copyright", "credits" or "license" for more information.
    | >>> unicode('\x9d', 'iso8859_1')
    | u'\x9d'
    | >>>
    |
    | U+009D is NOT a valid unicode character (it is not even a iso8859_1
    | valid character)

    That statement is not entirely true. If you check the current
    UnicodeData.txt (on http://www.unicode.org/Public/UNIDATA/) you'll find:

    009D;<control>;Cc;0;BN;;;;;N;OPERATING SYSTEM COMMAND;;;;

    Regards,

    Vincent Wehren

    |
    | The same happens if I use 'latin-1' instead of 'iso8859_1'.
    |
    | This caught me by surprise, since I was doing some heuristics guessing
    | string encodings, and 'iso8859_1' gave no errors even if the input
    | encoding was different.
    |
    | Is this a known behaviour, or I discovered a terrible unknown bug in
    python encoding
    | implementation that should be immediately reported and fixed? :)
    |
    |
    | happy new year,
    |
    | --
    | -----------------------------------------------------------
    || Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
    || __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
    | -----------------------------------------------------------
    | Antivirus alert: file .signature infected by signature virus.
    | Hi! I'm a signature virus! Copy me into your signature file to help me
    spread!
    Vincent Wehren, Dec 31, 2005
    #2
    1. Advertising

  3. wrote:

    >
    > I was playing with python encodings and noticed this:
    >
    > garabik@lancre:~$ python2.4
    > Python 2.4 (#2, Dec 3 2004, 17:59:05)
    > [GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
    > Type "help", "copyright", "credits" or "license" for more information.
    >>>> unicode('\x9d', 'iso8859_1')

    > u'\x9d'
    >>>>

    >
    > U+009D is NOT a valid unicode character (it is not even a iso8859_1
    > valid character)


    It *IS* a valid unicode and iso8859-1 character, so the behaviour of the
    python decoder is correct. The range U+0080 - U+009F is used for various
    control characters. There's rarely a valid use for these characters in
    documents, so you can be pretty sure that a document using these characters
    is windows-1252 - it is valid iso-8859-1, but for a heuristic guess it's
    probably saver to assume windows-1252.

    If you want an exception to be thrown, you'll need to implement your own
    codec, something like 'iso8859_1_nocc' - mmm.. I could try this myself,
    because I do such a test in one of my projects, too ;)

    > The same happens if I use 'latin-1' instead of 'iso8859_1'.
    >
    > This caught me by surprise, since I was doing some heuristics guessing
    > string encodings, and 'iso8859_1' gave no errors even if the input
    > encoding was different.
    >
    > Is this a known behaviour, or I discovered a terrible unknown bug in
    > python encoding implementation that should be immediately reported and
    > fixed? :)
    >
    >
    > happy new year,
    >


    --
    Benjamin Niemann
    Email: pink at odahoda dot de
    WWW: http://www.odahoda.de/
    Benjamin Niemann, Dec 31, 2005
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,832
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. Replies:
    1
    Views:
    23,344
    Real Gagnon
    Oct 8, 2004
  3. Edward K. Ream

    2.3 encoding parsing bug

    Edward K. Ream, Feb 17, 2004, in forum: Python
    Replies:
    11
    Views:
    503
    Edward K. Ream
    Feb 19, 2004
  4. Replies:
    2
    Views:
    364
  5. Replies:
    13
    Views:
    235
    Ned Deily
    Apr 13, 2013
Loading...

Share This Page