Detect character encoding

Discussion in 'Python' started by Michal, Dec 4, 2005.

  1. Michal

    Michal Guest

    Hello,
    is there any way how to detect string encoding in Python?

    I need to proccess several files. Each of them could be encoded in
    different charset (iso-8859-2, cp1250, etc). I want to detect it, and
    encode it to utf-8 (with string function encode).

    Thank you for any answer
    Regards
    Michal
    Michal, Dec 4, 2005
    #1
    1. Advertising

  2. Michal wrote:
    > Hello,
    > is there any way how to detect string encoding in Python?
    >
    > I need to proccess several files. Each of them could be encoded in
    > different charset (iso-8859-2, cp1250, etc). I want to detect it, and
    > encode it to utf-8 (with string function encode).
    >
    > Thank you for any answer
    > Regards
    > Michal

    The two ways to detect a string's encoding are:
    (1) know the encoding ahead of time
    (2) guess correctly

    This is the whole point of Unicode -- an encoding that works for _lots_
    of languages.

    --Scott David Daniels
    Scott David Daniels, Dec 4, 2005
    #2
    1. Advertising

  3. Michal wrote:
    > Hello,
    > is there any way how to detect string encoding in Python?
    >
    > I need to proccess several files. Each of them could be encoded in
    > different charset (iso-8859-2, cp1250, etc). I want to detect it, and
    > encode it to utf-8 (with string function encode).


    You can only guess, by e.g. looking for words that contain e.g. umlauts.
    Recode might be of help here, it has such heuristics built in AFAIK.

    But there is _no_ way to be absolutely sure. 8bit are 8bit, so each file
    is "legal" in all encodings.


    Diez
    Diez B. Roggisch, Dec 4, 2005
    #3
  4. Michal

    Mike Meyer Guest

    "Diez B. Roggisch" <> writes:
    > Michal wrote:
    >> is there any way how to detect string encoding in Python?
    >> I need to proccess several files. Each of them could be encoded in
    >> different charset (iso-8859-2, cp1250, etc). I want to detect it,
    >> and encode it to utf-8 (with string function encode).

    > But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
    > file is "legal" in all encodings.


    Not quite. Some encodings don't use all the valid 8-bit characters, so
    if you encounter a character not in an encoding, you can eliminate it
    from the list of possible encodings. This doesn't really help much by
    itself, though.

    <mike
    --
    Mike Meyer <> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
    Mike Meyer, Dec 4, 2005
    #4
  5. Michal

    Nemesis Guest

    Mentre io pensavo ad una intro simpatica "Michal" scriveva:

    > Hello,
    > is there any way how to detect string encoding in Python?
    > I need to proccess several files. Each of them could be encoded in
    > different charset (iso-8859-2, cp1250, etc). I want to detect it, and
    > encode it to utf-8 (with string function encode).
    > Thank you for any answer


    Hi,
    As you already heard you can't be sure but you can guess.

    I use a method like this:

    def guess_encoding(text):
    for best_enc in guess_list:
    try:
    unicode(text,best_enc,"strict")
    except:
    pass
    else:
    break
    return best_enc

    'guess_list' is an ordered charset name list like this:

    ['us-ascii','iso-8859-1','iso-8859-2',...,'windows-1250','windows-1252'...]

    of course you can remove charsets you are sure you'll never find.
    --
    Questa potrebbe davvero essere la scintilla che fa traboccare la
    goccia.

    |\ | |HomePage : http://nem01.altervista.org
    | \|emesis |XPN (my nr): http://xpn.altervista.org
    Nemesis, Dec 4, 2005
    #5
  6. Michal

    B Mahoney Guest

    B Mahoney, Dec 4, 2005
    #6
  7. Mike Meyer wrote:
    > "Diez B. Roggisch" <> writes:
    >> Michal wrote:
    >>> is there any way how to detect string encoding in Python?
    >>> I need to proccess several files. Each of them could be encoded in
    >>> different charset (iso-8859-2, cp1250, etc). I want to detect it,
    >>> and encode it to utf-8 (with string function encode).

    >> But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
    >> file is "legal" in all encodings.

    >
    > Not quite. Some encodings don't use all the valid 8-bit characters, so
    > if you encounter a character not in an encoding, you can eliminate it
    > from the list of possible encodings. This doesn't really help much by
    > itself, though.
    >
    > <mike


    I read or heard (can't remember the origin) that MS IE has a quite good
    implementation of guessing the language en character encoding of web
    pages when there not or falsely specified.
    From what I can remember is that they used an algorithm to create some
    statistics of the specific page and compared that with statistic about
    all kinds of languages and encodings and just mapped the most likely.

    Please be aware that I don't know if the above has even the slightest
    amount of truth in it, however it didn't prevent me from posting anyway ;-)

    --
    mph
    Martin P. Hellwig, Dec 4, 2005
    #7
  8. Michal

    Guest

    Martin> I read or heard (can't remember the origin) that MS IE has a
    Martin> quite good implementation of guessing the language en character
    Martin> encoding of web pages when there not or falsely specified.

    Gee, that's nice. Too bad the source isn't available... <0.5 wink>

    Skip
    , Dec 4, 2005
    #8
  9. Mike Meyer wrote:
    > "Diez B. Roggisch" <> writes:
    >
    >>Michal wrote:
    >>
    >>>is there any way how to detect string encoding in Python?
    >>>I need to proccess several files. Each of them could be encoded in
    >>>different charset (iso-8859-2, cp1250, etc). I want to detect it,
    >>>and encode it to utf-8 (with string function encode).

    >>
    >>But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
    >>file is "legal" in all encodings.

    >
    >
    > Not quite. Some encodings don't use all the valid 8-bit characters, so
    > if you encounter a character not in an encoding, you can eliminate it
    > from the list of possible encodings. This doesn't really help much by
    > itself, though.



    ----- test.py
    for enc in ["cp1250", "latin1", "iso-8859-2"]:
    print enc
    try:
    str.decode("".join([chr(i) for i in xrange(256)]), enc)
    except UnicodeDecodeError, e:
    print e
    -----

    192:~ deets$ python2.4 /tmp/test.py
    cp1250
    'charmap' codec can't decode byte 0x81 in position 129: character maps
    to <undefined>
    latin1
    iso-8859-2

    So cp1250 doesn't have all codepoints defined - but the others have.
    Sure, this helps you to eliminate 1 of the three choices the OP wanted
    to choose between - but how many texts you have that have a 129 in them?

    Regards,

    Diez
    Diez B. Roggisch, Dec 4, 2005
    #9
  10. [Diez B. Roggisch]
    >Michal wrote:


    >> is there any way how to detect string encoding in Python?


    >Recode might be of help here, it has such heuristics built in AFAIK.


    If we are speaking about the same Recode ☺, there are some built in
    tools that could help a human to discover a charset, but this requires
    work and time, and is far from fully automated as one might dream.
    While some charsets could be guessed almost correctly by automatic
    means, most are difficult to recognise. The whole problem is not easy.

    --
    François Pinard http://pinard.progiciels-bpi.ca
    =?iso-8859-1?Q?Fran=E7ois?= Pinard, Dec 5, 2005
    #10
  11. Martin P. Hellwig wrote:
    > From what I can remember is that they used an algorithm to create some
    > statistics of the specific page and compared that with statistic about
    > all kinds of languages and encodings and just mapped the most likely.


    More hearsay: I believe language-based heuristics are common. You first
    guess an encoding based on the bytes you see, then guess a language of
    the page. If you then get a lot of characters that should not appear
    in texts of the language (e.g. a lot of umlaut characters in a French
    page), you know your guess was wrong, and you try a different language
    for that encoding. If you run out of languages, you guess a different
    encoding.

    Mozilla can guess the encoding if you tell it what the language is,
    which sounds like a similar approach.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Dec 5, 2005
    #11
  12. Diez B. Roggisch wrote:
    > So cp1250 doesn't have all codepoints defined - but the others have.
    > Sure, this helps you to eliminate 1 of the three choices the OP wanted
    > to choose between - but how many texts you have that have a 129 in them?


    For the iso8859 ones, you should assume that the characters in
    range(128, 160) really aren't used. If you get one of these, and it is
    not utf-8, it is a Windows code page.

    UTF-8 can be recognized pretty reliable: even though it allows all bytes
    to appear, it is very constraint in what sequences of bytes it allows.
    E.g. you can't have a single byte >127 in UTF-8; you need atleast two
    of them subsequent, and they need to meet more constraints.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Dec 5, 2005
    #12
  13. Michal

    Kent Johnson Guest

    Martin P. Hellwig wrote:
    > I read or heard (can't remember the origin) that MS IE has a quite good
    > implementation of guessing the language en character encoding of web
    > pages when there not or falsely specified.


    Yes, I think that's right. In my experience MS Word does a very good job
    of guessing the encoding of text files.

    Kent
    Kent Johnson, Dec 5, 2005
    #13
  14. Michal

    The new guy Guest

    Michal wrote:

    > Hello,
    > is there any way how to detect string encoding in Python?
    >
    > I need to proccess several files. Each of them could be encoded in
    > different charset (iso-8859-2, cp1250, etc). I want to detect it, and
    > encode it to utf-8 (with string function encode).


    Well, about how to detect it in Python, I can't help. My first guess,
    though, would be to have a look at the source code of the "file" utility.
    This is an example of what it does:

    # ls
    de.i18n en.i18n
    # file *
    de.i18n: ISO-8859 text, with very long lines
    en.i18n: ISO-8859 English text, with very long lines

    cheers
    The new guy, Dec 6, 2005
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. sunil
    Replies:
    0
    Views:
    599
    sunil
    Jul 28, 2004
  2. HK
    Replies:
    7
    Views:
    8,579
    John C. Bollinger
    Jun 7, 2005
  3. raavi
    Replies:
    2
    Views:
    908
    raavi
    Mar 2, 2006
  4. Simon
    Replies:
    10
    Views:
    3,398
    Mayeul
    Jun 9, 2009
  5. iMath
    Replies:
    20
    Views:
    324
    iMath
    Jun 9, 2013
Loading...

Share This Page