how to detect the encoding used for a specific text data ?

iMath · Dec 20, 2012

iMath · Dec 20, 2012

åœ¨ 2012å¹´12æœˆ20æ—¥æ˜ŸæœŸå››UTC+8ä¸‹åˆ7æ—¶57åˆ†19ç§’ï¼ŒiMathå†™é“ï¼š

how to detect the encoding used for a specific text data ?

On windows XP

Jussi Piitulainen · Dec 20, 2012

iMath said:
how to detect the encoding used for a specific text data ?

The practical thing to do is to try an encoding and see whether you
find the expected frequent letters of the relevant languages in the
decoded text, or the most frequent words. This is likely to help you
decide between some of the most common encodings. Some decoding
attempts may even raise an exception, which should be a clue.

Strictly speaking, it cannot be done with complete certainty. There
are lots of Finnish texts that are identical whether you think they
are in Latin-1 or Latin-9. A further text from the same source might
still reveal the difference, so the distinction matters.

Short Finnish texts might also be identical whether you think they are
in Latin-1 or UTF-8, but the situation is different: a couple of
frequent letters turn into nonsense in the wrong encoding. It's easy
to tell at a glance.

Sometimes texts declare their encoding. That should be a clue, but in
practice the declaration may be false. Sometimes there is a stray
character that violates the declared or assumed encoding, or a part of
the text is in one encoding and another part in another. Bad source.
You decide how important it is to deal with the mess. (This only
happens in the real world.)

Good luck.

Stefan H. Holek · Dec 20, 2012

how to detect the encoding used for a specific text data ?

http://pypi.python.org/pypi?:action=search&term=detect+encoding

iMath · Dec 20, 2012

which package to use ?

iMath · Dec 20, 2012

which package to use ?

Jussi Piitulainen · Dec 20, 2012

iMath said:
which package to use ?

Read the text in as a "bytes object" (bytes), then it has a .decode
method that you can experiment with. Strings (str) are Unicode and
have an .encode method. These methods allow you to specify a desired
encoding and and what to do when there are errors.

help(bytes.decode)
help(str.encode)
help(open)
<http://docs.python.org/3.3/library/stdtypes.html>

In Python 2.7 and before, strings seem to do double duty and have both
the .encode and .decode methods, so Python version matters here.

rurpy · Dec 20, 2012

how to detect the encoding used for a specific text data ?

The chardet package will probably do what you want:
http://pypi.python.org/pypi/chardet

Dave Angel · Dec 21, 2012

<snip>
On a related note: how to answer question with no context on mailing
list?

Depends on how you're reading/responding. I'll assume you're using an
email client like Thunderbird, and that you do NOT subscribe in digest form.

Most general way is to use Reply-All, and remove any recipients you
don't want there, but make sure you keep the python-list recipient.

Alternatively, if you're using Thunderbird or another with similar
capability, use Reply-list, which is smart enough to only keep the list
entry.

Or, what I used to do, reply, then add the (e-mail address removed) to the
list of recipients. That's error prone.

I hope this answers your question.

how to detect the character encoding in a web page ?	20	Dec 23, 2012
Sending data between 2 specific devices	1	Oct 16, 2024
Uploading images - binary or unsupported text encoding	2	Dec 24, 2022
I am trying to detect Which image id="" was clicked ?	22	Jan 3, 2023
Word matching with specific parameters	1	Jan 26, 2025
How do i set specific code where in arduino	1	Mar 7, 2023
How to build a system to track specific keyword position on Google Search?	0	Jun 20, 2022
How to save Zimbra data as PST format?	2	Jan 14, 2025

how to detect the encoding used for a specific text data ?

iMath

iMath

Jussi Piitulainen

Stefan H. Holek

iMath

iMath

Jussi Piitulainen

rurpy

Dave Angel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads