An attempt at guessing the encoding of a (non-unicode) string

Christos TZOTZIOY Georgiou · Apr 2, 2004

This is a subject that comes up fairly often. Last night, I had the
following idea, for which I would like feedback from you.

This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows:

1. Create a dictionary (key: encoding, value: set of valid bytes for the
encoding)

1a. the sets can be constructed by trial and error:

def valid_bytes(encoding):
result= set()
for byte in xrange(256):
char= chr(byte)
try:
char.decode(encoding)
except UnicodeDecodeError:
pass
else:
result.add(char)
return result

2. for every 8-bit encoding, some "representative" text is given (the
longer, the better)

2a. the following function is a quick generator of all two-char
sequences from its string argument. can be used both for the production
of the pre-calculated data and for the analysis of a given string in the
'wild_guess' function.

def str_window(text):
return itertools.imap(
text.__getslice__, xrange(0, len(s)-1), xrange(2, len(s)+1)
)

So for every encoding and 'representative' text, a bag of two-char
sequences and their frequencies is calculated. {frequencies[encoding] =
dict(key: two-chars, value: count)}

2b. do a lengthy comparison of the bags in order to find the most common
two-char sequences that, as a set, can be considered unique for the
specific encoding.

2c. For every encoding, keep only a set of the (chosen in step 2b)
two-char sequences that were judged as 'representative'. Store these
calculated sets plus those from step 1a as python code in a helper
module to be imported from codecs.py for the wild_guess function
(reproduce the helper module every time some 'representative' text is
added or modified).

3. write the wild_guess function

3a. the function 'wild_guess' would first construct a set from its
argument:

sample_set= set(argument)

and by set operations against the sets from step 1a, we can exclude
codecs where the sample set is not a subset of the encoding valid set.
I don't expect that this step would exclude many encodings, but I think
it should not be skipped.

3b. pass the argument through the str_window function, and construct a
set of all two-char sequencies

3c. from all sets from step 2c, find the one whose intersection with set
from 3b is longest as a ratio of len(intersection)/len(encoding_set),
and suggest the relevant encoding.

What do you think? I can't test whether that would work unless I have
'representative' texts for various encodings. Please feel free to help
or bash

PS I know how generic 'representative' is, and how hard it is to qualify
some text as such, therefore the quotes. That is why I said 'the
longer, the better'.

Jon Willeke · Apr 2, 2004

Christos said:
This is a subject that comes up fairly often. Last night, I had the
following idea, for which I would like feedback from you.

This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows: ....
What do you think? I can't test whether that would work unless I have
'representative' texts for various encodings. Please feel free to help
or bash

The representative text would, in some circles, be called a training
corpus. See the Natural Language Toolkit for some modules that may help
you prototype this approach:

<http://nltk.sf.net/>

In particular, check out the probability tutorial.

Christos TZOTZIOY Georgiou · Apr 2, 2004

Christos TZOTZIOY Georgiou wrote:

This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows: ...

Click to expand...

<snip>

[Jon]
The representative text would, in some circles, be called a training
corpus. See the Natural Language Toolkit for some modules that may help
you prototype this approach:

<http://nltk.sf.net/>

In particular, check out the probability tutorial.

Thanks for the hint, and I am browsing the documentation now. However,
I'd like to create something that would not be dependent on external
python libraries, so that anyone interested would just download a small
module that would do the job, hopefully good.

David Eppstein · Apr 2, 2004

I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.

John Roth · Apr 3, 2004

David Eppstein said:
I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.

Shouldn't that be isalphanum()? Or does your data not have
very many numbers?

John Roth

David Eppstein · Apr 3, 2004

"John Roth said:
Shouldn't that be isalphanum()? Or does your data not have
very many numbers?

It's only important if your text has many code positions which produce a
digit in one encoding and not in another, and which are hard to
disambiguate using isalpha() alone. I haven't encountered that
situation.

Roger Binns · Apr 3, 2004

Christos said:
This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data.

Windows already has a related function:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_81np.asp

Read more about it here:

http://weblogs.asp.net/oldnewthing/archive/2004/03/24/95235.aspx

Roger

Christos TZOTZIOY Georgiou · Apr 5, 2004

Windows already has a related function:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_81np.asp

As far as I understand, this function tests whether its argument is a
valid Unicode text, so it has little to do with the issue I brought up:
take a python string (8-bit bytes) and try to guess its encoding (eg,
iso8859-1, iso8859-7 etc).

There must be a similar function used for the "auto guess encoding"
function of the MS Internet Explorer, however:

1. even if it is exported and usable under windows, it is not platform
independent

2. its guessing success rate (until IE 5.5 which I happen to use) is not
very high

<snip>

Thanks for your reply, anyway.

Christos TZOTZIOY Georgiou · Apr 5, 2004

I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.

Somebody (by email only so far) has suggested that spambayes could be
used to the task... perhaps they're right, however this is not as simple
and independent a solution I would like to deliver.

I would believe that your idea of a score is a good one; I feel that the
key should be two-char combinations, but I'll have to compare the
success rate of both one-char and two-char keys.

I'll try to search for "representative" texts on the web for as many
encodings as I can; any pointers, links from non-english speakers would
be welcome in the thread.

Seo Sanghyeon · Apr 5, 2004

I think you will find Mozilla's charset autodetection method
interesting.

A composite approach to language/encoding detection
http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

Perhaps this can be used with PyXPCOM. I don't know.

David Eppstein · Apr 5, 2004

I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.

Somebody (by email only so far) has suggested that spambayes could be
used to the task... perhaps they're right, however this is not as simple
and independent a solution I would like to deliver.

I would believe that your idea of a score is a good one; I feel that the
key should be two-char combinations, but I'll have to compare the
success rate of both one-char and two-char keys.

I'll try to search for "representative" texts on the web for as many
encodings as I can; any pointers, links from non-english speakers would
be welcome in the thread.[/QUOTE]

BTW, if you're going to implement the single-char version, at least for
encodings that translate one byte -> one unicode position (e.g., not
utf8), and your texts are large enough, it will be faster to precompute
a table of byte frequencies in the text and then compute the score by
summing the frequencies of alphabetic bytes.

Christos TZOTZIOY Georgiou · Apr 7, 2004

I think you will find Mozilla's charset autodetection method
interesting.

A composite approach to language/encoding detection
http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

Thank you!

Perhaps this can be used with PyXPCOM. I don't know.

Neither do I...

Christos TZOTZIOY Georgiou · Apr 7, 2004

BTW, if you're going to implement the single-char version, at least for
encodings that translate one byte -> one unicode position (e.g., not
utf8), and your texts are large enough, it will be faster to precompute
a table of byte frequencies in the text and then compute the score by
summing the frequencies of alphabetic bytes.

Thanks for the pointer, David. However, as it often happens, I came
second (or, probably, n-th

. Seo Sanghyeon sent a URL that includes a
two-char proposal, and it provides an algorithm in section 4.7.1 that I
find appropriate for this matter:

http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

Guessing the encoding from a BOM	7	Jan 16, 2014
An assessment of the Unicode standard	119	Aug 29, 2009
Question about encoding, I need a clue ...	2	Aug 5, 2011
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
retriving escape unicode sequences from files ...	1	Aug 4, 2012
Using __abstractmethod__ with non-methods	0	Jun 16, 2011
Proper use of the codecs module.	3	Aug 16, 2013
retriving escape unicode sequences from files ...	1	Aug 4, 2012

An attempt at guessing the encoding of a (non-unicode) string

Christos TZOTZIOY Georgiou

Jon Willeke

Christos TZOTZIOY Georgiou

David Eppstein

John Roth

David Eppstein

Roger Binns

Christos TZOTZIOY Georgiou

Christos TZOTZIOY Georgiou

Seo Sanghyeon

David Eppstein

Christos TZOTZIOY Georgiou

Christos TZOTZIOY Georgiou

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads