An attempt at guessing the encoding of a (non-unicode) string

  • Thread starter Christos TZOTZIOY Georgiou
  • Start date
C

Christos TZOTZIOY Georgiou

This is a subject that comes up fairly often. Last night, I had the
following idea, for which I would like feedback from you.

This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows:

1. Create a dictionary (key: encoding, value: set of valid bytes for the
encoding)

1a. the sets can be constructed by trial and error:

def valid_bytes(encoding):
result= set()
for byte in xrange(256):
char= chr(byte)
try:
char.decode(encoding)
except UnicodeDecodeError:
pass
else:
result.add(char)
return result

2. for every 8-bit encoding, some "representative" text is given (the
longer, the better)

2a. the following function is a quick generator of all two-char
sequences from its string argument. can be used both for the production
of the pre-calculated data and for the analysis of a given string in the
'wild_guess' function.

def str_window(text):
return itertools.imap(
text.__getslice__, xrange(0, len(s)-1), xrange(2, len(s)+1)
)

So for every encoding and 'representative' text, a bag of two-char
sequences and their frequencies is calculated. {frequencies[encoding] =
dict(key: two-chars, value: count)}

2b. do a lengthy comparison of the bags in order to find the most common
two-char sequences that, as a set, can be considered unique for the
specific encoding.

2c. For every encoding, keep only a set of the (chosen in step 2b)
two-char sequences that were judged as 'representative'. Store these
calculated sets plus those from step 1a as python code in a helper
module to be imported from codecs.py for the wild_guess function
(reproduce the helper module every time some 'representative' text is
added or modified).

3. write the wild_guess function

3a. the function 'wild_guess' would first construct a set from its
argument:

sample_set= set(argument)

and by set operations against the sets from step 1a, we can exclude
codecs where the sample set is not a subset of the encoding valid set.
I don't expect that this step would exclude many encodings, but I think
it should not be skipped.

3b. pass the argument through the str_window function, and construct a
set of all two-char sequencies

3c. from all sets from step 2c, find the one whose intersection with set
from 3b is longest as a ratio of len(intersection)/len(encoding_set),
and suggest the relevant encoding.

What do you think? I can't test whether that would work unless I have
'representative' texts for various encodings. Please feel free to help
or bash :)

PS I know how generic 'representative' is, and how hard it is to qualify
some text as such, therefore the quotes. That is why I said 'the
longer, the better'.
 
J

Jon Willeke

Christos said:
This is a subject that comes up fairly often. Last night, I had the
following idea, for which I would like feedback from you.

This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows: ....
What do you think? I can't test whether that would work unless I have
'representative' texts for various encodings. Please feel free to help
or bash :)

The representative text would, in some circles, be called a training
corpus. See the Natural Language Toolkit for some modules that may help
you prototype this approach:

<http://nltk.sf.net/>

In particular, check out the probability tutorial.
 
C

Christos TZOTZIOY Georgiou

Christos TZOTZIOY Georgiou wrote:
This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows: ...
<snip>

[Jon]
The representative text would, in some circles, be called a training
corpus. See the Natural Language Toolkit for some modules that may help
you prototype this approach:

<http://nltk.sf.net/>

In particular, check out the probability tutorial.

Thanks for the hint, and I am browsing the documentation now. However,
I'd like to create something that would not be dependent on external
python libraries, so that anyone interested would just download a small
module that would do the job, hopefully good.
 
D

David Eppstein

I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.
 
J

John Roth

David Eppstein said:
I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.

Shouldn't that be isalphanum()? Or does your data not have
very many numbers?

John Roth
 
D

David Eppstein

"John Roth said:
Shouldn't that be isalphanum()? Or does your data not have
very many numbers?

It's only important if your text has many code positions which produce a
digit in one encoding and not in another, and which are hard to
disambiguate using isalpha() alone. I haven't encountered that
situation.
 
C

Christos TZOTZIOY Georgiou


As far as I understand, this function tests whether its argument is a
valid Unicode text, so it has little to do with the issue I brought up:
take a python string (8-bit bytes) and try to guess its encoding (eg,
iso8859-1, iso8859-7 etc).

There must be a similar function used for the "auto guess encoding"
function of the MS Internet Explorer, however:

1. even if it is exported and usable under windows, it is not platform
independent

2. its guessing success rate (until IE 5.5 which I happen to use) is not
very high

<snip>

Thanks for your reply, anyway.
 
C

Christos TZOTZIOY Georgiou

I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.

Somebody (by email only so far) has suggested that spambayes could be
used to the task... perhaps they're right, however this is not as simple
and independent a solution I would like to deliver.

I would believe that your idea of a score is a good one; I feel that the
key should be two-char combinations, but I'll have to compare the
success rate of both one-char and two-char keys.

I'll try to search for "representative" texts on the web for as many
encodings as I can; any pointers, links from non-english speakers would
be welcome in the thread.
 
D

David Eppstein

I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.

Somebody (by email only so far) has suggested that spambayes could be
used to the task... perhaps they're right, however this is not as simple
and independent a solution I would like to deliver.

I would believe that your idea of a score is a good one; I feel that the
key should be two-char combinations, but I'll have to compare the
success rate of both one-char and two-char keys.

I'll try to search for "representative" texts on the web for as many
encodings as I can; any pointers, links from non-english speakers would
be welcome in the thread.[/QUOTE]

BTW, if you're going to implement the single-char version, at least for
encodings that translate one byte -> one unicode position (e.g., not
utf8), and your texts are large enough, it will be faster to precompute
a table of byte frequencies in the text and then compute the score by
summing the frequencies of alphabetic bytes.
 
C

Christos TZOTZIOY Georgiou

BTW, if you're going to implement the single-char version, at least for
encodings that translate one byte -> one unicode position (e.g., not
utf8), and your texts are large enough, it will be faster to precompute
a table of byte frequencies in the text and then compute the score by
summing the frequencies of alphabetic bytes.

Thanks for the pointer, David. However, as it often happens, I came
second (or, probably, n-th :). Seo Sanghyeon sent a URL that includes a
two-char proposal, and it provides an algorithm in section 4.7.1 that I
find appropriate for this matter:

http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top