C
Christos TZOTZIOY Georgiou
This is a subject that comes up fairly often. Last night, I had the
following idea, for which I would like feedback from you.
This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows:
1. Create a dictionary (key: encoding, value: set of valid bytes for the
encoding)
1a. the sets can be constructed by trial and error:
def valid_bytes(encoding):
result= set()
for byte in xrange(256):
char= chr(byte)
try:
char.decode(encoding)
except UnicodeDecodeError:
pass
else:
result.add(char)
return result
2. for every 8-bit encoding, some "representative" text is given (the
longer, the better)
2a. the following function is a quick generator of all two-char
sequences from its string argument. can be used both for the production
of the pre-calculated data and for the analysis of a given string in the
'wild_guess' function.
def str_window(text):
return itertools.imap(
text.__getslice__, xrange(0, len(s)-1), xrange(2, len(s)+1)
)
So for every encoding and 'representative' text, a bag of two-char
sequences and their frequencies is calculated. {frequencies[encoding] =
dict(key: two-chars, value: count)}
2b. do a lengthy comparison of the bags in order to find the most common
two-char sequences that, as a set, can be considered unique for the
specific encoding.
2c. For every encoding, keep only a set of the (chosen in step 2b)
two-char sequences that were judged as 'representative'. Store these
calculated sets plus those from step 1a as python code in a helper
module to be imported from codecs.py for the wild_guess function
(reproduce the helper module every time some 'representative' text is
added or modified).
3. write the wild_guess function
3a. the function 'wild_guess' would first construct a set from its
argument:
sample_set= set(argument)
and by set operations against the sets from step 1a, we can exclude
codecs where the sample set is not a subset of the encoding valid set.
I don't expect that this step would exclude many encodings, but I think
it should not be skipped.
3b. pass the argument through the str_window function, and construct a
set of all two-char sequencies
3c. from all sets from step 2c, find the one whose intersection with set
from 3b is longest as a ratio of len(intersection)/len(encoding_set),
and suggest the relevant encoding.
What do you think? I can't test whether that would work unless I have
'representative' texts for various encodings. Please feel free to help
or bash
PS I know how generic 'representative' is, and how hard it is to qualify
some text as such, therefore the quotes. That is why I said 'the
longer, the better'.
following idea, for which I would like feedback from you.
This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows:
1. Create a dictionary (key: encoding, value: set of valid bytes for the
encoding)
1a. the sets can be constructed by trial and error:
def valid_bytes(encoding):
result= set()
for byte in xrange(256):
char= chr(byte)
try:
char.decode(encoding)
except UnicodeDecodeError:
pass
else:
result.add(char)
return result
2. for every 8-bit encoding, some "representative" text is given (the
longer, the better)
2a. the following function is a quick generator of all two-char
sequences from its string argument. can be used both for the production
of the pre-calculated data and for the analysis of a given string in the
'wild_guess' function.
def str_window(text):
return itertools.imap(
text.__getslice__, xrange(0, len(s)-1), xrange(2, len(s)+1)
)
So for every encoding and 'representative' text, a bag of two-char
sequences and their frequencies is calculated. {frequencies[encoding] =
dict(key: two-chars, value: count)}
2b. do a lengthy comparison of the bags in order to find the most common
two-char sequences that, as a set, can be considered unique for the
specific encoding.
2c. For every encoding, keep only a set of the (chosen in step 2b)
two-char sequences that were judged as 'representative'. Store these
calculated sets plus those from step 1a as python code in a helper
module to be imported from codecs.py for the wild_guess function
(reproduce the helper module every time some 'representative' text is
added or modified).
3. write the wild_guess function
3a. the function 'wild_guess' would first construct a set from its
argument:
sample_set= set(argument)
and by set operations against the sets from step 1a, we can exclude
codecs where the sample set is not a subset of the encoding valid set.
I don't expect that this step would exclude many encodings, but I think
it should not be skipped.
3b. pass the argument through the str_window function, and construct a
set of all two-char sequencies
3c. from all sets from step 2c, find the one whose intersection with set
from 3b is longest as a ratio of len(intersection)/len(encoding_set),
and suggest the relevant encoding.
What do you think? I can't test whether that would work unless I have
'representative' texts for various encodings. Please feel free to help
or bash
PS I know how generic 'representative' is, and how hard it is to qualify
some text as such, therefore the quotes. That is why I said 'the
longer, the better'.