trying to strip out non ascii.. or rather convert non ascii

wxjmfauth · Oct 30, 2013

Le mercredi 30 octobre 2013 03:17:21 UTC+1, Chris Angelico a Ã©critÂ :

His idea of bad handling is "oh how terrible, ASCII and BMP have

optimizations". He hates the idea that it could be better in some

areas instead of even timings all along. But the FSR actually has some

distinct benefits even in the areas he's citing - watch this:

0.3582399439035271

The first two examples are his examples done on my computer, so you

can see how all four figures compare. Note how testing for the

presence of a non-Latin1 character in an 8-bit string is very fast.

Same goes for testing for non-BMP character in a 16-bit string. The

difference gets even larger if the string is longer:

2.8308718007456264

Wow! The FSR speeds up searches immensely! It's obviously the best

thing since sliced bread!

ChrisA

---------

It is not obvious to make comparaisons with all these
methods and characters (lookup depending on the position
in the table, ...). The only think that can be done and
observed is the tendency between the subsets the FSR
artificially creates.
One can use the best algotithms to adjust bytes, it is
very hard to escape from the fact that if one manipulates
two strings with different internal representations, it
is necessary to find a way to have a "common internal
coding " prior manipulations.
It seems to me that this FSR, with its "negative logic"
is always attempting to "optimize" with the worst
case instead of "optimizing" with the best case.
This kind of effect is shining on the memory side.
Compare utf-8, which has a memory optimization on
a per code point basis with the FSR which has an
optimization based on subsets (One of its purpose).
1020

jmf

wxjmfauth · Oct 30, 2013

Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a Ã©critÂ :

The right solution to that is to treat it no differently from other fuzzy

searches. A good search engine should be tolerant of spelling errors and

alternative spellings for any letter, not just those with diacritics.

Ideally, a good search engine would successfully match all three of

"naÃ¯ve", "naive" and "niave", and it shouldn't rely on special handling

of diacritics.

------

This is a non sense. The purpose of a diacritical mark is to
make a letter a different letter. If a tool is supposed to
match an Ã´, there is absolutely no reason to match something
else.

jmf

Mark Lawrence · Oct 30, 2013

I think it is much better just to ignore this nonsense instead of asking for evidence you know you will never get.

A good point, but note he doesn't have the courage to reply to me but
always to others. I guess he spends a lot of time clucking, not because
he's run out of supplies, but because he's simply a chicken.

Ned Batchelder · Oct 30, 2013

Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a Ã©crit :
------

This is a non sense. The purpose of a diacritical mark is to
make a letter a different letter. If a tool is supposed to
match an Ã´, there is absolutely no reason to match something
else.

jmf

jmf, Tim Chase described his use case, and it seems reasonable to me.
I'm not sure why you would describe it as nonsense.

--Ned.

Mark Lawrence · Oct 30, 2013

Le mercredi 30 octobre 2013 03:17:21 UTC+1, Chris Angelico a Ã©crit :

---------

It is not obvious to make comparaisons with all these
methods and characters (lookup depending on the position
in the table, ...). The only think that can be done and
observed is the tendency between the subsets the FSR
artificially creates.
One can use the best algotithms to adjust bytes, it is
very hard to escape from the fact that if one manipulates
two strings with different internal representations, it
is necessary to find a way to have a "common internal
coding " prior manipulations.
It seems to me that this FSR, with its "negative logic"
is always attempting to "optimize" with the worst
case instead of "optimizing" with the best case.
This kind of effect is shining on the memory side.
Compare utf-8, which has a memory optimization on
a per code point basis with the FSR which has an
optimization based on subsets (One of its purpose).

1020

jmf

How do theses figures compare to the ones quoted here
https://mail.python.org/pipermail/python-dev/2011-September/113714.html ?

wxjmfauth · Oct 30, 2013

Le mercredi 30 octobre 2013 13:44:47 UTC+1, Ned Batchelder a Ã©critÂ :

jmf, Tim Chase described his use case, and it seems reasonable to me.

I'm not sure why you would describe it as nonsense.

--Ned.

--------

My comment had nothing to do with Python, it was a
general comment. A diacritical mark just makes a letter
a different letter; a "Ã¯ " and a "i" are "as
diferent" as a "a" from a "z". A diacritical mark
is more than a simple ornementation.

From a unicode perspective.
Unicode.org "knows", these chars a very important, that's
the reason why they exist in two forms, precomposed and
composed forms.

From a software perspective.
Luckily for the end users, all the serious software
are considering all these chars in an equal way. They
are all belonging to the BMP plane. An "Ä„" is treated
as an "Ãª", same memory consumption, same performance,
==> very smooth software.

jmf

Mark Lawrence · Oct 30, 2013

On 30/10/2013 16:08, (e-mail address removed) wrote:

Would you please read, digest and action this
https://wiki.python.org/moin/GoogleGroupsPython

TIA.

Ned Batchelder · Oct 30, 2013

Le mercredi 30 octobre 2013 13:44:47 UTC+1, Ned Batchelder a Ã©crit :
--------

My comment had nothing to do with Python, it was a
general comment. A diacritical mark just makes a letter
a different letter; a "Ã¯ " and a "i" are "as
diferent" as a "a" from a "z". A diacritical mark
is more than a simple ornementation.

Yes, we understand that. Tim outlined a need that had to do with users'
informal typing. In his case, he needs to deal with that sloppiness.
You can't simply insist that users be more precise.

Unicode is a way to represent text, and text gets used in many different
ways. Each of us has to acknowledge that our text needs may be
different than someone else's. jmf, I'm guessing from your comments
over the last few months that you are doing detailed linguistic work
with corpora in many languages. That work leads to one style of Unicode
use. In your domain, it is "nonsense" to ignore diacriticals.

Other people do different kinds of work with Unicode, and that leads to
different needs. In Tim's system, it is important to ignore
diacriticals. You might not have a use personally for Tim's system.
That doesn't make it nonsense.

--Ned.

Michael Torrie · Oct 30, 2013

My comment had nothing to do with Python, it was a
general comment. A diacritical mark just makes a letter
a different letter; a "Ã¯ " and a "i" are "as
diferent" as a "a" from a "z". A diacritical mark
is more than a simple ornementation.

That's nice, but you didn't actually read what Ned said (or the OP).
The OP doesn't care that "Ã¯ " and a "i" are as different as "a" and "z".
For the purposes of his search he wants them treated as the same
letter. A fuzzy searching treats them all the same. For example, a
search for "Godel, Escher, Bach" should find "GÃ¶del, Escher, Bach" just
fine. Even though "o" and "Ã¶" are different characters. And lo and
behold Google actually does this! Try it. It's nice for those of use
who want to find something and our US keyboards don't have the right marks.

https://www.google.ca/search?q=godel+escher+bach

After all this nonsense, that's what the original poster is looking for
(I think... can't be sure since it's been so many days now). Seems to
me a python module does this quite nicely:

https://pypi.python.org/pypi/Unidecode

wxjmfauth · Oct 30, 2013

Le mercredi 30 octobre 2013 18:54:05 UTC+1, Michael Torrie a écrit :

That's nice, but you didn't actually read what Ned said (or the OP).

The OP doesn't care that "ï " and a "i" are as different as "a" and "z"..

For the purposes of his search he wants them treated as the same

letter. A fuzzy searching treats them all the same. For example, a

search for "Godel, Escher, Bach" should find "Gödel, Escher, Bach" just

fine. Even though "o" and "ö" are different characters. And lo and

behold Google actually does this! Try it. It's nice for those of use

who want to find something and our US keyboards don't have the right marks.

https://www.google.ca/search?q=godel+escher+bach

After all this nonsense, that's what the original poster is looking for

(I think... can't be sure since it's been so many days now). Seems to

me a python module does this quite nicely:

https://pypi.python.org/pypi/Unidecode

Ok. You are right. I recognize my mistake. Independently
from the top poster's task, I did not understand in that
way.

Let say it depends on the context, for a general
search engine, it's good that diacritics are ignored.
For, let say, a text processing system, it's good
to have only precised matches. It does not mean, other
matching possibilities may exist.

jmf

Terry Reedy · Oct 30, 2013

From a unicode perspective.
Unicode.org "knows", these chars a very important, that's
the reason why they exist in two forms, precomposed and
composed forms.

Only some chars have both forms. I believe the precomposed forms are
partly a historical accident of what precomposed forms were in the
various latin-1 sets.

Roy Smith · Oct 30, 2013

Michael Torrie said:
That's nice, but you didn't actually read what Ned said (or the OP).
The OP doesn't care that "Ã¯ " and a "i" are as different as "a" and "z".
For the purposes of his search he wants them treated as the same
letter. A fuzzy searching treats them all the same.

That's one definition of fuzzy. But, there's nothing that says you
can't build a fuzzy matching algorithm which considers some mismatches
to be worse than others.

For example, it's reasonable to consider any vowel (or string of vowels,
for that matter) to be closer to another vowel than to a consonant. A
great example is the word, "bureaucrat". As far as I'm concerned, it's
spelled {b, vowels, r, vowels, c, r, a, t}. It usually takes me three
or four tries to get auto-correct to even recognize what I'm trying to
type and fix it for me.

Likewise for pairs like {c, s}, {j, g}, {v, w}, and so on.

In that spirit, I would think that a, Ã¡, and Ã¢ would all be considered
more conservative replacements for each other than they would be for k,
x, or z.

Steven D'Aprano · Oct 31, 2013

This is a non sense. The purpose of a diacritical mark is to make a
letter a different letter. If a tool is supposed to match an Ã´, there is
absolutely no reason to match something else.

I'm glad that you know so much better than Google, Bing, Yahoo, and other
search engines. When I search for "mispealled" Google gives me:

Showing results for misspelled
Search instead for mispealled

But I see now that this is nonsense and there is *absolutely no reason*
to match something other than the ecaxt wrods I typed.

Perhaps you should submit a bug report to Google:

"When I mistype a word, Google correctly gives me the search results I
wanted, instead of the wrong results I didn't want."

Mark Lawrence · Oct 31, 2013

I'm glad that you know so much better than Google, Bing, Yahoo, and other
search engines. When I search for "mispealled" Google gives me:

Showing results for misspelled
Search instead for mispealled

But I see now that this is nonsense and there is *absolutely no reason*
to match something other than the ecaxt wrods I typed.

Perhaps you should submit a bug report to Google:

"When I mistype a word, Google correctly gives me the search results I
wanted, instead of the wrong results I didn't want."

I'm sorry Steven but you're completely out of your depth here. Please
bow down to the superior intellect of jmf, where jm is for Joseph McCarthy.

wxjmfauth · Oct 31, 2013

Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a écrit :

I'm glad that you know so much better than Google, Bing, Yahoo, and other

search engines. When I search for "mispealled" Google gives me:

Showing results for misspelled

Search instead for mispealled

But I see now that this is nonsense and there is *absolutely no reason*

to match something other than the ecaxt wrods I typed.

Perhaps you should submit a bug report to Google:

"When I mistype a word, Google correctly gives me the search results I

wanted, instead of the wrong results I didn't want."

As far as I know, I recognized my mistake. I had more
text processing systems in mind, than search engines.

I can even tell you, I am really stupid. I wrote pure
Unicode software to sort French or German strings.

Pure unicode == independent from any locale.

jmf

Tim Chase · Oct 31, 2013

For example, it's reasonable to consider any vowel (or string of
vowels, for that matter) to be closer to another vowel than to a
consonant. A great example is the word, "bureaucrat". As far as
I'm concerned, it's spelled {b, vowels, r, vowels, c, r, a, t}. It
usually takes me three or four tries to get auto-correct to even
recognize what I'm trying to type and fix it for me.

[glad I'm not the only one who has trouble spelling "bureaucrat"]

Steven D'Aprano wisely mentioned elsewhere in the thread that "The
right solution to that is to treat it no differently from other fuzzy
searches. A good search engine should be tolerant of spelling errors
and alternative spellings for any letter, not just those with
diacritics."

Often the Levenshtein distance is used for calculating closeness, and
the off-the-shelf algorithm assigns a cost of one per difference
(addition, change, or removal). It doesn't sound like it would be
that hard[1] to assign varying costs based on what character was
added/changed/removed. A diacritic might have a cost of N while a
similar character (vowel->vowel or consonant->consonant, or
consonant-cluster shift) might have a cost of 2N, and a totally
arbitrary character shift might have a cost of 3N (or higher).
Unfortunately, the Levenshtein algorithm is already O(M*N) slow and
can't be reasonably precalculated without knowing both strings, so
this just ends up heaping additional lookups/comparisons atop
already-slow code.

-tkc

[1]
http://en.wikipedia.org/wiki/Levenshtein_distance#Possible_modifications

..

Steven D'Aprano · Nov 1, 2013

Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a Ã©critÂ :

I'm glad that you know so much better than Google, Bing, Yahoo, and
other
search engines. When I search for "mispealled" Google gives me:

Click to expand...

[...]
As far as I know, I recognized my mistake. I had more text processing
systems in mind, than search engines.

Yes, you have, I acknowledge that now. I see now that at the time I made
my response to you, you had already replied recognising your error.
Unfortunately I had not seen that. So in that case, I withdraw my
comments and apologize.

I can even tell you, I am really stupid. I wrote pure Unicode software
to sort French or German strings.

Pure unicode == independent from any locale.

Unfortunately it is not that simple. The same code point can have
different meanings in different languages, and should be treated
differently when sorting. The natural Unicode sort order satisfies very
few European languages, including English. A few examples:

* Swedish Ã¤ is a distinct letters of the alphabet, appearing
after z: "a b c z Ã¤" is sorted according to Swedish rules.
But in German Ã¤ is considered to be the letter 'a' plus an
umlaut, and is collated after 'a': "a Ã¤ b c z" is sorted
according to German rules.

* In German Ã¶ is considered to be a variant of o, equivalent
to 'oe', while in Finish Ã¶ is a distinct letter which
cannot be expanded to 'oe', and which appears at the end
of the alphabet.

* Similarly, in modern English Ã¦ is a ligature of ae, while in
Danish and Norwegian is it a distinct letter of the alphabet
appearing after z: in English dictionaries, "Ã†sir" will be
found with other "A" words, often expanded to "Aesir", while
in Norwegian it will be found after "Z" words.

* Most European languages convert uppercase I to lowercase i,
but Turkish has distinct letters for dotted and dotless I.
According to Turkish rules, lowercase(I) is Ä± and uppercase(i)
is Ä°.

While it is true that the Unicode character set is independent of locale,
for natural processing of characters, it isn't enough to just use Unicode.

wxjmfauth · Nov 1, 2013

Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a Ã©critÂ :

I'm glad that you know so much better than Google, Bing, Yahoo, and
other
search engines. When I search for "mispealled" Google gives me:

Click to expand...

[...]

As far as I know, I recognized my mistake. I had more text processing

Click to expand...

systems in mind, than search engines.

Click to expand...

Yes, you have, I acknowledge that now. I see now that at the time I made

my response to you, you had already replied recognising your error.

Unfortunately I had not seen that. So in that case, I withdraw my

comments and apologize.

I can even tell you, I am really stupid. I wrote pure Unicode software

Click to expand...

to sort French or German strings.

Pure unicode == independent from any locale.

Click to expand...

Unfortunately it is not that simple. The same code point can have

different meanings in different languages, and should be treated

differently when sorting. The natural Unicode sort order satisfies very

few European languages, including English. A few examples:

* Swedish Ã¤ is a distinct letters of the alphabet, appearing

after z: "a b c z Ã¤" is sorted according to Swedish rules.

But in German Ã¤ is considered to be the letter 'a' plus an

umlaut, and is collated after 'a': "a Ã¤ b c z" is sorted

according to German rules.

* In German Ã¶ is considered to be a variant of o, equivalent

to 'oe', while in Finish Ã¶ is a distinct letter which

cannot be expanded to 'oe', and which appears at the end

of the alphabet.

* Similarly, in modern English Ã¦ is a ligature of ae, while in

Danish and Norwegian is it a distinct letter of the alphabet

appearing after z: in English dictionaries, "Ã†sir" will be

found with other "A" words, often expanded to "Aesir", while

in Norwegian it will be found after "Z" words.

* Most European languages convert uppercase I to lowercase i,

but Turkish has distinct letters for dotted and dotless I.

According to Turkish rules, lowercase(I) is Ä± and uppercase(i)

is Ä°.

While it is true that the Unicode character set is independent of locale,

for natural processing of characters, it isn't enough to just use Unicode..

I'm aware of all the points you gave. That's why
I wrote "French or German strings".

The hard task is not on the side of Unicode or sorting,
it is on the creation of key(s) used for sorting.

Eg, cote, cÃ´te, cotÃ©, cÃ´tÃ©. French editors are not all
sorting these words in the same way (diacritics).

jmf

PS A *real* case to test the FSR.

Mark Lawrence · Nov 1, 2013

On 01/11/2013 09:00, (e-mail address removed) wrote:

I'll ask again, would you please read, digest and action this
https://wiki.python.org/moin/GoogleGroupsPython

HTMLParser and non-ascii html pages	0	Sep 20, 2011
Flatten an email Message with a non-ASCII body using 8bit CTE	0	Jan 24, 2013
Ascii to Unicode.	4	Jul 28, 2010
problem with logging exceptions with non-ASCII __str__ result	1	Jan 14, 2008
convert Unicode filenames to good-looking ASCII	3	May 6, 2010
minidom xml & non ascii / unicode & files	4	Aug 5, 2005
Is there a string function to trim all non-ascii characters out of astring	10	Dec 31, 2007
Convertion of Unicode to ASCII NIGHTMARE	24	Apr 3, 2006

trying to strip out non ascii.. or rather convert non ascii

wxjmfauth

wxjmfauth

Mark Lawrence

Ned Batchelder

Mark Lawrence

wxjmfauth

Mark Lawrence

Ned Batchelder

Michael Torrie

wxjmfauth

Terry Reedy

Roy Smith

Steven D'Aprano

Mark Lawrence

wxjmfauth

Tim Chase

Steven D'Aprano

wxjmfauth

Mark Lawrence

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads