trying to strip out non ascii.. or rather convert non ascii

bruce · Oct 26, 2013

hi..

getting some files via curl, and want to convert them from what i'm
guessing to be unicode.

I'd like to convert a string like this::
<div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcántar,
Iliana</a></div>

to::
<div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcantar,
Iliana</a></div>

where I convert the
" á " to " a"

which appears to be a shift of 128, but I'm not sure how to accomplish this...

I've tested using the different decode/encode functions using
utf-8/ascii with no luck.

I've reviewed stack overflow, as well as a few other sites, but
haven't hit the aha moment.

pointers/comments would be welcome.

thanks

Steven D'Aprano · Oct 26, 2013

hi..

getting some files via curl, and want to convert them from what i'm
guessing to be unicode.

I'd like to convert a string like this:: <div class="profName"><a
href="ShowRatings.jsp?tid=1312168">AlcÃ¡ntar, Iliana</a></div>

to::
<div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcantar,
Iliana</a></div>

where I convert the
" Ã¡ " to " a"

Why on earth would you want to throw away perfectly good information?
It's 2013, not 1953, and if you're still unable to cope with languages
other than English, you need to learn new skills.

(Actually, not even English, since ASCII doesn't even support all the
characters used in American English, let alone British English. ASCII was
broken from the day it was invented.)

Start by getting some understanding:

http://www.joelonsoftware.com/articles/Unicode.html

Then read this post from just over a week ago:

https://mail.python.org/pipermail/python-list/2013-October/657827.html

Dennis Lee Bieber · Oct 26, 2013

Why on earth would you want to throw away perfectly good information?
It's 2013, not 1953, and if you're still unable to cope with languages
other than English, you need to learn new skills.

(Actually, not even English, since ASCII doesn't even support all the
characters used in American English, let alone British English. ASCII was
broken from the day it was invented.)

Compared to Baudot, both ASCII and EBCDIC were probably considered
wondrous.

Roy Smith · Oct 26, 2013

Dennis Lee Bieber said:
Compared to Baudot, both ASCII and EBCDIC were probably considered
wondrous.

Wonderous, indeed. Why would anybody ever need more than one case of
the alphabet? It's almost as absurd as somebody wanting to put funny
little marks on top of their vowels.

Tim Chase · Oct 26, 2013

Why on earth would you want to throw away perfectly good
information?

The main reason I've needed to do it in the past is for normalization
of search queries. When a user wants to find something containing
"pingÃ¼ino", I want to have those results come back even if they type
"pinguino" in the search box.

For the same reason searches are often normalized to ignore case.
The difference between "Polish" and "polish" is visually just
capitalization, but most folks don't think twice about

if term.upper() in datum.upper():
it_matches()

I'd be just as happy if Python provided a "sloppy string compare"
that ignored case, diacritical marks, and the like.

unicode_haystack1 = u"pingÃ¼ino"
unicode_haystack2 = u"Â¡MirÃ© un pingÃ¼ino!"
needle = u"pinguino"
if unicode_haystack1.sloppy_equals(needle):
it_matches()
if unicode_haystack2.sloppy_contains(needle):
it_contains()

As a matter of fact, I'd even be happier if Python did the heavy
lifting, since I wouldn't have to think about whether I want my code
to force upper-vs-lower for the comparison.

-tkc

Roy Smith · Oct 26, 2013

Tim Chase said:
I'd be just as happy if Python provided a "sloppy string compare"
that ignored case, diacritical marks, and the like.

The problem with putting fuzzy matching in the core language is that
there is no general agreement on how it's supposed to work.

There are, however, third-party libraries which do fuzzy matching. One
popular one is jellyfish (https://pypi.python.org/pypi/jellyfish/0.1.2).
Don't expect you can just download and use it right out of the box,
however. You'll need to do a little thinking about which of the several
algorithms it includes makes sense for your application.

So, for example, you probably expect U+004 (Latin Capital letter N) to
match U+006 (Latin Small Letter N). But, what about these (all cribbed
from Wikipedia):

U+00D1 Ã‘ Ñ Ñ Latin Capital letter N with tilde
U+00F1 Ã± ñ ñ Latin Small Letter N with tilde
U+0143 C Ń Latin Capital Letter N with acute
U+0144 D ń Latin Small Letter N with acute
U+0145 E Ņ Latin Capital Letter N with cedilla
U+0146 F ņ Latin Small Letter N with cedilla
U+0147 G Ň Latin Capital Letter N with caron
U+0148 H ň Latin Small Letter N with caron
U+0149 I ŉ Latin Small Letter N preceded by apostrophe[1]
U+014A J Ŋ Latin Capital Letter Eng
U+014B K ŋ Latin Small Letter Eng
U+019D #413; Latin Capital Letter N with left hook
U+019E #414; Latin Small Letter N with long right leg
U+01CA #458; Latin Capital Letter NJ
U+01CB #459; Latin Capital Letter N with Small Letter J
U+01CC #460; Latin Small Letter NJ
U+0235 #565; Latin Small Letter N with curl

I can't even begin to guess if they should match for your application.

Steven D'Aprano · Oct 26, 2013

Wonderous, indeed. Why would anybody ever need more than one case of
the alphabet? It's almost as absurd as somebody wanting to put funny
little marks on top of their vowels.

Vwls? Wh wst tm wrtng dwn th vwls?

Tim Chase · Oct 26, 2013

The problem with putting fuzzy matching in the core language is
that there is no general agreement on how it's supposed to work.

There are, however, third-party libraries which do fuzzy matching.
One popular one is jellyfish
(https://pypi.python.org/pypi/jellyfish/0.1.2).

Bookmarking and archiving your email for future reference.

Don't expect you can just download and use it right out of the box,
however. You'll need to do a little thinking about which of the
several algorithms it includes makes sense for your application.

I'd be content with a baseline that denormalizes and then strips out
combining diacritical marks, something akin to MRAB's

from unicodedata import normalize
"".join(c for c in normalize("NFKD", s) if ord(c) < 0x80)

and tweaking it if that was insufficient.

Thanks for the link to Jellyfish.

-tkc

Nobody · Oct 26, 2013

I'd be just as happy if Python provided a "sloppy string compare"
that ignored case, diacritical marks, and the like.

Simply ignoring diactrics won't get you very far.

Most languages which use diactrics have standard conversions, e.g.
Ã¶ -> oe, which are likely to be used by anyone familiar with the
language e.g. when using software (or a keyboard) which can't handle
diactrics.

OTOH, others (particularly native English speakers) may simply discard the
diactric. So to be of much use, a fuzzy match needs to handle either
possibility.

wxjmfauth · Oct 28, 2013

Le dimanche 27 octobre 2013 04:21:46 UTC+1, Nobody a écrit :

Simply ignoring diactrics won't get you very far.

Right. As an example, these four French words :
cote, côte, coté, côté .

Most languages which use diactrics have standard conversions, e.g.

ö -> oe, which are likely to be used by anyone familiar with the

language e.g. when using software (or a keyboard) which can't handle

diactrics.

I'm quite confortable with Unicode, esp. with the
Latin blocks.
Except this German case (I remember very old typewriters),
what are the other languages presenting this kind of
allowed feature ?

Just as a reminder. They are 1272 characters considered
as Latin characters (how to count them it not a simple
task), and if my knowledge is correct, they are covering
and/or are here to cover the 17 languages, to be exact,
the 17 European languages based on a Latin alphabet which
can not be covered with iso-8859-1.

And of course, logically, they are very, very badly handled
with the Flexible String Representation.

jmf

Mark Lawrence · Oct 28, 2013

Just as a reminder. They are 1272 characters considered
as Latin characters (how to count them it not a simple
task), and if my knowledge is correct, they are covering
and/or are here to cover the 17 languages, to be exact,
the 17 European languages based on a Latin alphabet which
can not be covered with iso-8859-1.

And of course, logically, they are very, very badly handled
with the Flexible String Representation.

jmf

Please provide us with evidence to back up your statement.

Tim Chase · Oct 28, 2013

Right. As an example, these four French words :
cote, cÃ´te, cotÃ©, cÃ´tÃ© .

Distinct words with distinct meanings, sure.

But when a naÃ¯ve (naive? â˜º) person or one without the easy ability
to enter characters with diacritics searches for "cote", I want to
return possible matches containing any of your 4 examples. It's
slightly fuzzier if they search for "cotÃ©", in which case they may
mean "cotÃ©" or they might mean be unable to figure out how to
add a hat and want to type "cÃ´tÃ©". Though I'd rather get more
results, even if it has some that only match fuzzily.

Circumflexually-circumspectly-yers,

-tkc

Steven D'Aprano · Oct 29, 2013

And of course, logically, they are very, very badly handled with the
Flexible String Representation.

I'm reminded of Cato the Elder, the Roman senator who would end every
speech, no matter the topic, with "Ceterum censeo Carthaginem esse
delendam" ("Furthermore, I consider that Carthage must be destroyed").

But at least he had the good grace to present that as an opinion, instead
of repeating a falsehood as if it were a fact.

Steven D'Aprano · Oct 29, 2013

Distinct words with distinct meanings, sure.

But when a naÃ¯ve (naive? â˜º) person or one without the easy ability to
enter characters with diacritics searches for "cote", I want to return
possible matches containing any of your 4 examples. It's slightly
fuzzier if they search for "cotÃ©", in which case they may mean "cotÃ©" or
they might mean be unable to figure out how to add a hat and want to
type "cÃ´tÃ©". Though I'd rather get more results, even if it has some
that only match fuzzily.

The right solution to that is to treat it no differently from other fuzzy
searches. A good search engine should be tolerant of spelling errors and
alternative spellings for any letter, not just those with diacritics.
Ideally, a good search engine would successfully match all three of
"naÃ¯ve", "naive" and "niave", and it shouldn't rely on special handling
of diacritics.

wxjmfauth · Oct 29, 2013

I'm reminded of Cato the Elder, the Roman senator who would end every

speech, no matter the topic, with "Ceterum censeo Carthaginem esse

delendam" ("Furthermore, I consider that Carthage must be destroyed").

But at least he had the good grace to present that as an opinion, instead

of repeating a falsehood as if it were a fact.

0.26411553466961735

If you are understanding the coding of characters, Unicode
and what this FSR does, it is a child play to produce gazillion
of examples like this.

(Notice the usage of a Dutch character instead of a boring â‚¬).

jmf

Tim Chase · Oct 29, 2013

0.26411553466961735

That reads to me as "If things were purely UCS4 internally, Python
would normally take 0.264... seconds to execute this test, but core
devs managed to optimize a particular (lower 127 ASCII characters
only) case so that it runs in less than half the time."

Is this not what you intended to demonstrate? 'cuz that sounds
like a pretty awesome optimization to me.

-tkc

wxjmfauth · Oct 29, 2013

That reads to me as "If things were purely UCS4 internally, Python

would normally take 0.264... seconds to execute this test, but core

devs managed to optimize a particular (lower 127 ASCII characters

only) case so that it runs in less than half the time."

Is this not what you intended to demonstrate? 'cuz that sounds

like a pretty awesome optimization to me.

-tkc

--------

That's very naive. In fact, what happens is just the opposite.
The "best case" with the FSR is worst than the "worst case"
without the FSR.

And this is just without counting the effect that this poor
Python is spending its time in switching from one internal
representation to one another, without forgetting the fact
that this has to be tested every time.
The more unicode manipulations one applies, the more time
it demands.

Two tasks, that come in my mind: re and normalization.
It's very interesting to observe what happens when one
normalizes latin text and polytonic Greek text, both with
plenty of diactrics.

----

Something different, based on my previous example.

What a European user is supposed to think, when she/he
sees, she/he can be "penalized" by such an amount,
simply by using non ascii characters for a product
which is supposed to be "unicode compliant" ?

jmf

Mark Lawrence · Oct 29, 2013

Le mardi 29 octobre 2013 16:52:49 UTC+1, Tim Chase a Ã©crit :

--------

That's very naive. In fact, what happens is just the opposite.
The "best case" with the FSR is worst than the "worst case"
without the FSR.

And this is just without counting the effect that this poor
Python is spending its time in switching from one internal
representation to one another, without forgetting the fact
that this has to be tested every time.
The more unicode manipulations one applies, the more time
it demands.

Two tasks, that come in my mind: re and normalization.
It's very interesting to observe what happens when one
normalizes latin text and polytonic Greek text, both with
plenty of diactrics.

----

Something different, based on my previous example.

What a European user is supposed to think, when she/he
sees, she/he can be "penalized" by such an amount,
simply by using non ascii characters for a product
which is supposed to be "unicode compliant" ?

jmf

Please provide hard evidence to support your claims or stop posting this
ridiculous nonsense. Give us real world problems that can be reported
on the bug tracker, investigated and resolved.

Piet van Oostrum · Oct 29, 2013

Mark Lawrence said:
Please provide hard evidence to support your claims or stop posting this
ridiculous nonsense. Give us real world problems that can be reported
on the bug tracker, investigated and resolved.

I think it is much better just to ignore this nonsense instead of asking for evidence you know you will never get.

Chris Angelico · Oct 29, 2013

You've stated above that logically unicode is badly handled by the fsr. You
then provide a trivial timing example. WTF???

His idea of bad handling is "oh how terrible, ASCII and BMP have
optimizations". He hates the idea that it could be better in some
areas instead of even timings all along. But the FSR actually has some
distinct benefits even in the areas he's citing - watch this:
0.3582399439035271

The first two examples are his examples done on my computer, so you
can see how all four figures compare. Note how testing for the
presence of a non-Latin1 character in an 8-bit string is very fast.
Same goes for testing for non-BMP character in a 16-bit string. The
difference gets even larger if the string is longer:
2.8308718007456264

Wow! The FSR speeds up searches immensely! It's obviously the best
thing since sliced bread!

ChrisA

HTMLParser and non-ascii html pages	0	Sep 20, 2011
Flatten an email Message with a non-ASCII body using 8bit CTE	0	Jan 24, 2013
Ascii to Unicode.	4	Jul 28, 2010
problem with logging exceptions with non-ASCII __str__ result	1	Jan 14, 2008
convert Unicode filenames to good-looking ASCII	3	May 6, 2010
minidom xml & non ascii / unicode & files	4	Aug 5, 2005
Is there a string function to trim all non-ascii characters out of astring	10	Dec 31, 2007
Convertion of Unicode to ASCII NIGHTMARE	24	Apr 3, 2006

trying to strip out non ascii.. or rather convert non ascii

bruce

Steven D'Aprano

Dennis Lee Bieber

Roy Smith

Tim Chase

Roy Smith

Steven D'Aprano

Tim Chase

Nobody

wxjmfauth

Mark Lawrence

Tim Chase

Steven D'Aprano

Steven D'Aprano

wxjmfauth

Tim Chase

wxjmfauth

Mark Lawrence

Piet van Oostrum

Chris Angelico

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads