ascii to latin1

L

Luis P. Mendes

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

I'm developing a django based intranet web server that has a search page.

Data contained in the database is mixed. Some of the words are
accented, some are not but they should be. This is because the
collection of data began a long time ago when ascii was the only way to go.

The problem is users have to search more than once for some word,
because the searched word can be or not be accented. If we consider
that some expressions can have several letters that can be accented, the
search effort is too much.

I've searched the net for some kind of solution but couldn't find. I've
just found for the opposite.

example:
if the word searched is 'televisão', I want that a search by either
'televisao', 'televisão' or even 'télévisao' (this last one doesn't
exist in Portuguese) is successful.

So, instead of only one search, there will be several used.

Is there anything already coded, or will I have to try to do it all by
myself?


Luis P. Mendes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEX9yqHn4UHCY8rB8RAovDAJ90vllWjxfXN5bnNvg0OCKadbrfnwCfb4Hp
2jmRFyNYySukPwYACJ1TdM8=
=hTr3
-----END PGP SIGNATURE-----
 
R

Robert Kern

Luis said:
example:
if the word searched is 'televisão', I want that a search by either
'televisao', 'televisão' or even 'télévisao' (this last one doesn't
exist in Portuguese) is successful.

The ICU library has the capability to transliterate strings via certain
rulesets. One such ruleset would transliterate all of the above to 'televisao'.
That transliteration could act as a normalization step akin to stemming.

There are one or two Python bindings out there. Google for PyICU. I don't recall
if it exposes the transliteration API or not.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
R

Rene Pijlman

Luis P. Mendes:
I'm developing a django based intranet web server that has a search page.

Data contained in the database is mixed. Some of the words are
accented, some are not but they should be. This is because the
collection of data began a long time ago when ascii was the only way to go.

The problem is users have to search more than once for some word,
because the searched word can be or not be accented. If we consider
that some expressions can have several letters that can be accented, the
search effort is too much.

I guess the best solution is to index all data in ASCII. That is, convert
a field to ASCII (from accented character to its unaccented constituent)
and index that.

Then, on a search, you also need to unaccent the search phrase, and match
it against the asciified index.
 
S

Serge Orlov

Luis said:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

I'm developing a django based intranet web server that has a search page.

Data contained in the database is mixed. Some of the words are
accented, some are not but they should be. This is because the
collection of data began a long time ago when ascii was the only way to go.

The problem is users have to search more than once for some word,
because the searched word can be or not be accented. If we consider
that some expressions can have several letters that can be accented, the
search effort is too much.

I've searched the net for some kind of solution but couldn't find. I've
just found for the opposite.

example:
if the word searched is 'televisão', I want that a search by either
'televisao', 'televisão' or even 'télévisao' (this last one doesn't
exist in Portuguese) is successful.

So, instead of only one search, there will be several used.

Is there anything already coded, or will I have to try to do it all by
myself?

You need to covert from latin1 to ascii not from ascii to latin1. The
function below does that. Then you need to build database index not on
latin1 text but on ascii text. After that convert user input to ascii
and search.

import unicodedata

def search_key(s):
de_str = unicodedata.normalize("NFD", s)
return ''.join(cp for cp in de_str if not
unicodedata.category(cp).startswith('M'))

print search_key(u"televisão")
print search_key(u"télévisao")

===== Result:
televisao
televisao
 
R

Richie Hindle

[Serge]
def search_key(s):
de_str = unicodedata.normalize("NFD", s)
return ''.join(cp for cp in de_str if not
unicodedata.category(cp).startswith('M'))

Lovely bit of code - thanks for posting it!

You might want to use "NFKD" to normalize things like LATIN SMALL
LIGATURE FI and subscript/superscript characters as well as diacritics.
 
L

Luis P. Mendes

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Richie Hindle escreveu:
[Serge]
def search_key(s):
de_str = unicodedata.normalize("NFD", s)
return ''.join(cp for cp in de_str if not
unicodedata.category(cp).startswith('M'))

Lovely bit of code - thanks for posting it!

You might want to use "NFKD" to normalize things like LATIN SMALL
LIGATURE FI and subscript/superscript characters as well as diacritics.

Thank you very much for your info. It's a very good aproach.

When I used the "NFD" option, I came across many errors on these and
possibly other codes: \xba, \xc9, \xcd.

I tried to use "NFKD" instead, and the number of errors was only about
half a dozen, for a universe of 600000+ names, on code \xbf.

It looks like I have to do a search and substitute using regular
expressions for these cases. Or is there a better way to do it?


Luis P. Mendes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEYINaHn4UHCY8rB8RAqLKAJ0cN7yRlzJSpmH7jlrWoyhUH1990wCgkxCW
9d7f/FyHXoSfRUrbES0XKvU=
=eAuO
-----END PGP SIGNATURE-----
 
R

Richie Hindle

[Luis]
When I used the "NFD" option, I came across many errors on these and
possibly other codes: \xba, \xc9, \xcd.

What errors? This works fine for me, printing "Ecoute":

import unicodedata
def search_key(s):
de_str = unicodedata.normalize("NFD", s)
return ''.join([cp for cp in de_str if not
unicodedata.category(cp).startswith('M')])
print search_key(u"\xc9coute")

Are you using unicode code point \xc9, or is that a byte in some
encoding? Which encoding?
 
S

Serge Orlov

Richie said:
[Serge]
def search_key(s):
de_str = unicodedata.normalize("NFD", s)
return ''.join(cp for cp in de_str if not
unicodedata.category(cp).startswith('M'))

Lovely bit of code - thanks for posting it!

Well, it is not so good. Please read my next message to Luis.
You might want to use "NFKD" to normalize things like LATIN SMALL
LIGATURE FI and subscript/superscript characters as well as diacritics.

IMHO It is perfectly acceptable to declare you don't interpret those
symbols. After all they are called *compatibility* code points. I
tried "a quater" symbol: Google and MSN don't interpret it. Yahoo
doesn't support it at all.

NFKD form is also more tricky to use. It loses semantic of characters,
for example if you have character "digit two" followed by "superscript
digit two"; they look like 2 power 2, but NFKD will convert them into
22 (twenty two), which is wrong. So if you want to use NFKD for search
your will have to preprocess your data, for example inserting space
between the twos.
 
S

Serge Orlov

Luis said:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Richie Hindle escreveu:
[Serge]
def search_key(s):
de_str = unicodedata.normalize("NFD", s)
return ''.join(cp for cp in de_str if not
unicodedata.category(cp).startswith('M'))

Lovely bit of code - thanks for posting it!

You might want to use "NFKD" to normalize things like LATIN SMALL
LIGATURE FI and subscript/superscript characters as well as diacritics.

Thank you very much for your info. It's a very good aproach.

When I used the "NFD" option, I came across many errors on these and
possibly other codes: \xba, \xc9, \xcd.

What errors? normalize method is not supposed to give any errors. You
mean it doesn't work as expected? Well, I have to admit that using
normalize is a far from perfect way to implement search. The most
advanced algorithm is published by Unicode guys:
I tried to use "NFKD" instead, and the number of errors was only about
half a dozen, for a universe of 600000+ names, on code \xbf.
It looks like I have to do a search and substitute using regular
expressions for these cases. Or is there a better way to do it?

Perhaps you can use unicode translate method to map the characters that
still give you problems to whatever you want.
 
R

Richie Hindle

[Serge]
I have to admit that using
normalize is a far from perfect way to implement search. The most
advanced algorithm is published by Unicode guys:
<http://www.unicode.org/reports/tr10/> If you read it you'll understand
it's not so easy.

I only have to look at the length of the document to understand it's not
so easy. :cool: I'll take your two-line normalization function any day.
IMHO It is perfectly acceptable to declare you don't interpret those
symbols. After all they are called *compatibility* code points. I
tried "a quater" symbol: Google and MSN don't interpret it. Yahoo
doesn't support it at all. [...]
if you have character "digit two" followed by "superscript
digit two"; they look like 2 power 2, but NFKD will convert them into
22 (twenty two), which is wrong. So if you want to use NFKD for search
your will have to preprocess your data, for example inserting space
between the twos.

I'm not sure it's obvious that it's wrong. How might a user enter
"2<superscript digit 2>" into a search box? They might enter a genuine
"<superscript digit 2>" in which case you're fine, or they might enter
"2^2" in which case it depends how you deal with punctuation. They
probably won't enter "2 2".

It's certainly not wrong in the case of ligatures like LATIN SMALL
LIGATURE FI - it's quite likely that the user will search for "fish"
rather than finding and (somehow) typing the ligature.

Some superscripts are similar - I imagine there's a code point for the
"superscript st" in "1st" (though I can't find it offhand) and you'd
definitely want to convert that to "st".

NFKD normalization doesn't convert VULGAR FRACTION ONE QUARTER into
"1/4" - I wonder whether there's some way to do that?
After all they are called *compatibility* code points.

Yes, compatible with what the user types. :cool:
 
L

Luis P. Mendes

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
What errors? normalize method is not supposed to give any errors. You
mean it doesn't work as expected? Well, I have to admit that using
normalize is a far from perfect way to implement search. The most
advanced algorithm is published by Unicode guys:


Perhaps you can use unicode translate method to map the characters that
still give you problems to whatever you want.

Errors occur when I assign the result of ''.join(cp for cp in de_str if
not unicodedata.category(cp).startswith('M')) to a variable. The same
happens with de_str. When I print the strings everything is ok.

Here's a short example of data:
115448,DAÇÃO
117788,DA 1º DE MO Nº 2

I used the following script to convert the data:
# -*- coding: iso8859-15 -*-

class Latin1ToAscii:

def abreFicheiro(self):
import csv
self.reader = csv.reader(open(self.input_file, "rb"))

def converter(self):
import unicodedata
self.lista_csv = []
for row in self.reader:
s = unicode(row[1],"latin-1")
de_str = unicodedata.normalize("NFD", s)
nome = ''.join(cp for cp in de_str if not \
unicodedata.category(cp).startswith('M'))

linha_ascii = row[0] + "," + nome # *
print linha_ascii.encode("ascii")
self.lista_csv.append(linha_ascii)


def __init__(self):
self.input_file = 'nome_latin1.csv'
self.output_file = 'nome_ascii.csv'

if __name__ == "__main__":
f = Latin1ToAscii()
f.abreFicheiro()
f.converter()


And I got the following result:
$ python latin1_to_ascii.py
115448,DACAO
Traceback (most recent call last):
File "latin1_to_ascii.py", line 44, in ?
f.converter()
File "latin1_to_ascii.py", line 22, in converter
print linha_ascii.encode("ascii")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xba' in
position 11: ordinal not in range(128)


The script converted the ÇÃ from the first line, but not the º from the
second one. Still in *, I also don't get a list as [115448,DAÇÃO] but a
[u'115448,DAÇÃO'] element, which doesn't suit my needs.

Would you mind telling me what should I change?


Luis P. Mendes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEYN7+Hn4UHCY8rB8RAjcTAKCgEkZwCURgp/VrtthM1MBba+d7KACfY9dj
xcHVL1BuhyrPV8+9Z5Q2AJQ=
=+AO0
-----END PGP SIGNATURE-----
 
P

Peter Otten

Luis said:
The script converted the ÇÃ from the first line, but not the º from the
second one. Still in *, I also don't get a list as [115448,DAÇÃO] but a
[u'115448,DAÇÃO'] element, which doesn't suit my needs.

Would you mind telling me what should I change?

Sometimes you are faster if you put the gloves off. Just write the
translation table with the desired substitute for every non-ascii character
in the latin1 charset by hand and be done.

Cyril Kyree
 
R

richie

[Luis]
The script converted the ÇÃ from the first line, but not the º from
the second one.

That's because º, 0xba, MASCULINE ORDINAL INDICATOR is classed as a
letter and not a diacritic:

http://www.fileformat.info/info/unicode/char/00ba/index.htm

You can't encode it in ascii because it's not an ascii character, and
the script doesn't remove it because it only removes diacritics.

I don't know what the best thing to do with it would be - could you use
latin-1 as your base encoding and leave it in there? I don't speak any
language that uses it, but I'd guess that anyone searching for eg. 5º
(forgive me if I have the gender wrong :cool: would actually type 5º -
are there any Italian/Spanish/Portuguese speakers here who can confirm
or deny that?

In the general case, you have to decide what happens to characters that
aren't diacritics and don't live in your base encoding - what happens
when a Chinese user searches for a Chinese character? Probably you
should just encode(base_encoding, 'ignore').
 
S

Serge Orlov

Luis said:
Errors occur when I assign the result of ''.join(cp for cp in de_str if
not unicodedata.category(cp).startswith('M')) to a variable. The same
happens with de_str. When I print the strings everything is ok.

Here's a short example of data:
115448,DAÇÃO
117788,DA 1º DE MO Nº 2

I used the following script to convert the data:
# -*- coding: iso8859-15 -*-

class Latin1ToAscii:

def abreFicheiro(self):
import csv
self.reader = csv.reader(open(self.input_file, "rb"))

def converter(self):
import unicodedata
self.lista_csv = []
for row in self.reader:
s = unicode(row[1],"latin-1")
de_str = unicodedata.normalize("NFD", s)
nome = ''.join(cp for cp in de_str if not \
unicodedata.category(cp).startswith('M'))

linha_ascii = row[0] + "," + nome # *
print linha_ascii.encode("ascii")
self.lista_csv.append(linha_ascii)


def __init__(self):
self.input_file = 'nome_latin1.csv'
self.output_file = 'nome_ascii.csv'

if __name__ == "__main__":
f = Latin1ToAscii()
f.abreFicheiro()
f.converter()


And I got the following result:
$ python latin1_to_ascii.py
115448,DACAO
Traceback (most recent call last):
File "latin1_to_ascii.py", line 44, in ?
f.converter()
File "latin1_to_ascii.py", line 22, in converter
print linha_ascii.encode("ascii")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xba' in
position 11: ordinal not in range(128)


The script converted the ÇÃ from the first line, but not the º from the
second one. Still in *, I also don't get a list as [115448,DAÇÃO] but a
[u'115448,DAÇÃO'] element, which doesn't suit my needs.

Would you mind telling me what should I change?

Calling this process "latin1 to ascii" was a misnomer, sorry that I
used this phrase. It should be called "latin1 to search key", there is
no requirement that the key must be ascii, so change the corresponding
lines in your code:

linha_key = row[0] + "," + nome
print linha_key
self.lista_csv.append(linha_key.encode("latin-1")

With regards to º, Richie already gave you food for thoughts, if you
want "1 DE MO" to match "1º DE MO" remove that symbol from the key
(linha_key = linha_key.translate({u"º": None}), if you don't want such
a fuzzy matching, keep it.
 
L

Luis P. Mendes

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

With regards to º, Richie already gave you food for thoughts, if you
want "1 DE MO" to match "1º DE MO" remove that symbol from the key
(linha_key = linha_key.translate({u"º": None}), if you don't want such
a fuzzy matching, keep it.
Thank you all for your help.

That was what I did. That symbol 'º' is not needded for the field.

It's working fine, now.


Luis P. Mendes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEYdUGHn4UHCY8rB8RAhWgAKCNqUaknEmiNlA050u5G+p4cTPGHwCgs7fu
7/5HMYDDo+sOP2QDexIELn8=
=XiPL
-----END PGP SIGNATURE-----
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top