Unicode


A

Anatoli Hristov

Hello guys,

I'm using Linux CentOS and Python 2.4 with MySQL 5.xx, I get error
with Unicode I tried many things that I found on the net but none of
them working.

If I dont use UTF-8 it inserts the data into the DB but some French
char. are not correctly decoded. Could you please help me ?

Thanks

def PrepareSpecs(product_id, icecat_prod_id, icecat_image_url, name):
"""Gets the specifications of a product from Icecat.biz and insert
them into the DB
"""
specs = {3:GetSpecsNL(icecat_prod_id),2:GetSpecsFR(icecat_prod_id).decode('utf-8'),1:GetSpecsEN(icecat_prod_id)}
SpecsToSQL(product_id,specs,name)
CategorySQL(product_id)
StoreSQL(product_id)
GetIMG(icecat_image_url,icecat_prod_id)
return

def GetSpecsFR(icecat_prod_id):
opener = urllib.FancyURLopener({})
ffr = opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr"
% icecat_prod_id)
specsfr = ffr.read()
#specsfr = specsfr.decode('utf-8')
specsfr = RemoveHTML(specsfr)
##specsfr = "%r" % specsfr
## if specsfr:
## try:
## specsfr = str(specsfr)
## except UnicodeEncodeError:
## specsfr = str(specsfr.encode('utf-16'))
return specsfr

def RemoveHTML(specs):
specs = specs.replace("<html>","")
specs = specs.replace("<HTML>","")
specs = specs.replace("</html>","")
specs = specs.replace("</HTML>","")
specs = specs.replace("<head>","")
specs = specs.replace("<HEAD>","")
specs = specs.replace("</head>","")
specs = specs.replace("</HEAD>","")
specs = specs.replace("<body>","")
specs = specs.replace("</body>","")
specs = specs.replace("<BODY>","")
specs = specs.replace("</body>","")
specs = specs.replace("<TITLE>","")
specs = specs.replace("</TITLE>","")
specs = specs.replace("<title>","")
specs = specs.replace("</title>","")
specs = specs.replace("<p>","")
specs = specs.replace("</p>","")
return specs

def SpecsToSQL(product_id, specs, name):
for lang, spec in specs.iteritems():
InsertSpecsDB(product_id, spec, lang, name)
return

def InsertSpecsDB(product_id, spec, name, lang):
db = MySQLdb.connect("localhost","getit","opencart")
cursor = db.cursor()
sql = "INSERT INTO product_description (product_id, language_id,
name, description) VALUES (%s,%s,%s,%s)"
params = (product_id, lang, name, spec)
cursor.execute(sql, params)
id = cursor.lastrowid
print"Updated ID %s description %s" %(int(id), lang)
return
 
Ad

Advertisements

S

Steven D'Aprano

If I dont use UTF-8 it inserts the data into the DB but some French
char. are not correctly decoded. Could you please help me ?

What happens when you do use UTF-8?

What do you mean, "use UTF-8"?


To learn about Unicode, start here:

http://www.joelonsoftware.com/articles/Unicode.html

If that helps you solve the problem, excellent. If not, please come back
with your questions, but first read this:

http://www.sscce.org/

As given, we cannot answer your question easily, or at all, because we
cannot run your code. It gives indentation errors, you don't tell us what
modules you're using, and you haven't reduced the example down to the
critical parts that demonstrate the failure.
 
A

Anatoli Hristov

What happens when you do use UTF-8?
This is the result when I encode the string:
" étroits, en utilisant un portable extrêmement puissantâ€â€le plus
petit et le plus léger des HP EliteBook pleine puissanceâ€â€avec un
écran de diagonale 31,75 cm (12,5 pouces), idéal pourle
professionnel ultra-mobile.
"
No accents
What do you mean, "use UTF-8"?

Trying to encode the string
To learn about Unicode, start here:

http://www.joelonsoftware.com/articles/Unicode.html

If that helps you solve the problem, excellent. If not, please come back
with your questions, but first read this:
I will try to understand the logic :)
http://www.sscce.org/

As given, we cannot answer your question easily, or at all, because we
cannot run your code. It gives indentation errors, you don't tell us what
modules you're using, and you haven't reduced the example down to the
critical parts that demonstrate the failure.
I didn't wanted to include all my code as it is 15K. and also I know
my code is crappy and you will start blaming and saying that my code
is crap.- and I know it !

Thanks
 
B

Benjamin Kaplan

This is the result when I encode the string:
" étroits, en utilisant un portable extrêmement puissantâ€â€le plus
petit et le plus léger des HP EliteBook pleine puissanceâ€â€avec un
écran de diagonale 31,75 cm (12,5 pouces), idéal pour le
professionnel ultra-mobile.
"
No accents

Trying to encode the string

What's your terminal's encoding? That looks like you have a CP-1252
terminal trying to output UTF-8 text.
 
A

Anatoli Hristov

What's your terminal's encoding? That looks like you have a CP-1252
terminal trying to output UTF-8 text.

Thanks for your answer, I tried <locale> in my terminal and it gives
this as an output:
LANG=en_US
LC_CTYPE="en_US"
LC_NUMERIC="en_US"
LC_TIME="en_US"
LC_COLLATE="en_US"
LC_MONETARY="en_US"
LC_MESSAGES="en_US"
LC_PAPER="en_US"
LC_NAME="en_US"
LC_ADDRESS="en_US"
LC_TELEPHONE="en_US"
LC_MEASUREMENT="en_US"
LC_IDENTIFICATION="en_US"
LC_ALL=
 
V

Vlastimil Brom

2012/12/17 Anatoli Hristov said:
What happens when you do use UTF-8?
This is the result when I encode the string:
" étroits, en utilisant un portable extrêmement puissant—le plus
petit et le plus léger des HP EliteBook pleine puissance—avecun
écran de diagonale 31,75 cm (12,5 pouces), idéal pour le
professionnel ultra-mobile.
"
No accents
Hi,
if you only see encoding problems on printing results to your
terminal, its settings or unicode capability might be the cause,
however, if you also get badly encoding items in the database, you are
likely using an inappropriate encoding in some step.

you seem to be doing something like the following (explicitly or
partly implicitly, based on your system defaults):

i.e. encode a text using utf-8 and handling it like windows-1252
afterwards (or take an already encoded text and decode it with the
inappropriate ANSI encoding.

hth,
vbr
 
Ad

Advertisements

A

Anatoli Hristov

if you only see encoding problems on printing results to your
terminal, its settings or unicode capability might be the cause,
however, if you also get badly encoding items in the database, you are
likely using an inappropriate encoding in some step.

I get badly encoding into my DB
you seem to be doing something like the following (explicitly or
partly implicitly, based on your system defaults):


i.e. encode a text using utf-8 and handling it like windows-1252
afterwards (or take an already encoded text and decode it with the
inappropriate ANSI encoding.

Thank you Vlastimil,

I tried to print it as you sholed mr, but I receive an erro:Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0192'
in position 1: ordinal not in range(256)
 
V

Vlastimil Brom

2012/12/17 Anatoli Hristov said:
I get badly encoding into my DB


Thank you Vlastimil,

I tried to print it as you sholed mr, but I receive an erro:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0192'
in position 1: ordinal not in range(256)

Hi,
this seems to be an encoding error of your terminal on printing.
You may need to describe (or better post the respective parts of the
source) where the text is coming from (external text file, database
entry, harcoded in the python source ...), how it is stored, retrieved
and possibly manipulated before you insert it to the database.

You may try to print a repr(...) of the string to be inserted to the
database to see, whether it isn't already mangled in some previous
part of the processing.

hth,

vbr
 
A

Anatoli Hristov

this seems to be an encoding error of your terminal on printing.
You may need to describe (or better post the respective parts of the
source) where the text is coming from (external text file, database
entry, harcoded in the python source ...), how it is stored, retrieved
and possibly manipulated before you insert it to the database.
Here is how I get the data using the urllib opener:

def GetSpecsFR(icecat_prod_id):
opener = urllib.FancyURLopener({})
ffr = opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr"
% icecat_prod_id)
specsfr = ffr.read()
#specsfr = specsfr.decode('utf-8')
specsfr = RemoveHTML(specsfr)
##specsfr = "%r" % specsfr
## if specsfr:
## try:
## specsfr = str(specsfr)
## except UnicodeEncodeError:
## specsfr = str(specsfr.encode('utf-16'))
return specsfr
 
V

Vlastimil Brom

2012/12/17 Anatoli Hristov said:
Here is how I get the data using the urllib opener:

def GetSpecsFR(icecat_prod_id):
opener = urllib.FancyURLopener({})
ffr = opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr"
% icecat_prod_id)
specsfr = ffr.read()
#specsfr = specsfr.decode('utf-8')
specsfr = RemoveHTML(specsfr)
##specsfr = "%r" % specsfr
## if specsfr:
## try:
## specsfr = str(specsfr)
## except UnicodeEncodeError:
## specsfr = str(specsfr.encode('utf-16'))
return specsfr

Hi,
I don't know, what the product ID would look like, for this page, but
assuming, the catalog pages are also utf-8 encoded as well as the
error page I get, it should work ok; cf.:


<!-- This Icecat template is used as head of all pages in Product finder -->


<HTML>
<HEAD>

[... - shortened]

<div align="center">"Désolé, pour ce produit, nous n'avons pas trouvé
d'autres informations produit.<br>Si vous n'êtes pas redirigés
automatiquement, veuillez cliquer" <a href="#" style="font-size:80%"
onclick="history.back()">ici</a>
</div>
<!--
<td bgcolor="" width="230" align="center"><img
src="/imgs/logo.gif" width="180" height="58"></td>
-->



Printing on an unicode-capable shell works ok (wx PyShell in my case),
inserting to the database should be straightforward too (although I
don't have experiences with the specific db you are using.

Are you getting another unicode errors in other parts of the process,
or do the above steps work differently on your computer?

hth,
vbr
 
A

Anatoli Hristov

Hi,
I don't know, what the product ID would look like, for this page, but
assuming, the catalog pages are also utf-8 encoded as well as the
error page I get, it should work ok; cf.:
You are right, I get it work on Windows too, but not in Linux. I
changed the codec of linux, but still I don't get it

Here is what I get from Linux:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2122'
in position 17167: ordinal not in range(256)
 
Ad

Advertisements

D

Dave Angel

You are right, I get it work on Windows too, but not in Linux. I
changed the codec of linux, but still I don't get it

Here is what I get from Linux:

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2122'
in position 17167: ordinal not in range(256)

I can tell you what's happening, but maybe not how to fix it.

src.decode() is creating a unicode string. The error is not happening
there. But when print is used with a unicode string, it has to encode
the data. And for whatever reason, yours is using latin-1, and you have
a character in there which is not in the latin-1 encoding.

My python 2.7 uses utf-8 everywhere (on Linux Ubuntu 11.04).
 
A

Anatoli Hristov

src.decode() is creating a unicode string. The error is not happening
there. But when print is used with a unicode string, it has to encode
the data. And for whatever reason, yours is using latin-1, and you have
a character in there which is not in the latin-1 encoding.
I fixed the print, I changed the setting of the terminal and also on
the sshconfig, so now when I print I'm able to print out without
problems, but when I tried to run the script I've made it gives me
again the same error :
""Unexpected error: exceptions.UnicodeEncodeError
"""
Maybe I will try to update to 2.7
 
V

Vlastimil Brom

2012/12/17 Anatoli Hristov said:
I fixed the print, I changed the setting of the terminal and also on
the sshconfig, so now when I print I'm able to print out without
problems, but when I tried to run the script I've made it gives me
again the same error :
""Unexpected error: exceptions.UnicodeEncodeError
"""
Maybe I will try to update to 2.7

Well, we don't see the context or traceback of that error, but it
looks like a mysql error on inserting data.
Could it be, that your database is not unicode enabled, e.g. utf-8,
but, say, latin-1?
I don't have experiences with this database this, but I guess, there
must be some configure options for this.
Would maybe setting the encoding in db.connect(...) work?
cf.:
http://stackoverflow.com/questions/8365660/python-mysql-unicode-and-encoding

Hopefully, others might give more reliable suggestions..

hth,
vbr
 
A

Anatoli Hristov

I fixed the print, I changed the setting of the terminal and also on
the sshconfig, so now when I print I'm able to print out without
problems, but when I tried to run the script I've made it gives me
again the same error :
""Unexpected error: exceptions.UnicodeEncodeError
"""
Maybe I will try to update to 2.7

Upgraded to python 27 and still it gives Unexpected error:
exceptions.UnicodeEncodeError. Damn encoders I don'y know what to
do...
 
D

Dave Angel

That's not the whole error message. What encoding does it report in the
error?

Maybe I will try to update to 2.7
Upgraded to python 27 and still it gives Unexpected error:
exceptions.UnicodeEncodeError. Damn encoders I don'y know what to
do...

I doubted that 2.7 would make any difference.

1. What does your "terminal' expect. (For all I know you're using
TeraTermPro as a terminal, which doesn't support utf-8.)
Have you looked at the terminal encoding to see what your copy of
Terminal is expecting? On my Ubuntu Linux, I open the terminal with
Ctrl-Alt-t, then in the menu bar, I select
Terminal->SetCharacterEncoding->utf-8

2. What does your environment tell Linux to support? At a bash prompt, try
echo $LANG (there are two other environment variables I've seen
reference to, so this aspect is nuts)

Mine says
en_US.UTF-8

3. What does Python think it was told?
import sys
print sys.stdout.encoding

Mine says
UTF-8


I can force a similar error as follows:


import urllib
opener = urllib.FancyURLopener({})
ffr =
opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr"
% (14688538))
src = ffr.read()

out = src.decode("utf-8").encode("latin-1")

Traceback (most recent call last):
File "anatoli3.py", line 9, in <module>
src.decode("utf-8").encode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2122' in
position 17167: ordinal not in range(256)


And from that it's quite clear that for that particular data, I cannot
use a latin-1 encoder.

So I did a bit of hunting, and I find the offending character is the one
after the word 'Core" in the following quote:

processeurs Intel® Core™ de 3ème génération


The symbol is a trademark symbol and is not part of latin-1. If you're
really stuck with a latin-1 terminal, then you could do something like:

print src.decode("utf-8").encode("latin-1", "ignore")

That says to decode it using utf-8 (because the html declared a utf-8
encoding), and encode it back to latin-1 (because your terminal is stuck
there), then print.


Just realize that once you start using 'ignore' you're going to also
ignore discrepancies that are real. For example, maybe your terminal is
actual something other than either latin-1 or utf-8.


For others that just want to play with a minimal subset:


test = u'processeurs Intel\xae Core\u2122 de 3\xe8me g\xe9n\xe9ration av'
print test
print test.encode("latin-1", "ignore")
print test.encode("latin-1")

produces :

processeurs Intel® Core™ de 3ème génération av
processeurs Intel� Core de 3�me g�n�ration av
Traceback (most recent call last):
File "anatoli3.py", line 22, in <module>
print test.encode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2122' in
position 23: ordinal not in range(256)
 
Ad

Advertisements

H

Hans Mulder

print src.decode("utf-8").encode("latin-1", "ignore")

That says to decode it using utf-8 (because the html declared a utf-8
encoding), and encode it back to latin-1 (because your terminal is stuck
there), then print.


Just realize that once you start using 'ignore' you're going to also
ignore discrepancies that are real. For example, maybe your terminal is
actual something other than either latin-1 or utf-8.

If you need to see such discrepancies, you can do

print src.decode("utf-8").encode("latin-1", ""xmlcharrefreplace")


That would produce something like:

processeurs Intel® Core™ de 3ème génération av

that is, the problem characters are displayed in &#...; notation.
That is ugly, but sometimes it's the only way to see what character
you really have.

Notice that the number you get is in decimal, where the \u....
notation uses hex:


Hope this helps,

-- HansM
 
T

Terry Reedy

Upgraded to python 27 and still it gives Unexpected error:
exceptions.UnicodeEncodeError. Damn encoders I don'y know what to
do...

If you are working with unicode, and you can upgrade to 3.3, you will
probably we happier if you do. This does not solve all problems, but the
python side is definitely better. (IE, there are unicode bugs in 2.7
whose fix *is* to upgrade to 3.3.)

That said, retrieving

http://prf.icecat.biz/index.cgi?pro...t;smi=product;shopname=openICEcat-url;lang=fr

with Firefox on Win 7 returns a page containing

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

so I presume the http encoding is also utf-8

Also: printing to the screen in IDLE may work better than with the
standard interactive console (especially the awful Windows version). I
have the font set to Lucida Sans Unicode (this may be windows specific)
which seems to works for all BMP (Basic Multilingual Plane) chars.
 
A

Anatoli Hristov

I doubted that 2.7 would make any difference.

Yeah this complicated my life even more, all my import functions was
gone - took me 2h to fix all :)
and it does not solved my issue:)
1. What does your "terminal' expect. (For all I know you're using
TeraTermPro as a terminal, which doesn't support utf-8.)
Have you looked at the terminal encoding to see what your copy of
Terminal is expecting? On my Ubuntu Linux, I open the terminal with
Ctrl-Alt-t, then in the menu bar, I select
Terminal->SetCharacterEncoding->utf-8

I'm using putty for windows and I changed the putty to UTF-8 and this
is what solved the problem - ihuuuuuu :p
There is no logic, but it solved the issue !
2. What does your environment tell Linux to support? At a bash prompt, try
echo $LANG (there are two other environment variables I've seen
reference to, so this aspect is nuts)
Mine says
en_US.UTF-8

Mine too US.UTF-8, but the putty was in latin1
3. What does Python think it was told?
import sys
print sys.stdout.encoding

Mine says
UTF-8
Mine too :p

Thank you Dave you always come with a solution :)
 
Ad

Advertisements

A

Anatoli Hristov

Just realize that once you start using 'ignore' you're going to also
If you need to see such discrepancies, you can do

print src.decode("utf-8").encode("latin-1", ""xmlcharrefreplace")


That would produce something like:

processeurs Intel® Core™ de 3ème génération av

that is, the problem characters are displayed in &#...; notation.
That is ugly, but sometimes it's the only way to see what character
you really have.

Notice that the number you get is in decimal, where the \u....
notation uses hex:

Thanks guys my issue is now solved - the problem came from my Putty
client, it was on latin1 by default and changing it to utf-8, now
works...
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top