Unicode

Anatoli Hristov · Dec 16, 2012

Hello guys,

I'm using Linux CentOS and Python 2.4 with MySQL 5.xx, I get error
with Unicode I tried many things that I found on the net but none of
them working.

If I dont use UTF-8 it inserts the data into the DB but some French
char. are not correctly decoded. Could you please help me ?

Thanks

def PrepareSpecs(product_id, icecat_prod_id, icecat_image_url, name):
"""Gets the specifications of a product from Icecat.biz and insert
them into the DB
"""
specs = {3:GetSpecsNL(icecat_prod_id),2:GetSpecsFR(icecat_prod_id).decode('utf-8'),1:GetSpecsEN(icecat_prod_id)}
SpecsToSQL(product_id,specs,name)
CategorySQL(product_id)
StoreSQL(product_id)
GetIMG(icecat_image_url,icecat_prod_id)
return

def GetSpecsFR(icecat_prod_id):
opener = urllib.FancyURLopener({})
ffr = opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr"
% icecat_prod_id)
specsfr = ffr.read()
#specsfr = specsfr.decode('utf-8')
specsfr = RemoveHTML(specsfr)
##specsfr = "%r" % specsfr
## if specsfr:
## try:
## specsfr = str(specsfr)
## except UnicodeEncodeError:
## specsfr = str(specsfr.encode('utf-16'))
return specsfr

def RemoveHTML(specs):
specs = specs.replace("<html>","")
specs = specs.replace("<HTML>","")
specs = specs.replace("</html>","")
specs = specs.replace("</HTML>","")
specs = specs.replace("<head>","")
specs = specs.replace("<HEAD>","")
specs = specs.replace("</head>","")
specs = specs.replace("</HEAD>","")
specs = specs.replace("<body>","")
specs = specs.replace("</body>","")
specs = specs.replace("<BODY>","")
specs = specs.replace("</body>","")
specs = specs.replace("<TITLE>","")
specs = specs.replace("</TITLE>","")
specs = specs.replace("<title>","")
specs = specs.replace("</title>","")
specs = specs.replace("<p>","")
specs = specs.replace("</p>","")
return specs

def SpecsToSQL(product_id, specs, name):
for lang, spec in specs.iteritems():
InsertSpecsDB(product_id, spec, lang, name)
return

def InsertSpecsDB(product_id, spec, name, lang):
db = MySQLdb.connect("localhost","getit","opencart")
cursor = db.cursor()
sql = "INSERT INTO product_description (product_id, language_id,
name, description) VALUES (%s,%s,%s,%s)"
params = (product_id, lang, name, spec)
cursor.execute(sql, params)
id = cursor.lastrowid
print"Updated ID %s description %s" %(int(id), lang)
return

Steven D'Aprano · Dec 17, 2012

If I dont use UTF-8 it inserts the data into the DB but some French
char. are not correctly decoded. Could you please help me ?

What happens when you do use UTF-8?

What do you mean, "use UTF-8"?

To learn about Unicode, start here:

http://www.joelonsoftware.com/articles/Unicode.html

If that helps you solve the problem, excellent. If not, please come back
with your questions, but first read this:

http://www.sscce.org/

As given, we cannot answer your question easily, or at all, because we
cannot run your code. It gives indentation errors, you don't tell us what
modules you're using, and you haven't reduced the example down to the
critical parts that demonstrate the failure.

Anatoli Hristov · Dec 17, 2012

What happens when you do use UTF-8?
This is the result when I encode the string:
" ÃƒÂ©troits, en utilisant un portable extrÃƒÂªmement puissantÃ¢â‚¬â€le plus
petit et le plus lÃƒÂ©ger des HP EliteBook pleine puissanceÃ¢â‚¬â€avec un
ÃƒÂ©cran de diagonale 31,75 cm (12,5 pouces), idÃƒÂ©al pourle
professionnel ultra-mobile.
"
No accents

What do you mean, "use UTF-8"?

Trying to encode the string

To learn about Unicode, start here:

http://www.joelonsoftware.com/articles/Unicode.html

If that helps you solve the problem, excellent. If not, please come back
with your questions, but first read this:

I will try to understand the logic

http://www.sscce.org/

As given, we cannot answer your question easily, or at all, because we
cannot run your code. It gives indentation errors, you don't tell us what
modules you're using, and you haven't reduced the example down to the
critical parts that demonstrate the failure.

I didn't wanted to include all my code as it is 15K. and also I know
my code is crappy and you will start blaming and saying that my code
is crap.- and I know it !

Thanks

Benjamin Kaplan · Dec 17, 2012

This is the result when I encode the string:
" ÃƒÂ©troits, en utilisant un portable extrÃƒÂªmement puissantÃ¢â‚¬â€le plus
petit et le plus lÃƒÂ©ger des HP EliteBook pleine puissanceÃ¢â‚¬â€avec un
ÃƒÂ©cran de diagonale 31,75 cm (12,5 pouces), idÃƒÂ©al pour le
professionnel ultra-mobile.
"
No accents

Trying to encode the string

What's your terminal's encoding? That looks like you have a CP-1252
terminal trying to output UTF-8 text.

Anatoli Hristov · Dec 17, 2012

What's your terminal's encoding? That looks like you have a CP-1252

terminal trying to output UTF-8 text.

Thanks for your answer, I tried <locale> in my terminal and it gives
this as an output:
LANG=en_US
LC_CTYPE="en_US"
LC_NUMERIC="en_US"
LC_TIME="en_US"
LC_COLLATE="en_US"
LC_MONETARY="en_US"
LC_MESSAGES="en_US"
LC_PAPER="en_US"
LC_NAME="en_US"
LC_ADDRESS="en_US"
LC_TELEPHONE="en_US"
LC_MEASUREMENT="en_US"
LC_IDENTIFICATION="en_US"
LC_ALL=

Vlastimil Brom · Dec 17, 2012

2012/12/17 Anatoli Hristov said:
What happens when you do use UTF-8?

This is the result when I encode the string:
" Ã©troits, en utilisant un portable extrÃªmement puissantâ€”le plus
petit et le plus lÃ©ger des HP EliteBook pleine puissanceâ€”avecun
Ã©cran de diagonale 31,75 cm (12,5 pouces), idÃ©al pour le
professionnel ultra-mobile.
"
No accents
Hi,
if you only see encoding problems on printing results to your
terminal, its settings or unicode capability might be the cause,
however, if you also get badly encoding items in the database, you are
likely using an inappropriate encoding in some step.

you seem to be doing something like the following (explicitly or
partly implicitly, based on your system defaults):

i.e. encode a text using utf-8 and handling it like windows-1252
afterwards (or take an already encoded text and decode it with the
inappropriate ANSI encoding.

hth,
vbr

Anatoli Hristov · Dec 17, 2012

if you only see encoding problems on printing results to your

terminal, its settings or unicode capability might be the cause,
however, if you also get badly encoding items in the database, you are
likely using an inappropriate encoding in some step.

I get badly encoding into my DB

you seem to be doing something like the following (explicitly or
partly implicitly, based on your system defaults):

i.e. encode a text using utf-8 and handling it like windows-1252
afterwards (or take an already encoded text and decode it with the
inappropriate ANSI encoding.

Thank you Vlastimil,

I tried to print it as you sholed mr, but I receive an erro:Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0192'
in position 1: ordinal not in range(256)

Vlastimil Brom · Dec 17, 2012

2012/12/17 Anatoli Hristov said:
I get badly encoding into my DB

Thank you Vlastimil,

I tried to print it as you sholed mr, but I receive an erro:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0192'
in position 1: ordinal not in range(256)

Hi,
this seems to be an encoding error of your terminal on printing.
You may need to describe (or better post the respective parts of the
source) where the text is coming from (external text file, database
entry, harcoded in the python source ...), how it is stored, retrieved
and possibly manipulated before you insert it to the database.

You may try to print a repr(...) of the string to be inserted to the
database to see, whether it isn't already mangled in some previous
part of the processing.

hth,

vbr

Anatoli Hristov · Dec 17, 2012

this seems to be an encoding error of your terminal on printing.

You may need to describe (or better post the respective parts of the
source) where the text is coming from (external text file, database
entry, harcoded in the python source ...), how it is stored, retrieved
and possibly manipulated before you insert it to the database.

Here is how I get the data using the urllib opener:

def GetSpecsFR(icecat_prod_id):
opener = urllib.FancyURLopener({})
ffr = opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr"
% icecat_prod_id)
specsfr = ffr.read()
#specsfr = specsfr.decode('utf-8')
specsfr = RemoveHTML(specsfr)
##specsfr = "%r" % specsfr
## if specsfr:
## try:
## specsfr = str(specsfr)
## except UnicodeEncodeError:
## specsfr = str(specsfr.encode('utf-16'))
return specsfr

Vlastimil Brom · Dec 17, 2012

2012/12/17 Anatoli Hristov said:
Here is how I get the data using the urllib opener:

def GetSpecsFR(icecat_prod_id):
opener = urllib.FancyURLopener({})
ffr = opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr"
% icecat_prod_id)
specsfr = ffr.read()
#specsfr = specsfr.decode('utf-8')
specsfr = RemoveHTML(specsfr)
##specsfr = "%r" % specsfr
## if specsfr:
## try:
## specsfr = str(specsfr)
## except UnicodeEncodeError:
## specsfr = str(specsfr.encode('utf-16'))
return specsfr

Hi,
I don't know, what the product ID would look like, for this page, but
assuming, the catalog pages are also utf-8 encoded as well as the
error page I get, it should work ok; cf.:



<HTML>
<HEAD>

[... - shortened]

<div align="center">"Désolé, pour ce produit, nous n'avons pas trouvé
d'autres informations produit.<br>Si vous n'êtes pas redirigés
automatiquement, veuillez cliquer" <a href="#" style="font-size:80%"
onclick="history.back()">ici</a>
</div>


Printing on an unicode-capable shell works ok (wx PyShell in my case),
inserting to the database should be straightforward too (although I
don't have experiences with the specific db you are using.

Are you getting another unicode errors in other parts of the process,
or do the above steps work differently on your computer?

hth,
vbr

Anatoli Hristov · Dec 17, 2012

Hi,

I don't know, what the product ID would look like, for this page, but
assuming, the catalog pages are also utf-8 encoded as well as the
error page I get, it should work ok; cf.:

You are right, I get it work on Windows too, but not in Linux. I
changed the codec of linux, but still I don't get it

Here is what I get from Linux:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2122'
in position 17167: ordinal not in range(256)

Dave Angel · Dec 17, 2012

You are right, I get it work on Windows too, but not in Linux. I
changed the codec of linux, but still I don't get it

Here is what I get from Linux:

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2122'
in position 17167: ordinal not in range(256)

I can tell you what's happening, but maybe not how to fix it.

src.decode() is creating a unicode string. The error is not happening
there. But when print is used with a unicode string, it has to encode
the data. And for whatever reason, yours is using latin-1, and you have
a character in there which is not in the latin-1 encoding.

My python 2.7 uses utf-8 everywhere (on Linux Ubuntu 11.04).

Anatoli Hristov · Dec 17, 2012

src.decode() is creating a unicode string. The error is not happening

there. But when print is used with a unicode string, it has to encode
the data. And for whatever reason, yours is using latin-1, and you have
a character in there which is not in the latin-1 encoding.

I fixed the print, I changed the setting of the terminal and also on
the sshconfig, so now when I print I'm able to print out without
problems, but when I tried to run the script I've made it gives me
again the same error :
""Unexpected error: exceptions.UnicodeEncodeError
"""
Maybe I will try to update to 2.7

Vlastimil Brom · Dec 17, 2012

2012/12/17 Anatoli Hristov said:
I fixed the print, I changed the setting of the terminal and also on
the sshconfig, so now when I print I'm able to print out without
problems, but when I tried to run the script I've made it gives me
again the same error :
""Unexpected error: exceptions.UnicodeEncodeError
"""
Maybe I will try to update to 2.7

Well, we don't see the context or traceback of that error, but it
looks like a mysql error on inserting data.
Could it be, that your database is not unicode enabled, e.g. utf-8,
but, say, latin-1?
I don't have experiences with this database this, but I guess, there
must be some configure options for this.
Would maybe setting the encoding in db.connect(...) work?
cf.:
http://stackoverflow.com/questions/8365660/python-mysql-unicode-and-encoding

Hopefully, others might give more reliable suggestions..

hth,
vbr

Anatoli Hristov · Dec 17, 2012

I fixed the print, I changed the setting of the terminal and also on

the sshconfig, so now when I print I'm able to print out without
problems, but when I tried to run the script I've made it gives me
again the same error :
""Unexpected error: exceptions.UnicodeEncodeError
"""
Maybe I will try to update to 2.7

Upgraded to python 27 and still it gives Unexpected error:
exceptions.UnicodeEncodeError. Damn encoders I don'y know what to
do...

Dave Angel · Dec 17, 2012

That's not the whole error message. What encoding does it report in the
error?

Maybe I will try to update to 2.7

Upgraded to python 27 and still it gives Unexpected error:
exceptions.UnicodeEncodeError. Damn encoders I don'y know what to
do...

I doubted that 2.7 would make any difference.

1. What does your "terminal' expect. (For all I know you're using
TeraTermPro as a terminal, which doesn't support utf-8.)
Have you looked at the terminal encoding to see what your copy of
Terminal is expecting? On my Ubuntu Linux, I open the terminal with
Ctrl-Alt-t, then in the menu bar, I select
Terminal->SetCharacterEncoding->utf-8

2. What does your environment tell Linux to support? At a bash prompt, try
echo $LANG (there are two other environment variables I've seen
reference to, so this aspect is nuts)

Mine says
en_US.UTF-8

3. What does Python think it was told?
import sys
print sys.stdout.encoding

Mine says
UTF-8

I can force a similar error as follows:

import urllib
opener = urllib.FancyURLopener({})
ffr =
opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr"
% (14688538))
src = ffr.read()

out = src.decode("utf-8").encode("latin-1")

Traceback (most recent call last):
File "anatoli3.py", line 9, in <module>
src.decode("utf-8").encode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2122' in
position 17167: ordinal not in range(256)

And from that it's quite clear that for that particular data, I cannot
use a latin-1 encoder.

So I did a bit of hunting, and I find the offending character is the one
after the word 'Core" in the following quote:

processeurs IntelÂ® Coreâ„¢ de 3Ã¨me gÃ©nÃ©ration

The symbol is a trademark symbol and is not part of latin-1. If you're
really stuck with a latin-1 terminal, then you could do something like:

print src.decode("utf-8").encode("latin-1", "ignore")

That says to decode it using utf-8 (because the html declared a utf-8
encoding), and encode it back to latin-1 (because your terminal is stuck
there), then print.

Just realize that once you start using 'ignore' you're going to also
ignore discrepancies that are real. For example, maybe your terminal is
actual something other than either latin-1 or utf-8.

For others that just want to play with a minimal subset:

test = u'processeurs Intel\xae Core\u2122 de 3\xe8me g\xe9n\xe9ration av'
print test
print test.encode("latin-1", "ignore")
print test.encode("latin-1")

produces :

processeurs IntelÂ® Coreâ„¢ de 3Ã¨me gÃ©nÃ©ration av
processeurs Intelï¿½ Core de 3ï¿½me gï¿½nï¿½ration av
Traceback (most recent call last):
File "anatoli3.py", line 22, in <module>
print test.encode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2122' in
position 23: ordinal not in range(256)

Hans Mulder · Dec 17, 2012

print src.decode("utf-8").encode("latin-1", "ignore")

That says to decode it using utf-8 (because the html declared a utf-8
encoding), and encode it back to latin-1 (because your terminal is stuck
there), then print.

Just realize that once you start using 'ignore' you're going to also
ignore discrepancies that are real. For example, maybe your terminal is
actual something other than either latin-1 or utf-8.

If you need to see such discrepancies, you can do

print src.decode("utf-8").encode("latin-1", ""xmlcharrefreplace")

That would produce something like:

processeurs IntelÂ® Core™ de 3Ã¨me gÃ©nÃ©ration av

that is, the problem characters are displayed in &#...; notation.
That is ugly, but sometimes it's the only way to see what character
you really have.

Notice that the number you get is in decimal, where the \u....
notation uses hex:

Hope this helps,

-- HansM

Terry Reedy · Dec 17, 2012

Upgraded to python 27 and still it gives Unexpected error:
exceptions.UnicodeEncodeError. Damn encoders I don'y know what to
do...

If you are working with unicode, and you can upgrade to 3.3, you will
probably we happier if you do. This does not solve all problems, but the
python side is definitely better. (IE, there are unicode bugs in 2.7
whose fix *is* to upgrade to 3.3.)

That said, retrieving

http://prf.icecat.biz/index.cgi?pro...t;smi=product;shopname=openICEcat-url;lang=fr

with Firefox on Win 7 returns a page containing

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

so I presume the http encoding is also utf-8

Also: printing to the screen in IDLE may work better than with the
standard interactive console (especially the awful Windows version). I
have the font set to Lucida Sans Unicode (this may be windows specific)
which seems to works for all BMP (Basic Multilingual Plane) chars.

Anatoli Hristov · Dec 17, 2012

I doubted that 2.7 would make any difference.

Yeah this complicated my life even more, all my import functions was
gone - took me 2h to fix all

and it does not solved my issue

1. What does your "terminal' expect. (For all I know you're using
TeraTermPro as a terminal, which doesn't support utf-8.)
Have you looked at the terminal encoding to see what your copy of
Terminal is expecting? On my Ubuntu Linux, I open the terminal with
Ctrl-Alt-t, then in the menu bar, I select
Terminal->SetCharacterEncoding->utf-8

I'm using putty for windows and I changed the putty to UTF-8 and this
is what solved the problem - ihuuuuuu

There is no logic, but it solved the issue !

2. What does your environment tell Linux to support? At a bash prompt, try
echo $LANG (there are two other environment variables I've seen
reference to, so this aspect is nuts)
Mine says
en_US.UTF-8

Mine too US.UTF-8, but the putty was in latin1

3. What does Python think it was told?
import sys
print sys.stdout.encoding

Mine says
UTF-8

Mine too

Thank you Dave you always come with a solution

Anatoli Hristov · Dec 17, 2012

Just realize that once you start using 'ignore' you're going to also

If you need to see such discrepancies, you can do

print src.decode("utf-8").encode("latin-1", ""xmlcharrefreplace")

That would produce something like:

processeurs IntelÂ® Core™ de 3Ã¨me gÃ©nÃ©ration av

that is, the problem characters are displayed in &#...; notation.
That is ugly, but sometimes it's the only way to see what character
you really have.

Notice that the number you get is in decimal, where the \u....
notation uses hex:

Thanks guys my issue is now solved - the problem came from my Putty
client, it was on latin1 by default and changing it to utf-8, now
works...

MySQLdb insert HTML code error	0	Dec 10, 2012
Problem in getting dashboard page from login page in python pycharm using POST command	0	Dec 24, 2022
Script stops working when using variables to save time typing...	4	Oct 31, 2022
Mini Web Server in C++ (Part One)	4	Oct 2, 2025
HTML form to csv file on server	1	Feb 12, 2025
Align separate li to right	2	Jun 19, 2024
Align img inside nav tabs section	5	Dec 29, 2023
Radio player with now playing	1	Jul 15, 2025

Unicode

Anatoli Hristov

Steven D'Aprano

Anatoli Hristov

Benjamin Kaplan

Anatoli Hristov

Vlastimil Brom

Anatoli Hristov

Vlastimil Brom

Anatoli Hristov

Vlastimil Brom

Anatoli Hristov

Dave Angel

Anatoli Hristov

Vlastimil Brom

Anatoli Hristov

Dave Angel

Hans Mulder

Terry Reedy

Anatoli Hristov

Anatoli Hristov

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads