Unicode chr(150) en dash

M

marexposed

Hello guys & girls

I'm pasting an "en dash" (http://www.fileformat.info/info/unicode/char/2013/index.htm) character into a tkinter widget, expecting it to be properly stored into a MySQL database.

I'm getting this error:
*****************************************************************************
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Python24\lib\lib-tk\Tkinter.py", line 1345, in __call__
return self.func(*args)
File "chupadato.py", line 25, in guardar
cursor.execute(a)
File "C:\Python24\Lib\site-packages\MySQLdb\cursors.py", line 149, in execute
query = query.encode(charset)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position
52: ordinal not in range(256)
*****************************************************************************

Variable 'a' in 'cursor.execute(a)' contains a proper SQL statement, which includes the 'en dash' character just pasted into the Text widget.

When I type 'print chr(150)' into a python command line window I get a LATIN SMALL LETTER U WITH CIRCUMFLEX (http://www.fileformat.info/info/unicode/char/00fb/index.htm), but when I do so into a IDLE window I get a hypen (chr(45).

Funny thing I quite don't understand is, when I do paste that 'en dash' character into a python command window (I'm using MSWindows), the character is conveniently converted to chr(45) which is a hyphen (I wouldn't mind if I could do that by coding, I mean 'adapting' by visual similarity).

I tried searching "en dash" or even "dash" into the encodings folder of python Lib, but I couldn't find anything.

I'm using Windows Vista english, Python 2.4, latest MySQLdb. Default encoding changed (while testing) into "iso-8859-1".

Thanks for any help.
 
M

Martin v. Löwis

"C:\Python24\Lib\site-packages\MySQLdb\cursors.py", line 149, in
execute query = query.encode(charset) UnicodeEncodeError: 'latin-1'
codec can't encode character u'\u2013' in position 52: ordinal not in
range(256)

Here it complains that it deals with the character U+2013, which
is "EN DASH"; it complains that the encoding called "latin-1" does
not support that character.

That is a fact - Latin-1 does not support EN DASH.
When I type 'print chr(150)' into a python command line window I get
a LATIN SMALL LETTER U WITH CIRCUMFLEX
(http://www.fileformat.info/info/unicode/char/00fb/index.htm),

That's because your console uses the code page 437:

py> chr(150).decode("cp437")
u'\xfb'
py> unicodedata.name(_)
'LATIN SMALL LETTER U WITH CIRCUMFLEX'

Code page 437, on your system, is the "OEM code page".
but when I do so into a IDLE window I get a hypen (chr(45).

That's because IDLE uses the "ANSI code page" of your system,
which is windows code page 1252.

py> chr(150).decode("windows-1252")
u'\u2013'
py> unicodedata.name(_)
'EN DASH'

You actually *don't* get the character U+002D, HYPHEN-MINUS,
displayed - just a character that has, in your font, a glyph
which looks similar to the glyph for HYPHEN-MINUS.
However, HYPHEN-MINUS and EN DASH are different characters, and
IDLE displays the latter, not the former.
I tried searching "en dash" or even "dash" into the encodings folder
of python Lib, but I couldn't find anything.

You didn't ask a specific question, so I assume you are primarily
after an explanation.

HTH,
Martin
 
J

John Nagle

Hello guys & girls

I'm pasting an "en dash"
(http://www.fileformat.info/info/unicode/char/2013/index.htm) character into
a tkinter widget, expecting it to be properly stored into a MySQL database.

I'm getting this error:
*****************************************************************************
Exception in Tkinter callback Traceback (most recent call last): File
"C:\Python24\lib\lib-tk\Tkinter.py", line 1345, in __call__ return
self.func(*args) File "chupadato.py", line 25, in guardar cursor.execute(a)
File "C:\Python24\Lib\site-packages\MySQLdb\cursors.py", line 149, in execute
query = query.encode(charset) UnicodeEncodeError: 'latin-1' codec can't
encode character u'\u2013' in position 52: ordinal not in range(256)
*****************************************************************************

Python and MySQL will do end to end Unicode quite well. But that's
not what you're doing. How did "latin-1" get involved?

If you want to use MySQL in Unicode, there are several things to be done.
First, the connection has to be opened in Unicode:

db = MySQLdb.connect(host="localhost",
use_unicode = True, charset = "utf8",
user=username, passwd=password, db=database)

Yes, you have to specify both "use_unicode=True", which tells the client
to talk Unicode, and set "charset" to"utf8", which tells the server
to talk Unicode encoded as UTF-8".

Then the tables need to be in Unicode. In SQL,

ALTER DATABASE dbname DEFAULT CHARACTER SET utf8;

before creating the tables. You can also change the types of
existing tables and even individual fields to utf8, if necessary.
(This takes time for big tables; the table is copied. But it works.)

It's possible to get MySQL to store character sets other than
ASCII or Unicode; you can store data in "latin1" if you want. This
might make sense if, for example, all your data is in French or German,
which maps well to "latin1". Unless that's your situation, go with
either all-ASCII or all-Unicode. It's less confusing.

John Nagle
 
M

marexposed

Thank you Martin and John, for you excellent explanations.

I think I understand the unicode basic principles, what confuses me is the usage different applications make out of it.

For example, I got that EN DASH out of a web page which states <?xml version="1.0" encoding="ISO-8859-1"?> at the beggining. That's why I did go for that encoding. But if the browser can properly decode that character using that encoding, how come other applications can't?

I might need to go for python's htmllib to avoid this, not sure. But if I don't, if I only want to just copy and paste some web pages text contents into a tkinter Text widget, what should I do to succesfully make every single character go all the way from the widget and out of tkinter into a python string variable? How did my browser knew it should render an EN DASH instead of a circumflexed lowercase u?

This is the webpage in case you are interested, 4th line of first paragraph, there is the EN DASH: http://www.pagina12.com.ar/diario/elmundo/subnotas/102453-32303-2008-04-15.html

Thanks a lot.
 
S

s0suk3

Thank you Martin and John, for you excellent explanations.

I think I understand the unicode basic principles, what confuses me is the usage different applications make out of it.

For example, I got that EN DASH out of a web page which states <?xml version="1.0" encoding="ISO-8859-1"?> at the beggining. That's why I did go for that encoding. But if the browser can properly decode that character using that encoding, how come other applications can't?

I might need to go for python's htmllib to avoid this, not sure. But if I don't, if I only want to just copy and paste some web pages text contents into a tkinter Text widget, what should I do to succesfully make every single character go all the way from the widget and out of tkinter into a python string variable? How did my browser knew it should render an EN DASH instead of a circumflexed lowercase u?

This is the webpage in case you are interested, 4th line of first paragraph, there is the EN DASH:http://www.pagina12.com.ar/diario/elmundo/subnotas/102453-32303-2008-...

Thanks a lot.

Simplemente escribe en ingles. Like this, see? No encodings mess.
 
R

Richard Brodie

I think I understand the unicode basic principles, what confuses me is the usage
different applications
make out of it.

For example, I got that EN DASH out of a web page which states
<?xml version="1.0" encoding="ISO-8859-1"?> at the beggining. That's why I did go for
that
encoding. But if the browser can properly decode that character using that encoding,
how come
other applications can't?

Browsers tend to guess what the author intended a lot. In particular, they fudge the
difference
between ISO8859-1 and Windows-1252. http://en.wikipedia.org/wiki/Windows-1252
 
M

Martin v. Löwis

For example, I got that EN DASH out of a web page which states <?xml
version="1.0" encoding="ISO-8859-1"?> at the beggining. That's why I
did go for that encoding. But if the browser can properly decode that
character using that encoding, how come other applications can't?

Please do trust us that ISO-8859-1 does *NOT* support EN DASH.

There are two possible explanations for the behavior you observed:
a) even though the file was declared ISO-8859-1, the data in it
actually didn't use that encoding. The browser somehow found out,
and chose a different encoding from the declared one.
b) the web page contained the character reference – (or –),
or the entity reference &ndash;. XML allows to support arbitrary
Unicode characters even in a file that is encoded with ASCII.
I might need to go for python's htmllib to avoid this, not sure. But
if I don't, if I only want to just copy and paste some web pages text
contents into a tkinter Text widget, what should I do to succesfully
make every single character go all the way from the widget and out of
tkinter into a python string variable? How did my browser knew it
should render an EN DASH instead of a circumflexed lowercase u?

Read the source of the web page to be certain.
This is the webpage in case you are interested, 4th line of first
paragraph, there is the EN DASH:
http://www.pagina12.com.ar/diario/elmundo/subnotas/102453-32303-2008-04-15.html

Ok, this says – in several places, as well as “ and ”

HTH,
Martin
 
H

hdante

Thank you Martin and John, for you excellent explanations.

I think I understand the unicode basic principles, what confuses me is the usage different applications make out of it.

For example, I got that EN DASH out of a web page which states <?xml version="1.0" encoding="ISO-8859-1"?> at the beggining. That's why I did go for that encoding. But if the browser can

There's a trick here. Blame lax web standards and companies that
don't like standards.

There's no EN DASH in ISO-8859-1. The first 256 characters in Unicode
are the same as ISO-8859-1, but EN DASH number is U+2013.

The character code in question (which is present in the page), 150,
doesn't exist in ISO-8859-1. See

http://en.wikipedia.org/wiki/ISO/IEC_8859-1 (the entry for 150 is
blank)

The character 150 exists in Windows-1252, however, which is a non-
standard clone of ISO-8859-1.

http://en.wikipedia.org/wiki/Windows-1252

Who is wrong ?
- The guy who wrote the web site
- The browser that does the trick.
- Everybody for using a non-standard encoding
- Everybody for using an outdated 8-bit encoding.

Don't use old 8-bit encodings. Use UTF-8.
 
M

marexposed

Don't use old 8-bit encodings. Use UTF-8.

Yes, I'll try. But is a problem when I only want to read, not that I'm trying to write or create the content.
To blame I suppose is Microsoft's commercial success. They won't adhere to standars if that doesn't make sense for their business.

I'll change the approach trying to filter the contents with htmllib and mapping on my own those troubling characters.
Anyway this has been a very instructive dive into unicode for me, I've got things cleared up now.

Thanks to everyone for the great help.
 
J

J. Clifford Dyer

Yes, I'll try. But is a problem when I only want to read, not that I'm trying to write or create the content.
To blame I suppose is Microsoft's commercial success. They won't adhere to standars if that doesn't make sense for their business.

I'll change the approach trying to filter the contents with htmllib and mapping on my own those troubling characters.
Anyway this has been a very instructive dive into unicode for me, I've got things cleared up now.

Thanks to everyone for the great help.

There are a number of code points (150 being one of them) that are used
in cp1252, which are reserved for control characters in ISO-8859-1.
Those characters will pretty much never be used in ISO-8859-1 documents.
If you're expecting documents of both types coming in, test for the
presence of those characters, and assume cp1252 for those documents.

Something like:

for c in control_chars:
if c in encoded_text:
unicode_text = encoded_text.decode('cp1252')
break
else:
unicode_text = encoded_text.decode('latin-1')

Note that the else matches the for, not the if.

You can figure out the characters to match on by looking at the
wikipedia pages for the encodings.

Cheers,
Cliff
 
J

J. Clifford Dyer

There are a number of code points (150 being one of them) that are used
in cp1252, which are reserved for control characters in ISO-8859-1.
Those characters will pretty much never be used in ISO-8859-1 documents.
If you're expecting documents of both types coming in, test for the
presence of those characters, and assume cp1252 for those documents.

Something like:

for c in control_chars:
if c in encoded_text:
unicode_text = encoded_text.decode('cp1252')
break
else:
unicode_text = encoded_text.decode('latin-1')

Note that the else matches the for, not the if.

You can figure out the characters to match on by looking at the
wikipedia pages for the encodings.

One warning: This works if you know all your documents are in one of
those two encodings, but you could break other encodings, like UTF-8
this way. Fortunately UTF-8 is a pretty fragile encoding, so it's easy
to break. You can usually test if a document is decent UTF-8 just by
wrapping it in a try except block:

try:
unicode_text = encoded.text.decode('utf-8')
except UnicodeEncodeError: # I think that's the proper exception
# do the stuff above

None of these are perfect methods, but then again, if text encoding
detection were a perfect science, python could just handle it on its
own.

If in doubt, prompt the user for confirmation.

Maybe others can share better "best practices."

Cheers,
Cliff
 
J

John Machin

hdante said:
The character code in question (which is present in the page), 150,
doesn't exist in ISO-8859-1.

Are you sure? Consider (re-)reading all of the Wikipedia article.

150 aka \x96 doesn't exist in ISO 8859-1. ISO-8859-1 (two hyphens) is a
superset of ISO 8859-1 (one hyphen) and adds the not-very-useful-AFAICT
control codes \x80 to \x9F.

You must have been looking at the table of the "lite" ISO 8859-1 (one
hyphen). Reading further you will see \x96 described as SPA or "Start of
Guarded Area". Then there is the ISO-8859-1 (two hyphens) table,
including \x96.

HTH,
John
 
H

hdante

Are you sure? Consider (re-)reading all of the Wikipedia article.

150 aka \x96 doesn't exist in ISO 8859-1. ISO-8859-1 (two hyphens) is a
superset of ISO 8859-1 (one hyphen) and adds the not-very-useful-AFAICT
control codes \x80 to \x9F.



You must have been looking at the table of the "lite" ISO 8859-1 (one
hyphen). Reading further you will see \x96 described as SPA or "Start of
Guarded Area". Then there is the ISO-8859-1 (two hyphens) table,
including \x96.

HTH,
John

Sorry, that's right, I should have been referring to the second
table.
 
M

Martin v. Löwis

150 aka \x96 doesn't exist in ISO 8859-1. ISO-8859-1 (two hyphens) is a
superset of ISO 8859-1 (one hyphen) and adds the not-very-useful-AFAICT
control codes \x80 to \x9F.

To disambiguate the two, when I want to refer to the one with the
control characters, I use the name "IANA ISO-8859-1" or "the IANA
version of Latin-1", or some such, to reflect the fact that it's
not the ISO standard, but the (unfortunately differing) IANA
registration thereof.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top