Unicode problem

pabloski · Jul 7, 2007

Hi to all, I have a little problem with unicode handling under Python.

I have this code

s = u'A unicode string with this damn apostrophe \x2019'

outf = codecs.open('filename.txt', 'w', 'iso-8859-15')
outf.write(s)

what I obtain is a UnicodeEncodeError that says me that character \x2019
maps to undefined.

But the character \x2019 is the apostrophe and in the unicode table it has
\x0027 as an equivalent, so the codecs should convert \x2019 to \x27 ( as
defined in iso-8859-15 )....

The problem is that my software deals with italian strings that has a lot
of apostrophe and other similar simbols mapped between 2000 and 206F

Have can I resolve this issue? Should I prepocess the unicode strings or
is there a way to instruct Python to do the conversion?

Marc 'BlackJack' Rintsch · Jul 7, 2007

Hi to all, I have a little problem with unicode handling under Python.

I have this code

s = u'A unicode string with this damn apostrophe \x2019'

outf = codecs.open('filename.txt', 'w', 'iso-8859-15')
outf.write(s)

what I obtain is a UnicodeEncodeError that says me that character \x2019
maps to undefined.

But the character \x2019 is the apostrophe and in the unicode table it has
\x0027 as an equivalent, so the codecs should convert \x2019 to \x27 ( as
defined in iso-8859-15 )....

No it shouldn't because \x2019 is a "right single quotation mark" and not
an apostrophe.

Ciao,
Marc 'BlackJack' Rintsch

pabloski · Jul 7, 2007

No it shouldn't because \x2019 is a "right single quotation mark" and not

an apostrophe.

Ciao,
Marc 'BlackJack' Rintsch

I agree, but the problem is much subtle. I have coverted a text from
iso-8859-1 to utf-8 and the codecs have translated \x27 ( the iso
apostrophe ) to \xe28099 in utf-8 ( or u'2019' in unicode code point
notation )

So if convert an apostrophe to a "right single quotation mark" why not
translate the "right single quotation mark" to "apostrophe"

As I can see it works in one direction but not in the other

Guest · Jul 7, 2007

I agree, but the problem is much subtle. I have coverted a text from

iso-8859-1 to utf-8 and the codecs have translated \x27 ( the iso
apostrophe ) to \xe28099 in utf-8 ( or u'2019' in unicode code point
notation )

What software did you use to make that so? The Python codec certainly
never would do such a thing.

Are you sure it was latin-1 and \x27, and not windows-1252 and \x92?

Regards,
Martin

Alex Martelli · Jul 7, 2007

...
Ah, I answered you on the Italian NG before seeing you had also posted
the same request here. What I proposed there was (untested):

import codecs

_rimedi = { u'\x2019': "'" }

def rimedia(exc):
if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)):
erore = exc.object[exc.start:exc.end]
if len(erore)==1 and erore in _rimedi: return _rimedi[erore]
raise exc
codecs.register_error('rimedia', rimedia)

outf = codecs.open('filename.txt', 'w', 'iso-8859-15', errors='rimedia')

Alex

Erik Max Francis · Jul 7, 2007

Hi to all, I have a little problem with unicode handling under Python.

I have this code

s = u'A unicode string with this damn apostrophe \x2019'

outf = codecs.open('filename.txt', 'w', 'iso-8859-15')
outf.write(s)

what I obtain is a UnicodeEncodeError that says me that character \x2019
maps to undefined.

But the character \x2019 is the apostrophe and in the unicode table it has
\x0027 as an equivalent, so the codecs should convert \x2019 to \x27 ( as
defined in iso-8859-15 )....

U+2019 is RIGHT SINGLE QUOTATION MARK. The APOSTROPHE (U+0027) is a
cross-reference as a similar code point, but they're not the same thing.

Your problem is that ISO-8859-15 doesn't have the RIGHT SINGLE QUOTATION
MARK, so you'll have to do the translation yourself if you want to turn
it into a true APOSTROPHE.

pabloski · Jul 8, 2007

What software did you use to make that so? The Python codec certainly
never would do such a thing.

Are you sure it was latin-1 and \x27, and not windows-1252 and \x92?

Regards,
Martin

you're right...the source of text are html pages and obviously webmasters
have poor knowledge of encodings, so the meta declared the encoding as
ISO-8859-1 but the real encoding is Windows-1252 and yes it uses \x92 as
apostrophe, so the problem isn't Python

Ascii to Unicode.	4	Jul 28, 2010
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
helping with unicode	4	Jul 3, 2012
Unicode conversion problem (codec can't decode)	2	Apr 4, 2008
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
Unicode chr(150) en dash	13	Apr 16, 2008
Why can't I set sys.ps1 to a unicode string?	3	Aug 12, 2010
python3 raw strings and \u escapes	10	May 30, 2012

Unicode problem

pabloski

Marc 'BlackJack' Rintsch

pabloski

Guest

Alex Martelli

Erik Max Francis

pabloski

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads