Unicode problem

P

pabloski

Hi to all, I have a little problem with unicode handling under Python.

I have this code

s = u'A unicode string with this damn apostrophe \x2019'

outf = codecs.open('filename.txt', 'w', 'iso-8859-15')
outf.write(s)

what I obtain is a UnicodeEncodeError that says me that character \x2019
maps to undefined.

But the character \x2019 is the apostrophe and in the unicode table it has
\x0027 as an equivalent, so the codecs should convert \x2019 to \x27 ( as
defined in iso-8859-15 )....

The problem is that my software deals with italian strings that has a lot
of apostrophe and other similar simbols mapped between 2000 and 206F

Have can I resolve this issue? Should I prepocess the unicode strings or
is there a way to instruct Python to do the conversion?
 
M

Marc 'BlackJack' Rintsch

Hi to all, I have a little problem with unicode handling under Python.

I have this code

s = u'A unicode string with this damn apostrophe \x2019'

outf = codecs.open('filename.txt', 'w', 'iso-8859-15')
outf.write(s)

what I obtain is a UnicodeEncodeError that says me that character \x2019
maps to undefined.

But the character \x2019 is the apostrophe and in the unicode table it has
\x0027 as an equivalent, so the codecs should convert \x2019 to \x27 ( as
defined in iso-8859-15 )....

No it shouldn't because \x2019 is a "right single quotation mark" and not
an apostrophe.

Ciao,
Marc 'BlackJack' Rintsch
 
P

pabloski

No it shouldn't because \x2019 is a "right single quotation mark" and not
an apostrophe.

Ciao,
Marc 'BlackJack' Rintsch


I agree, but the problem is much subtle. I have coverted a text from
iso-8859-1 to utf-8 and the codecs have translated \x27 ( the iso
apostrophe ) to \xe28099 in utf-8 ( or u'2019' in unicode code point
notation )

So if convert an apostrophe to a "right single quotation mark" why not
translate the "right single quotation mark" to "apostrophe"

As I can see it works in one direction but not in the other
 
G

Guest

I agree, but the problem is much subtle. I have coverted a text from
iso-8859-1 to utf-8 and the codecs have translated \x27 ( the iso
apostrophe ) to \xe28099 in utf-8 ( or u'2019' in unicode code point
notation )

What software did you use to make that so? The Python codec certainly
never would do such a thing.

Are you sure it was latin-1 and \x27, and not windows-1252 and \x92?

Regards,
Martin
 
A

Alex Martelli

...
Ah, I answered you on the Italian NG before seeing you had also posted
the same request here. What I proposed there was (untested):

import codecs

_rimedi = { u'\x2019': "'" }

def rimedia(exc):
if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)):
erore = exc.object[exc.start:exc.end]
if len(erore)==1 and erore in _rimedi: return _rimedi[erore]
raise exc
codecs.register_error('rimedia', rimedia)

outf = codecs.open('filename.txt', 'w', 'iso-8859-15', errors='rimedia')


Alex
 
E

Erik Max Francis

Hi to all, I have a little problem with unicode handling under Python.

I have this code

s = u'A unicode string with this damn apostrophe \x2019'

outf = codecs.open('filename.txt', 'w', 'iso-8859-15')
outf.write(s)

what I obtain is a UnicodeEncodeError that says me that character \x2019
maps to undefined.

But the character \x2019 is the apostrophe and in the unicode table it has
\x0027 as an equivalent, so the codecs should convert \x2019 to \x27 ( as
defined in iso-8859-15 )....

U+2019 is RIGHT SINGLE QUOTATION MARK. The APOSTROPHE (U+0027) is a
cross-reference as a similar code point, but they're not the same thing.

Your problem is that ISO-8859-15 doesn't have the RIGHT SINGLE QUOTATION
MARK, so you'll have to do the translation yourself if you want to turn
it into a true APOSTROPHE.
 
P

pabloski

What software did you use to make that so? The Python codec certainly
never would do such a thing.

Are you sure it was latin-1 and \x27, and not windows-1252 and \x92?

Regards,
Martin

you're right...the source of text are html pages and obviously webmasters
have poor knowledge of encodings, so the meta declared the encoding as
ISO-8859-1 but the real encoding is Windows-1252 and yes it uses \x92 as
apostrophe, so the problem isn't Python
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,539
Members
45,024
Latest member
ARDU_PROgrammER

Latest Threads

Top