Unicode problem

Discussion in 'Python' started by pabloski@giochinternet.com, Jul 7, 2007.

  1. Guest

    Hi to all, I have a little problem with unicode handling under Python.

    I have this code

    s = u'A unicode string with this damn apostrophe \x2019'

    outf = codecs.open('filename.txt', 'w', 'iso-8859-15')
    outf.write(s)

    what I obtain is a UnicodeEncodeError that says me that character \x2019
    maps to undefined.

    But the character \x2019 is the apostrophe and in the unicode table it has
    \x0027 as an equivalent, so the codecs should convert \x2019 to \x27 ( as
    defined in iso-8859-15 )....

    The problem is that my software deals with italian strings that has a lot
    of apostrophe and other similar simbols mapped between 2000 and 206F

    Have can I resolve this issue? Should I prepocess the unicode strings or
    is there a way to instruct Python to do the conversion?
    , Jul 7, 2007
    #1
    1. Advertising

  2. On Sat, 07 Jul 2007 16:06:03 +0000, wrote:

    > Hi to all, I have a little problem with unicode handling under Python.
    >
    > I have this code
    >
    > s = u'A unicode string with this damn apostrophe \x2019'
    >
    > outf = codecs.open('filename.txt', 'w', 'iso-8859-15')
    > outf.write(s)
    >
    > what I obtain is a UnicodeEncodeError that says me that character \x2019
    > maps to undefined.
    >
    > But the character \x2019 is the apostrophe and in the unicode table it has
    > \x0027 as an equivalent, so the codecs should convert \x2019 to \x27 ( as
    > defined in iso-8859-15 )....


    No it shouldn't because \x2019 is a "right single quotation mark" and not
    an apostrophe.

    Ciao,
    Marc 'BlackJack' Rintsch
    Marc 'BlackJack' Rintsch, Jul 7, 2007
    #2
    1. Advertising

  3. Guest

    > No it shouldn't because \x2019 is a "right single quotation mark" and not
    > an apostrophe.
    >
    > Ciao,
    > Marc 'BlackJack' Rintsch



    I agree, but the problem is much subtle. I have coverted a text from
    iso-8859-1 to utf-8 and the codecs have translated \x27 ( the iso
    apostrophe ) to \xe28099 in utf-8 ( or u'2019' in unicode code point
    notation )

    So if convert an apostrophe to a "right single quotation mark" why not
    translate the "right single quotation mark" to "apostrophe"

    As I can see it works in one direction but not in the other
    , Jul 7, 2007
    #3
  4. > I agree, but the problem is much subtle. I have coverted a text from
    > iso-8859-1 to utf-8 and the codecs have translated \x27 ( the iso
    > apostrophe ) to \xe28099 in utf-8 ( or u'2019' in unicode code point
    > notation )


    What software did you use to make that so? The Python codec certainly
    never would do such a thing.

    Are you sure it was latin-1 and \x27, and not windows-1252 and \x92?

    Regards,
    Martin
    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=, Jul 7, 2007
    #4
  5. <> wrote:
    ...
    Ah, I answered you on the Italian NG before seeing you had also posted
    the same request here. What I proposed there was (untested):

    import codecs

    _rimedi = { u'\x2019': "'" }

    def rimedia(exc):
    if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)):
    erore = exc.object[exc.start:exc.end]
    if len(erore)==1 and erore in _rimedi: return _rimedi[erore]
    raise exc
    codecs.register_error('rimedia', rimedia)

    outf = codecs.open('filename.txt', 'w', 'iso-8859-15', errors='rimedia')


    Alex
    Alex Martelli, Jul 7, 2007
    #5
  6. wrote:

    > Hi to all, I have a little problem with unicode handling under Python.
    >
    > I have this code
    >
    > s = u'A unicode string with this damn apostrophe \x2019'
    >
    > outf = codecs.open('filename.txt', 'w', 'iso-8859-15')
    > outf.write(s)
    >
    > what I obtain is a UnicodeEncodeError that says me that character \x2019
    > maps to undefined.
    >
    > But the character \x2019 is the apostrophe and in the unicode table it has
    > \x0027 as an equivalent, so the codecs should convert \x2019 to \x27 ( as
    > defined in iso-8859-15 )....


    U+2019 is RIGHT SINGLE QUOTATION MARK. The APOSTROPHE (U+0027) is a
    cross-reference as a similar code point, but they're not the same thing.

    Your problem is that ISO-8859-15 doesn't have the RIGHT SINGLE QUOTATION
    MARK, so you'll have to do the translation yourself if you want to turn
    it into a true APOSTROPHE.

    --
    Erik Max Francis && && http://www.alcyone.com/max/
    San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
    She glanced at her watch ... It was 9:23.
    -- James Clavell
    Erik Max Francis, Jul 7, 2007
    #6
  7. Guest

    >
    > What software did you use to make that so? The Python codec certainly
    > never would do such a thing.
    >
    > Are you sure it was latin-1 and \x27, and not windows-1252 and \x92?
    >
    > Regards,
    > Martin


    you're right...the source of text are html pages and obviously webmasters
    have poor knowledge of encodings, so the meta declared the encoding as
    ISO-8859-1 but the real encoding is Windows-1252 and yes it uses \x92 as
    apostrophe, so the problem isn't Python
    , Jul 8, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    3,898
    Robert Mark Bram
    Sep 28, 2003
  2. ygao

    unicode wrap unicode object?

    ygao, Apr 8, 2006, in forum: Python
    Replies:
    6
    Views:
    521
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 8, 2006
  3. Gabriele *darkbard* Farina

    Unicode digit to unicode string

    Gabriele *darkbard* Farina, May 16, 2006, in forum: Python
    Replies:
    2
    Views:
    486
    Gabriele *darkbard* Farina
    May 16, 2006
  4. gabor
    Replies:
    13
    Views:
    525
    Leo Kislov
    Nov 18, 2006
  5. Jean-Paul Calderone
    Replies:
    23
    Views:
    646
    Leo Kislov
    Nov 21, 2006
Loading...

Share This Page