compare unicode to non-unicode strings

Discussion in 'Python' started by Asterix, Aug 31, 2008.

  1. Asterix

    Asterix Guest

    how could I test that those 2 strings are the same:

    'séd' (repr is 's\\xc3\\xa9d')

    u'séd' (repr is u's\\xe9d')
    Asterix, Aug 31, 2008
    #1
    1. Advertising

  2. Asterix

    John Machin Guest

    On Aug 31, 11:04 pm, Asterix <> wrote:
    > how could I test that those 2 strings are the same:
    >
    > 'séd' (repr is 's\\xc3\\xa9d')


    No, the repr is 's\xc3\xa9d'.

    >
    > u'séd' (repr is u's\\xe9d')


    No, the repr is u's\xe9d'.

    To answer your question:
    John Machin, Aug 31, 2008
    #2
    1. Advertising

  3. Asterix

    John Machin Guest

    On Aug 31, 11:04 pm, Asterix <> wrote:
    > how could I test that those 2 strings are the same:
    >
    > 'séd' (repr is 's\\xc3\\xa9d')
    >
    > u'séd' (repr is u's\\xe9d')


    [note: your reprs are wrong; change the \\ to \]

    You need to decode the non-unicode string and compare the result with
    the unicode string. You need to know the encoding used for the non-
    unicode string. In the example that you gave, it's about 99.99% likely
    that it's UTF-8.

    >>> 's\xc3\xa9d'.decode('utf8')

    u's\xe9d'
    >>> u's\xe9d'.encode('utf8')

    's\xc3\xa9d'
    >>>


    HTH,
    John
    John Machin, Aug 31, 2008
    #3
  4. Asterix wrote:

    > how could I test that those 2 strings are the same:
    >
    > 'séd' (repr is 's\\xc3\\xa9d')
    >
    > u'séd' (repr is u's\\xe9d')


    determine what encoding the former string is using (looks like UTF-8),
    and convert it to Unicode before doing the comparision.

    >>> b = 's\xc3\xa9d'
    >>> u = u's\xe9d'
    >>> b

    's\xc3\xa9d'
    >>> u

    u's\xe9d'
    >>> unicode(b, "utf-8")

    u's\xe9d'
    >>> unicode(b, "utf-8") == u

    True

    </F>
    Fredrik Lundh, Aug 31, 2008
    #4
  5. Par Toutatis !
    Si tu avais posé la question à Ordralphabétix, ou sur un des ng français
    consacrés à Python, au lieu de refaire "La grande Traversée", la réponse
    aurait peut-être été plus rapide.

    @-salutations
    --
    Michel Claveau
    Méta-MCI (MVP), Aug 31, 2008
    #5
  6. Asterix wrote:
    > how could I test that those 2 strings are the same:
    >
    > 'séd' (repr is 's\\xc3\\xa9d')
    >
    > u'séd' (repr is u's\\xe9d')


    You may also want to look at unicodedata.normalize(). For example, é can
    be represented multiple ways:

    >>> import unicodedata
    >>> unicodedata.normalize('NFC', u'é')

    u'\xe9'
    >>> unicodedata.normalize('NFD', u'é')

    u'e\u0301'
    >>> u'\xe9' == u'e\u0301'

    False

    The first form is "composed", just being U+00E9 (LATIN SMALL LETTER E
    WITH ACUTE). The second form is "decomposed", being made up of U+0065
    (LATIN SMALL LETTER E) and U+0301 (COMBINING ACUTE ACCENT).

    Even though they represent the same thing to a human, they don't compare
    as equal. But if you normalize them to the same form, they will.

    For more information, look at the unicodedata module's documentation:
    <http://docs.python.org/lib/module-unicodedata.html>
    --
    Matt Nordhoff, Aug 31, 2008
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jonathon Blake

    Non-unicode strings & Python.

    Jonathon Blake, Aug 31, 2004, in forum: Python
    Replies:
    1
    Views:
    387
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Aug 31, 2004
  2. Holger Joukl
    Replies:
    5
    Views:
    503
    Ben Finney
    Dec 13, 2006
  3. Ben

    Strings, Strings and Damned Strings

    Ben, Jun 22, 2006, in forum: C Programming
    Replies:
    14
    Views:
    724
    Malcolm
    Jun 24, 2006
  4. Suzanne
    Replies:
    4
    Views:
    139
    Dr.Ruud
    Aug 9, 2008
  5. Jochen Lehmeier

    DBD::Oracle, Unicode, non-UTF8-non-ASCII strings

    Jochen Lehmeier, Jul 23, 2009, in forum: Perl Misc
    Replies:
    0
    Views:
    379
    Jochen Lehmeier
    Jul 23, 2009
Loading...

Share This Page