utf-8 encoding issue

Discussion in 'Python' started by Marc Petitmermet, Sep 19, 2003.

  1. The line below looks up the name "öttinger" (with the German umlaut) of
    an author using the mysql console:

    mysql> select author from records where author like '%Öttinger%';

    This successfully finds all entries in the records database where
    "öttinger" is the author or the co-author.

    In a web form, the user enters "öttinger" and wants to search with this
    search string. My idea is now to convert the search string (which also
    could be e.g. some cyrillic text) into unicode and then to utf-8:

    unicode(search_string).encode('utf-8')

    This gives me the utf-8 encoded version of the string but not yet in the
    correct representation. How can I get the correct one (is this the hex
    version? I don't know the correct terminology.)?

    In short: how do I e.g. convert a sting containing a "ö" into a string
    containing a "%Ö"?

    Regards,
    Marc
     
    Marc Petitmermet, Sep 19, 2003
    #1
    1. Advertising

  2. Marc Petitmermet wrote:

    > In a web form, the user enters "öttinger" and wants to search with this
    > search string. My idea is now to convert the search string (which also
    > could be e.g. some cyrillic text) into unicode and then to utf-8:
    >
    > unicode(search_string).encode('utf-8')
    >
    > This gives me the utf-8 encoded version of the string but not yet in the
    > correct representation. How can I get the correct one (is this the hex
    > version? I don't know the correct terminology.)?
    >
    > In short: how do I e.g. convert a sting containing a "ö" into a string
    > containing a "%Ö"?


    that's not UTF-8, that's HTML/XML-style charrefs.

    if mysql translates the charref's to unicode characters, you can simply
    use:

    s = u.encode("ascii", "xmlcharrefreplace")

    where "u" is a unicode string.

    if you've stored charrefs as is in the database, you're in for some
    serious trouble. assuming that all charrefs are hexadecimal charrefs,
    you can use something like:

    def fixup(m): return "&#" + hex(int(m.group(1)))[1:]
    s = re.sub("&#(\d+)", fixup, u.encode("ascii", "xmlcharrefreplace"))

    to map all non-ASCII characters to charrefs, and then translate all
    charrefs to hexadecimal charrefs.

    decoding the charrefs *before* you add the strings to the database
    is a better idea, though.

    </F>
     
    Fredrik Lundh, Sep 19, 2003
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. JJBW
    Replies:
    1
    Views:
    10,359
    Joerg Jooss
    Apr 24, 2004
  2. =?Utf-8?B?QXNoYQ==?=
    Replies:
    3
    Views:
    447
  3. Arifi Koseoglu
    Replies:
    2
    Views:
    1,008
    Arifi Koseoglu
    Apr 13, 2004
  4. Replies:
    2
    Views:
    394
  5. Replies:
    2
    Views:
    398
    Nathan Keel
    Aug 14, 2009
Loading...

Share This Page