utf-8 encoding issue

Marc Petitmermet · Sep 19, 2003

The line below looks up the name "öttinger" (with the German umlaut) of
an author using the mysql console:

mysql> select author from records where author like '%Öttinger%';

This successfully finds all entries in the records database where
"öttinger" is the author or the co-author.

In a web form, the user enters "öttinger" and wants to search with this
search string. My idea is now to convert the search string (which also
could be e.g. some cyrillic text) into unicode and then to utf-8:

unicode(search_string).encode('utf-8')

This gives me the utf-8 encoded version of the string but not yet in the
correct representation. How can I get the correct one (is this the hex
version? I don't know the correct terminology.)?

In short: how do I e.g. convert a sting containing a "ö" into a string
containing a "%Ö"?

Regards,
Marc

Fredrik Lundh · Sep 19, 2003

Marc said:
In a web form, the user enters "öttinger" and wants to search with this
search string. My idea is now to convert the search string (which also
could be e.g. some cyrillic text) into unicode and then to utf-8:

unicode(search_string).encode('utf-8')

This gives me the utf-8 encoded version of the string but not yet in the
correct representation. How can I get the correct one (is this the hex
version? I don't know the correct terminology.)?

In short: how do I e.g. convert a sting containing a "ö" into a string
containing a "%Ö"?

that's not UTF-8, that's HTML/XML-style charrefs.

if mysql translates the charref's to unicode characters, you can simply
use:

s = u.encode("ascii", "xmlcharrefreplace")

where "u" is a unicode string.

if you've stored charrefs as is in the database, you're in for some
serious trouble. assuming that all charrefs are hexadecimal charrefs,
you can use something like:

def fixup(m): return "&#" + hex(int(m.group(1)))[1:]
s = re.sub("&#(\d+)", fixup, u.encode("ascii", "xmlcharrefreplace"))

to map all non-ASCII characters to charrefs, and then translate all
charrefs to hexadecimal charrefs.

decoding the charrefs *before* you add the strings to the database
is a better idea, though.

</F>

Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
Unicode (UTF-8) in C	13	Mar 16, 2014
Stuck with urllib.quote and Unicode/UTF-8	0	May 7, 2011
hex dump w/ or w/out utf-8 chars	40	Jul 8, 2013
Python unicode utf-8 characters and MySQL unicode utf-8 characters	2	Jan 18, 2011
codec for UTF-8 with BOM	3	May 2, 2011
Lookuperror : unknown encoding : utf-8	12	Oct 30, 2006
encoding error	1	Feb 20, 2013

utf-8 encoding issue

Marc Petitmermet

Fredrik Lundh

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads