MySQLdb not playing nice with unicode

R

Roy Smith

5.1.63-log MySQL Community Server (GPL)
Python 2.7.3
MySQL-python==1.2.3

I've got a table:

CREATE TABLE `songza_temp` (
`s` varchar(1000) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

When I run this code:

--------------------------------------------
db = MySQLdb.connect(host=db_host,
db=db_db,
user=db_user,
passwd=db_passwd,
charset='utf8',
sql_mode='TRADITIONAL')
cursor = db.cursor()

baloon = u'\U0001f388'
print "baloon=%s, repr=%r, length=%d" % (baloon, baloon, len(baloon))
cursor.execute("insert into songza_temp (s) values (%s)", [baloon])
--------------------------------------------

I get:

baloon=<ÿàþ, repr=u'\U0001f388', length=1
Traceback (most recent call last):
File "./try.py", line 21, in <module>
cursor.execute("insert into songza_temp (s) values (%s)", [baloon])
File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 174,
in execute
self.errorhandler(self, exc, value)
File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line
36, in defaulterrorhandler
raise errorclass, errorvalue
_mysql_exceptions.OperationalError: (1366, "Incorrect string value:
'\\xF0\\x9F\\x8E\\x88' for column 's' at row 1")

My unicode-fu is a bit weak. Are we looking at a Python problem, a
MySQLdb problem, or a problem with the underlying MySQL server? We've
certainly inserted utf-8 data before without any problems. It's
possible this is the first time we've tried to handle a character
outside the BMP.
 
R

Roy Smith

Roy Smith said:
My unicode-fu is a bit weak. Are we looking at a Python problem, a
MySQLdb problem, or a problem with the underlying MySQL server? We've
certainly inserted utf-8 data before without any problems. It's
possible this is the first time we've tried to handle a character
outside the BMP.

Sigh. As is so often the case, I found the answer shortly after posting
this.

http://stackoverflow.com/questions/1890693/

It turns out MySQL (at least the version we're running) can't handle
characters outside the BMP!

OK, that leads to the next question. Is there anyway I can (in Python
2.7) detect when a string is not entirely in the BMP? If I could find
all the non-BMP characters, I could replace them with U+FFFD
(REPLACEMENT CHARACTER) and life would be good (enough).

Apparently, newer versions of MySQL have utf8mb4 which can handle this.
On possibility is upgrading to a new MySQL, but if we could just catch
and replace the non-BMP characters during ingestion, that would be a lot
simpler.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,931
Messages
2,570,085
Members
46,536
Latest member
keelop

Latest Threads

Top