Is there a way to get utf-8 out of a Unicode string?

T

thebjorn

I've got a database (ms sqlserver) that's (way) out of my control,
where someone has stored utf-8 encoded Unicode data in regular varchar
fields, so that e.g. the string 'Blåbærsyltetøy' is in the database
as 'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y' :-/

Then I read the data out using adodbapi (which returns all strings as
Unicode) and I get u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'. I couldn't
find any way to get back to the original short of:

def unfk(s):
return eval(repr(s)[1:]).decode('utf-8')

i.e. chopping off the u in the repr of a unicode string, and relying on
eval to interpret the \xHH sequences.

Is there a less hack'ish way to do this?

-- bjorn
 
F

Fredrik Lundh

thebjorn said:
I've got a database (ms sqlserver) that's (way) out of my control,
where someone has stored utf-8 encoded Unicode data in regular varchar
fields, so that e.g. the string 'Blåbærsyltetøy' is in the database
as 'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y' :-/

Then I read the data out using adodbapi (which returns all strings as
Unicode) and I get u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'. I couldn't
find any way to get back to the original short of:

def unfk(s):
return eval(repr(s)[1:]).decode('utf-8')

i.e. chopping off the u in the repr of a unicode string, and relying on
eval to interpret the \xHH sequences.

Is there a less hack'ish way to do this?

first, check if you can get your database adapter to understand that the
database contains UTF-8 and not ISO-8859-1. if that's not possible, you
can roundtrip via ISO-8859-1 yourself:
Blåbærsyltetøy

</F>
 
G

Gerrit Holl

Hei,

def unfk(s):
return eval(repr(s)[1:]).decode('utf-8')

i.e. chopping off the u in the repr of a unicode string, and relying on
eval to interpret the \xHH sequences.

Is there a less hack'ish way to do this?

Slightly lack hackish:

return ''.join(chr(ord(c)) for c in s)

Gerrit.
 
T

thebjorn

Fredrik said:
...
first, check if you can get your database adapter to understand that the
database contains UTF-8 and not ISO-8859-1.

It would be the way to go, however it looks like they've managed to get
Latin-1 data in exactly two columns in the entire database (this is a
commercial product of course, so there's no way for us to fix things).
And just to make things more interesting, I think I'm running into an
ADO bug where capital letters (outside the U+0000 to U+007F range) are
returning strange values:

I haven't tested this through C[#/++] to verify that it's an ADO issue,
but it seems unlikely that MS would view this as anything but incorrect
usage no matter where the issue is...

Anyway, sorry for venting :)
if that's not possible, you can roundtrip via ISO-8859-1 yourself:

Blåbærsyltetøy

That's very nice!

-- bjorn
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top