Unicode conversion problem (codec can't decode)

  • Thread starter Eric S. Johansson
  • Start date
E

Eric S. Johansson

I'm having a problem (Python 2.4) converting strings with random 8-bit
characters into an escape form which is 7-bit clean for storage in a database.
Here's an example:

body = meta['mini_body'].encode('unicode-escape')

when given an 8-bit string, (in meta['mini_body']), the code fragment above
yields the error below.

'ascii' codec can't decode byte 0xe1 in position 13: ordinal not in range(128)

the string that generates that error is:

<br>Reduce Whát You Owe by 50%. Get out of debt today!<br>Reduuce Interest &
|V|onthlyy Paymeñts Easy, we will show you how..<br>Freee Quote in 10
Min.<br>http://www.freefromdebtin.net.cn

I've read a lot of stuff about Unicode and Python and I'm pretty comfortable
with how you can convert between different encoding types. What I don't
understand is how to go from a byte string with 8-bit characters to an encoded
string where 8-bit characters are turned into two character hexadecimal sequences.

I really don't care about the character set used. I'm looking for a matched set
of operations that converts the string to a seven bits a form and back to its
original form. Since I need the ability to match a substring of the original
text while the string is in it's encoded state, something like Unicode-escaped
encoding would work well for me. unfortunately, I am missing some knowledge
about encoding and decoding. I wish I knew what cjson was doing because it does
the right things for my project. It takes strings or Unicode, stores everything
as Unicode and then returns everything as Unicode. Quite frankly, I love to
have my entire system run using Unicode strings but again, I missing some
knowledge on how to force all of my modules to be Unicode by default

any enlightenment would be most appreciated.

---eric
 
J

Jason Scheirer

I'm having a problem (Python 2.4) converting strings with random 8-bit
characters into an escape form which is 7-bit clean for storage in a database.
Here's an example:

body = meta['mini_body'].encode('unicode-escape')

when given an 8-bit string, (in meta['mini_body']), the code fragment above
yields the error below.

'ascii' codec can't decode byte 0xe1 in position 13: ordinal not in range(128)

the string that generates that error is:

<br>Reduce Whát You Owe by 50%. Get out of debt today!<br>Reduuce Interest &
|V|onthlyy Paymeñts Easy, we will show you how..<br>Freee Quote in 10
Min.<br>http://www.freefromdebtin.net.cn

I've read a lot of stuff about Unicode and Python and I'm pretty comfortable
with how you can convert between different encoding types. What I don't
understand is how to go from a byte string with 8-bit characters to an encoded
string where 8-bit characters are turned into two character hexadecimal sequences.

I really don't care about the character set used. I'm looking for a matched set
of operations that converts the string to a seven bits a form and back to its
original form. Since I need the ability to match a substring of the original
text while the string is in it's encoded state, something like Unicode-escaped
encoding would work well for me. unfortunately, I am missing some knowledge
about encoding and decoding. I wish I knew what cjson was doing because it does
the right things for my project. It takes strings or Unicode, stores everything
as Unicode and then returns everything as Unicode. Quite frankly, I love to
have my entire system run using Unicode strings but again, I missing some
knowledge on how to force all of my modules to be Unicode by default

any enlightenment would be most appreciated.

---eric

ASCII is technically only the seven-bit characters, so the codec is
just being very 'correct'. One trick you may want to try is a
string.decode() before your encode using some 8-bit encoding, such as
latin-1:

body = meta['mini_body'].decode('latin-1').encode('unicode-escape')

The problem here is that you don't really ever know EXACTLY which
single-byte character set you're dealing with, so there's no guarantee
you're going to be translating the CORRECT sequence of bytes back and
forth -- for instance, the addition of the Euro symbol, which was
fairly recent, supplanting the place of the old generic 'currency'
character. There are libraries such as Mark Pilgrim's port of the
Mozilla character detection code ( http://chardet.feedparser.org/ ),
but from my experience it doesn't do differentiation between latin
sets well, it's better at detecting CJK character encodings. If you're
merely using some unicode file/database as a dumb store you plan to
eventually push back into a sequence of bytes, you may be well off
doing string.decode('some-random-encoding').encode('utf-8') when
pushing in and string.decode('utf-8').encode('some-random-encoding')
when getting it back out.

Another thing to consider is a lot of XML libraries will create
unicode string objects that ARE NOT REALLY UNICODE -- something that
bit me when using cElementTree is that if an XML file is in latin-*
without a declaration of it being in that charset, it will still
create unicode string instances, but with illegal characters in them.
This causes Python to lead you down the garden path until you try to
encode the string again. At that point, it will try to validate it and
throw that exception. I usually use an idiom like this:

def join_chars(x):
def __dummy(*args, **kws):
return ''.join(x(*args, **kws))
return __dummy

@join_chars
def unidecode(unicode_string):
for character in unicode_string:
try:
yield character.decode('utf-8').encode('utf-8')
except:
yield ord(character)

And pass all my potentially invalid 'unicode' strings through it,
giving an explicit try when encoding each character. It's slow, but
it's really the only quick, reproducible way I've found around the
problem.
 
M

M.-A. Lemburg

If you don't want to process the 7-bit form in any way, there
are a couple of encodings which you could use:
Here's an example:

body = meta['mini_body'].encode('unicode-escape')

when given an 8-bit string, (in meta['mini_body']), the code fragment above
yields the error below.

'ascii' codec can't decode byte 0xe1 in position 13: ordinal not in range(128)

Try this:

body = meta['mini_body'].decode('latin-1').encode('unicode-escape')
mini_body = body.decode('unicode-escape').encode('latin-1')

or this:

body = meta['mini_body'].decode('latin-1').encode('utf-7')
mini_body = body.decode('utf-7').encode('latin-1')

If all you need is the 7-bit form, you're probably better of
with a base64 encoding:

body = meta['mini_body'].encode('base64')
mini_body = body.decode('base64')

Looks like spam :)

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Apr 04 2008)________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top