Unicode perplex

J

John Roth

I've got an interesting little problem that I can't find an
answer to after hunting through the doc (2.3.3). I've
got a string that contains something that kind of
resembles an HTML document. On looking through
it, I find a <meta http-equiv="content-type"
content="text/html; charset=UTF-8"> tag.

The problem is that I've got a normal string where
the byte stream is actually UTF-8. How do I turn
it into a Unicode string? Remember that the trick
is that it's still going to have the *same* stream of
bytes (at least if the Unicode string is implemented
in UTF-8.) I don't need to convert it with a codec,
I need to change the class under the data.

I don't want to have to write a c language
extension, and I also don't want to have to write
it out to a file and read it back in. The product
involved (FIT) is distributed under the GPL[1], so
packages that don't have the same license (or
that aren't maintained across all systems which
support Python) aren't eligible.

It's also not possible to ask the service caller to
properly specify the string when they pass it to me.

Any ideas?

John Roth

[1] That wasn't my choice, so political comments
aren't relevant. Bitch at Ward Cunningham if you
want to bitch.
 
I

Irmen de Jong

John said:
Remember that the trick
is that it's still going to have the *same* stream of
bytes (at least if the Unicode string is implemented
in UTF-8.)

Which it isnt't.

AFAIK Python's storage format for Unicode strings is
some form of 2-byte representation, it certainly isn't
UTF-8.

So if you want to turn your string into a Python Unicode
object, you really have to push it trough the UTF-8 codec...

--Irmen
 
I

Ivan Voras

John said:
The problem is that I've got a normal string where
the byte stream is actually UTF-8. How do I turn
it into a Unicode string? Remember that the trick

does

str2 = str.decode('utf-8')

work?
 
J

John Roth

Irmen de Jong said:
Which it isnt't.

AFAIK Python's storage format for Unicode strings is
some form of 2-byte representation, it certainly isn't
UTF-8.

So if you want to turn your string into a Python Unicode
object, you really have to push it trough the UTF-8 codec...

I see. I'm really very much a novice at unicode and all
the codec stuff. If I understand you, I need to get the
utf-8 codec and use the decode function to turn it into
a unicode string, and then use the encode function to
turn it back to a standard 8-byte string so I can write
it out (or send it down the pipe or socket...)

Thanks. Now that you point it out, it does look kind
of obvious - the second time.

John Roth
 
J

John Roth

Ivan Voras said:
does

str2 = str.decode('utf-8')

work?

[dirty word]. Thanks. I knew I'd seen it before
somewhere; it just didn't occur to me to look in
the obvious place. It sure ought to.

Thanks.

John Roth
 
F

Fredrik Lundh

John said:
The problem is that I've got a normal string where
the byte stream is actually UTF-8. How do I turn
it into a Unicode string? Remember that the trick
is that it's still going to have the *same* stream of
bytes (at least if the Unicode string is implemented
in UTF-8.) I don't need to convert it with a codec,
I need to change the class under the data.

you're making more assumptions about things you don't know anything
about than is really good for you. had you read any article on Python's
Unicode system, you'd learned that UTF-8 is an encoding, while Python
Unicode string type contains sequences of Unicode characters.

or in other words, if you have something that isn't a Python Unicode
string, and you want a Python Unicode string, you need to convert it.

more reading:

http://www.effbot.org/zone/unicode-objects.htm
http://www.reportlab.com/i18n/python_unicode_tutorial.html
(slightly outdated; ignore installation/setup parts)
http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf

</F>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Thinking Unicode 0
Unicode 20
unicode 7
Python 3.3, gettext and Unicode problems 0
How is unicode implemented behind the scenes? 4
Unicode questions 17
Ascii to Unicode. 16
Python Unicode handling wins again -- mostly 67

Members online

No members online now.

Forum statistics

Threads
474,431
Messages
2,571,679
Members
48,796
Latest member
Greg L.

Latest Threads

Top