a unicode question?

Z

zdwang

Hello,
There is a unicode string, I want to change it to ansi string. but
it raise an exception.
Could you help me?

## I want to change s1 to s2.

s1 = u'\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '

s2 = '\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '
 
J

John Machin

What do you mean by "ansi string"?

Here is a superficially not-unreasonable answer to your more specific
question:

# >>> s1 = u'\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '
# >>> s2 = '\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '
# >>> s3 = s1.encode('latin1')
# >>> s2 == s3
# True

But what are you really trying to achieve? Where does your Unicode data
come from? What ranges of characters do you expect it to contain? You
need to crunch it into an 8-bit representation because ... what?
 
Z

zdwang

Mr. John Machin

This question come form the flow codes. I use the PyXml to build a DOM
tree.

from xml.dom.ext.reader import HtmlLib
doc =
HtmlLib.FromHtmlUrl('http://stock.business.sohu.com/q/nbcg.php?code=600028')
title_elem = doc.documentElement.getElementsByTagName("TITLE")[0]
title_string = title_elem.firstChild.data
print title_string

# the title_string is unicode, but it is not "latin1" code, so I wantto
change it.
 
S

Serge Orlov

Mr. John Machin

This question come form the flow codes. I use the PyXml to build a DOM
tree.

from xml.dom.ext.reader import HtmlLib
doc =
HtmlLib.FromHtmlUrl('http://stock.business.sohu.com/q/nbcg.php?code=600028')
title_elem = doc.documentElement.getElementsByTagName("TITLE")[0]
title_string = title_elem.firstChild.data
print title_string

# the title_string is unicode, but it is not "latin1" code, so I wantto
change it.

Errr, but the title of the page is written in Chinese and it is not
supposed to be crammed into latin1 encoding. What are you trying to do
with the string after you squeezed Chinese into latin1?
 
J

John Machin

Errrrrrrr, it get's worse: not only is the title written in Chinese, it
is encoded as gb2312 -- here is the repr() of the first few chunks:

"<html>\n<head>\n <title>\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) :
\xc4\xd
a\xb2\xbf\xc8\xcb\xd4\xb1\xb3\xd6\xb9\xc9 -
\xcb\xd1\xba\xfc\xb9\xc9\xc6\xb1</ti
tle>\n<meta http-equiv='Content-Type' content='text/html;
charset=gb2312'>\n"

and here is what you get after that_guff.decode('gb2312')

u"<html>\n<head>\n <title>\u4e2d\u56fd\u77f3\u5316(600028) :
\u5185\u90e8\u
4eba\u5458\u6301\u80a1 - \u641c\u72d0\u80a1\u7968</title>\n<meta
http-equiv='Con
tent-Type' content='text/html; charset=gb2312'>\n"

The first 2 characters of the title are recognisable both visually on
the browser title and in the unicode as "zhong guo" i.e. China.

BUT the OP's first message is interpreting that gb2312-encoded stuff as
Unicode:
s1 = u'\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '

*SOMEBODY* is seriously deluded, and it ain't me, and it ain't Serge
:)

.... and yes Peter, info travels faster also from China that it does
from Armenia :-())
 
P

Peter Otten

John said:
... and yes Peter, info travels faster also from China that it does
from Armenia :-())

Q: Can info travel faster from Armenia than from China?
Radio Yerevan: In principle, yes. Just make sure that it doesn't go the
other way round the globe or meets some friends on the way...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,062
Latest member
OrderKetozenseACV

Latest Threads

Top