a unicode question?

zdwang · Apr 9, 2006

Hello,
There is a unicode string, I want to change it to ansi string. but
it raise an exception.
Could you help me?

## I want to change s1 to s2.

s1 = u'\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '

s2 = '\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '

John Machin · Apr 9, 2006

What do you mean by "ansi string"?

Here is a superficially not-unreasonable answer to your more specific
question:

# >>> s1 = u'\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '
# >>> s2 = '\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '
# >>> s3 = s1.encode('latin1')
# >>> s2 == s3
# True

But what are you really trying to achieve? Where does your Unicode data
come from? What ranges of characters do you expect it to contain? You
need to crunch it into an 8-bit representation because ... what?

zdwang · Apr 9, 2006

Mr. John Machin, Thank you very much!

zdwang · Apr 9, 2006

Mr. John Machin

This question come form the flow codes. I use the PyXml to build a DOM
tree.

from xml.dom.ext.reader import HtmlLib
doc =
HtmlLib.FromHtmlUrl('http://stock.business.sohu.com/q/nbcg.php?code=600028')
title_elem = doc.documentElement.getElementsByTagName("TITLE")[0]
title_string = title_elem.firstChild.data
print title_string

# the title_string is unicode, but it is not "latin1" code, so I wantto
change it.

Serge Orlov · Apr 10, 2006

Mr. John Machin

This question come form the flow codes. I use the PyXml to build a DOM
tree.

from xml.dom.ext.reader import HtmlLib
doc =
HtmlLib.FromHtmlUrl('http://stock.business.sohu.com/q/nbcg.php?code=600028')
title_elem = doc.documentElement.getElementsByTagName("TITLE")[0]
title_string = title_elem.firstChild.data
print title_string

# the title_string is unicode, but it is not "latin1" code, so I wantto
change it.

Errr, but the title of the page is written in Chinese and it is not
supposed to be crammed into latin1 encoding. What are you trying to do
with the string after you squeezed Chinese into latin1?

John Machin · Apr 10, 2006

Errrrrrrr, it get's worse: not only is the title written in Chinese, it
is encoded as gb2312 -- here is the repr() of the first few chunks:

"<html>\n<head>\n <title>\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) :
\xc4\xd
a\xb2\xbf\xc8\xcb\xd4\xb1\xb3\xd6\xb9\xc9 -
\xcb\xd1\xba\xfc\xb9\xc9\xc6\xb1</ti
tle>\n<meta http-equiv='Content-Type' content='text/html;
charset=gb2312'>\n"

and here is what you get after that_guff.decode('gb2312')

u"<html>\n<head>\n <title>\u4e2d\u56fd\u77f3\u5316(600028) :
\u5185\u90e8\u
4eba\u5458\u6301\u80a1 - \u641c\u72d0\u80a1\u7968</title>\n<meta
http-equiv='Con
tent-Type' content='text/html; charset=gb2312'>\n"

The first 2 characters of the title are recognisable both visually on
the browser title and in the unicode as "zhong guo" i.e. China.

BUT the OP's first message is interpreting that gb2312-encoded stuff as
Unicode:
s1 = u'\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '

*SOMEBODY* is seriously deluded, and it ain't me, and it ain't Serge

.... and yes Peter, info travels faster also from China that it does
from Armenia :-())

Peter Otten · Apr 11, 2006

John said:
... and yes Peter, info travels faster also from China that it does
from Armenia :-())

Q: Can info travel faster from Armenia than from China?
Radio Yerevan: In principle, yes. Just make sure that it doesn't go the
other way round the globe or meets some friends on the way...

how to show Chinese Characters in the value set of a dictionary	4	Jan 1, 2006
Tamil/Indian Languages Support in Tkinter	1	Jun 12, 2007
How to display Chinese in a list retrieved from database via python	11	Dec 25, 2008
newbie with a encoding question, please help	8	Apr 1, 2010
How to get a "screen" length of a multibyte string?	9	Nov 25, 2012
to_yaml in utf-8 encoding	7	Apr 8, 2011
Unicode char replace	4	Feb 12, 2008
WSGI/wsgiref: modifying output on windows ?	2	Jun 3, 2007

a unicode question?

zdwang

John Machin

zdwang

zdwang

Serge Orlov

John Machin

Peter Otten

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads