A question about unicode() function

JTree · Dec 31, 2006

Hi,all
I encountered a problem when using unicode() function to fetch a
webpage, I don't know why this happenned.
My codes and error messages are:

Code:
#!/usr/bin/python
#Filename: test.py
#Modified: 2006-12-31

import cPickle as p
import urllib
import htmllib
import re
import sys

def funUrlFetch(url):
lambda url:urllib.urlopen(url).read()

objUrl = raw_input('Enter the Url:')
content = funUrlFetch(objUrl)
content = unicode(content,"gbk")
print content
content.close()

error message:

C:\WINDOWS\system32\cmd.exe /c python test.py
Enter the Url:http://www.msn.com
Traceback (most recent call last):
File "test.py", line 16, in ?
content = unicode(content,"gbk")
TypeError: coercing to Unicode: need string or buffer, NoneType found
shell returned 1
Hit any key to close this window...

Any suggestions would be appreciated!

Thanks!

Felipe Almeida Lessa · Dec 31, 2006

def funUrlFetch(url):
lambda url:urllib.urlopen(url).read()

This function only creates a lambda function (that is not used or
assigned anywhere), nothing more, nothing less. Thus, it returns None
(sort of "void") no matter what is its argument. Probably you meant
something like

def funUrlFetch(url):
return urllib.urlopen(url).read()

or

funUrlFetch = lambda url:urllib.urlopen(url).read()

objUrl = raw_input('Enter the Url:')
content = funUrlFetch(objUrl)

content gets assigned None. Try putting "print content" before the unicode line.

content = unicode(content,"gbk")

This, equivalent to unicode(None, "gbk"), leads to

TypeError: coercing to Unicode: need string or buffer, NoneType found

None's are not strings nor buffers, so unicode() complains.

See ya,

JTree · Jan 1, 2007

Hi,

I changed my codes to:

#!/usr/bin/python
#Filename: test.py
#Modified: 2007-01-01

import cPickle as p
import urllib
import htmllib
import re
import sys

funUrlFetch = lambda url:urllib.urlopen(url).read()

objUrl = raw_input('Enter the Url:')
content = funUrlFetch(objUrl)
content = content.encode('gb2312','ignore')
print content
content.close()

I used "ignore" to deal with the data lose, but it still caused a
error:

C:\WINDOWS\system32\cmd.exe /c python tianya.py
Enter the Url:http://www.tianya.cn
Traceback (most recent call last):
File "tianya.py", line 17, in ?
content = content.encode('gb2312','ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbb in position
88: ordinal not in range(128)
shell returned 1
Hit any key to close this window...

My python version is 2.4, Does it have some problems with asian
encoding support?

Thanks!

Tim Roberts · Jan 1, 2007

JTree said:
Hi,all
I encountered a problem when using unicode() function to fetch a
webpage, I don't know why this happenned.
My codes and error messages are:

Code:
#!/usr/bin/python
#Filename: test.py
#Modified: 2006-12-31

import cPickle as p
import urllib
import htmllib
import re
import sys

def funUrlFetch(url):
lambda url:urllib.urlopen(url).read()

objUrl = raw_input('Enter the Url:')
content = funUrlFetch(objUrl)
content = unicode(content,"gbk")
print content
content.close()

Once you fix the lambda, as Felipe described, there's another issue here.
You are telling the unicode function that the string you're passing it is
an 8-bit string encoded as gbk. How do you know that? In your specific
example, www.msn.com, I can guarantee it will produce the wrong results:
www.msn.com is encoded in UTF-8.

John Machin · Jan 1, 2007

JTree said:
Hi,

I changed my codes to:

#!/usr/bin/python
#Filename: test.py
#Modified: 2007-01-01

import cPickle as p
import urllib
import htmllib
import re
import sys

funUrlFetch = lambda url:urllib.urlopen(url).read()

objUrl = raw_input('Enter the Url:')
content = funUrlFetch(objUrl)
content = content.encode('gb2312','ignore')

Why did you change what you had before? "content" is a str, encoded in
gb2312 (according to the internal evidence). You are now pretending
that it is unicode, and trying to encode it as gb2312. However because
it is *not* unicode, Python tries to convert it to unicode first. What
you have coded above is equivalent to:
content = content.decode('ascii').encode('gb2312', 'ignore')

and of course the *decode* fails, as the error message says:
Unicode*Decode*Error: 'ascii' codec can't decode byte 0xbb in position
88: ordinal not in range(128)

It never got any where near the encode()

So:
If you want a str encoded in gb2312, leave it alone.
If you want it in unicode, do this:
ucontent = unicode(content, 'gb2312')

print content

Try print repr(content)
It's much better for diagnostic purposes.

content.close()

This will be your next problem; "content" refers to a str object or a
unicode object -- they don't have a close() method !!

I used "ignore" to deal with the data lose, but it still caused a
error:

What data loss???

C:\WINDOWS\system32\cmd.exe /c python tianya.py
Enter the Url:http://www.tianya.cn
Traceback (most recent call last):
File "tianya.py", line 17, in ?
content = content.encode('gb2312','ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbb in position
88: ordinal not in range(128)
shell returned 1
Hit any key to close this window...

My python version is 2.4, Does it have some problems with asian
encoding support?

"asian" is irrelevant. You would have got the same problem with just
about any non-ascii encoding, including cp1252 and similar encodings
commonly used in English-speaking countries and in western Europe. The
only encoding support problem with 2.4 is that it can't read your mind.

By the way, you should upgrade to 2.5, it can't read your mind either,
but it has more functionality etc

HTH,
John

JTree · Jan 1, 2007

Thanks everyone!

Sorry for my ambiguous question.
I changed the codes and now it works fine.

Paul Watson · Jan 2, 2007

JTree said:
Thanks everyone!

Sorry for my ambiguous question.
I changed the codes and now it works fine.

So... How about posting the brief working code?

JTree · Jan 3, 2007

hi,
I just removed the unicode() method from my codes.
As John Machin said, I had an wrong understanding of unicode and ascii.

A question about thrift performance.	0	Jan 6, 2013
Error about " module object has no attribute 'QStringList' "	4	Mar 17, 2013
Downloading/Saving to a Directory	0	Nov 28, 2013
Flatten an email Message with a non-ASCII body using 8bit CTE	0	Jan 24, 2013
Parsing XML with ElementTree (unicode problem?)	13	Jul 23, 2007
[Newby] question about modules	6	Dec 10, 2004
generate and send mail with python: tutorial	8	Aug 11, 2011
A python telnet entry level question	6	Apr 22, 2004

A question about unicode() function

JTree

Felipe Almeida Lessa

JTree

Tim Roberts

John Machin

JTree

Paul Watson

JTree

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads