A question about unicode() function

J

JTree

Hi,all
I encountered a problem when using unicode() function to fetch a
webpage, I don't know why this happenned.
My codes and error messages are:


Code:
#!/usr/bin/python
#Filename: test.py
#Modified: 2006-12-31

import cPickle as p
import urllib
import htmllib
import re
import sys

def funUrlFetch(url):
lambda url:urllib.urlopen(url).read()

objUrl = raw_input('Enter the Url:')
content = funUrlFetch(objUrl)
content = unicode(content,"gbk")
print content
content.close()


error message:

C:\WINDOWS\system32\cmd.exe /c python test.py
Enter the Url:http://www.msn.com
Traceback (most recent call last):
File "test.py", line 16, in ?
content = unicode(content,"gbk")
TypeError: coercing to Unicode: need string or buffer, NoneType found
shell returned 1
Hit any key to close this window...

Any suggestions would be appreciated!

Thanks!
 
F

Felipe Almeida Lessa

def funUrlFetch(url):
lambda url:urllib.urlopen(url).read()

This function only creates a lambda function (that is not used or
assigned anywhere), nothing more, nothing less. Thus, it returns None
(sort of "void") no matter what is its argument. Probably you meant
something like

def funUrlFetch(url):
return urllib.urlopen(url).read()

or

funUrlFetch = lambda url:urllib.urlopen(url).read()

objUrl = raw_input('Enter the Url:')
content = funUrlFetch(objUrl)

content gets assigned None. Try putting "print content" before the unicode line.
content = unicode(content,"gbk")

This, equivalent to unicode(None, "gbk"), leads to
TypeError: coercing to Unicode: need string or buffer, NoneType found

None's are not strings nor buffers, so unicode() complains.

See ya,
 
J

JTree

Hi,

I changed my codes to:

#!/usr/bin/python
#Filename: test.py
#Modified: 2007-01-01

import cPickle as p
import urllib
import htmllib
import re
import sys

funUrlFetch = lambda url:urllib.urlopen(url).read()

objUrl = raw_input('Enter the Url:')
content = funUrlFetch(objUrl)
content = content.encode('gb2312','ignore')
print content
content.close()

I used "ignore" to deal with the data lose, but it still caused a
error:

C:\WINDOWS\system32\cmd.exe /c python tianya.py
Enter the Url:http://www.tianya.cn
Traceback (most recent call last):
File "tianya.py", line 17, in ?
content = content.encode('gb2312','ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbb in position
88: ordinal not in range(128)
shell returned 1
Hit any key to close this window...

My python version is 2.4, Does it have some problems with asian
encoding support?

Thanks!
 
T

Tim Roberts

JTree said:
Hi,all
I encountered a problem when using unicode() function to fetch a
webpage, I don't know why this happenned.
My codes and error messages are:


Code:
#!/usr/bin/python
#Filename: test.py
#Modified: 2006-12-31

import cPickle as p
import urllib
import htmllib
import re
import sys

def funUrlFetch(url):
lambda url:urllib.urlopen(url).read()

objUrl = raw_input('Enter the Url:')
content = funUrlFetch(objUrl)
content = unicode(content,"gbk")
print content
content.close()

Once you fix the lambda, as Felipe described, there's another issue here.
You are telling the unicode function that the string you're passing it is
an 8-bit string encoded as gbk. How do you know that? In your specific
example, www.msn.com, I can guarantee it will produce the wrong results:
www.msn.com is encoded in UTF-8.
 
J

John Machin

JTree said:
Hi,

I changed my codes to:

#!/usr/bin/python
#Filename: test.py
#Modified: 2007-01-01

import cPickle as p
import urllib
import htmllib
import re
import sys

funUrlFetch = lambda url:urllib.urlopen(url).read()

objUrl = raw_input('Enter the Url:')
content = funUrlFetch(objUrl)
content = content.encode('gb2312','ignore')

Why did you change what you had before? "content" is a str, encoded in
gb2312 (according to the internal evidence). You are now pretending
that it is unicode, and trying to encode it as gb2312. However because
it is *not* unicode, Python tries to convert it to unicode first. What
you have coded above is equivalent to:
content = content.decode('ascii').encode('gb2312', 'ignore')

and of course the *decode* fails, as the error message says:
Unicode*Decode*Error: 'ascii' codec can't decode byte 0xbb in position
88: ordinal not in range(128)

It never got any where near the encode()

So:
If you want a str encoded in gb2312, leave it alone.
If you want it in unicode, do this:
ucontent = unicode(content, 'gb2312')
print content

Try print repr(content)
It's much better for diagnostic purposes.

content.close()

This will be your next problem; "content" refers to a str object or a
unicode object -- they don't have a close() method !!
I used "ignore" to deal with the data lose, but it still caused a
error:

What data loss???
C:\WINDOWS\system32\cmd.exe /c python tianya.py
Enter the Url:http://www.tianya.cn
Traceback (most recent call last):
File "tianya.py", line 17, in ?
content = content.encode('gb2312','ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbb in position
88: ordinal not in range(128)
shell returned 1
Hit any key to close this window...

My python version is 2.4, Does it have some problems with asian
encoding support?

"asian" is irrelevant. You would have got the same problem with just
about any non-ascii encoding, including cp1252 and similar encodings
commonly used in English-speaking countries and in western Europe. The
only encoding support problem with 2.4 is that it can't read your mind.


By the way, you should upgrade to 2.5, it can't read your mind either,
but it has more functionality etc :)

HTH,
John
 
J

JTree

Thanks everyone!

Sorry for my ambiguous question.
I changed the codes and now it works fine.
 
P

Paul Watson

JTree said:
Thanks everyone!

Sorry for my ambiguous question.
I changed the codes and now it works fine.

So... How about posting the brief working code?
 
J

JTree

hi,
I just removed the unicode() method from my codes.
As John Machin said, I had an wrong understanding of unicode and ascii.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top