Can't get the real contents form page in internet as the tag "no-chche"

K

Kent Johnson

dongdong said:
using web browser can get page's content formally, but when use
urllib2.open("http://tech.163.com/2004w11/12732/2004w11_1100059465339.html").read()

the result is

<html><head><META HTTP-EQUIV=REFRESH
CONTENT="0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html">
<META http-equiv="Pragma"
content="no-cache"></HEAD><body>?y?ú'ò?aò3??...</body></html>

The page is in Chinese (I think), when you print the data it is printing
in your console encoding which is apparently not Chinese. What did you
expect to see?

Kent
 
D

dongdong

yeah,u r right, the page uses chinese.(I'm a chinese too.^_^,)

using urllib2.urlopen('............').read(),I can't get the contents
between '<body>' and '</body>' ,the reason isn't the chinese encoding
but the 'no-cache' set,I think.

I want to get the contents between....

can you find the problem why i can't read the contents? thanks.
 
I

I V

dongdong said:
using web browser can get page's content formally, but when use
urllib2.open("http://tech.163.com/2004w11/12732/2004w11_1100059465339.html").read()

the result is

<html><head><META HTTP-EQUIV=REFRESH
CONTENT="0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html">

This line here instructs the browser to go to
http://tech.163.com/04/1110/12/14QUR2BR0009159H.html . If you try
loading that with urllib2, do you get the right content?

If the people behind that web page new how to use the web, they
wouldn't use the META HTTP-EQUIV hack, and instead would have
instructed their web server to return a 300 redirect response, which
would have allowed urllib2 to follow the redirect and get the right
content automatically. If you have any influence with them, you could
try and persuade them to set up their web server properly.
 
T

Tim Roberts

dongdong said:
using web browser can get page's content formally, but when use
urllib2.open("http://tech.163.com/2004w11/12732/2004w11_1100059465339.html").read()

the result is

<html><head><META HTTP-EQUIV=REFRESH
CONTENT="0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html">
<META http-equiv="Pragma"
content="no-cache"></HEAD><body>?y?ú'ò?aò3??...</body></html>

,I think the reson is the no-cache, are there person would help me?

No, that's not the reason. The reason is that this includes a redirect.

As an HTML consumer, you are supposed to parse that content and notice the
<meta http-equiv> tag, which says "here is something that should have been
one of the HTTP headers".

In this case, it wants you to act as though you saw:
Refresh: 0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html
Pragma: no-cache

In this case, the "Refresh" header means that you are supposed to go fetch
the contents of that new page immediately. Try using urllib2.open on THAT
address, and you should get your content.

This is one way to handle a web site reorganization and still allow older
URLs to work.
 
D

Diez B. Roggisch

dongdong said:
using web browser can get page's content formally, but when use
urllib2.open("http://tech.163.com/2004w11/12732/2004w11_1100059465339.html").read()

the result is

<html><head><META HTTP-EQUIV=REFRESH
CONTENT="0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html">
<META http-equiv="Pragma"
content="no-cache"></HEAD><body>?y?ú'ò?aò3??...</body></html>

,I think the reson is the no-cache, are there person would help me?

No, the reason is the <META HTTP-EQUIV=REFRESH
CONTENT="0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html">

that redirects you to the real site. Extract that url from the page and
request that. Or maybe you can use webunit, which acts more like a "real"
http-client with interpreting such content.

diez
 
D

dongdong

oh~~~! offer my thanks to Tim Roberts and all persons above!
I see now, it's the different url causes!
contents can only be got from the later (real ) url.
I made a mistick not to look at the different urls taking effect.
 
J

John J. Lee

dongdong said:
oh~~~! offer my thanks to Tim Roberts and all persons above!
I see now, it's the different url causes!
contents can only be got from the later (real ) url.
I made a mistick not to look at the different urls taking effect.

If you use ClientCookie.urlopen() in place of urllib2.urlopen(), it
will handle Refreshes and HTTP-EQUIV for you transparently.

Actually, you have to explicitly ask for that functionality:

import ClientCookie
opener = ClientCookie.build_opener(ClientCookie.HTTPEquivProcessor,
ClientCookie.HTTPRefreshProcessor,
)
ClientCookie.install_opener(opener)

print ClientCookie.urlopen(url).read()


If you want to do even less of this stuff "by hand", class Browser
from module mechanize is a subclass of the class of "opener" above,
but behaves much more like a web browser in various ways. Still
alpha, but very near now to stable release.


FWIW, you can also use ClientCookie.HTTPRefreshProcessor,
ClientCookie.HTTPEquivProcessor etc. with Python 2.4's urllib2, as
long as you follow the instructions under the heading "Notes about
ClientCookie, urllib2 and cookielib" in the ClientCookie README file
(specifically, if you want to use ClientCookie.RefreshProcessor with
Python 2.4's urllib2, you must also use
ClientCookie.HTTPRedirectHandler).


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,586
Members
45,083
Latest member
SylviaHarr

Latest Threads

Top