Can't get the real contents form page in internet as the tag "no-chche"

dongdong · Mar 23, 2006

using web browser can get page's content formally, but when use
urllib2.open("http://tech.163.com/2004w11/12732/2004w11_1100059465339.html").read()

the result is

<html><head><META HTTP-EQUIV=REFRESH
CONTENT="0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html">
<META http-equiv="Pragma"
content="no-cache"></HEAD><body>?y?ú'ò?aò3??...</body></html>

,I think the reson is the no-cache, are there person would help me?

Kent Johnson · Mar 23, 2006

dongdong said:
using web browser can get page's content formally, but when use
urllib2.open("http://tech.163.com/2004w11/12732/2004w11_1100059465339.html").read()

the result is

<html><head><META HTTP-EQUIV=REFRESH
CONTENT="0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html">
<META http-equiv="Pragma"
content="no-cache"></HEAD><body>?y?ú'ò?aò3??...</body></html>

The page is in Chinese (I think), when you print the data it is printing
in your console encoding which is apparently not Chinese. What did you
expect to see?

Kent

dongdong · Mar 23, 2006

yeah,u r right, the page uses chinese.(I'm a chinese too.^_^,)

using urllib2.urlopen('............').read(),I can't get the contents
between '<body>' and '</body>' ,the reason isn't the chinese encoding
but the 'no-cache' set,I think.

I want to get the contents between....

can you find the problem why i can't read the contents? thanks.

I V · Mar 23, 2006

dongdong said:
using web browser can get page's content formally, but when use
urllib2.open("http://tech.163.com/2004w11/12732/2004w11_1100059465339.html").read()

the result is

<html><head><META HTTP-EQUIV=REFRESH
CONTENT="0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html">

This line here instructs the browser to go to
http://tech.163.com/04/1110/12/14QUR2BR0009159H.html . If you try
loading that with urllib2, do you get the right content?

If the people behind that web page new how to use the web, they
wouldn't use the META HTTP-EQUIV hack, and instead would have
instructed their web server to return a 300 redirect response, which
would have allowed urllib2 to follow the redirect and get the right
content automatically. If you have any influence with them, you could
try and persuade them to set up their web server properly.

Tim Roberts · Mar 23, 2006

dongdong said:
using web browser can get page's content formally, but when use
urllib2.open("http://tech.163.com/2004w11/12732/2004w11_1100059465339.html").read()

the result is

<html><head><META HTTP-EQUIV=REFRESH
CONTENT="0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html">
<META http-equiv="Pragma"
content="no-cache"></HEAD><body>?y?ú'ò?aò3??...</body></html>

,I think the reson is the no-cache, are there person would help me?

No, that's not the reason. The reason is that this includes a redirect.

As an HTML consumer, you are supposed to parse that content and notice the
<meta http-equiv> tag, which says "here is something that should have been
one of the HTTP headers".

In this case, it wants you to act as though you saw:
Refresh: 0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html
Pragma: no-cache

In this case, the "Refresh" header means that you are supposed to go fetch
the contents of that new page immediately. Try using urllib2.open on THAT
address, and you should get your content.

This is one way to handle a web site reorganization and still allow older
URLs to work.

Diez B. Roggisch · Mar 23, 2006

dongdong said:
using web browser can get page's content formally, but when use
urllib2.open("http://tech.163.com/2004w11/12732/2004w11_1100059465339.html").read()

the result is

<html><head><META HTTP-EQUIV=REFRESH
CONTENT="0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html">
<META http-equiv="Pragma"
content="no-cache"></HEAD><body>?y?Ãº'Ã²?aÃ²3??...</body></html>

,I think the reson is the no-cache, are there person would help me?

No, the reason is the <META HTTP-EQUIV=REFRESH
CONTENT="0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html">

that redirects you to the real site. Extract that url from the page and
request that. Or maybe you can use webunit, which acts more like a "real"
http-client with interpreting such content.

diez

dongdong · Mar 23, 2006

oh~~~! offer my thanks to Tim Roberts and all persons above!
I see now, it's the different url causes!
contents can only be got from the later (real ) url.
I made a mistick not to look at the different urls taking effect.

John J. Lee · Mar 23, 2006

dongdong said:
oh~~~! offer my thanks to Tim Roberts and all persons above!
I see now, it's the different url causes!
contents can only be got from the later (real ) url.
I made a mistick not to look at the different urls taking effect.

If you use ClientCookie.urlopen() in place of urllib2.urlopen(), it
will handle Refreshes and HTTP-EQUIV for you transparently.

Actually, you have to explicitly ask for that functionality:

import ClientCookie
opener = ClientCookie.build_opener(ClientCookie.HTTPEquivProcessor,
ClientCookie.HTTPRefreshProcessor,
)
ClientCookie.install_opener(opener)

print ClientCookie.urlopen(url).read()

If you want to do even less of this stuff "by hand", class Browser
from module mechanize is a subclass of the class of "opener" above,
but behaves much more like a web browser in various ways. Still
alpha, but very near now to stable release.

FWIW, you can also use ClientCookie.HTTPRefreshProcessor,
ClientCookie.HTTPEquivProcessor etc. with Python 2.4's urllib2, as
long as you follow the instructions under the heading "Notes about
ClientCookie, urllib2 and cookielib" in the ClientCookie README file
(specifically, if you want to use ClientCookie.RefreshProcessor with
Python 2.4's urllib2, you must also use
ClientCookie.HTTPRedirectHandler).

John

About as basic "Newbie-Question" that you can get.	3	Sep 4, 2023
ElementTree: can't figure out a mismached-tag error	0	Jul 11, 2013
Having difficulty with the layout of these images / video for this web page	2	Jul 5, 2022
XHTML - how extend/create ELEMENT body in my DTD?	0	Oct 29, 2019
Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
Multi select options in a menu	1	Oct 30, 2022
tidy to convert google scholar page in xml	1	Oct 8, 2012
Why is this WordPress comments form not submitting?	1	Jan 12, 2020

Can't get the real contents form page in internet as the tag "no-chche"

dongdong

Kent Johnson

dongdong

I V

Tim Roberts

Diez B. Roggisch

dongdong

John J. Lee

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads