fetching webpage

Y

yookyung

I am trying to crawl webpages in citeseer domain (a collection of research
papers mostly in computer science).

I have used the following code snippet.

#####
import urllib

sock = urllib.urlopen("http://citeseer.ist.psu.edu")
webcontent = sock.read().split('\n')
sock.close()
print webcontent
########

Then I get the following error message.


['<!--#set var="TITLE" value="Server error!"', '--><!--#include
virtual="include/top.html" -->', '', ' <!--#if
expr="$REDIRECT_ERROR_NOTES" -->', '', ' The server encountered an
internal error and was ', ' unable to complete your request.', '', '
<!--#include virtual="include/spacer.html" -->', '', ' Error message:', '
<br /><!--#echo encoding="none" var="REDIRECT_ERROR_NOTES" -->', '', '
<!--#else -->', '', ' The server encountered an internal error and was ',
' unable to complete your request. Either the server is', ' overloaded
or there was an error in a CGI script.', '', ' <!--#endif -->', '',
'<!--#include virtual="include/bottom.html" -->', '']

However, the url is valid and it works fine if I open the url in my web
browser.
Or, if I use a different url (http://www.google.com instead of
http://citeseer.ist.psu.edu),
then it works.

What is wrong?
Could it be that the citeseer webserver checks the http request, and it sees
something
that it doesn't like and reject the request?
What should I do?

Thank you.

Best regards,
Yookyung
 
C

charlespina

I went to the URL you posted, and it looks like that error is the
content you should be recieving. Try refreshing your browser cache, you
could be loading a cached page.

Charles
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top