python tags on websites timeout problem

jeff · Jul 20, 2003

Hiya

im trying to pull tags off a website using python ive got a few things
running that have the potential to work its just i cant get them to
becuase of certain errors?

basically i dont what to download the images and all the stuff just
the html and then work from there, i think its timing out because its
trying to downlaod the images as well which i dont what to do as this
would decrease the speed of what im trying to achieve, the URL used is
only that for an example

ive included my source and the errors

cheers

greg

this is my source

--------------------------------------------------------------------------------

#!/usr/bin/env python
import re
import urllib

file = urllib.urlretrieve("http://images.google.com/images?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=rabbit"
, "temp1.tmp")

# open a file
file = open("temp1.tmp","r")
text = file.readlines()
file.close()

# searching the file content line by line:
keyword = re.compile(r"</a>")

for line in text:
result = keyword.search (line)
if result:
print result.group(1), ":", line,
--------------------------------------------------------------------------------
and these are the errors im getting

C:\Python22>python tagyourit.py
Traceback (most recent call last):
File "tagyourit.py", line 5, in ?
file = urllib.urlretrieve("http://images.google.com/image
8&oe=UTF-8&q=rabbit" , "temp1.tmp")
File "C:\PYTHON22\lib\urllib.py", line 80, in urlretrieve
return _urlopener.retrieve(url, filename, reporthook, dat
File "C:\PYTHON22\lib\urllib.py", line 210, in retrieve
fp = self.open(url, data)
File "C:\PYTHON22\lib\urllib.py", line 178, in open
return getattr(self, name)(url)
File "C:\PYTHON22\lib\urllib.py", line 292, in open_http
h.endheaders()
File "C:\PYTHON22\lib\httplib.py", line 695, in endheaders
self._send_output()
File "C:\PYTHON22\lib\httplib.py", line 581, in _send_outpu
self.send(msg)
File "C:\PYTHON22\lib\httplib.py", line 548, in send
self.connect()
File "C:\PYTHON22\lib\httplib.py", line 532, in connect
raise socket.error, msg
--------------------------------------------------------------------------------

Lee Harr · Jul 20, 2003

Hiya

im trying to pull tags off a website using python ive got a few things
running that have the potential to work its just i cant get them to
becuase of certain errors?

basically i dont what to download the images and all the stuff just
the html and then work from there, i think its timing out because its
trying to downlaod the images as well which i dont what to do as this
would decrease the speed of what im trying to achieve, the URL used is
only that for an example

A web page is made up of many separate components. When you
"download a webpage" you generally are fetching the HTML code,
and you will not get any images unless you specifically
download those by their own URLs.

this is my source

--------------------------------------------------------------------------------

#!/usr/bin/env python
import re
import urllib

file = urllib.urlretrieve("http://images.google.com/images?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=rabbit"
, "temp1.tmp")

Two things:

Don't use the name "file" as the name of your variable, as that
is now the standard way to access a file (used instead of open)

Why save the file and then read it back in?

I might do something like...

text = urllib.urlopen('http://www.example.org')
for line in text.readlines():
print line

# searching the file content line by line:
keyword = re.compile(r"</a>")

for line in text:
result = keyword.search (line)
if result:
print result.group(1), ":", line,

There are no parentheses in your regex, so I do not
think you will ever have a group(1)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: no such group

'</a>'

C:\Python22>python tagyourit.py
Traceback (most recent call last):
File "tagyourit.py", line 5, in ?
file = urllib.urlretrieve("http://images.google.com/image
8&oe=UTF-8&q=rabbit" , "temp1.tmp")

Is this newline (between image and 8 really there? Maybe
there is a problem with the URL...

File "C:\PYTHON22\lib\urllib.py", line 80, in urlretrieve
return _urlopener.retrieve(url, filename, reporthook, dat
File "C:\PYTHON22\lib\urllib.py", line 210, in retrieve
fp = self.open(url, data)
File "C:\PYTHON22\lib\urllib.py", line 178, in open
return getattr(self, name)(url)
File "C:\PYTHON22\lib\urllib.py", line 292, in open_http
h.endheaders()
File "C:\PYTHON22\lib\httplib.py", line 695, in endheaders
self._send_output()
File "C:\PYTHON22\lib\httplib.py", line 581, in _send_outpu
self.send(msg)
File "C:\PYTHON22\lib\httplib.py", line 548, in send
self.connect()
File "C:\PYTHON22\lib\httplib.py", line 532, in connect
raise socket.error, msg
--------------------------------------------------------------------------------

I think maybe you just are not getting any response at
all from your try to fetch. Can you get any other URL ?
Maybe google is watching user-agent strings to try to keep
spiders out of their pages?

John J. Lee · Jul 21, 2003

im trying to pull tags off a website using python ive got a few things
running that have the potential to work its just i cant get them to
becuase of certain errors?
file = urllib.urlretrieve("http://images.google.com/images?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=rabbit"
, "temp1.tmp")

Google's terms of service, IIRC, don't allow automated queries. I'm
not entirely sure what that means, but people seem to interpret it as
meaning "Don't web-scrape", so don't do that. Use the Google API
instead (you can get a free key). It is true some bits of Google
aren't accessible through the API, though. Dunno about the image
search facility.

http://www.google.com/groups?as_q=SOAP python google

http://sourceforge.net/projects/pywebsvcs

John

jeff · Jul 21, 2003

Hiya,

thanks everyone that replied, very informative.

yep i looked into using the google api, although wasnt sure what it
did at first so ill check that again,

thanks for the sourceforge links

thanks for everything else

cheers

greg

Python discord bot problem	1	Jan 11, 2023
urllib timeout issues	5	Mar 27, 2007
Python FTP timeout value not effective	3	Sep 2, 2013
Waiting for receiving data	4	Nov 23, 2009
socket timeout error?	0	Apr 17, 2006
urllib (54, 'Connection reset by peer') error	5	Jun 13, 2008
Problem with codewars.	5	Dec 4, 2023
Python battle game help	2	Feb 23, 2023

python tags on websites timeout problem

jeff

Lee Harr

John J. Lee

jeff

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads