urllib2 - iteration over non-sequence

R

rplobue

im trying to get urllib2 to work on my server which runs python
2.2.1. When i run the following code:


import urllib2
for line in urllib2.urlopen('www.google.com'):
print line


i will always get the error:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: iteration over non-sequence


Anyone have any answers?
 
L

Larry Bates

im trying to get urllib2 to work on my server which runs python
2.2.1. When i run the following code:


import urllib2
for line in urllib2.urlopen('www.google.com'):
print line


i will always get the error:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: iteration over non-sequence


Anyone have any answers?

I ran your code:
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python25\lib\urllib2.py", line 121, in urlopen
return _opener.open(url, data)
File "C:\Python25\lib\urllib2.py", line 366, in open
protocol = req.get_type()
File "C:\Python25\lib\urllib2.py", line 241, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: www.google.com

Note the traceback.

you need to call it with type in front of the url:
<addinfourl at 27659320 whose fp = <socket._fileobject object at 0x01A51F48>>

Python's interactive mode is very useful for tracking down this type
of problem.

-Larry
 
R

rplobue

Thanks for the reply Larry but I am still having trouble. If i
understand you correctly, your are just suggesting that i add an http://
in front of the address? However when i run this:

I am still getting the message:

TypeError: iteration over non-sequence
File "<stdin>", line 1
TypeError: iteration over non-sequence
 
G

Gary Herron

Thanks for the reply Larry but I am still having trouble. If i
understand you correctly, your are just suggesting that i add an http://
in front of the address? However when i run this:



I am still getting the message:

TypeError: iteration over non-sequence
File "<stdin>", line 1
TypeError: iteration over non-sequence
Newer version of Python are willing to implement an iterator that
*reads* the contents of a file object and supplies the lines to you
one-by-one in a loop. However, you explicitly said the version of
Python you are using, and that predates generators/iterators.

So... You must explicitly read the contents of the file-like object
yourself, and loop through the lines you self. However, fear not --
it's easy. The socket._fileobject object provides a method "readlines"
that reads the *entire* contents of the object, and returns a list of
lines. And you can iterate through that list of lines. Like this:

import urllib2
url = urllib2.urlopen('http://www.google.com')
for line in url.readlines():
print line
url.close()


Gary Herron
 
E

Erik Max Francis

Gary said:
So... You must explicitly read the contents of the file-like object
yourself, and loop through the lines you self. However, fear not --
it's easy. The socket._fileobject object provides a method "readlines"
that reads the *entire* contents of the object, and returns a list of
lines. And you can iterate through that list of lines. Like this:

import urllib2
url = urllib2.urlopen('http://www.google.com')
for line in url.readlines():
print line
url.close()

This is really wasteful, as there's no point in reading in the whole
file before iterating over it. To get the same effect as file iteration
in later versions, use the .xreadlines method::

for line in aFile.xreadlines():
...
 
P

Paul Rubin

Erik Max Francis said:
This is really wasteful, as there's no point in reading in the whole
file before iterating over it. To get the same effect as file
iteration in later versions, use the .xreadlines method::

for line in aFile.xreadlines():
...

Ehhh, a heck of a lot of web pages don't have any newlines, so you end
up getting the whole file anyway, with that method. Something like

for line in iter(lambda: aFile.read(4096), ''): ...

may be best.
 
G

Gary Herron

Paul said:
Ehhh, a heck of a lot of web pages don't have any newlines, so you end
up getting the whole file anyway, with that method. Something like

for line in iter(lambda: aFile.read(4096), ''): ...

may be best.
Certainly there's are cases where xreadlines or read(bytecount) are
reasonable, but only if the total pages size is *very* large. But for
most web pages, you guys are just nit-picking (or showing off) to
suggest that the full read implemented by readlines is wasteful.
Moreover, the original problem was with sockets -- which don't have
xreadlines. That seems to be a method on regular file objects.

For simplicity, I'd still suggest my original use of readlines. If
and when you find you are downloading web pages with sizes that are
putting a serious strain on your memory footprint, then one of the other
suggestions might be indicated.

Gary Herron
 
P

Paul Rubin

Gary Herron said:
For simplicity, I'd still suggest my original use of readlines. If
and when you find you are downloading web pages with sizes that are
putting a serious strain on your memory footprint, then one of the other
suggestions might be indicated.

If you know in advance that the page you're retrieving will be
reasonable in size, then using readlines is fine. If you don't know
in advance what you're retrieving (e.g. you're working on a crawler)
you have to assume that you'll hit some very large pages with
difficult construction.
 
E

Erik Max Francis

Gary said:
Certainly there's are cases where xreadlines or read(bytecount) are
reasonable, but only if the total pages size is *very* large. But for
most web pages, you guys are just nit-picking (or showing off) to
suggest that the full read implemented by readlines is wasteful.
Moreover, the original problem was with sockets -- which don't have
xreadlines. That seems to be a method on regular file objects.

For simplicity, I'd still suggest my original use of readlines. If
and when you find you are downloading web pages with sizes that are
putting a serious strain on your memory footprint, then one of the other
suggestions might be indicated.

It isn't nitpicking to point out that you're making something that will
consume vastly more amounts of memory than it could possibly need. And
insisting that pages aren't _always_ huge is just a silly cop-out; of
course pages get very large.

There is absolutely no reason to read the entire file into memory (which
is what you're doing) before processing it. This is a good example of
the principle of there is one obvious right way to do it -- and it isn't
to read the whole thing in first for no reason whatsoever other than to
avoid an `x`.
 
E

Erik Max Francis

Paul said:
If you know in advance that the page you're retrieving will be
reasonable in size, then using readlines is fine. If you don't know
in advance what you're retrieving (e.g. you're working on a crawler)
you have to assume that you'll hit some very large pages with
difficult construction.

And that's before you even mention the point that, depending on the
application, it could easily open yourself up to a DOS attack.

There's premature optimization, and then there's premature completely
obvious and pointless waste. This falls in the latter category.

Besides, someone was asking for/needing an older equivalent to iterating
over a file. That's obviously .xreadlines, not .readlines.
 
G

Gabriel Genellina

There is absolutely no reason to read the entire file into memory (which
is what you're doing) before processing it. This is a good example of
the principle of there is one obvious right way to do it -- and it isn't
to read the whole thing in first for no reason whatsoever other than to
avoid an `x`.

The problem is -and you appear not to have noticed that- that the object
returned by urlopen does NOT have a xreadlines() method; and even if it
had, a lot of pages don't contain any '\n' so using xreadlines would read
the whole page in memory anyway.

Python 2.2 (the version that the OP is using) did include a xreadlines
module (now defunct) but on this case it is painfully slooooooooooooow -
perhaps it tries to read the source one character at a time.

So the best way would be to use (as Paul Rubin already said):

for line in iter(lambda: f.read(4096), ''): print line
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top