Too big of a list? and other problems

B

Brian

First off, I am sorry for cluttering this group with my inept
questions, but I am stuck again despite a few hours of hair pulling.

I have a function (below) that takes a list of html pages that have
images on them (not porn but boats). This function then (supposedly)
goes through and extracts the links to those images and puts them into
a list, appending with each iteration of the for loop. The list of
html pages is 82 items long and each page has multiple image links.
When the function gets to item 77 or so, the list gets all funky.
Sometimes it goes empty, and others it is a much more abbreviated list
than I expect - it should have roughly 750 image links.

When I looked at it while running, it appears as if my regex is
actually appending a tuple (I think) of the results it finds to the
list. My best guess is that the list is getting too big and croaks.
Since one of the objects of the function is also to be able to count
the items in the list, I am getting some strange errors there as well.

Here is the code:

def countPics(linkList):
foundPics = []
count = 0
for link in linkList:
picPage =
urllib.urlopen("http://continuouswave.com/whaler/cetacea/" +
link)
count = count +1
print 'got page', count
html = picPage.read()
picPage.close()
pics = re.compile(r"images/.*\.jpeg")
foundPics.append(pics.findall(html))
#print len(foundPics)
print "found", len(foundPics), "pictures"
print foundPics

Again, I sincerely appreciate the answers, time and patience this group
is giving me.

Thank you for any help you can provide in showing me where I am going
wrong.
Brian
 
T

Terry Reedy

Brian said:
First off, I am sorry for cluttering this group with my inept
questions, but I am stuck again despite a few hours of hair pulling.

I have a function (below) that takes a list of html pages that have
images on them (not porn but boats). This function then (supposedly)
goes through and extracts the links to those images and puts them into
a list, appending with each iteration of the for loop. The list of
html pages is 82 items long and each page has multiple image links.
When the function gets to item 77 or so, the list gets all funky.
Sometimes it goes empty, and others it is a much more abbreviated list
than I expect - it should have roughly 750 image links.

This does not make much sense without actual examples. But ...
When I looked at it while running, it appears as if my regex is
actually appending a tuple (I think) of the results it finds to the
list.

because that is what you said to do. So foundpics will have 1 items
appended per page. I am sure you want to .extend the list with the
sequence returned by findall, not append.
My best guess is that the list is getting too big and croaks.

Almost certainly not, unless you are out of memory, in which case you
should get appropriate exception. If suggested edit is not enough, print
the results of findall and even of foundPics with every iteration to
investigate further. Add input statement so you can step thru iterations
(perhaps after 50th or so).

Terry Jan Reedy
 
B

Ben Finney

Brian said:
First off, I am sorry for cluttering this group with my inept
questions

Questions aren't a problem; we all come here to learn at some point.

I will ask you, though, to learn effective quoting when you respond to
someone's post (i.e. quote relevant material that gives some context
to your response, with your response following).
I have a function (below) that takes a list of html pages that have
images on them (not porn but boats).

Not boat porn? :)
When I looked at it while running, it appears as if my regex is
actually appending a tuple (I think) of the results it finds to the
list.

Yes, that's what you asked it to do.
pics = re.compile(r"images/.*\.jpeg")
foundPics.append(pics.findall(html))

The 'findall' method of a regex object acts like the module's
'findall' function; it returns a list of the matches.

<URL:http://docs.python.org/lib/node115.html#l2h-879>

The 'append' method of a list object appends a single value to a
list. You're appending a list value to a list, extending it by one
value (the list of matches).

Perhaps you want the 'extend' method, which will append each item in
the specified sequence, extend the existing list by the values from
that sequence.

<URL:http://docs.python.org/lib/typesseq-mutable.html>
 
P

Paul McGuire

Brian said:
First off, I am sorry for cluttering this group with my inept
questions, but I am stuck again despite a few hours of hair pulling.

Don't apologize for getting stuck, especially after you have made an honest
effort at solving your own problems.
I have a function (below) that takes a list of html pages that have
images on them (not porn but boats). This function then (supposedly)
goes through and extracts the links to those images and puts them into
a list, appending with each iteration of the for loop. The list of
html pages is 82 items long and each page has multiple image links.
When the function gets to item 77 or so, the list gets all funky.
Sometimes it goes empty, and others it is a much more abbreviated list
than I expect - it should have roughly 750 image links.

When I looked at it while running, it appears as if my regex is
actually appending a tuple (I think) of the results it finds to the
list. My best guess is that the list is getting too big and croaks.

750 elements is really pretty modest in the universe of Python lists. This
should not be an issue.
Since one of the objects of the function is also to be able to count
the items in the list, I am getting some strange errors there as well.

Here is the code:

def countPics(linkList):
foundPics = []
count = 0
for link in linkList:
picPage =
urllib.urlopen("http://continuouswave.com/whaler/cetacea/" +
link)
count = count +1
print 'got page', count
html = picPage.read()
picPage.close()
pics = re.compile(r"images/.*\.jpeg")
foundPics.append(pics.findall(html))
#print len(foundPics)
print "found", len(foundPics), "pictures"
print foundPics

Again, I sincerely appreciate the answers, time and patience this group
is giving me.

Thank you for any help you can provide in showing me where I am going
wrong.
Brian

I'm not overly familiar with the workings of re.findall so I ran these
statements on the Python command line:
r = re.compile("A.B")
print r.findall("SLDKJFOIWUEAJBLJEQUSAUBSLJF:SDFA_B") ['AJB', 'AUB', 'A_B']
print list(r.findall("SLDKJFOIWUEAJBLJEQUSAUBSLJF:SDFA_B")) ['AJB', 'AUB', 'A_B']
print r.findall("SLDKJFOIWUEAJBLJEQUSAUBSLJF:SDF") ['AJB', 'AUB']
print r.findall("SLDKJFOIWUEAJBLJEQUSSLJF:SDF") ['AJB']
print type(r.findall("SLDKJFOIWUEAJBLJEQUSSLJF:SDF"))
<type 'list'>

Everything looks just like one would expect.

A minor nit is that you *don't* have to compile your pics regexp in the body
of the loop. Move the

pics = re.compile(r"images/.*\.jpeg")

statement to before the start of the for loop - you can safely reuse it on
each successive web page without having to recompile (this is the purpose of
compiling re's in the first place - otherwise, you could just call
re.findall(r"images/.*\.jpeg",html). Compiling the regexp saves some
processing in the body of the loop.) But this should not account for your
described odd behavior.

How is countPics being called? Are you accidentally calling it multiple
times? This would explain why the list of found pics goes back to zero
(since you reset it at the start of the function).

-- Paul
 
B

Brian

Thank you for your insight. It appears that using .extend rather than
..append solved the problem.

Brian
 
T

Tim Chase

pics = re.compile(r"images/.*\.jpeg")

While I'm not sure if this is the issue, you might be having some
trouble with the greediness of the "*" repeater here. HTML like

<img src="images/1.jpeg"><img src="hello.jpeg">

will yield a result of

"images/1.jpeg"><img src="hello.jpeg"

rather than the expected

"images/1.jpeg"

You can make it "stingy" (rather than greedy) by appending a
question-mark:

r"images/.*?\.jpeg"

I also don't know if they all are coming back as "jpeg", or if
some come back as "jpg", in which case you might want to use

r"images/.*?\.jpe?g"

This still might bork up on things like

<img src="images/a.gif"><img src="2.jpeg">

My first thought would be to install the BeautifulSoup parser,
and then use it to snag all the <img> tags in your document.
Then you know you're just getting the tag, and in turn, just
getting their associated "src" attribute. I do something like
that in my comic-snatcher (scrapes comics from various sites so I
can read them all in one place in one sitting). You're welcome
to remash this code excerpt (there's no guarantee it's great code):

req = urllib2.Request(url)
req.add_header("Referer", referer)
page = urllib2.urlopen(req)
bs = BeautifulSoup.BeautifulSoup()
map(bs.feed, page.readlines())
bs.done()
r = re.compile(targetRegex)
imageURLs = [img["src"] for img in bs.fetch("img")]
targetImageURL = [url for url in imageURLs if r.match(url)]

It does blithely assume every image has a "src" attribute as it
should, but if not, you can put in an "if" clause in the
assignment of imageURLs to only take those that have src attributes.

As others have mentioned as well, once you successfully get back
the list of images, you'll likely want to *extend()* your master
list of image URLs with your list of currently-found-URLs, rather
than *append()*, or otherwise you'll end up with a list of lists
which may not be what you want.

Just a few ideas you might want to try.

-tkc
 
B

Brian

Tim said:
While I'm not sure if this is the issue, you might be having some
trouble with the greediness of the "*" repeater here. HTML like

<img src="images/1.jpeg"><img src="hello.jpeg">

will yield a result of

"images/1.jpeg"><img src="hello.jpeg"

rather than the expected

"images/1.jpeg"

You can make it "stingy" (rather than greedy) by appending a
question-mark:

r"images/.*?\.jpeg"

I also don't know if they all are coming back as "jpeg", or if
some come back as "jpg", in which case you might want to use

r"images/.*?\.jpe?g"

Thanks Tim! That modification to the regex helped a lot, and believe
it or not, my pic count went up!

Thank you,
Brian
 
J

John Machin

First off, I am sorry for cluttering this group with my inept
questions, but I am stuck again despite a few hours of hair pulling.

I have a function (below) that takes a list of html pages that have
images on them (not porn but boats). This function then (supposedly)
goes through and extracts the links to those images and puts them into
a list, appending with each iteration of the for loop. The list of
html pages is 82 items long and each page has multiple image links.
When the function gets to item 77 or so, the list gets all funky.
Sometimes it goes empty,

The list (not a tuple!!) found by findall is empty or smaller than
expected when the webmaster has used .jpg instead of .jpeg. Pages 27,
77, and 79-82 at the moment have all .jpg as you would have found out
had you inspected the actual data you are operating on instead of
guessing. The print statement is your friend; use it. Your browser's
"view source" functionality (ctrl-U in Firefox) is also handy.

However if you mean that your foundPics list becomes empty, then either
you haven't posted the code that you actually used, or the pixies from
the bottom of the garden have been rearranging it for you :)

and others it is a much more abbreviated list
than I expect - it should have roughly 750 image links.

When I looked at it while running, it appears as if my regex is
actually appending a tuple (I think) of the results it finds to the
list.

No, read the manual. findall returns a list. *You* are appending that
list to your list.
My best guess is that the list is getting too big and croaks.

Very unlikely. In any case you would have seen evidence, like an
exception and a traceback ... or maybe just your swap disk going into
overdrive :)
Since one of the objects of the function is also to be able to count
the items in the list, I am getting some strange errors there as well.

And what were the strange errors that you perceived?
Here is the code:
[snip]
Here is mine:

import re, urllib
def countPics():
foundPics = []
links_count = 0
pics_count = 0
pics = re.compile(r"images/.*\.jpeg")
# for better results, change jpeg to jpe?g
for link in ["cetaceaPage%02d.html" % x for x in range(1, 83)]:
picPage =
urllib.urlopen("http://continuouswave.com/whaler/cetacea/" +
link)
links_count += 1
html = picPage.read()
picPage.close()
findall_result = pics.findall(html)
pics_count += len(findall_result)
print links_count, pics_count, link, findall_result
foundPics.append(findall_result)
print("done")

countPics()

You may wish to change that append to extend, but then you will lose
track of which pictures are on which page, if that matters to you.

HTH,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,787
Messages
2,569,630
Members
45,335
Latest member
Tommiesal

Latest Threads

Top