why does this call to re.findall() loop forever?

J

james.kirin40

Hi everyone,

I am using Python's re module to extract some data from html. The
following code never returns, and I was wondering if someone can
explain to me why. Is this a problem with my regexp (I tried really
hard to find it?)?

The string contains three records (list items in a html page). Notice
that NONE of them matches the regexp: these records do not contain the
"title" element which the regexp expects inside '<span
class="date">'.

The weird thing is that removing any of the three records makes
findall() immediately return an empty list, while if I pass all three
records to findall() it never returns. Why does this happen?

This is using python 2.6.

Thanks so much for any help

-james

s="""<li class="post" key="4994199a0b80136cb3174e9e875c545e">
<h4 class="desc"><a href="http://www.sluggy.com/"
rel="nofollow">Sluggy Freelance</a>
</h4>
<div class="commands"> &nbsp;<a save href="/post?url=http%3A%2F
%2Fwww.sluggy.com%2F&amp;title=Sluggy
%20Freelance&amp;copyuser=crowebert&amp;copytags=imported%2BRSS
%2BComics%2Bhumor%2Bdaily%2Bwebcomics&amp;jump=no&amp;partner=del"
class="copy" rel="nofollow">save this</a></div> <div class="meta">to
<a class="tag" href="/crowebert/imported">imported</a> <a class="tag"
href="/crowebert/RSS">RSS</a> <a class="tag" href="/crowebert/
Comics">Comics</a> <a class="tag" href="/crowebert/humor">humor</a> <a
class="tag" href="/crowebert/daily">daily</a> <a class="tag" href="/
crowebert/webcomics">webcomics</a> ... <a class="pop" href="/url/
ac655d3fe17873b31abeb29a1043e439" style="padding: 0 0.2em; background-
color: rgb(100%, 66%, 66%);">saved by 983 other people</a> <span
class="date">1945-07-18</span> </div>
</li>

<li class="post" key="65d66f4197fc7eba5c214fe85ed77725">
<h4 class="desc"><a href="http://www.snackbar-games.com/
gbacovers.php" rel="nofollow">Snackbar-Games.com :: GBA DS Cover
Project</a>
</h4>
<div class="commands"> &nbsp;<a save href="/post?url=http%3A%2F
%2Fwww.snackbar-games.com%2Fgbacovers.php&amp;title=Snackbar-Games.com
%20%3A%3A%20GBA%20DS%20Cover
%20Project&amp;copyuser=crowebert&amp;copytags=imported%2BBookmarkMenu
%2BGameStuff%2Bart%2BGBA%2Bgames
%2Bnintendo&amp;jump=no&amp;partner=del" class="copy"
rel="nofollow">save this</a></div> <div class="meta">to <a class="tag"
href="/crowebert/imported">imported</a> <a class="tag" href="/
crowebert/BookmarkMenu">BookmarkMenu</a> <a class="tag" href="/
crowebert/GameStuff">GameStuff</a> <a class="tag" href="/crowebert/
art">art</a> <a class="tag" href="/crowebert/GBA">GBA</a> <a
class="tag" href="/crowebert/games">games</a> <a class="tag" href="/
crowebert/nintendo">nintendo</a> ... <a class="pop" href="/url/
a65a4a0ebe813ec6e9c881331e3f9583" style="padding: 0 0.2em; background-
color: rgb(100%, 84%, 84%);">saved by 26 other people</a> <span
class="date">1948-12-31</span> </div>
</li>

<li class="post" key="690ace1f465ae419dee8145ad3871024">
<h4 class="desc"><a href="http://www.megatokyo.com/"
rel="nofollow">MegaTokyo</a>
</h4>
<div class="commands"> &nbsp;<a save href="/post?url=http%3A%2F
%2Fwww.megatokyo.com
%2F&amp;title=MegaTokyo&amp;copyuser=crowebert&amp;copytags=imported
%2BBookmarkBar%2BWeekendComics%2Bcomics%2Bmanga%2Bhumor
%2Bwebcomics&amp;jump=no&amp;partner=del" class="copy"
rel="nofollow">save this</a></div> <div class="meta">to <a class="tag"
href="/crowebert/imported">imported</a> <a class="tag" href="/
crowebert/BookmarkBar">BookmarkBar</a> <a class="tag" href="/crowebert/
WeekendComics">WeekendComics</a> <a class="tag" href="/crowebert/
comics">comics</a> <a class="tag" href="/crowebert/manga">manga</a> <a
class="tag" href="/crowebert/humor">humor</a> <a class="tag" href="/
crowebert/webcomics">webcomics</a> ... <a class="pop" href="/url/
94843244f0c6d80f1c6806ed5c0abec7" style="padding: 0 0.2em; background-
color: rgb(100%, 60%, 60%);">saved by 2784 other people</a> <span
class="date">1946-01-28</span> </div>
</li>"""

regexp = re.compile("<li class=\"post\".*?<h4 class=\"desc\"><a href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?> )
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>", re.DOTALL)

re.findall(regexp, s)
 
J

james.kirin40

My apologies, given that Google Groups messes up the formatting, the
regexp should read

regexp = re.compile("""<li class=\"post\".*?<h4 class=\"desc\"><a
href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?> )
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>""", re.DOTALL)
 
T

Terry Reedy

Hi everyone,

I am using Python's re module to extract some data from html. The
following code never returns, and I was wondering if someone can
explain to me why. Is this a problem with my regexp (I tried really
hard to find it?)? [snip] html/xml string
regexp = re.compile("<li class=\"post\".*?<h4 class=\"desc\"><a href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?> )
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>", re.DOTALL)

re.findall(regexp, s)

Python have several modules for parsing and working with xml. Do you
not know of them or is there some reason they won't work?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top