why does this call to re.findall() loop forever?

james.kirin40 · Nov 9, 2008

Hi everyone,

I am using Python's re module to extract some data from html. The
following code never returns, and I was wondering if someone can
explain to me why. Is this a problem with my regexp (I tried really
hard to find it?)?

The string contains three records (list items in a html page). Notice
that NONE of them matches the regexp: these records do not contain the
"title" element which the regexp expects inside '<span
class="date">'.

The weird thing is that removing any of the three records makes
findall() immediately return an empty list, while if I pass all three
records to findall() it never returns. Why does this happen?

This is using python 2.6.

Thanks so much for any help

-james

s="""<li class="post" key="4994199a0b80136cb3174e9e875c545e">
<h4 class="desc"><a href="http://www.sluggy.com/"
rel="nofollow">Sluggy Freelance</a>
</h4>
<div class="commands">  <a save href="/post?url=http%3A%2F
%2Fwww.sluggy.com%2F&title=Sluggy
%20Freelance&copyuser=crowebert&copytags=imported%2BRSS
%2BComics%2Bhumor%2Bdaily%2Bwebcomics&jump=no&partner=del"
class="copy" rel="nofollow">save this</a></div> <div class="meta">to
<a class="tag" href="/crowebert/imported">imported</a> <a class="tag"
href="/crowebert/RSS">RSS</a> <a class="tag" href="/crowebert/
Comics">Comics</a> <a class="tag" href="/crowebert/humor">humor</a> <a
class="tag" href="/crowebert/daily">daily</a> <a class="tag" href="/
crowebert/webcomics">webcomics</a> ... <a class="pop" href="/url/
ac655d3fe17873b31abeb29a1043e439" style="padding: 0 0.2em; background-
color: rgb(100%, 66%, 66%);">saved by 983 other people</a> <span
class="date">1945-07-18</span> </div>
</li>

<li class="post" key="65d66f4197fc7eba5c214fe85ed77725">
<h4 class="desc"><a href="http://www.snackbar-games.com/
gbacovers.php" rel="nofollow">Snackbar-Games.com :: GBA DS Cover
Project</a>
</h4>
<div class="commands">  <a save href="/post?url=http%3A%2F
%2Fwww.snackbar-games.com%2Fgbacovers.php&title=Snackbar-Games.com
%20%3A%3A%20GBA%20DS%20Cover
%20Project&copyuser=crowebert&copytags=imported%2BBookmarkMenu
%2BGameStuff%2Bart%2BGBA%2Bgames
%2Bnintendo&jump=no&partner=del" class="copy"
rel="nofollow">save this</a></div> <div class="meta">to <a class="tag"
href="/crowebert/imported">imported</a> <a class="tag" href="/
crowebert/BookmarkMenu">BookmarkMenu</a> <a class="tag" href="/
crowebert/GameStuff">GameStuff</a> <a class="tag" href="/crowebert/
art">art</a> <a class="tag" href="/crowebert/GBA">GBA</a> <a
class="tag" href="/crowebert/games">games</a> <a class="tag" href="/
crowebert/nintendo">nintendo</a> ... <a class="pop" href="/url/
a65a4a0ebe813ec6e9c881331e3f9583" style="padding: 0 0.2em; background-
color: rgb(100%, 84%, 84%);">saved by 26 other people</a> <span
class="date">1948-12-31</span> </div>
</li>

<li class="post" key="690ace1f465ae419dee8145ad3871024">
<h4 class="desc"><a href="http://www.megatokyo.com/"
rel="nofollow">MegaTokyo</a>
</h4>
<div class="commands">  <a save href="/post?url=http%3A%2F
%2Fwww.megatokyo.com
%2F&title=MegaTokyo&copyuser=crowebert&copytags=imported
%2BBookmarkBar%2BWeekendComics%2Bcomics%2Bmanga%2Bhumor
%2Bwebcomics&jump=no&partner=del" class="copy"
rel="nofollow">save this</a></div> <div class="meta">to <a class="tag"
href="/crowebert/imported">imported</a> <a class="tag" href="/
crowebert/BookmarkBar">BookmarkBar</a> <a class="tag" href="/crowebert/
WeekendComics">WeekendComics</a> <a class="tag" href="/crowebert/
comics">comics</a> <a class="tag" href="/crowebert/manga">manga</a> <a
class="tag" href="/crowebert/humor">humor</a> <a class="tag" href="/
crowebert/webcomics">webcomics</a> ... <a class="pop" href="/url/
94843244f0c6d80f1c6806ed5c0abec7" style="padding: 0 0.2em; background-
color: rgb(100%, 60%, 60%);">saved by 2784 other people</a> <span
class="date">1946-01-28</span> </div>
</li>"""

regexp = re.compile("<li class=\"post\".*?<h4 class=\"desc\"><a href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?> )
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>", re.DOTALL)

re.findall(regexp, s)

james.kirin40 · Nov 9, 2008

My apologies, given that Google Groups messes up the formatting, the
regexp should read

regexp = re.compile("""<li class=\"post\".*?<h4 class=\"desc\"><a
href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?> )
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>""", re.DOTALL)

Terry Reedy · Nov 9, 2008

Hi everyone,

I am using Python's re module to extract some data from html. The
following code never returns, and I was wondering if someone can
explain to me why. Is this a problem with my regexp (I tried really
hard to find it?)? [snip] html/xml string
regexp = re.compile("<li class=\"post\".*?<h4 class=\"desc\"><a href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?> )
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>", re.DOTALL)

re.findall(regexp, s)

Python have several modules for parsing and working with xml. Do you
not know of them or is there some reason they won't work?

Why is this WordPress comments form not submitting?	1	Jan 12, 2020
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
How do I fix this issue in sqaurespace code block?	1	Jul 2, 2024
Positioning CSS components	1	Nov 16, 2023
Hey can anyone tell me why input data wont save in my database?	2	Jun 15, 2024
Can't execute php to delete multiple rows in database	3	May 14, 2023
Add recipes using JavaScript in table	20	Apr 17, 2023
Survey details won't go through using php, ajax, Mysql	3	Oct 25, 2023

why does this call to re.findall() loop forever?

james.kirin40

james.kirin40

Terry Reedy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads