why does this call to re.findall() loop forever?

Discussion in 'Python' started by james.kirin40@gmail.com, Nov 9, 2008.

  1. Guest

    Hi everyone,

    I am using Python's re module to extract some data from html. The
    following code never returns, and I was wondering if someone can
    explain to me why. Is this a problem with my regexp (I tried really
    hard to find it?)?

    The string contains three records (list items in a html page). Notice
    that NONE of them matches the regexp: these records do not contain the
    "title" element which the regexp expects inside '<span
    class="date">'.

    The weird thing is that removing any of the three records makes
    findall() immediately return an empty list, while if I pass all three
    records to findall() it never returns. Why does this happen?

    This is using python 2.6.

    Thanks so much for any help

    -james

    s="""<li class="post" key="4994199a0b80136cb3174e9e875c545e">
    <h4 class="desc"><a href="http://www.sluggy.com/"
    rel="nofollow">Sluggy Freelance</a>
    </h4>
    <div class="commands"> &nbsp;<a save href="/post?url=http%3A%2F
    %2Fwww.sluggy.com%2F&amp;title=Sluggy
    %20Freelance&amp;copyuser=crowebert&amp;copytags=imported%2BRSS
    %2BComics%2Bhumor%2Bdaily%2Bwebcomics&amp;jump=no&amp;partner=del"
    class="copy" rel="nofollow">save this</a></div> <div class="meta">to
    <a class="tag" href="/crowebert/imported">imported</a> <a class="tag"
    href="/crowebert/RSS">RSS</a> <a class="tag" href="/crowebert/
    Comics">Comics</a> <a class="tag" href="/crowebert/humor">humor</a> <a
    class="tag" href="/crowebert/daily">daily</a> <a class="tag" href="/
    crowebert/webcomics">webcomics</a> ... <a class="pop" href="/url/
    ac655d3fe17873b31abeb29a1043e439" style="padding: 0 0.2em; background-
    color: rgb(100%, 66%, 66%);">saved by 983 other people</a> <span
    class="date">1945-07-18</span> </div>
    </li>

    <li class="post" key="65d66f4197fc7eba5c214fe85ed77725">
    <h4 class="desc"><a href="http://www.snackbar-games.com/
    gbacovers.php" rel="nofollow">Snackbar-Games.com :: GBA DS Cover
    Project</a>
    </h4>
    <div class="commands"> &nbsp;<a save href="/post?url=http%3A%2F
    %2Fwww.snackbar-games.com%2Fgbacovers.php&amp;title=Snackbar-Games.com
    %20%3A%3A%20GBA%20DS%20Cover
    %20Project&amp;copyuser=crowebert&amp;copytags=imported%2BBookmarkMenu
    %2BGameStuff%2Bart%2BGBA%2Bgames
    %2Bnintendo&amp;jump=no&amp;partner=del" class="copy"
    rel="nofollow">save this</a></div> <div class="meta">to <a class="tag"
    href="/crowebert/imported">imported</a> <a class="tag" href="/
    crowebert/BookmarkMenu">BookmarkMenu</a> <a class="tag" href="/
    crowebert/GameStuff">GameStuff</a> <a class="tag" href="/crowebert/
    art">art</a> <a class="tag" href="/crowebert/GBA">GBA</a> <a
    class="tag" href="/crowebert/games">games</a> <a class="tag" href="/
    crowebert/nintendo">nintendo</a> ... <a class="pop" href="/url/
    a65a4a0ebe813ec6e9c881331e3f9583" style="padding: 0 0.2em; background-
    color: rgb(100%, 84%, 84%);">saved by 26 other people</a> <span
    class="date">1948-12-31</span> </div>
    </li>

    <li class="post" key="690ace1f465ae419dee8145ad3871024">
    <h4 class="desc"><a href="http://www.megatokyo.com/"
    rel="nofollow">MegaTokyo</a>
    </h4>
    <div class="commands"> &nbsp;<a save href="/post?url=http%3A%2F
    %2Fwww.megatokyo.com
    %2F&amp;title=MegaTokyo&amp;copyuser=crowebert&amp;copytags=imported
    %2BBookmarkBar%2BWeekendComics%2Bcomics%2Bmanga%2Bhumor
    %2Bwebcomics&amp;jump=no&amp;partner=del" class="copy"
    rel="nofollow">save this</a></div> <div class="meta">to <a class="tag"
    href="/crowebert/imported">imported</a> <a class="tag" href="/
    crowebert/BookmarkBar">BookmarkBar</a> <a class="tag" href="/crowebert/
    WeekendComics">WeekendComics</a> <a class="tag" href="/crowebert/
    comics">comics</a> <a class="tag" href="/crowebert/manga">manga</a> <a
    class="tag" href="/crowebert/humor">humor</a> <a class="tag" href="/
    crowebert/webcomics">webcomics</a> ... <a class="pop" href="/url/
    94843244f0c6d80f1c6806ed5c0abec7" style="padding: 0 0.2em; background-
    color: rgb(100%, 60%, 60%);">saved by 2784 other people</a> <span
    class="date">1946-01-28</span> </div>
    </li>"""

    regexp = re.compile("<li class=\"post\".*?<h4 class=\"desc\"><a href=
    \"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
    \">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?> )
    +))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
    li>", re.DOTALL)

    re.findall(regexp, s)
     
    , Nov 9, 2008
    #1
    1. Advertising

  2. Guest

    My apologies, given that Google Groups messes up the formatting, the
    regexp should read

    regexp = re.compile("""<li class=\"post\".*?<h4 class=\"desc\"><a
    href=
    \"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
    \">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?> )
    +))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
    li>""", re.DOTALL)
     
    , Nov 9, 2008
    #2
    1. Advertising

  3. Terry Reedy Guest

    wrote:
    > Hi everyone,
    >
    > I am using Python's re module to extract some data from html. The
    > following code never returns, and I was wondering if someone can
    > explain to me why. Is this a problem with my regexp (I tried really
    > hard to find it?)?

    [snip] html/xml string
    > regexp = re.compile("<li class=\"post\".*?<h4 class=\"desc\"><a href=
    > \"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
    > \">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?> )
    > +))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
    > li>", re.DOTALL)
    >
    > re.findall(regexp, s)


    Python have several modules for parsing and working with xml. Do you
    not know of them or is there some reason they won't work?
     
    Terry Reedy, Nov 9, 2008
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Amil Hanish

    why does simple aspx hang FOREVER

    Amil Hanish, Dec 1, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    1,573
    mortb
    Dec 1, 2004
  2. Sokar
    Replies:
    2
    Views:
    356
    osmium
    Apr 7, 2005
  3. Mr. SweatyFinger
    Replies:
    2
    Views:
    2,217
    Smokey Grindel
    Dec 2, 2006
  4. greenflame

    why does it take forever?

    greenflame, Jun 12, 2005, in forum: Javascript
    Replies:
    5
    Views:
    122
    greenflame
    Jun 13, 2005
  5. Isaac Won
    Replies:
    9
    Views:
    446
    Ulrich Eckhardt
    Mar 4, 2013
Loading...

Share This Page