Regexp kicking my ass

Discussion in 'Perl Misc' started by Tuc, Jan 27, 2005.

  1. Tuc

    Tuc Guest

    Hi,

    I'm trying to get a regexp to make a match, and its not working,
    and its kicking my ass. The text I'm going against is :

    $text='<div id="sr_SearchResultsPageNavTop"> <div
    id="sr_SaveSearchImage"><img
    src="http://images.match.com/match//search/sr_NavIconPlaceHolder.gif"
    width="15
    " height="12" alt="" border="0"></div> <div
    id="sr_ViewPhotoGalleryText"><a
    href="come.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0
    &RN=2102522&lid=7&PN=1&DO=2" class="cssGlobalLinks_PageNav"
    id="lnkSaveThisSearch">viewas photo gallery</a></div> <div
    id="sr_Pagination"><span
    class="cssGlobalSysText_LightGray">page&nbsp;</span><a
    href="some.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid=8&PN=1&DO=0"
    class="cssSr_PaginationCurrentPage" id="lnkPage">1</a><a
    href="come.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid=8&PN=2&DO=0"class="cssSr_PageNav"
    id="lnkPage">2</a><a
    href="come.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid=8&PN=3&DO=0"
    class="cssSr_PageNav" id="lnkPage">';

    What I'm looking for is the url between the href and
    cssSr_PaginationCurrentPage . When I do it, it ends ip starting at
    the first href and going all the way to the
    cssSr_PaginationCurrentPage. I've tried \b, I've tried {}, I tried
    ()'s.... And I just can't get it to get the one url of
    some.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid=8&PN=1&DO=0

    How am I to tell it to start at the cssSr_PaginationCurrentPage
    and work backwards to the first instance of href="


    Thanks, Tuc
    Tuc, Jan 27, 2005
    #1
    1. Advertising

  2. Tuc wrote:
    > I'm trying to get a regexp to make a match, and its not working,
    > and its kicking my ass. The text I'm going against is :
    >
    > $text='<div id="sr_SearchResultsPageNavTop"> <div
    > id="sr_SaveSearchImage"><img
    > src="http://images.match.com/match//search/sr_NavIconPlaceHolder.gif"
    > width="15
    > " height="12" alt="" border="0"></div> <div
    > id="sr_ViewPhotoGalleryText"><a
    > href="come.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0
    > &RN=2102522&lid=7&PN=1&DO=2" class="cssGlobalLinks_PageNav"
    > id="lnkSaveThisSearch">viewas photo gallery</a></div> <div
    > id="sr_Pagination"><span
    > class="cssGlobalSysText_LightGray">page&nbsp;</span><a
    > href="some.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid=8&PN=1&DO=0"
    > class="cssSr_PaginationCurrentPage" id="lnkPage">1</a><a
    > href="come.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid=8&PN=2&DO=0"class="cssSr_PageNav"
    > id="lnkPage">2</a><a
    > href="come.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid=8&PN=3&DO=0"
    > class="cssSr_PageNav" id="lnkPage">';
    >
    > What I'm looking for is the url between the href and
    > cssSr_PaginationCurrentPage .


    This may or may not work:

    if ( $text =~ /<a\s+href\s*=\s*
    (?:(?:(["'])(\S+)\1)|(\S+))
    [^>]*class\s*=\s*(?:["'])?cssSr_PaginationCurrentPage/x ) {
    print $+;
    }

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Jan 27, 2005
    #2
    1. Advertising

  3. On Wed, 26 Jan 2005, Tuc wrote:

    > Hi,
    >
    > I'm trying to get a regexp to make a match, and its not working,
    > and its kicking my ass. The text I'm going against is :
    >

    <snip>
    >
    > What I'm looking for is the url between the href and
    > cssSr_PaginationCurrentPage . When I do it, it ends ip starting at
    > the first href and going all the way to the
    > cssSr_PaginationCurrentPage. I've tried \b, I've tried {}, I tried
    > ()'s.... And I just can't get it to get the one url of
    > some.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid=8&PN=1&DO=0
    >
    > How am I to tell it to start at the cssSr_PaginationCurrentPage
    > and work backwards to the first instance of href="
    >


    perhaps you need to 'divide and conquer'.

    this works for me.

    use strict;
    use warnings;

    if ( $text =~ /href="(.*?)class="cssSr_PaginationCurrentPage/s )
    {
    my $url = $1;
    chomp($url);
    $url =~ s/^.*?href="//s;
    $url =~ s/"$//s;
    print STDOUT "url == ``". $url . "''\n";
    }

    >
    >
    > Thanks, Tuc
    >
    >


    --
    terry l. ridder ><>
    terry l. ridder, Jan 27, 2005
    #3
  4. On Thu, 27 Jan 2005, Gunnar Hjalmarsson wrote:

    >
    > This may or may not work:
    >
    > if ( $text =~ /<a\s+href\s*=\s*
    > (?:(?:(["'])(\S+)\1)|(\S+))
    > [^>]*class\s*=\s*(?:["'])?cssSr_PaginationCurrentPage/x ) {
    > print $+;
    > }
    >
    >


    that works rather well.
    beats my 'divide and conquer' approach.

    --
    terry l. ridder ><>
    terry l. ridder, Jan 27, 2005
    #4
  5. terry l. ridder wrote:
    > On Thu, 27 Jan 2005, Gunnar Hjalmarsson wrote:
    >> This may or may not work:
    >>
    >> if ( $text =~ /<a\s+href\s*=\s*
    >> (?:(?:(["'])(\S+)\1)|(\S+))
    >> [^>]*class\s*=\s*(?:["'])?cssSr_PaginationCurrentPage/x ) {
    >> print $+;
    >> }

    >
    > that works rather well.


    A shorter (and clearer) variant would be:

    if ( $text =~ /href\s*=\s*
    (?:
    (?:
    (["'])(\S+)\1 # quoted URL
    )
    |
    (\S+) # non-quoted URL
    )
    [^>]+cssSr_PaginationCurrentPage/x ) {
    print $+;
    }

    Yeah, it works, provided that

    1) the class attribute actually does come after the href attribute, and

    2) no 'weird' attribute such as

    someattr="x > z"

    has been put in between.

    Which I suppose illustrates Bob's point that it *is* difficult to parse
    HTML with regular expressions...

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Jan 27, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SSBhbSBTYW0=?=

    Datagrids are kicking my ass!! Please help

    =?Utf-8?B?SSBhbSBTYW0=?=, Mar 14, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    403
    JohnFol
    Mar 14, 2005
  2. =?Utf-8?B?R3JhbnQ=?=
    Replies:
    2
    Views:
    951
    Joanna
    Sep 13, 2006
  3. Jason

    ass values between two pages

    Jason, Mar 5, 2006, in forum: ASP .Net
    Replies:
    2
    Views:
    536
    Eliyahu Goldin
    Mar 5, 2006
  4. tiewknvc9

    jsp - I feel like an ass!

    tiewknvc9, Jun 15, 2006, in forum: Java
    Replies:
    10
    Views:
    591
    Chris Smith
    Jun 19, 2006
  5. Ciaran
    Replies:
    4
    Views:
    629
    Ben C
    May 17, 2007
Loading...

Share This Page