Regexp kicking my ass

T

Tuc

Hi,

I'm trying to get a regexp to make a match, and its not working,
and its kicking my ass. The text I'm going against is :

$text='<div id="sr_SearchResultsPageNavTop"> <div
id="sr_SaveSearchImage"><img
src="http://images.match.com/match//search/sr_NavIconPlaceHolder.gif"
width="15
" height="12" alt="" border="0"></div> <div
id="sr_ViewPhotoGalleryText"><a
href="come.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0
&RN=2102522&lid=7&PN=1&DO=2" class="cssGlobalLinks_PageNav"
id="lnkSaveThisSearch">viewas photo gallery</a></div> <div
id="sr_Pagination"><span
class="cssGlobalSysText_LightGray">page&nbsp;</span><a
href="some.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid=8&PN=1&DO=0"
class="cssSr_PaginationCurrentPage" id="lnkPage">1</a><a
href="come.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid=8&PN=2&DO=0"class="cssSr_PageNav"
id="lnkPage">2</a><a
href="come.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid=8&PN=3&DO=0"
class="cssSr_PageNav" id="lnkPage">';

What I'm looking for is the url between the href and
cssSr_PaginationCurrentPage . When I do it, it ends ip starting at
the first href and going all the way to the
cssSr_PaginationCurrentPage. I've tried \b, I've tried {}, I tried
()'s.... And I just can't get it to get the one url of
some.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid=8&PN=1&DO=0

How am I to tell it to start at the cssSr_PaginationCurrentPage
and work backwards to the first instance of href="


Thanks, Tuc
 
G

Gunnar Hjalmarsson

Tuc said:
I'm trying to get a regexp to make a match, and its not working,
and its kicking my ass. The text I'm going against is :

$text='<div id="sr_SearchResultsPageNavTop"> <div
id="sr_SaveSearchImage"><img
src="http://images.match.com/match//search/sr_NavIconPlaceHolder.gif"
width="15
" height="12" alt="" border="0"></div> <div
id="sr_ViewPhotoGalleryText"><a
href="come.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0
&RN=2102522&lid=7&PN=1&DO=2" class="cssGlobalLinks_PageNav"
id="lnkSaveThisSearch">viewas photo gallery</a></div> <div
id="sr_Pagination"><span
class="cssGlobalSysText_LightGray">page&nbsp;</span><a
href="some.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid=8&PN=1&DO=0"
class="cssSr_PaginationCurrentPage" id="lnkPage">1</a><a
href="come.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid=8&PN=2&DO=0"class="cssSr_PageNav"
id="lnkPage">2</a><a
href="come.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid=8&PN=3&DO=0"
class="cssSr_PageNav" id="lnkPage">';

What I'm looking for is the url between the href and
cssSr_PaginationCurrentPage .

This may or may not work:

if ( $text =~ /<a\s+href\s*=\s*
(?:(?:(["'])(\S+)\1)|(\S+))
[^>]*class\s*=\s*(?:["'])?cssSr_PaginationCurrentPage/x ) {
print $+;
}
 
T

terry l. ridder

Hi,

I'm trying to get a regexp to make a match, and its not working,
and its kicking my ass. The text I'm going against is :
What I'm looking for is the url between the href and
cssSr_PaginationCurrentPage . When I do it, it ends ip starting at
the first href and going all the way to the
cssSr_PaginationCurrentPage. I've tried \b, I've tried {}, I tried
()'s.... And I just can't get it to get the one url of
some.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid=8&PN=1&DO=0

How am I to tell it to start at the cssSr_PaginationCurrentPage
and work backwards to the first instance of href="

perhaps you need to 'divide and conquer'.

this works for me.

use strict;
use warnings;

if ( $text =~ /href="(.*?)class="cssSr_PaginationCurrentPage/s )
{
my $url = $1;
chomp($url);
$url =~ s/^.*?href="//s;
$url =~ s/"$//s;
print STDOUT "url == ``". $url . "''\n";
}
 
T

terry l. ridder

This may or may not work:

if ( $text =~ /<a\s+href\s*=\s*
(?:(?:(["'])(\S+)\1)|(\S+))
[^>]*class\s*=\s*(?:["'])?cssSr_PaginationCurrentPage/x ) {
print $+;
}

that works rather well.
beats my 'divide and conquer' approach.
 
G

Gunnar Hjalmarsson

terry said:
This may or may not work:

if ( $text =~ /<a\s+href\s*=\s*
(?:(?:(["'])(\S+)\1)|(\S+))
[^>]*class\s*=\s*(?:["'])?cssSr_PaginationCurrentPage/x ) {
print $+;
}

that works rather well.

A shorter (and clearer) variant would be:

if ( $text =~ /href\s*=\s*
(?:
(?:
(["'])(\S+)\1 # quoted URL
)
|
(\S+) # non-quoted URL
)
[^>]+cssSr_PaginationCurrentPage/x ) {
print $+;
}

Yeah, it works, provided that

1) the class attribute actually does come after the href attribute, and

2) no 'weird' attribute such as

someattr="x > z"

has been put in between.

Which I suppose illustrates Bob's point that it *is* difficult to parse
HTML with regular expressions...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top