Parsing HTML with Regular Expressions

C

Captain Dondo

I am trying to pull out an href from a bit of javascript. I am running
php, but the RE should be the same....

What I have is this:

<a href="JavaScript:void(0)"
onClick="JavaScript:window.open(\'http://www.seiner.com/blog/Travels/...og/Travels/images/2.jpg&width=730&height=755\',
\'FamilyPic\', \'scrollbars=yes,height=755,width=730,location=no\');
return false"><img
src="http://www.seiner.com/blog/Travels/images/thumb-2.jpg"/></a>

What I want to do is pull out the URL in the window.open call but only
if it doesn't contain either a next=[whatever] or a prev=[whatever] tag.

In other words, the above href doesn't contain either one, so my RE
returns 'http://www.seiner.com/blog/Travels/images/1.jpg'.

But if the above URL were to be as follows (see the next and prev at the
end of the URL):

<a href="JavaScript:void(0)"
onClick="JavaScript:window.open(\'http://www.seiner.com/blog/Travels/...g&width=730&height=755&prev=4.jpg&next=2.jpg\',
\'FamilyPic\', \'scrollbars=yes,height=755,width=730,location=no\');
return false"><img
src="http://www.seiner.com/blog/Travels/images/thumb-2.jpg"/></a>

I want the RE to not match....

The RE I am using is

$re = '<[aA] .*image=([a-zA-Z0-9.:/-]*).*/>';

and the actual match is done via:

preg_match_all ( $re, $text , $matches, PREG_OFFSET_CAPTURE);

TIA...
 
J

JDS

I want the RE to not match....

The RE I am using is

$re = '<[aA] .*image=([a-zA-Z0-9.:/-]*).*/>';

and the actual match is done via:

preg_match_all ( $re, $text , $matches, PREG_OFFSET_CAPTURE);

TIA...

Are you getting any errors? What are they?
 
C

Captain Dondo

JDS said:
I want the RE to not match....

The RE I am using is

$re = '<[aA] .*image=([a-zA-Z0-9.:/-]*).*/>';

and the actual match is done via:

preg_match_all ( $re, $text , $matches, PREG_OFFSET_CAPTURE);

TIA...


Are you getting any errors? What are they?

No, but my RE always pulls out the URL.. I can't figure out how to make
it conditional:

Only match if URL doesn't contain prev or next and be case insensitive

....
 
G

Gunnar Hjalmarsson

Captain said:
I am trying to pull out an href from a bit of javascript. I am running
php, but the RE should be the same....

Are you sure of that?
What I want to do is pull out the URL in the window.open call but only
if it doesn't contain either a next=[whatever] or a prev=[whatever] tag.

The RE I am using is

$re = '<[aA] .*image=([a-zA-Z0-9.:/-]*).*/>';

This may or may not do what you want:

$re = '<[aA][^>]+image=([a-zA-Z0-9.:/-]+)(?![^>]+(?:next|prev)=)';
 
C

Captain Dondo

Gunnar said:
Are you sure of that?

Hopefully close enough to get me started....
This may or may not do what you want:

$re = '<[aA][^>]+image=([a-zA-Z0-9.:/-]+)(?![^>]+(?:next|prev)=)';

Close enough.... It's picking up the whole <a ... /a> but I think it's
enough to get me started....

The [^>] rewinds the pattern match to the beginning of the line, I take it?
 
J

Jim Gibson

[regex snipped]
The [^>] rewinds the pattern match to the beginning of the line, I take it?

No. The [^>] matches any single character that is not a '>'.
 
J

JDS

No, but my RE always pulls out the URL.. I can't figure out how to make it
conditional:

Only match if URL doesn't contain prev or next and be case insensitive

I don't know if this will solve your problem, but have you tried this:

if ( ! preg_match(...expression...) ){
}

Where the "...expression..." part is the match you posted earlier.

The important part is the "!" -- it negates the if() test.

later...
 
G

Gunnar Hjalmarsson

Captain said:
Gunnar said:
Are you sure of that?

Hopefully close enough to get me started....
$re = '<[aA][^>]+image=([a-zA-Z0-9.:/-]+)(?![^>]+(?:next|prev)=)';

Close enough.... It's picking up the whole <a ... /a>

In Perl it does its job on the example snippets you posted. That's what
I meant when questioning that regular expressions are the same
irrespective of the programming language. You'd better not count on it.
but I think it's enough to get me started....

The (?!...) construct is called a "zero-width negative look-ahead
assertion" in Perl.
The [^>] rewinds the pattern match to the beginning of the line, I take it?

No. Jim explained what it means (and *that* is not Perl specific AFAIK).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top