Regular expression, getting href which is followed by img tag with specific src

F

fatted

From a html file, I'd like to extract a href value of an <a> tag which
contains an <img> tag who's src value I'm searching on.

Basically (but theres more!):
<a href="IwantThis.html"><img src="importantimage.gif"></a>

(Un)Interesting part:
I first match a line from the html file containing importantimage.gif,
I then try to find my href value on this line.
But this line contains multiple <a> tags, (which have href values and
might also have an <img> tag with associated src value). Also all of
the <a> tags and <img> tags have more than one attribute.
So the line actually looks something like this:
<a class="red" href="uninteresting.html" target="_new">Not so exciting
text</a><a href="equallyboring.html" class = "blue">yawn</a><a
class="green" href="IwantThis.html"><img border="0"
src="importantimage.gif" alt="MeMe"></a>

My code:

use warnings;
use strict;

open(FILE,"<","4body.html");
while(<FILE>)
{
my $line = $_;
if($line =~ /importantimage\.gif/i)
{
if($line =~ /<a.+?href="(.+?)".+?src="importantimage\.gif".+?><\/a>/)
{
print $1."\n";
}
}
}

which results in:

uninteresting.html

I think I understand why it gets this value, but I can't get the value
I want :)
 
C

codyhess

Your parenthesis are set to capture the first bit of ".+" in the scalar.
If you want the third link you should make your expression more
specific. Instead of

if($line =~
/<a.+?href="(.+?)".+?src="importantimage\.gif".+?><\/a>/) try

if($line =~ /<a.+?href=".+?".+?href=".+".+href="(.+).+src="importantima-
ge\.gif".+?><\/a>/)



Why are you using .+? instead of .+

uh....?
 
F

Fatted

Tad McClellan said:
You should use a module that understands HTML for processing HTML data.

Unfortunately I don't think that will help me with my problem, I want to
extract the value of a href, for an <a> tag, preceding an <img> tag which
has an attribute src with a specific value. I'm not sure what module does
this. (I'm going to look again though!)
"lines" do not matter in HTML.

Thanks for the reminder :) However if I were to use perl to parse a plain
text file (which just happened to contain html), "lines" :) do matter. I
first wanted to find the line (thereby ignoring all the rest of the html)
which contained the <img src="importantimage.gif" (there just happens to be
lots of tags on this line), and then try to find the preceding value of the
^^^^^^^^

"the line" is singular, you didn't post 1 line, you posted 4 lines.

I posted 1 line (at least that was the attempt), unfortunately Google groups
did a bit of a hatchet job on it, and it got spread over 4 lines. Thats why
I referred to one line :)
If that _was_ really all on a single line, then it would still be
equivalent HTML, since most whitespace does not matter in HTML data.

<br>
and
<br >
and
<br

Are all the same HTML data.

Revision is always good :)
You should always, yes *always*, check the return value from open():

I know, I know but I was working just on the regular expression in a tester
script, so it'd be obvious if there was a file problem, (my real script does
check for return value. Honest :). Good habits are good habits though.
open(FILE, '<', '4body.html') or die "could not open '4body.html' $!";




If you want it in $line instead of $_ then you can put it
in $line straightaway:

while ( my $line = <FILE> )

Good point.
This will NOT do what you asked, because it does not handle
arbitrary HTML, it handles only the one case that you have shown.

You're right it won't do what I asked, I think the google wrap, put you off.
It can be easily broken by legal HTML.

I'll try to keep my HTML as bad as my perl code :)
It would work correctly if I had used a module that understands
HTML data...

See my first comment, but I'd be delighted to be proved wrong. In the mean
time, I'd still appreciate some tips on the regular expression...
------------------------------------
#!/usr/bin/perl
use strict;
use warnings;

my $html = '
<a class="red" href="uninteresting.html" target="_new">Not so exciting
text</a><a href="equallyboring.html" class = "blue">yawn</a><a
class="green" href="IwantThis.html"><img border="0"
src="importantimage.gif" alt="MeMe"></a>';


while ( $html =~ m#(<a\s.*?</a>)#sg ) {
my $anchor = $1;
next unless $anchor =~ /src="importantimage\.gif"/;

print "$1\n" if $anchor =~ /href="([^"]*)/;
}
 
T

Tad McClellan

Fatted said:
Unfortunately I don't think that will help me with my problem,


Yes it will. That is why I suggested it.

I want to
extract the value of a href, for an <a> tag, preceding an <img> tag which
has an attribute src with a specific value. I'm not sure what module does
this. (I'm going to look again though!)


I understood what you wanted to do quite clearly, that's why the
code that I already posted does just what you describe above!

Did you run the program?

Thanks for the reminder :)


But you are going to forget it again before you get to the
end of your followup...

I
first wanted to find the line


If you think of "lines" when processing HTML you aren't thinking
correctly, and it will hurt you at some point.

So don't do that. :)

which contained the <img src="importantimage.gif" (there just happens to be
lots of tags on this line), and then try to find the preceding value of the
<a> tags href.


That is what my code does.

I posted 1 line (at least that was the attempt), unfortunately Google groups
did a bit of a hatchet job on it, and it got spread over 4 lines. Thats why
I referred to one line :)


Yes I expected that that is what happened.

Have you seen the Posting Guidelines that are posted here frequently?

If you had said it "in Perl" then you could have conveyed your
actual data without "helpful" tools (attempting to) break it for you.


$html = '<a class="red" href="uninteresting.html" target="_new">'
. 'Not so exciting text</a><a href="equallyboring.html" '
. 'class = "blue"> ...';

You're right it won't do what I asked,


You're wrong, it *will* do what you asked.

Did you run the program?

It prints

IwantThis.html

isn't that what you wanted to be able to find?

But it will not work for real-world HTML, only for the specific
example of HTML that you posted. This legal HTML would break
it for instance:

<a class="green" href="Ido*NOT*wantThis.html">
<!-- src="importantimage.gif" -->
</a>

Whereas a Real HTML parser would not report that false positive.

I think the google wrap, put you off.


No it didn't.

First, my code does exactly what you asked for with the data you gave.
(and if you modify the data to be all on one line, it will _still_
do the Right Thing.
)

Did you run the program?

Secondly, the word-wrapping did *not* break anything, because the
HTML is equivalent whether wrapped or all on a single line.

Your code should be able to handle HTML, and line breaks don't matter
in HTML, so your code should be able to handle the data either way.

See my first comment, but I'd be delighted to be proved wrong.
^^^^^^^^^^^^

I'll do that a little farther down.

In the mean
time, I'd still appreciate some tips on the regular expression...


Trying to accomplish what you want with regular expressions is the
path to madness. You can work on it for many days and it will
still be easily broken by legal HTML data.

I know, I've been doing this sort of thing for 13 years.

regexs are not sufficiently powerful for the job you need done.

You need a Real Parser.


[snip working code]

You can do it in less than 10 lines of code with HTML::Tree

http://search.cpan.org/author/SBURKE/HTML-Tree-3.17/


---------------------------------------------------------
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;

my $html = '
<a class="red" href="uninteresting.html" target="_new">Not so exciting
text</a><a href="equallyboring.html" class = "blue">yawn</a><a
class="green" href="IwantThis.html"><img border="0"
src="importantimage.gif" alt="MeMe"></a>
';

# $html =~ s/\n/ /g; # make it all on one line

my $tree = HTML::TreeBuilder->new();
$tree->parse($html);

# find elements containing: src="importantimage.gif"
foreach my $img ( $tree->look_down('src', 'importantimage.gif') ) {
next unless $img->tag eq 'img'; # ensure the "src" attr was on
# an <img> element

next unless $img->parent->tag eq 'a'; # ensure parent is an <a> element
my $href = $img->parent->attr('href'); # grab its "href" attr value

print "$href\n";
}

$tree->delete;
 
F

fatted

Yes it will. That is why I suggested it.

Perhaps, I mean't that I couldn't see *how* it would help with my
problem :)
I understood what you wanted to do quite clearly, that's why the
code that I already posted does just what you describe above!

Did you run the program?

I did, but some idiot copy pasted incorrectly :) When I catch that
guy...
But you are going to forget it again before you get to the
end of your followup...

Just put the gun down son... No I really do understand how HTML works.
I talked about a line, because, I am absolutely sure that the <a><img
If you think of "lines" when processing HTML you aren't thinking
correctly, and it will hurt you at some point.

So don't do that. :)

No more please :)
You can do it in less than 10 lines of code with HTML::Tree

http://search.cpan.org/author/SBURKE/HTML-Tree-3.17/
---------------------------------------------------------
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;

my $html = '
<a class="red" href="uninteresting.html" target="_new">Not so exciting
text</a><a href="equallyboring.html" class = "blue">yawn</a><a
class="green" href="IwantThis.html"><img border="0"
src="importantimage.gif" alt="MeMe"></a>
';

# $html =~ s/\n/ /g; # make it all on one line

my $tree = HTML::TreeBuilder->new();
$tree->parse($html);

# find elements containing: src="importantimage.gif"
foreach my $img ( $tree->look_down('src', 'importantimage.gif') ) {
next unless $img->tag eq 'img'; # ensure the "src" attr was on
# an <img> element

next unless $img->parent->tag eq 'a'; # ensure parent is an <a> element
my $href = $img->parent->attr('href'); # grab its "href" attr value

print "$href\n";
}

$tree->delete;
---------------------------------------------------------

Thanks.

I also figured out what was wrong (Keep the list short :)with the
regular expression in my original post. I had:

if($line =~ /<a.+?href="(.+?)".+?src="importantimage\.gif".+?><\/a>/)

But if I'd tried:

if($line =~ /<a.+href="(.+?)".+?src="importantimage\.gif".+><\/a>/)

I would have managed. Although I'll have to think about that a bit
more.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top