Regular expression, getting href which is followed by img tag with specific src

Discussion in 'Perl Misc' started by fatted, Aug 20, 2003.

  1. fatted

    fatted Guest

    From a html file, I'd like to extract a href value of an <a> tag which
    contains an <img> tag who's src value I'm searching on.

    Basically (but theres more!):
    <a href="IwantThis.html"><img src="importantimage.gif"></a>

    (Un)Interesting part:
    I first match a line from the html file containing importantimage.gif,
    I then try to find my href value on this line.
    But this line contains multiple <a> tags, (which have href values and
    might also have an <img> tag with associated src value). Also all of
    the <a> tags and <img> tags have more than one attribute.
    So the line actually looks something like this:
    <a class="red" href="uninteresting.html" target="_new">Not so exciting
    text</a><a href="equallyboring.html" class = "blue">yawn</a><a
    class="green" href="IwantThis.html"><img border="0"
    src="importantimage.gif" alt="MeMe"></a>

    My code:

    use warnings;
    use strict;

    open(FILE,"<","4body.html");
    while(<FILE>)
    {
    my $line = $_;
    if($line =~ /importantimage\.gif/i)
    {
    if($line =~ /<a.+?href="(.+?)".+?src="importantimage\.gif".+?><\/a>/)
    {
    print $1."\n";
    }
    }
    }

    which results in:

    uninteresting.html

    I think I understand why it gets this value, but I can't get the value
    I want :)
     
    fatted, Aug 20, 2003
    #1
    1. Advertising

  2. fatted

    codyhess Guest

    Your parenthesis are set to capture the first bit of ".+" in the scalar.
    If you want the third link you should make your expression more
    specific. Instead of

    if($line =~
    /<a.+?href="(.+?)".+?src="importantimage\.gif".+?><\/a>/) try

    if($line =~ /<a.+?href=".+?".+?href=".+".+href="(.+).+src="importantima-
    ge\.gif".+?><\/a>/)



    Why are you using .+? instead of .+

    uh....?


    --
    print &quot;Aspiring to be just another perl hacker,&quot;


    Posted via http://dbforums.com
     
    codyhess, Aug 20, 2003
    #2
    1. Advertising

  3. fatted

    Fatted Guest

    "Tad McClellan" <> wrote in message
    news:...
    > fatted <> wrote:


    > You should use a module that understands HTML for processing HTML data.


    Unfortunately I don't think that will help me with my problem, I want to
    extract the value of a href, for an <a> tag, preceding an <img> tag which
    has an attribute src with a specific value. I'm not sure what module does
    this. (I'm going to look again though!)

    > > Basically (but theres more!):
    > ><a href="IwantThis.html"><img src="importantimage.gif"></a>
    > >
    > > (Un)Interesting part:
    > > I first match a line

    >
    >
    > "lines" do not matter in HTML.


    Thanks for the reminder :) However if I were to use perl to parse a plain
    text file (which just happened to contain html), "lines" :) do matter. I
    first wanted to find the line (thereby ignoring all the rest of the html)
    which contained the <img src="importantimage.gif" (there just happens to be
    lots of tags on this line), and then try to find the preceding value of the
    <a> tags href. I was trying to break the problem down (in my own little way
    :)

    > > So the line

    > ^^^^^^^^
    >
    > "the line" is singular, you didn't post 1 line, you posted 4 lines.


    I posted 1 line (at least that was the attempt), unfortunately Google groups
    did a bit of a hatchet job on it, and it got spread over 4 lines. Thats why
    I referred to one line :)

    > > actually looks something like this:
    > ><a class="red" href="uninteresting.html" target="_new">Not so exciting
    > > text</a><a href="equallyboring.html" class = "blue">yawn</a><a
    > > class="green" href="IwantThis.html"><img border="0"
    > > src="importantimage.gif" alt="MeMe"></a>

    >
    >
    > If that _was_ really all on a single line, then it would still be
    > equivalent HTML, since most whitespace does not matter in HTML data.
    >
    > <br>
    > and
    > <br >
    > and
    > <br
    > >

    >
    > Are all the same HTML data.


    Revision is always good :)

    > > open(FILE,"<","4body.html");

    >
    >
    > You should always, yes *always*, check the return value from open():


    I know, I know but I was working just on the regular expression in a tester
    script, so it'd be obvious if there was a file problem, (my real script does
    check for return value. Honest :). Good habits are good habits though.

    > open(FILE, '<', '4body.html') or die "could not open '4body.html' $!";
    >
    >
    > > while(<FILE>)
    > > {
    > > my $line = $_;

    >
    >
    > If you want it in $line instead of $_ then you can put it
    > in $line straightaway:
    >
    > while ( my $line = <FILE> )


    Good point.

    > This will NOT do what you asked, because it does not handle
    > arbitrary HTML, it handles only the one case that you have shown.


    You're right it won't do what I asked, I think the google wrap, put you off.

    > It can be easily broken by legal HTML.


    I'll try to keep my HTML as bad as my perl code :)

    > It would work correctly if I had used a module that understands
    > HTML data...


    See my first comment, but I'd be delighted to be proved wrong. In the mean
    time, I'd still appreciate some tips on the regular expression...

    > ------------------------------------
    > #!/usr/bin/perl
    > use strict;
    > use warnings;
    >
    > my $html = '
    > <a class="red" href="uninteresting.html" target="_new">Not so exciting
    > text</a><a href="equallyboring.html" class = "blue">yawn</a><a
    > class="green" href="IwantThis.html"><img border="0"
    > src="importantimage.gif" alt="MeMe"></a>';
    >
    >
    > while ( $html =~ m#(<a\s.*?</a>)#sg ) {
    > my $anchor = $1;
    > next unless $anchor =~ /src="importantimage\.gif"/;
    >
    > print "$1\n" if $anchor =~ /href="([^"]*)/;
    > }
     
    Fatted, Aug 20, 2003
    #3
  4. Fatted <> wrote:
    > "Tad McClellan" <> wrote in message
    > news:...
    >> fatted <> wrote:

    >
    >> You should use a module that understands HTML for processing HTML data.

    >
    > Unfortunately I don't think that will help me with my problem,



    Yes it will. That is why I suggested it.


    > I want to
    > extract the value of a href, for an <a> tag, preceding an <img> tag which
    > has an attribute src with a specific value. I'm not sure what module does
    > this. (I'm going to look again though!)



    I understood what you wanted to do quite clearly, that's why the
    code that I already posted does just what you describe above!

    Did you run the program?


    >> "lines" do not matter in HTML.

    >
    > Thanks for the reminder :)



    But you are going to forget it again before you get to the
    end of your followup...


    > I
    > first wanted to find the line



    If you think of "lines" when processing HTML you aren't thinking
    correctly, and it will hurt you at some point.

    So don't do that. :)


    > which contained the <img src="importantimage.gif" (there just happens to be
    > lots of tags on this line), and then try to find the preceding value of the
    ><a> tags href.



    That is what my code does.


    > I posted 1 line (at least that was the attempt), unfortunately Google groups
    > did a bit of a hatchet job on it, and it got spread over 4 lines. Thats why
    > I referred to one line :)



    Yes I expected that that is what happened.

    Have you seen the Posting Guidelines that are posted here frequently?

    If you had said it "in Perl" then you could have conveyed your
    actual data without "helpful" tools (attempting to) break it for you.


    $html = '<a class="red" href="uninteresting.html" target="_new">'
    . 'Not so exciting text</a><a href="equallyboring.html" '
    . 'class = "blue"> ...';


    >> If that _was_ really all on a single line, then it would still be
    >> equivalent HTML, since most whitespace does not matter in HTML data.



    >> This will NOT do what you asked, because it does not handle
    >> arbitrary HTML, it handles only the one case that you have shown.

    >
    > You're right it won't do what I asked,



    You're wrong, it *will* do what you asked.

    Did you run the program?

    It prints

    IwantThis.html

    isn't that what you wanted to be able to find?

    But it will not work for real-world HTML, only for the specific
    example of HTML that you posted. This legal HTML would break
    it for instance:

    <a class="green" href="Ido*NOT*wantThis.html">
    <!-- src="importantimage.gif" -->
    </a>

    Whereas a Real HTML parser would not report that false positive.


    > I think the google wrap, put you off.



    No it didn't.

    First, my code does exactly what you asked for with the data you gave.
    (and if you modify the data to be all on one line, it will _still_
    do the Right Thing.
    )

    Did you run the program?

    Secondly, the word-wrapping did *not* break anything, because the
    HTML is equivalent whether wrapped or all on a single line.

    Your code should be able to handle HTML, and line breaks don't matter
    in HTML, so your code should be able to handle the data either way.


    >> It would work correctly if I had used a module that understands
    >> HTML data...

    >
    > See my first comment, but I'd be delighted to be proved wrong.

    ^^^^^^^^^^^^

    I'll do that a little farther down.


    > In the mean
    > time, I'd still appreciate some tips on the regular expression...



    Trying to accomplish what you want with regular expressions is the
    path to madness. You can work on it for many days and it will
    still be easily broken by legal HTML data.

    I know, I've been doing this sort of thing for 13 years.

    regexs are not sufficiently powerful for the job you need done.

    You need a Real Parser.


    [snip working code]

    You can do it in less than 10 lines of code with HTML::Tree

    http://search.cpan.org/author/SBURKE/HTML-Tree-3.17/


    ---------------------------------------------------------
    #!/usr/bin/perl
    use strict;
    use warnings;
    use HTML::TreeBuilder;

    my $html = '
    <a class="red" href="uninteresting.html" target="_new">Not so exciting
    text</a><a href="equallyboring.html" class = "blue">yawn</a><a
    class="green" href="IwantThis.html"><img border="0"
    src="importantimage.gif" alt="MeMe"></a>
    ';

    # $html =~ s/\n/ /g; # make it all on one line

    my $tree = HTML::TreeBuilder->new();
    $tree->parse($html);

    # find elements containing: src="importantimage.gif"
    foreach my $img ( $tree->look_down('src', 'importantimage.gif') ) {
    next unless $img->tag eq 'img'; # ensure the "src" attr was on
    # an <img> element

    next unless $img->parent->tag eq 'a'; # ensure parent is an <a> element
    my $href = $img->parent->attr('href'); # grab its "href" attr value

    print "$href\n";
    }

    $tree->delete;
    ---------------------------------------------------------


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Aug 21, 2003
    #4
  5. fatted

    fatted Guest

    (Tad McClellan) wrote in message news:<>...
    > Fatted <> wrote:
    > > "Tad McClellan" <> wrote in message
    > > news:...
    > >> fatted <> wrote:

    >
    > >> You should use a module that understands HTML for processing HTML data.

    > >
    > > Unfortunately I don't think that will help me with my problem,

    >
    >
    > Yes it will. That is why I suggested it.


    Perhaps, I mean't that I couldn't see *how* it would help with my
    problem :)

    >
    > > I want to
    > > extract the value of a href, for an <a> tag, preceding an <img> tag which
    > > has an attribute src with a specific value. I'm not sure what module does
    > > this. (I'm going to look again though!)

    >
    >
    > I understood what you wanted to do quite clearly, that's why the
    > code that I already posted does just what you describe above!
    >
    > Did you run the program?


    I did, but some idiot copy pasted incorrectly :) When I catch that
    guy...

    >
    > >> "lines" do not matter in HTML.

    > >
    > > Thanks for the reminder :)

    >
    >
    > But you are going to forget it again before you get to the
    > end of your followup...


    Just put the gun down son... No I really do understand how HTML works.
    I talked about a line, because, I am absolutely sure that the <a><img
    /></a> tags which I'm interested in are always on one text line from
    the html file.

    > > I
    > > first wanted to find the line

    >
    >
    > If you think of "lines" when processing HTML you aren't thinking
    > correctly, and it will hurt you at some point.
    >
    > So don't do that. :)


    No more please :)

    >
    >
    > > which contained the <img src="importantimage.gif" (there just happens to be
    > > lots of tags on this line), and then try to find the preceding value of the
    > ><a> tags href.

    >


    <snip>


    > You can do it in less than 10 lines of code with HTML::Tree
    >
    > http://search.cpan.org/author/SBURKE/HTML-Tree-3.17/
    > ---------------------------------------------------------
    > #!/usr/bin/perl
    > use strict;
    > use warnings;
    > use HTML::TreeBuilder;
    >
    > my $html = '
    > <a class="red" href="uninteresting.html" target="_new">Not so exciting
    > text</a><a href="equallyboring.html" class = "blue">yawn</a><a
    > class="green" href="IwantThis.html"><img border="0"
    > src="importantimage.gif" alt="MeMe"></a>
    > ';
    >
    > # $html =~ s/\n/ /g; # make it all on one line
    >
    > my $tree = HTML::TreeBuilder->new();
    > $tree->parse($html);
    >
    > # find elements containing: src="importantimage.gif"
    > foreach my $img ( $tree->look_down('src', 'importantimage.gif') ) {
    > next unless $img->tag eq 'img'; # ensure the "src" attr was on
    > # an <img> element
    >
    > next unless $img->parent->tag eq 'a'; # ensure parent is an <a> element
    > my $href = $img->parent->attr('href'); # grab its "href" attr value
    >
    > print "$href\n";
    > }
    >
    > $tree->delete;
    > ---------------------------------------------------------


    Thanks.

    I also figured out what was wrong (Keep the list short :)with the
    regular expression in my original post. I had:

    if($line =~ /<a.+?href="(.+?)".+?src="importantimage\.gif".+?><\/a>/)

    But if I'd tried:

    if($line =~ /<a.+href="(.+?)".+?src="importantimage\.gif".+><\/a>/)

    I would have managed. Although I'll have to think about that a bit
    more.
     
    fatted, Aug 21, 2003
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Steve Richter

    <txt src= ...> equivalent of <img src= ...>

    Steve Richter, Feb 8, 2006, in forum: ASP .Net
    Replies:
    3
    Views:
    2,131
    Laurent Bugnion
    Feb 9, 2006
  2. Antti Nummiaho
    Replies:
    7
    Views:
    3,204
    John C. Bollinger
    Nov 17, 2003
  3. Replies:
    3
    Views:
    629
    windandwaves
    Sep 16, 2005
  4. pheadxdll
    Replies:
    16
    Views:
    3,769
    Neredbojias
    Jun 6, 2007
  5. Soren Vejrum
    Replies:
    4
    Views:
    666
    Lasse Reichstein Nielsen
    Jul 5, 2003
Loading...

Share This Page