How it works?(about while loop and regex as condition)

Discussion in 'Perl Misc' started by havel.zhang, Oct 6, 2008.

  1. havel.zhang

    havel.zhang Guest

    dear perl-gurus,
    i don't understand how this function works. can you please give me

    the program is very simple:
    open (O,"<z.html");
    @l = <O>;

    if ($_ =~ /<a\b([^>]+)(.*?)<\/a>/ig){
    while($html =~ m{a\b([^>]+)(.*?)</a>}ig){
    my $Guts = $1;
    my $Link = $2;
    print "$Guts\n$Link\n";
    ++++++++z.html content+++++++++++++++++++++
    the z.html 's content is:
    <A HREF="">link1</A><A HREF="text.txt">text.txt</
    A><A HREF=
    +++++++and output is:++++++++++++++++++++++++++++

    I want to using this program pick out hrefs and lables like
    This program works well, but i can't understand the while loop with
    "$html =~ m{a\b([^>]+)(.*?)</a>}ig"
    it's works fine, and so amazing:) everytime, it's pick out patten "<a
    href=...></a>" and get right result. But HOW does it work? I think it
    will always pick out the first matched patten.

    Can any perl guru give me answer?

    Thank you :)

    havel.zhang, Oct 6, 2008
    1. Advertisements

  2. The documentation can. See 'perldoc perlop', section 'Quote and
    quote-like operators', the two paragraphs beginning with
    "The "/g" modifier specifies global pattern matching--that is, ..."

    However, it is not surprising that you didn't find it. The whole perlop
    man page is about 2000 lines long. That is way too long and complex. It
    is almost impossible to find anything there or to point people to
    specific part of it. Is someone already working on breaking it down into
    more managable chunks?

    Jürgen Exner, Oct 6, 2008
    1. Advertisements

  3. havel.zhang

    sln Guest

    ^^ might need a while here
    does the same thing as above, could even add the '<'
    the modifier 'g' will continue the match until the end of string.

    The problem is the first 'if' regex will only match the first occurance.
    Does the same as the inner match except only once. Why do you need the outer 'if'
    use strict;
    use warnings;

    my $str = '<A HREF="">link1</A><A HREF="text.txt">text.txt</A><A HREF="fes.iso">fes.iso</A>';

    print "Output from 'if \$str':\n---------------\n";
    if ($str =~ /(<a\b([^>]+)(.*?)<\/a>)/ig)
    print "found: '$1'\n\n";
    my $html = $1;
    while ($html =~ m{a\b([^>]+)(.*?)</a>}ig)
    my $Guts = $1;
    my $Link = $2;
    print "$Guts\n$Link\n";

    pos ($str) = 0;

    print "\n\nOutput from 'while \$str':\n---------------\n";
    while ($str =~ /(<a\b([^>]+)(.*?)<\/a>)/ig)
    print "found: '$1'\n\n";
    my $html = $1;
    while ($html =~ m{a\b([^>]+)(.*?)</a>}ig)
    my $Guts = $1;
    my $Link = $2;
    print "$Guts\n$Link\n";

    pos ($str) = 0;

    print "\n\nOutput from just 'while \$html':\n---------------\n";
    while ($str =~ m{<a\s*([^>]+)(.*?)</a\s*>}ig)
    my $Guts = $1;
    my $Link = $2;
    print "$Guts\n$Link\n";


    Output from 'if $str':
    found: '<A HREF="">link1</A>'


    Output from 'while $str':
    found: '<A HREF="">link1</A>'

    found: '<A HREF="text.txt">text.txt</A>'

    found: '<A HREF="fes.iso">fes.iso</A>'


    Output from just 'while $html':

    In general it doesn't work fine. You can run into problems if the phrase your
    looking for spans lines. Also problematic is your regex does not account for
    legal white spaces.

    The better regex would be: "while ( m{<a\s*([^>]+)(.*?)</a\s*>}ig ) {}"

    Its always good to have delimeters surrounding what you are trying to match.
    In your case the '<a ...></a>' the 'a' tag being the delimeters.

    This will grab inner non 'a' tags, nested 'a' tags however, will not work.
    Because of nesting, html/xml can't be parsed this way, seeking the end delimeter.
    But in your case it should be ok.

    In general, should you need to do specific parsing, you should get a parser that
    captures groups of phrases, from which you can parse with reliability.

    use strict;
    use warnings;

    use RXParse; # VERSIN 2

    my $p = new RXParse();
    $p->setMode( 'html' => 1, 'resume_onerror'=> 1 );
    my %oldh = $p->setHandlers('start' => \&starth, 'end' => \&endh);

    sub starth
    my ($obj, $el, $term, @attr) = @_;
    my $buffer = lc($el);
    $obj->CaptureOn( $buffer ) if ($buffer eq 'a');
    sub endh
    my ($obj, $el, $term) = @_;
    my $buffer = lc($el);
    $obj->CaptureOff( $buffer, 1 ) if ($buffer eq 'a');

    open my $fh, 'c:\temp\z.html' or die "can't open z.html...";
    close $fh;

    # get and parse capture buffer 'a'
    # ....

    # display 'a'


    BUFFER: a
    index seqence
    ----- --------
    [0] 1 <A HREF="">link1</A>
    [1] 2 <A HREF="text.txt">text.txt</A>
    [2] 3 <A HREF="fes.iso">fes.iso</A>
    sln, Oct 6, 2008
  4. havel.zhang

    Dr.Ruud Guest

    Jürgen Exner schreef:

    You could generate something like


    =over 2

    =item L</Operator Precedence and Associativity>

    =item L</Terms and List Operators (Leftward)>

    =item L</The Arrow Operator>

    =item etc. etc.



    before the "=head1 DESCRIPTION" line,

    and use

    perldoc -oHtml perlop | lynx -stdin

    to have a viewer that is easier to navigate.

    Something like "info" would also be nicer than the default man view.

    Or use
    Dr.Ruud, Oct 6, 2008
  5. havel.zhang

    havel.zhang Guest

    Thank you jue:
    After I post my question on news group, I found answer in a perl
    book. That book point out the function which a regex with /g modifier
    as condition in while loop, as you point out above. It's so easy and
    Thank you again:)

    havel.zhang, Oct 7, 2008
  6. havel.zhang

    sln Guest

    I tried but you didn't listen.
    The function does not work well for what you are doing.
    Not at all, never will.

    No you don't, this is not how to do it. It fails easily.
    The answer was given in detail, and at great time expense.
    Next time there will be no answer.

    sln, Oct 7, 2008
  7. havel.zhang

    Tim Greer Guest

    It steps through the @l array, and for each element within it, it checks
    $_ (which is by default the value of the for/foreach/while, so you
    don't actually need to declare it).

    It then checks that $_ can find an opening HTML tag that starts with
    "a", which is an anchor (hot link), most likely anyway with a word
    boundary \b to ensure it's not some other tag that starts with "a",
    such as <applet> (just an example), and takes anything that's not an
    ending HTML tag (>) an captures it into $1. Then, it captures anything
    else between that last match and the ending anchor tag (</a> -- seen as
    <\/a>) and captures it into $2. It does this check globally and
    without letter case. Of course, that regex doesn't make sense, and
    neither does the check, to be honest, but no matter.

    After the above check, which I assume is to see if there's a matching
    anchor tag, and if there is, then it continues, it then assigns the
    $html variable the value of $_, does a while look and case
    insensitively and globally, checks for the same exact thing it just did
    above and assigns and prints the $Guts and $Link variables the values
    of the first and second match it captured ($1 and $2, respectively) and
    prints it out. The above code really isn't very good and doesn't make
    sense, it's repeating things that can be done in one check, it captures
    values it's never going to use, etc. It should instead just use the
    one and even that one is not correct. It should be


    Notice the addition of ">" between ([^>]+) and (.*?). Otherwise $2 will
    always start with < (is that what you want? It also would match any
    non valid values when checking the anchor tag, which doesn't seem like
    it would do any good. If it works, great, but there are some wastes of
    processing and bugs so you should expect the unexpected if you run it
    against many HTML files.
    Tim Greer, Oct 7, 2008
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.