How it works?(about while loop and regex as condition)

Discussion in 'Perl Misc' started by havel.zhang, Oct 6, 2008.

  1. havel.zhang

    havel.zhang Guest

    dear perl-gurus,
    i don't understand how this function works. can you please give me
    further
    explanation:

    the program is very simple:
    +++++++++++++program++++++++++++++++++++++
    open (O,"<z.html");
    @l = <O>;
    close(O);

    foreach(@l){
    if ($_ =~ /<a\b([^>]+)(.*?)<\/a>/ig){
    $html=$_;
    while($html =~ m{a\b([^>]+)(.*?)</a>}ig){
    my $Guts = $1;
    my $Link = $2;
    print "$Guts\n$Link\n";
    }
    }
    };
    ++++++++z.html content+++++++++++++++++++++
    the z.html 's content is:
    <A HREF="http://10.123.111.11">link1</A><A HREF="text.txt">text.txt</
    A><A HREF=
    "fes.iso">fes.iso</A>
    +++++++and output is:++++++++++++++++++++++++++++
    HREF="http://10.123.111.11"
    >link1

    HREF="text.txt"
    >text.txt

    HREF="fes.iso"
    >fes.iso

    ++++++++end+++++++++++++++++++++++++++++++++

    I want to using this program pick out hrefs and lables like
    "link1","text.txt","fes.iso".
    This program works well, but i can't understand the while loop with
    regex:
    "$html =~ m{a\b([^>]+)(.*?)</a>}ig"
    ^^^^^^^^^^^^^^^^^^^^^^^
    it's works fine, and so amazing:) everytime, it's pick out patten "<a
    href=...></a>" and get right result. But HOW does it work? I think it
    will always pick out the first matched patten.

    Can any perl guru give me answer?

    Thank you :)

    Havel
    havel.zhang, Oct 6, 2008
    #1
    1. Advertising

  2. "havel.zhang" <> wrote:
    [...]
    >This program works well, but i can't understand the while loop with
    >regex:
    > "$html =~ m{a\b([^>]+)(.*?)</a>}ig"
    > ^^^^^^^^^^^^^^^^^^^^^^^
    >it's works fine, and so amazing:) everytime, it's pick out patten "<a
    >href=...></a>" and get right result. But HOW does it work? I think it
    >will always pick out the first matched patten.
    >
    >Can any perl guru give me answer?


    The documentation can. See 'perldoc perlop', section 'Quote and
    quote-like operators', the two paragraphs beginning with
    "The "/g" modifier specifies global pattern matching--that is, ..."

    However, it is not surprising that you didn't find it. The whole perlop
    man page is about 2000 lines long. That is way too long and complex. It
    is almost impossible to find anything there or to point people to
    specific part of it. Is someone already working on breaking it down into
    more managable chunks?

    jue
    Jürgen Exner, Oct 6, 2008
    #2
    1. Advertising

  3. havel.zhang

    Guest

    On Mon, 6 Oct 2008 02:41:30 -0700 (PDT), "havel.zhang" <> wrote:

    >dear perl-gurus,
    >i don't understand how this function works. can you please give me
    >further
    >explanation:
    >
    >the program is very simple:
    >+++++++++++++program++++++++++++++++++++++
    >open (O,"<z.html");
    >@l = <O>;
    >close(O);
    >
    >foreach(@l){
    > if ($_ =~ /<a\b([^>]+)(.*?)<\/a>/ig){

    ^^ might need a while here

    > $html=$_;
    > while($html =~ m{a\b([^>]+)(.*?)</a>}ig){

    does the same thing as above, could even add the '<'
    m{<a\b([^>]+)(.*?)</a>}ig
    the if ($_ =~ /.. is not needed

    > my $Guts = $1;
    > my $Link = $2;
    > print "$Guts\n$Link\n";
    > }
    > }
    >};
    >++++++++z.html content+++++++++++++++++++++
    >the z.html 's content is:
    > <A HREF="http://10.123.111.11">link1</A><A HREF="text.txt">text.txt</
    >A><A HREF=
    >"fes.iso">fes.iso</A>
    >+++++++and output is:++++++++++++++++++++++++++++
    > HREF="http://10.123.111.11"
    >>link1

    > HREF="text.txt"
    >>text.txt

    > HREF="fes.iso"
    >>fes.iso

    >++++++++end+++++++++++++++++++++++++++++++++
    >
    >I want to using this program pick out hrefs and lables like
    >"link1","text.txt","fes.iso".
    >This program works well, but i can't understand the while loop with
    >regex:
    > "$html =~ m{a\b([^>]+)(.*?)</a>}ig"
    > ^^^^^^^^^^^^^^^^^^^^^^^

    the modifier 'g' will continue the match until the end of string.

    The problem is the first 'if' regex will only match the first occurance.
    Does the same as the inner match except only once. Why do you need the outer 'if'
    then?

    >it's works fine, and so amazing:) everytime, it's pick out patten "<a
    >href=...></a>" and get right result. But HOW does it work? I think it
    >will always pick out the first matched patten.
    >
    >Can any perl guru give me answer?
    >
    >Thank you :)
    >
    >Havel
    >

    use strict;
    use warnings;

    my $str = '<A HREF="http://10.123.111.11">link1</A><A HREF="text.txt">text.txt</A><A HREF="fes.iso">fes.iso</A>';

    print "Output from 'if \$str':\n---------------\n";
    if ($str =~ /(<a\b([^>]+)(.*?)<\/a>)/ig)
    {
    print "found: '$1'\n\n";
    my $html = $1;
    while ($html =~ m{a\b([^>]+)(.*?)</a>}ig)
    {
    my $Guts = $1;
    my $Link = $2;
    print "$Guts\n$Link\n";
    }
    }

    pos ($str) = 0;

    print "\n\nOutput from 'while \$str':\n---------------\n";
    while ($str =~ /(<a\b([^>]+)(.*?)<\/a>)/ig)
    {
    print "found: '$1'\n\n";
    my $html = $1;
    while ($html =~ m{a\b([^>]+)(.*?)</a>}ig)
    {
    my $Guts = $1;
    my $Link = $2;
    print "$Guts\n$Link\n";
    }
    }

    pos ($str) = 0;

    print "\n\nOutput from just 'while \$html':\n---------------\n";
    while ($str =~ m{<a\s*([^>]+)(.*?)</a\s*>}ig)
    {
    my $Guts = $1;
    my $Link = $2;
    print "$Guts\n$Link\n";
    }

    __END__


    Output from 'if $str':
    ---------------
    found: '<A HREF="http://10.123.111.11">link1</A>'

    HREF="http://10.123.111.11"
    >link1



    Output from 'while $str':
    ---------------
    found: '<A HREF="http://10.123.111.11">link1</A>'

    HREF="http://10.123.111.11"
    >link1

    found: '<A HREF="text.txt">text.txt</A>'

    HREF="text.txt"
    >text.txt

    found: '<A HREF="fes.iso">fes.iso</A>'

    HREF="fes.iso"
    >fes.iso



    Output from just 'while $html':
    ---------------
    HREF="http://10.123.111.11"
    >link1

    HREF="text.txt"
    >text.txt

    HREF="fes.iso"
    >fes.iso



    In general it doesn't work fine. You can run into problems if the phrase your
    looking for spans lines. Also problematic is your regex does not account for
    legal white spaces.

    The better regex would be: "while ( m{<a\s*([^>]+)(.*?)</a\s*>}ig ) {}"

    Its always good to have delimeters surrounding what you are trying to match.
    In your case the '<a ...></a>' the 'a' tag being the delimeters.

    This will grab inner non 'a' tags, nested 'a' tags however, will not work.
    Because of nesting, html/xml can't be parsed this way, seeking the end delimeter.
    But in your case it should be ok.

    In general, should you need to do specific parsing, you should get a parser that
    captures groups of phrases, from which you can parse with reliability.


    ==================================================
    use strict;
    use warnings;

    use RXParse; # VERSIN 2

    my $p = new RXParse();
    $p->setMode( 'html' => 1, 'resume_onerror'=> 1 );
    my %oldh = $p->setHandlers('start' => \&starth, 'end' => \&endh);

    sub starth
    {
    my ($obj, $el, $term, @attr) = @_;
    my $buffer = lc($el);
    $obj->CaptureOn( $buffer ) if ($buffer eq 'a');
    }
    sub endh
    {
    my ($obj, $el, $term) = @_;
    my $buffer = lc($el);
    $obj->CaptureOff( $buffer, 1 ) if ($buffer eq 'a');
    }

    open my $fh, 'c:\temp\z.html' or die "can't open z.html...";
    $p->parse($fh);
    close $fh;

    # get and parse capture buffer 'a'
    # ....

    # display 'a'
    $p->DumpCaptureBuffs();


    __END__


    BUFFER: a
    =====================================
    index seqence
    ----- --------
    [0] 1 <A HREF="http://10.123.111.11">link1</A>
    [1] 2 <A HREF="text.txt">text.txt</A>
    [2] 3 <A HREF="fes.iso">fes.iso</A>
    , Oct 6, 2008
    #3
  4. havel.zhang

    Dr.Ruud Guest

    Jürgen Exner schreef:

    > The whole
    > perlop man page is about 2000 lines long. That is way too long and
    > complex. It is almost impossible to find anything there or to point
    > people to specific part of it. Is someone already working on breaking
    > it down into more managable chunks?



    You could generate something like

    -------------------------
    =head2 TABLE OF CONTENTS

    =over 2

    =item L</Operator Precedence and Associativity>

    =item L</Terms and List Operators (Leftward)>

    =item L</The Arrow Operator>

    =item etc. etc.

    =back

    -------------------------

    before the "=head1 DESCRIPTION" line,

    and use

    perldoc -oHtml perlop | lynx -stdin

    to have a viewer that is easier to navigate.

    Something like "info" would also be nicer than the default man view.

    Or use http://perldoc.perl.org/perlop.html

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Oct 6, 2008
    #4
  5. havel.zhang

    havel.zhang Guest

    On Oct 6, 9:34 pm, Jürgen Exner <> wrote:
    > "havel.zhang" <> wrote:
    >
    > [...]
    >
    > >This program works well, but i can't understand the  while loop with
    > >regex:
    > >                "$html =~ m{a\b([^>]+)(.*?)</a>}ig"
    > >                 ^^^^^^^^^^^^^^^^^^^^^^^
    > >it's works fine, and so amazing:) everytime, it's pick out patten "<a
    > >href=...></a>"  and get right result. But HOW does it work? I think it
    > >will always pick out the first matched patten.

    >
    > >Can any perl guru give me answer?

    >
    > The documentation can. See 'perldoc perlop', section 'Quote and
    > quote-like operators', the two paragraphs beginning with
    > "The "/g" modifier specifies global pattern matching--that is, ..."
    >
    > However, it is not surprising that you didn't find it. The whole perlop
    > man page is about 2000 lines long. That is way too long and complex. It
    > is almost impossible to find anything there or to point people to
    > specific part of it. Is someone already working on breaking it down into
    > more managable chunks?
    >
    > jue


    Thank you jue:
    After I post my question on news group, I found answer in a perl
    book. That book point out the function which a regex with /g modifier
    as condition in while loop, as you point out above. It's so easy and
    amazing:)
    Thank you again:)

    Havel
    havel.zhang, Oct 7, 2008
    #5
  6. havel.zhang

    Guest

    On Mon, 6 Oct 2008 02:41:30 -0700 (PDT), "havel.zhang" <> wrote:

    >dear perl-gurus,
    >i don't understand how this function works. can you please give me
    >further
    >explanation:
    >

    I tried but you didn't listen.
    The function does not work well for what you are doing.
    Not at all, never will.

    <snip>

    >I want to using this program pick out hrefs and lables like
    >"link1","text.txt","fes.iso".


    No you don't, this is not how to do it. It fails easily.

    >Can any perl guru give me answer?


    The answer was given in detail, and at great time expense.
    Next time there will be no answer.

    sln
    , Oct 7, 2008
    #6
  7. havel.zhang

    Tim Greer Guest

    havel.zhang wrote:

    > foreach(@l){
    > if ($_ =~ /<a\b([^>]+)(.*?)<\/a>/ig){
    > $html=$_;
    > while($html =~ m{a\b([^>]+)(.*?)</a>}ig){
    > my $Guts = $1;
    > my $Link = $2;
    > print "$Guts\n$Link\n";
    > }
    > }


    It steps through the @l array, and for each element within it, it checks
    $_ (which is by default the value of the for/foreach/while, so you
    don't actually need to declare it).

    It then checks that $_ can find an opening HTML tag that starts with
    "a", which is an anchor (hot link), most likely anyway with a word
    boundary \b to ensure it's not some other tag that starts with "a",
    such as <applet> (just an example), and takes anything that's not an
    ending HTML tag (>) an captures it into $1. Then, it captures anything
    else between that last match and the ending anchor tag (</a> -- seen as
    <\/a>) and captures it into $2. It does this check globally and
    without letter case. Of course, that regex doesn't make sense, and
    neither does the check, to be honest, but no matter.

    After the above check, which I assume is to see if there's a matching
    anchor tag, and if there is, then it continues, it then assigns the
    $html variable the value of $_, does a while look and case
    insensitively and globally, checks for the same exact thing it just did
    above and assigns and prints the $Guts and $Link variables the values
    of the first and second match it captured ($1 and $2, respectively) and
    prints it out. The above code really isn't very good and doesn't make
    sense, it's repeating things that can be done in one check, it captures
    values it's never going to use, etc. It should instead just use the
    one and even that one is not correct. It should be

    m{a\b([^>]+)>(.*?)</a>}ig

    Notice the addition of ">" between ([^>]+) and (.*?). Otherwise $2 will
    always start with < (is that what you want? It also would match any
    non valid values when checking the anchor tag, which doesn't seem like
    it would do any good. If it works, great, but there are some wastes of
    processing and bugs so you should expect the unexpected if you run it
    against many HTML files.
    --
    Tim Greer, CEO/Founder/CTO, BurlyHost.com, Inc.
    Shared Hosting, Reseller Hosting, Dedicated & Semi-Dedicated servers
    and Custom Hosting. 24/7 support, 30 day guarantee, secure servers.
    Industry's most experienced staff! -- Web Hosting With Muscle!
    Tim Greer, Oct 7, 2008
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Steven

    while loop in a while loop

    Steven, Mar 24, 2005, in forum: Java
    Replies:
    5
    Views:
    2,213
    Tim Slattery
    Mar 30, 2005
  2. -
    Replies:
    12
    Views:
    675
    Remon van Vliet
    Jun 15, 2005
  3. Uday Bidkar
    Replies:
    4
    Views:
    474
    =?ISO-8859-15?Q?Juli=E1n?= Albo
    Dec 12, 2006
  4. Bill W.
    Replies:
    13
    Views:
    279
    Phillip Gawlowski
    May 9, 2011
  5. Isaac Won
    Replies:
    9
    Views:
    349
    Ulrich Eckhardt
    Mar 4, 2013
Loading...

Share This Page