How it works?(about while loop and regex as condition)

havel.zhang · Oct 6, 2008

dear perl-gurus,
i don't understand how this function works. can you please give me
further
explanation:

the program is very simple:
+++++++++++++program++++++++++++++++++++++
open (O,"<z.html");
@l = <O>;
close(O);

foreach(@l){
if ($_ =~ /<a\b([^>]+)(.*?)<\/a>/ig){
$html=$_;
while($html =~ m{a\b([^>]+)(.*?)</a>}ig){
my $Guts = $1;
my $Link = $2;
print "$Guts\n$Link\n";
}
}
};
++++++++z.html content+++++++++++++++++++++
the z.html 's content is:
<A HREF="http://10.123.111.11">link1</A><A HREF="text.txt">text.txt</
A><A HREF=
"fes.iso">fes.iso</A>
+++++++and output is:++++++++++++++++++++++++++++
HREF="http://10.123.111.11"

link1 HREF="text.txt"
HREF="fes.iso"
fes.iso

++++++++end+++++++++++++++++++++++++++++++++

I want to using this program pick out hrefs and lables like
"link1","text.txt","fes.iso".
This program works well, but i can't understand the while loop with
regex:
"$html =~ m{a\b([^>]+)(.*?)</a>}ig"
^^^^^^^^^^^^^^^^^^^^^^^
it's works fine, and so amazing

everytime, it's pick out patten "<a
href=...></a>" and get right result. But HOW does it work? I think it
will always pick out the first matched patten.

Can any perl guru give me answer?

Thank you

Havel

Jürgen Exner · Oct 6, 2008

havel.zhang said:
This program works well, but i can't understand the while loop with
regex:
"$html =~ m{a\b([^>]+)(.*?)</a>}ig"
^^^^^^^^^^^^^^^^^^^^^^^
it's works fine, and so amazing everytime, it's pick out patten "<a
href=...></a>" and get right result. But HOW does it work? I think it
will always pick out the first matched patten.

Can any perl guru give me answer?

The documentation can. See 'perldoc perlop', section 'Quote and
quote-like operators', the two paragraphs beginning with
"The "/g" modifier specifies global pattern matching--that is, ..."

However, it is not surprising that you didn't find it. The whole perlop
man page is about 2000 lines long. That is way too long and complex. It
is almost impossible to find anything there or to point people to
specific part of it. Is someone already working on breaking it down into
more managable chunks?

jue

sln · Oct 6, 2008

dear perl-gurus,
i don't understand how this function works. can you please give me
further
explanation:

the program is very simple:
+++++++++++++program++++++++++++++++++++++
open (O,"<z.html");
@l = <O>;
close(O);

foreach(@l){
if ($_ =~ /<a\b([^>]+)(.*?)<\/a>/ig){

^^ might need a while here

$html=$_;
while($html =~ m{a\b([^>]+)(.*?)</a>}ig){

does the same thing as above, could even add the '<'

my $Guts = $1;
my $Link = $2;
print "$Guts\n$Link\n";
}
}
};
++++++++z.html content+++++++++++++++++++++
the z.html 's content is:
<A HREF="http://10.123.111.11">link1</A><A HREF="text.txt">text.txt</
A><A HREF=
"fes.iso">fes.iso</A>
+++++++and output is:++++++++++++++++++++++++++++
HREF="http://10.123.111.11"

link1 HREF="text.txt"
HREF="fes.iso"
fes.iso

Click to expand...

++++++++end+++++++++++++++++++++++++++++++++

I want to using this program pick out hrefs and lables like
"link1","text.txt","fes.iso".
This program works well, but i can't understand the while loop with
regex:
"$html =~ m{a\b([^>]+)(.*?)</a>}ig"
^^^^^^^^^^^^^^^^^^^^^^^

the modifier 'g' will continue the match until the end of string.

The problem is the first 'if' regex will only match the first occurance.
Does the same as the inner match except only once. Why do you need the outer 'if'
then?

it's works fine, and so amazing everytime, it's pick out patten "<a
href=...></a>" and get right result. But HOW does it work? I think it
will always pick out the first matched patten.

Can any perl guru give me answer?

Thank you

Havel

use strict;
use warnings;

my $str = '<A HREF="http://10.123.111.11">link1</A><A HREF="text.txt">text.txt</A><A HREF="fes.iso">fes.iso</A>';

print "Output from 'if \$str':\n---------------\n";
if ($str =~ /(<a\b([^>]+)(.*?)<\/a>)/ig)
{
print "found: '$1'\n\n";
my $html = $1;
while ($html =~ m{a\b([^>]+)(.*?)</a>}ig)
{
my $Guts = $1;
my $Link = $2;
print "$Guts\n$Link\n";
}
}

pos ($str) = 0;

print "\n\nOutput from 'while \$str':\n---------------\n";
while ($str =~ /(<a\b([^>]+)(.*?)<\/a>)/ig)
{
print "found: '$1'\n\n";
my $html = $1;
while ($html =~ m{a\b([^>]+)(.*?)</a>}ig)
{
my $Guts = $1;
my $Link = $2;
print "$Guts\n$Link\n";
}
}

pos ($str) = 0;

print "\n\nOutput from just 'while \$html':\n---------------\n";
while ($str =~ m{<a\s*([^>]+)(.*?)</a\s*>}ig)
{
my $Guts = $1;
my $Link = $2;
print "$Guts\n$Link\n";
}

__END__

Output from 'if $str':
---------------
found: '<A HREF="http://10.123.111.11">link1</A>'

HREF="http://10.123.111.11"

link1

Output from 'while $str':
---------------
found: '<A HREF="http://10.123.111.11">link1</A>'

HREF="http://10.123.111.11"

link1

found: '<A HREF="text.txt">text.txt</A>'

HREF="text.txt"

text.txt

found: '<A HREF="fes.iso">fes.iso</A>'

HREF="fes.iso"

fes.iso

Output from just 'while $html':
---------------
HREF="http://10.123.111.11"

link1 HREF="text.txt"
HREF="fes.iso"
fes.iso

In general it doesn't work fine. You can run into problems if the phrase your
looking for spans lines. Also problematic is your regex does not account for
legal white spaces.

The better regex would be: "while ( m{<a\s*([^>]+)(.*?)</a\s*>}ig ) {}"

Its always good to have delimeters surrounding what you are trying to match.
In your case the '<a ...></a>' the 'a' tag being the delimeters.

This will grab inner non 'a' tags, nested 'a' tags however, will not work.
Because of nesting, html/xml can't be parsed this way, seeking the end delimeter.
But in your case it should be ok.

In general, should you need to do specific parsing, you should get a parser that
captures groups of phrases, from which you can parse with reliability.

==================================================
use strict;
use warnings;

use RXParse; # VERSIN 2

my $p = new RXParse();
$p->setMode( 'html' => 1, 'resume_onerror'=> 1 );
my %oldh = $p->setHandlers('start' => \&starth, 'end' => \&endh);

sub starth
{
my ($obj, $el, $term, @attr) = @_;
my $buffer = lc($el);
$obj->CaptureOn( $buffer ) if ($buffer eq 'a');
}
sub endh
{
my ($obj, $el, $term) = @_;
my $buffer = lc($el);
$obj->CaptureOff( $buffer, 1 ) if ($buffer eq 'a');
}

open my $fh, 'c:\temp\z.html' or die "can't open z.html...";
$p->parse($fh);
close $fh;

# get and parse capture buffer 'a'
# ....

# display 'a'
$p->DumpCaptureBuffs();

__END__

BUFFER: a
=====================================
index seqence
----- --------
[0] 1 <A HREF="http://10.123.111.11">link1</A>
[1] 2 <A HREF="text.txt">text.txt</A>
[2] 3 <A HREF="fes.iso">fes.iso</A>

Dr.Ruud · Oct 6, 2008

Jürgen Exner schreef:

The whole
perlop man page is about 2000 lines long. That is way too long and
complex. It is almost impossible to find anything there or to point
people to specific part of it. Is someone already working on breaking
it down into more managable chunks?

You could generate something like

-------------------------
=head2 TABLE OF CONTENTS

=over 2

=item L</Operator Precedence and Associativity>

=item L</Terms and List Operators (Leftward)>

=item L</The Arrow Operator>

=item etc. etc.

=back

-------------------------

before the "=head1 DESCRIPTION" line,

and use

perldoc -oHtml perlop | lynx -stdin

to have a viewer that is easier to navigate.

Something like "info" would also be nicer than the default man view.

Or use http://perldoc.perl.org/perlop.html

havel.zhang · Oct 6, 2008

[...]

This program works well, but i can't understand the while loop with
regex:
"$html =~ m{a\b([^>]+)(.*?)</a>}ig"
^^^^^^^^^^^^^^^^^^^^^^^
it's works fine, and so amazing everytime, it's pick out patten "<a
href=...></a>" and get right result. But HOW does it work? I think it
will always pick out the first matched patten.

Click to expand...

Can any perl guru give me answer?

Click to expand...

The documentation can. See 'perldoc perlop', section 'Quote and
quote-like operators', the two paragraphs beginning with
"The "/g" modifier specifies global pattern matching--that is, ..."

However, it is not surprising that you didn't find it. The whole perlop
man page is about 2000 lines long. That is way too long and complex. It
is almost impossible to find anything there or to point people to
specific part of it. Is someone already working on breaking it down into
more managable chunks?

jue

Thank you jue:
After I post my question on news group, I found answer in a perl
book. That book point out the function which a regex with /g modifier
as condition in while loop, as you point out above. It's so easy and
amazing

Thank you again

Havel

sln · Oct 7, 2008

dear perl-gurus,
i don't understand how this function works. can you please give me
further
explanation:

I tried but you didn't listen.
The function does not work well for what you are doing.
Not at all, never will.

I want to using this program pick out hrefs and lables like
"link1","text.txt","fes.iso".

No you don't, this is not how to do it. It fails easily.

Can any perl guru give me answer?

The answer was given in detail, and at great time expense.
Next time there will be no answer.

sln

Tim Greer · Oct 7, 2008

havel.zhang said:
foreach(@l){
ifÂ ($_Â =~Â /<a\b([^>]+)(.*?)<\/a>/ig){
$html=$_;
while($html =~ m{a\b([^>]+)(.*?)</a>}ig){
my $Guts = $1;
my $Link = $2;
print "$Guts\n$Link\n";
}
}

It steps through the @l array, and for each element within it, it checks
$_ (which is by default the value of the for/foreach/while, so you
don't actually need to declare it).

It then checks that $_ can find an opening HTML tag that starts with
"a", which is an anchor (hot link), most likely anyway with a word
boundary \b to ensure it's not some other tag that starts with "a",
such as <applet> (just an example), and takes anything that's not an
ending HTML tag (>) an captures it into $1. Then, it captures anything
else between that last match and the ending anchor tag (</a> -- seen as
<\/a>) and captures it into $2. It does this check globally and
without letter case. Of course, that regex doesn't make sense, and
neither does the check, to be honest, but no matter.

After the above check, which I assume is to see if there's a matching
anchor tag, and if there is, then it continues, it then assigns the
$html variable the value of $_, does a while look and case
insensitively and globally, checks for the same exact thing it just did
above and assigns and prints the $Guts and $Link variables the values
of the first and second match it captured ($1 and $2, respectively) and
prints it out. The above code really isn't very good and doesn't make
sense, it's repeating things that can be done in one check, it captures
values it's never going to use, etc. It should instead just use the
one and even that one is not correct. It should be

m{a\b([^>]+)>(.*?)</a>}ig

Notice the addition of ">" between ([^>]+) and (.*?). Otherwise $2 will
always start with < (is that what you want? It also would match any
non valid values when checking the anchor tag, which doesn't seem like
it would do any good. If it works, great, but there are some wastes of
processing and bugs so you should expect the unexpected if you run it
against many HTML files.

While loop unclear, can someone help?	4	Dec 6, 2023
How can I arrange a series of radio buttons?	2	Jan 24, 2024
Rich Text Format (RTF) Document Builder in C++: Code and Features	0	Sep 28, 2025
How to change star rating color on mouseenter, on mouseout, and on onclick	0	Sep 28, 2018
RegEx issue	8	Jul 29, 2004
How to disregard the first match of a loop?	22	Aug 8, 2011
Odd regex behavior	9	Sep 30, 2007
can't get out of infinite while loop	2	Aug 17, 2007

How it works?(about while loop and regex as condition)

havel.zhang

Jürgen Exner

sln

Dr.Ruud

havel.zhang

sln

Tim Greer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads