How it works?(about while loop and regex as condition)

H

havel.zhang

dear perl-gurus,
i don't understand how this function works. can you please give me
further
explanation:

the program is very simple:
+++++++++++++program++++++++++++++++++++++
open (O,"<z.html");
@l = <O>;
close(O);

foreach(@l){
if ($_ =~ /<a\b([^>]+)(.*?)<\/a>/ig){
$html=$_;
while($html =~ m{a\b([^>]+)(.*?)</a>}ig){
my $Guts = $1;
my $Link = $2;
print "$Guts\n$Link\n";
}
}
};
++++++++z.html content+++++++++++++++++++++
the z.html 's content is:
<A HREF="http://10.123.111.11">link1</A><A HREF="text.txt">text.txt</
A><A HREF=
"fes.iso">fes.iso</A>
+++++++and output is:++++++++++++++++++++++++++++
HREF="http://10.123.111.11"
link1 HREF="text.txt"
HREF="fes.iso"
fes.iso
++++++++end+++++++++++++++++++++++++++++++++

I want to using this program pick out hrefs and lables like
"link1","text.txt","fes.iso".
This program works well, but i can't understand the while loop with
regex:
"$html =~ m{a\b([^>]+)(.*?)</a>}ig"
^^^^^^^^^^^^^^^^^^^^^^^
it's works fine, and so amazing:) everytime, it's pick out patten "<a
href=...></a>" and get right result. But HOW does it work? I think it
will always pick out the first matched patten.

Can any perl guru give me answer?

Thank you :)

Havel
 
J

Jürgen Exner

havel.zhang said:
This program works well, but i can't understand the while loop with
regex:
"$html =~ m{a\b([^>]+)(.*?)</a>}ig"
^^^^^^^^^^^^^^^^^^^^^^^
it's works fine, and so amazing:) everytime, it's pick out patten "<a
href=...></a>" and get right result. But HOW does it work? I think it
will always pick out the first matched patten.

Can any perl guru give me answer?

The documentation can. See 'perldoc perlop', section 'Quote and
quote-like operators', the two paragraphs beginning with
"The "/g" modifier specifies global pattern matching--that is, ..."

However, it is not surprising that you didn't find it. The whole perlop
man page is about 2000 lines long. That is way too long and complex. It
is almost impossible to find anything there or to point people to
specific part of it. Is someone already working on breaking it down into
more managable chunks?

jue
 
S

sln

dear perl-gurus,
i don't understand how this function works. can you please give me
further
explanation:

the program is very simple:
+++++++++++++program++++++++++++++++++++++
open (O,"<z.html");
@l = <O>;
close(O);

foreach(@l){
if ($_ =~ /<a\b([^>]+)(.*?)<\/a>/ig){
^^ might need a while here
$html=$_;
while($html =~ m{a\b([^>]+)(.*?)</a>}ig){
does the same thing as above, could even add the '<'
my $Guts = $1;
my $Link = $2;
print "$Guts\n$Link\n";
}
}
};
++++++++z.html content+++++++++++++++++++++
the z.html 's content is:
<A HREF="http://10.123.111.11">link1</A><A HREF="text.txt">text.txt</
A><A HREF=
"fes.iso">fes.iso</A>
+++++++and output is:++++++++++++++++++++++++++++
HREF="http://10.123.111.11"
link1 HREF="text.txt"
HREF="fes.iso"
fes.iso
++++++++end+++++++++++++++++++++++++++++++++

I want to using this program pick out hrefs and lables like
"link1","text.txt","fes.iso".
This program works well, but i can't understand the while loop with
regex:
"$html =~ m{a\b([^>]+)(.*?)</a>}ig"
^^^^^^^^^^^^^^^^^^^^^^^
the modifier 'g' will continue the match until the end of string.

The problem is the first 'if' regex will only match the first occurance.
Does the same as the inner match except only once. Why do you need the outer 'if'
then?
it's works fine, and so amazing:) everytime, it's pick out patten "<a
href=...></a>" and get right result. But HOW does it work? I think it
will always pick out the first matched patten.

Can any perl guru give me answer?

Thank you :)

Havel
use strict;
use warnings;

my $str = '<A HREF="http://10.123.111.11">link1</A><A HREF="text.txt">text.txt</A><A HREF="fes.iso">fes.iso</A>';

print "Output from 'if \$str':\n---------------\n";
if ($str =~ /(<a\b([^>]+)(.*?)<\/a>)/ig)
{
print "found: '$1'\n\n";
my $html = $1;
while ($html =~ m{a\b([^>]+)(.*?)</a>}ig)
{
my $Guts = $1;
my $Link = $2;
print "$Guts\n$Link\n";
}
}

pos ($str) = 0;

print "\n\nOutput from 'while \$str':\n---------------\n";
while ($str =~ /(<a\b([^>]+)(.*?)<\/a>)/ig)
{
print "found: '$1'\n\n";
my $html = $1;
while ($html =~ m{a\b([^>]+)(.*?)</a>}ig)
{
my $Guts = $1;
my $Link = $2;
print "$Guts\n$Link\n";
}
}

pos ($str) = 0;

print "\n\nOutput from just 'while \$html':\n---------------\n";
while ($str =~ m{<a\s*([^>]+)(.*?)</a\s*>}ig)
{
my $Guts = $1;
my $Link = $2;
print "$Guts\n$Link\n";
}

__END__


Output from 'if $str':
---------------
found: '<A HREF="http://10.123.111.11">link1</A>'

HREF="http://10.123.111.11"


Output from 'while $str':
---------------
found: '<A HREF="http://10.123.111.11">link1</A>'

HREF="http://10.123.111.11"
found: '<A HREF="text.txt">text.txt</A>'

HREF="text.txt"
found: '<A HREF="fes.iso">fes.iso</A>'

HREF="fes.iso"


Output from just 'while $html':
---------------
HREF="http://10.123.111.11"
link1 HREF="text.txt"
HREF="fes.iso"
fes.iso


In general it doesn't work fine. You can run into problems if the phrase your
looking for spans lines. Also problematic is your regex does not account for
legal white spaces.

The better regex would be: "while ( m{<a\s*([^>]+)(.*?)</a\s*>}ig ) {}"

Its always good to have delimeters surrounding what you are trying to match.
In your case the '<a ...></a>' the 'a' tag being the delimeters.

This will grab inner non 'a' tags, nested 'a' tags however, will not work.
Because of nesting, html/xml can't be parsed this way, seeking the end delimeter.
But in your case it should be ok.

In general, should you need to do specific parsing, you should get a parser that
captures groups of phrases, from which you can parse with reliability.


==================================================
use strict;
use warnings;

use RXParse; # VERSIN 2

my $p = new RXParse();
$p->setMode( 'html' => 1, 'resume_onerror'=> 1 );
my %oldh = $p->setHandlers('start' => \&starth, 'end' => \&endh);

sub starth
{
my ($obj, $el, $term, @attr) = @_;
my $buffer = lc($el);
$obj->CaptureOn( $buffer ) if ($buffer eq 'a');
}
sub endh
{
my ($obj, $el, $term) = @_;
my $buffer = lc($el);
$obj->CaptureOff( $buffer, 1 ) if ($buffer eq 'a');
}

open my $fh, 'c:\temp\z.html' or die "can't open z.html...";
$p->parse($fh);
close $fh;

# get and parse capture buffer 'a'
# ....

# display 'a'
$p->DumpCaptureBuffs();


__END__


BUFFER: a
=====================================
index seqence
----- --------
[0] 1 <A HREF="http://10.123.111.11">link1</A>
[1] 2 <A HREF="text.txt">text.txt</A>
[2] 3 <A HREF="fes.iso">fes.iso</A>
 
D

Dr.Ruud

Jürgen Exner schreef:
The whole
perlop man page is about 2000 lines long. That is way too long and
complex. It is almost impossible to find anything there or to point
people to specific part of it. Is someone already working on breaking
it down into more managable chunks?


You could generate something like

-------------------------
=head2 TABLE OF CONTENTS

=over 2

=item L</Operator Precedence and Associativity>

=item L</Terms and List Operators (Leftward)>

=item L</The Arrow Operator>

=item etc. etc.

=back

-------------------------

before the "=head1 DESCRIPTION" line,

and use

perldoc -oHtml perlop | lynx -stdin

to have a viewer that is easier to navigate.

Something like "info" would also be nicer than the default man view.

Or use http://perldoc.perl.org/perlop.html
 
H

havel.zhang

[...]
This program works well, but i can't understand the  while loop with
regex:
               "$html =~ m{a\b([^>]+)(.*?)</a>}ig"
                ^^^^^^^^^^^^^^^^^^^^^^^
it's works fine, and so amazing:) everytime, it's pick out patten "<a
href=...></a>"  and get right result. But HOW does it work? I think it
will always pick out the first matched patten.
Can any perl guru give me answer?

The documentation can. See 'perldoc perlop', section 'Quote and
quote-like operators', the two paragraphs beginning with
"The "/g" modifier specifies global pattern matching--that is, ..."

However, it is not surprising that you didn't find it. The whole perlop
man page is about 2000 lines long. That is way too long and complex. It
is almost impossible to find anything there or to point people to
specific part of it. Is someone already working on breaking it down into
more managable chunks?

jue

Thank you jue:
After I post my question on news group, I found answer in a perl
book. That book point out the function which a regex with /g modifier
as condition in while loop, as you point out above. It's so easy and
amazing:)
Thank you again:)

Havel
 
S

sln

dear perl-gurus,
i don't understand how this function works. can you please give me
further
explanation:
I tried but you didn't listen.
The function does not work well for what you are doing.
Not at all, never will.

I want to using this program pick out hrefs and lables like
"link1","text.txt","fes.iso".

No you don't, this is not how to do it. It fails easily.
Can any perl guru give me answer?

The answer was given in detail, and at great time expense.
Next time there will be no answer.

sln
 
T

Tim Greer

havel.zhang said:
foreach(@l){
if ($_ =~ /<a\b([^>]+)(.*?)<\/a>/ig){
$html=$_;
while($html =~ m{a\b([^>]+)(.*?)</a>}ig){
my $Guts = $1;
my $Link = $2;
print "$Guts\n$Link\n";
}
}

It steps through the @l array, and for each element within it, it checks
$_ (which is by default the value of the for/foreach/while, so you
don't actually need to declare it).

It then checks that $_ can find an opening HTML tag that starts with
"a", which is an anchor (hot link), most likely anyway with a word
boundary \b to ensure it's not some other tag that starts with "a",
such as <applet> (just an example), and takes anything that's not an
ending HTML tag (>) an captures it into $1. Then, it captures anything
else between that last match and the ending anchor tag (</a> -- seen as
<\/a>) and captures it into $2. It does this check globally and
without letter case. Of course, that regex doesn't make sense, and
neither does the check, to be honest, but no matter.

After the above check, which I assume is to see if there's a matching
anchor tag, and if there is, then it continues, it then assigns the
$html variable the value of $_, does a while look and case
insensitively and globally, checks for the same exact thing it just did
above and assigns and prints the $Guts and $Link variables the values
of the first and second match it captured ($1 and $2, respectively) and
prints it out. The above code really isn't very good and doesn't make
sense, it's repeating things that can be done in one check, it captures
values it's never going to use, etc. It should instead just use the
one and even that one is not correct. It should be

m{a\b([^>]+)>(.*?)</a>}ig

Notice the addition of ">" between ([^>]+) and (.*?). Otherwise $2 will
always start with < (is that what you want? It also would match any
non valid values when checking the anchor tag, which doesn't seem like
it would do any good. If it works, great, but there are some wastes of
processing and bugs so you should expect the unexpected if you run it
against many HTML files.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,025
Latest member
KetoRushACVFitness

Latest Threads

Top