capturing multiple patterns per line

C

ccc31807

This is a newbie question, I admit, but I don't know the answer.

Suppose I am parsing a file line by line, and I want to push to an
array all substrings on that line that match a pattern. For example,
consider the listing below. @urls SHOULD contain this: @urls = (http://
google.com, http://yahoo.com, http://amazon.com, http://ebay.com)
Instead, it contains only the last value. Using the g modifier doesn't
help.

I know why @urls contains only the last value, but I don't know how to
get all the values.

Thanks, CC.

-------listing---------------
use strict;
use warnings;

my @urls;
while (<DATA>)
{
if (/<a.*href="([^"]+)/) { push @urls, $1; }
}

print @urls;
exit(0);

__DATA__
<html>\n
<body>\n
<h1>My Favorite Sites</h1>\n
<p>\n
My favorite sites are <a href="http://google.com">Google</a>, <a
href="http://yahoo.com">Yahoo</a>, <a href="http://amazon.com">Amazon</
a>, and <a href="http://ebay.com">Ebay</a>.\n
</p>\n
</body>\n
</html>\n
 
J

Jürgen Exner

ccc31807 said:
This is a newbie question, I admit, but I don't know the answer.

Suppose I am parsing a file line by line, and I want to push to an
array all substrings on that line that match a pattern. For example,
consider the listing below. @urls SHOULD contain this: @urls = (http://
google.com, http://yahoo.com, http://amazon.com, http://ebay.com)
Instead, it contains only the last value. Using the g modifier doesn't
help.

I know why @urls contains only the last value, but I don't know how to
get all the values.

Cannot repro your problem. The code you posted adds all three URLs to
the array and prints them in one contiguous line.

C:\tmp>t.pl
http://google.comhttp://amazon.comhttp://ebay.com

jue
 
C

ccc31807

Cannot repro your problem. The code you posted adds all three URLs to
the array and prints them in one contiguous line.

C:\tmp>t.plhttp://google.comhttp://amazon.comhttp://ebay.com

This is a mystery. I've run the script on both a Windows and Linux
machine with the same results. Besides, your output should also
include Yahoo, which it doesn't.

I was able to do what I wanted with the following hack. I'm not real
happy about it, but it works. Still, I'd rather know how to do it with
a RE.

CC.

---------hack---------------
while (<DATA>)
{
my @line = split /<a/;
foreach my $url (@line)
{
if (/<a.*href="([^"]+)/) { push @urls, $1; }
}
}
 
W

Willem

ccc31807 wrote:
) This is a newbie question, I admit, but I don't know the answer.
)
) Suppose I am parsing a file line by line, and I want to push to an
) array all substrings on that line that match a pattern. For example,
) consider the listing below. @urls SHOULD contain this: @urls = (http://
) google.com, http://yahoo.com, http://amazon.com, http://ebay.com)
) Instead, it contains only the last value. Using the g modifier doesn't
) help.
)
) I know why @urls contains only the last value, but I don't know how to
) get all the values.

I think you don't actually know why it only contains the last value,
because there are two separate issues with your code.

) Thanks, CC.
)
) -------listing---------------
) use strict;
) use warnings;
)
) my @urls;
) while (<DATA>)
) {
) if (/<a.*href="([^"]+)/) { push @urls, $1; }
) }

First of all, the .* in there will match everything, so in this case it
will match everything from the first <a to the last href="..."

Second, with the /g modifier, the results will not all be put in $1

And third, obviously, this is a lot easier in perl if you realise that it
can do a lot of set processing:

while (<DATA>)
{
push @urls, /<a.*?href="(.*?)"/gi;
}

Or even:

@urls = map { /<a.*?href="(.*?)"/gi } <DATA>

Although that is a lot more memory hungry.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
C

ccc31807

  while (<DATA>)
  {
    push @urls, /<a.*?href="(.*?)"/gi;
  }

Yes, yes, yes, you are entirely right. I thought that the non-greedy
modifier might do the trick, but (1) I didn't realize that the greedy
version would skip all the way to the last one to the detriment of my
search, and (2) I didn't carefully think through exactly where I
should use the non-greedy modifier.

Thanks, CC.
 
J

John W. Krahn

ccc31807 said:
Cannot repro your problem. The code you posted adds all three URLs to
the array and prints them in one contiguous line.

C:\tmp>t.plhttp://google.comhttp://amazon.comhttp://ebay.com

This is a mystery. I've run the script on both a Windows and Linux
machine with the same results. Besides, your output should also
include Yahoo, which it doesn't.

I was able to do what I wanted with the following hack. I'm not real
happy about it, but it works. Still, I'd rather know how to do it with
a RE.

---------hack---------------
while (<DATA>)
{
my @line = split /<a/;
foreach my $url (@line)
{
if (/<a.*href="([^"]+)/) { push @urls, $1; }

That is short for:

if ($_ =~ /<a.*href="([^"]+)/)

So you are not using the results from split() at all and the foreach
loop is superfluous. But if you changed that to:

if ($url =~ /<a.*href="([^"]+)/)

Then it wouldn't work because "split /<a/" removes the string '<a' from
all input and the regular expression requires a match with '<a'.



John
 
J

Jürgen Exner

ccc31807 said:
This is a mystery. I've run the script on both a Windows and Linux
machine with the same results. Besides, your output should also
include Yahoo, which it doesn't.

After reading the other responses I realize that I was looking at the
wrong problem. You wrote "Instead, it contains only the last value. "
Running your code I saw three distinct values. Three is more than "only
the last", so obviously your claim was wrong.
You never mentioned that you were talking about the RE not
extracting/capturing all the elements from a _SINGLE(!!!)_ line.

Thank you very much for throwing red herring around.

jue
 
R

RedGrittyBrick

This is a mystery. I've run the script on both a Windows and Linux
machine with the same results. Besides, your output should also
include Yahoo, which it doesn't.

Thats because your DATA lines have been reformatted and split onto
several lines!
I was able to do what I wanted with the following hack. I'm not real
happy about it, but it works. Still, I'd rather know how to do it with
a RE.

Not every job should be done with an RE
---------hack---------------
while (<DATA>)
{
my @line = split /<a/;
foreach my $url (@line)
{
if (/<a.*href="([^"]+)/) { push @urls, $1; }
}
}

-------------8<-------------
#!/usr/bin/perl
use strict;
use warnings;
my @urls;
while (<DATA>)
{
push @urls, /<a href="([^"]+)/g;
}
print join(',',@urls), "\n";
__DATA__
xxx
x <a href="g">G</a><a href="y">Y</a> x
x <a href="a">A</a><a href="e">E</a> x
xxx
-------------8<-------------
g,y,a,e
 
C

ccc31807

Thats because your DATA lines have been reformatted and split onto
several lines!

Yeah, I saw that before I posted, which is why I use '\n' to mark the
ends of the 'real' lines.

Not every job should be done with an RE

No, but in accord with TIMTOWTDI, I wanted to see how it could be done
with an RE.
while (<DATA>)
{
    push @urls, /<a href="([^"]+)/g;}

print join(',',@urls), "\n";

I'm having fun playing with the suggestions offered, and am actually
learning in the process. ;-)

Thanks, CC.
 
S

sln

This is a newbie question, I admit, but I don't know the answer.

Suppose I am parsing a file line by line, and I want to push to an
array all substrings on that line that match a pattern. For example,
consider the listing below. @urls SHOULD contain this: @urls = (http://
google.com, http://yahoo.com, http://amazon.com, http://ebay.com)
Instead, it contains only the last value. Using the g modifier doesn't
help.

I know why @urls contains only the last value, but I don't know how to
get all the values.

Thanks, CC.

-------listing---------------
use strict;
use warnings;

my @urls;
while (<DATA>)
{
if (/<a.*href="([^"]+)/) { push @urls, $1; }
}

print @urls;
exit(0);

__DATA__
<html>\n
<body>\n
<h1>My Favorite Sites</h1>\n
<p>\n
My favorite sites are <a href="http://google.com">Google</a>, <a
href="http://yahoo.com">Yahoo</a>, <a href="http://amazon.com">Amazon</
a>, and <a href="http://ebay.com">Ebay</a>.\n
</p>\n
</body>\n
</html>\n

If you want to parse with a little bit more conformity,
something like this (albeit deficient) might work better
when you come across possible gotcha's.

-sln

use strict;
use warnings;

my @urls;
{
local $/;
@urls = <DATA> =~
/<a\s+[^>]*?(?<=\s)href\s*=\s*["'](.+?)["'][^>]*?\s*\/?>/sg;

# Or, if you want to be more precise and don't mind the quotes:
#/<a\s+[^>]*?(?<=\s)href\s*=\s*(".+?"|'.+?')[^>]*?\s*\/?>/sg
}

print $_,"\n" for @urls;
exit(0);

__DATA__
<html>\n
<body>\n
<h1>My Favorite Sites</h1>\n
<p>\n
My favorite sites are <a asdfhref=http://google.com" href='http://gg.com' >Google</a>, <a
href="http://yahoo.com">Yahoo</a>, <a href="http://amazon.com">Amazon</
a>, and <a href="http://ebay.com">Ebay</a>.\n
</p>\n
</body>\n
</html>\n
 
P

Peter J. Holzer

On Fri, 5 Feb 2010 09:12:54 -0800 (PST) in comp.lang.perl.misc, ccc31807
Yes, yes, yes, you are entirely right. I thought that the non-greedy
modifier might do the trick, but

Instead of .*? I think [^>]*? would be more accurate.

Nope. ">" is allowed in a double-quoted parameter value.

hp
 
S

sln

On Fri, 5 Feb 2010 09:12:54 -0800 (PST) in comp.lang.perl.misc, ccc31807
  while (<DATA>)
  {
    push @urls, /<a.*?href="(.*?)"/gi;
  }

Yes, yes, yes, you are entirely right. I thought that the non-greedy
modifier might do the trick, but

Instead of .*? I think [^>]*? would be more accurate.

Nope. ">" is allowed in a double-quoted parameter value.

hp

In single quotes as well.
Yes, > is allowed in a double/single quote attval.
Its also allowed in content surrounded by quotes.

So, CC's regex will match: <a/>href=" > "
Clearly, a guard must be in place to thwart this.
[^>]*? is a good candidate but where do you put it?

CC's regex will also match: <aBBB Zhref="some stuff"
So, its not really a good regex for this.

However, you can use [^>]*? to flesh out the tag-att/val form.
There are 5 or 6 sub-pattern forms in an expression.
At least 1 complete form for tag-att/val's is needed.

A complete sub-pattern (form), that will parse any tag-att/val
markup is this:
<(?:($Name)(\s+(?:(?:".*?")|(?:'.*?')|(?:[^>]*?))+)\s*(\/?))>

Where, tag and ".*?" and '.*?' and [^>]*? consume all valid text between <>.
Easier said than done. After this a further parsing is necessary on the
capture groups to separate data and detect errors.

The form above can be combined with the seconday parsing when there
is specifiic information available. Like CC's <a href= .. data.
Still, a complete form is needed.

As a side note, xml is stricter than html when it comes to quoting
values in att/val pairs. Html is not so strict and allows for unquoted
vals and standalone unquoted attributes as well.
The form above accomodates both, strictures can be enforced later
and the bottom line is the *form* integrity is maintained in the stream
and does not overflow into invalid teritory.

So, CC's regex could be made into combined modified form,
though still inadequite because it is a standalone form where
other forms are missing that could negate the results.

Yes you were right about the ">", but without [^>]*? in a couple
of places, it won't work:

/<a\s+[^>]*?(?<=\s)href\s*=\s*["'](.+?)["'][^>]*?\s*\/?>/
or
/<a\s+[^>]*?(?<=\s)href\s*=\s*(".+?"|'.+?')[^>]*?\s*\/?>/ # quotes captured

-sln
 
J

jl_post

Suppose I am parsing a file line by line, and I want to push to an
array all substrings on that line that match a pattern. For example,
consider the listing below. @urls SHOULD contain this: @urls = (http://
google.com,http://yahoo.com,http://amazon.com,http://ebay.com)
Instead, it contains only the last value. Using the g modifier doesn't
help.

(My apologies if someone has already answered to your
satisfaction.)

Try using the /g modifier, changing "if" to "while", and changing
"<a.*href" to just "href" (since "a" and "href" are not guaranteed to
occur together on the same line). So your script would look like:

-------listing---------------
use strict;
use warnings;
my @urls;
while (<DATA>)
{
while (/href="([^"]+)/g) { push @urls, $1; }
}

print join "\n", @urls;
__DATA__
<html>\n
<body>\n
<h1>My Favorite Sites</h1>\n
<p>\n
My favorite sites are <a href="http://google.com">Google</a>, <a
href="http://yahoo.com">Yahoo</a>, <a href="http://
amazon.com">Amazon</
a>, and <a href="http://ebay.com">Ebay</a>.\n
</p>\n
</body>\n
</html>\n
-------end of listing---------------

(I added a call to join() in the print() statement to make the
output a little easier to read.)

Running this modified program, I get as output:

http://google.com
http://yahoo.com
http://amazon.com
http://ebay.com

This is what you want, right?

(And consider using the /i modifier, as HTML tags are not required
to be lower-case.)

Hope this helps,

-- Jean-Luc
 
S

sln

(My apologies if someone has already answered to your
satisfaction.)

Try using the /g modifier, changing "if" to "while", and changing
"<a.*href" to just "href" (since "a" and "href" are not guaranteed to
occur together on the same line). So your script would look like:

Other non-guarantees:
- "href" and "=" could have a span of lines between them
- attribute value may be single quoted
- attribute value may not be quoted at all
- a quoted value may span several lines

The list is too long to write.

-sln
 
P

Peter J. Holzer

On Fri, 5 Feb 2010 09:12:54 -0800 (PST) in comp.lang.perl.misc, ccc31807
  while (<DATA>)
  {
    push @urls, /<a.*?href="(.*?)"/gi;
  }

Yes, yes, yes, you are entirely right. I thought that the non-greedy
modifier might do the trick, but

Instead of .*? I think [^>]*? would be more accurate.

Nope. ">" is allowed in a double-quoted parameter value.

In single quotes as well.
Yes, > is allowed in a double/single quote attval.
Its also allowed in content surrounded by quotes.

I'm not sure what you mean by "content surrounded by quotes". It is
allowed in #PCDATA, quotes have nothing to do with it.

So, CC's regex will match: <a/>href=" > "
Clearly, a guard must be in place to thwart this.

I'm not worried much about matching some invalid HTML. In some (but not
all!) situations you just know that the input is valid.
I'm much more worried that it doesn't match some valid HTML like

* Single quotes:
<a href='http://example.com'>
* Multiple lines:
<a
href="http://example.com">
* extra whitespace:
<a href = "http://example.com">

All of these occur in valid, real-life HTML code.

But again, if you know the input (e.g., all the HTML files were produced
by the same program or written by the same person) that may not be an
issue.
[^>]*? is a good candidate but where do you put it?

Don't worry about this. If you want a robust HTML parser, use one of the
modules already available. Writing the 756th HTML parser may be fun but
it isn't productive.

hp
 
P

Peter J. Holzer

On Thu, 11 Feb 2010 13:21:41 +0100 in comp.lang.perl.misc, "Peter J.
Holzer said:
On Fri, 5 Feb 2010 09:12:54 -0800 (PST) in comp.lang.perl.misc, ccc31807
  while (<DATA>)
  {
    push @urls, /<a.*?href="(.*?)"/gi;
  }

Yes, yes, yes, you are entirely right. I thought that the non-greedy
modifier might do the trick, but

Instead of .*? I think [^>]*? would be more accurate.

Nope. ">" is allowed in a double-quoted parameter value.

OK, thanks for that. But then, "href=" is not in all "<a" tags, i.e.
the ones that specify "name=" instead. So the "href=" matched above
might not even be part of a tag.

Oh, you meant the first .*?. I thought you meant the second one.
Yes, the first one is a drastic oversimplification. But changing it to
[^>]*? makes it only marginally better.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,015
Latest member
AmbrosePal

Latest Threads

Top