RegEx - matching previous match

j ellings · Feb 27, 2008

Hello.

I have an html file converted from PDF that includes the following
sample lines:

(html has been converted)

Z & A Newsstand 
Retail Food: Mobile Food Vendor 
2 N 10th St 
Philadelphia, PA 19107 
Inspection Date 
4/11/07 
No Critical Violations 
4/11/07 
No Critical Violations 
11/28/06 
No Critical Violations 
4/24/06 
No Critical Violations 
Newstand 
Retail Food: Mobile Food Vendor 
32 N 10th St 
Philadelphia, PA 19107 
Inspection Date 
7/2/07 
No Critical Violations 
Pudgies Deli 
Retail Food: Restaurant, Eat-in 
46 N 10th St 
Philadelphia, PA 19107 
Inspection Date 
1/11/07 
No Critical Violations 
9/25/06 
No Critical Violations 
8/7/06 
No Critical Violations 

I am trying to capture the information between the 
tags as these are the only unique delimiters between entries.

My regex is as follows:

while ($html =~ m{(.*?)}gs) {
#do something
}

Unfortunately, the regex will match the first instance( Z & A
Newsstand), but ignore the second (Newstand) and then match on the
third (Pudgies Deli).

I can see that the match is working according to what I wrote; I am
trying to fine tune it so that I can grab every match. Is there a way
to include the previous in the next match such that
it will not skip a potential match?

Any suggestions or advice would be most appreciated.

John

Any

Gunnar Hjalmarsson · Feb 28, 2008

j said:
(html has been converted)

Yes, but why on earth did you post the data in that format?

I am trying to capture the information between the 
tags as these are the only unique delimiters between entries.

My regex is as follows:

while ($html =~ m{(.*?)}gs) {
#do something
}

Unfortunately, the regex will match the first instance( Z & A
Newsstand), but ignore the second (Newstand) and then match on the
third (Pudgies Deli).

I can see that the match is working according to what I wrote; I am
trying to fine tune it so that I can grab every match. Is there a way
to include the previous in the next match such that
it will not skip a potential match?

A zero-width positive look-ahead assertion may be what you are after;
see "perldoc perlre".

while ($html =~ m{(.*?)(?=)}gs) {
---------------------------------^^^------^

Another approach that doesn't slurp the whole file into a scalar variable:

local $/ = '';
while ( my $html = <> ) {
#do something
}

Tad J McClellan · Feb 28, 2008

j ellings said:
Hello.

I have an html file converted from PDF that includes the following
sample lines:

(html has been converted)

Why has HTML been converted?

This is a plain-text medium...

Z & A Newsstand

^^ ^^
^^ ^^

My regex is as follows:

while ($html =~ m{(.*?)}gs) {

End tags have slash characters in them that your pattern will not match.

Your data closes the bold before the italic, but your regex looks
for the italic close before the bold close.

I can see that the match is working according to what I wrote;

You have a strange definition of "working" then...

trying to fine tune it so that I can grab every match. Is there a way
to include the previous in the next match such that
it will not skip a potential match?

Any suggestions or advice would be most appreciated.

while ($html =~ m{(.*?)}gs) {

j ellings · Feb 28, 2008

A zero-width positive look-ahead assertion may be what you are after;
see "perldoc perlre".

while ($html =~ m{(.*?)(?=)}gs) {
---------------------------------^^^------^

Another approach that doesn't slurp the whole file into a scalar variable:

local $/ = '';
while ( my $html = <> ) {
#do something
}

Thanks Gunnar, this worked perfectly; apologies for the formatting.

j ellings · Feb 28, 2008

while ($html =~ m{(.*?)}gs) {

Tad

Thanks for the suggestion. Your regex will match the first instance
of opening and closing of the tags; what I needed it to do was
to match the opening of the two tags. My original regex did capture
between two opening instances, but only after skipping one.

action_page.php form	2	Oct 25, 2020
Musatov claims "Mode/Code"	2	Oct 31, 2009
Musatov's 'Mode/Code' Primary method call	4	Oct 31, 2009
"Boston Tea Party" <[email protected]>,	0	Mar 1, 2007
Help not duplicating previous matches	0	Aug 8, 2003
About a regular expression	5	Nov 26, 2007
reuse code inquiry	3	Dec 5, 2007
RXParse module (by robic0), Version 0.1000	29	Apr 17, 2006

RegEx - matching previous match

j ellings

Gunnar Hjalmarsson

Tad J McClellan

j ellings

j ellings

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads