Searching all instances of a pattern across multi-lines

laredotornado · Dec 13, 2009

Hi,

I'm using Perl 5.8.8 on Mac 10.5.6. I found this script online for
matching a pattern across multiple lines. The problem is, it only
prints out one instance of the expression, and I would like it to
print out all instances. What can I change so that it will print out
all instances?

#!/usr/bin/perl
use strict;
use warnings;

open(my $file, "<", "myfile.txt")
or die "Can't open file: $!";
my $text = do { local $/; <$file> };

if ($text =~ /(<\s*script[^<]*>.*?<\/script>)/gs) {
print $1;
}

Thanks, - Dave

C.DeRykus · Dec 13, 2009

Hi,

I'm using Perl 5.8.8 on Mac 10.5.6. I found this script online for
matching a pattern across multiple lines. The problem is, it only
prints out one instance of the expression, and I would like it to
print out all instances. What can I change so that it will print out
all instances?

#!/usr/bin/perl
use strict;
use warnings;

open(my $file, "<", "myfile.txt")
or die "Can't open file: $!";
my $text = do { local $/; <$file> };

if ($text =~ /(<\s*script[^<]*>.*?<\/script>)/gs) {
print $1;

}

print $1 while $text =~ /(<\s*script[^<]*>.*?<\/script>)/gs;

sln · Dec 13, 2009

Hi,

I'm using Perl 5.8.8 on Mac 10.5.6. I found this script online for
matching a pattern across multiple lines. The problem is, it only
prints out one instance of the expression, and I would like it to
print out all instances. What can I change so that it will print out
all instances?

#!/usr/bin/perl
use strict;
use warnings;

open(my $file, "<", "myfile.txt")
or die "Can't open file: $!";
my $text = do { local $/; <$file> };

if ($text =~ /(<\s*script[^<]*>.*?<\/script>)/gs) {
print $1;
}

Thanks, - Dave

'while()' should work as others have said.

The above regex should take into account these forms:
<tag>
<tag/>
<tag attr> content </tag>
<tag attr/>

Try this. It takes into account all the above forms
plus handles attributes fairly well, without the need for
[^<]*, where the actual character '<' can exist in the value
part. Handling attrib/vals correctly and taking acccount of all
valid forms are important, it all goes toward partitioning the
data.

Also, this is a complex parse. It includes multiple atomic
markup units, which is debatably <tag> style and content.
Content being the current state that is not markup.
Ideally, the unit is parsed to find the start element 'script',
recording is turned on, then off at the end element 'script'.

As it is now, the regex you are using won't correctly parse the
$text string below.

Good luck!
-sln

------------
use strict;
use warnings;

my $text = <<HTML;
<script />
<notme>
<script attr = "asdf" attr = 'wafsd'/>
<script a = "asdf" b= 'wafsd'>
use strict;
use warnings;
print "hello world, I'm a <tag>\\n";
</script>
<script>
// comment me out c++ style
/* now c style
*/
</script>
HTML

my $name = 'script';

my $rx = qr /
(
< $name (?: \s+ (?: ".*?" | '.*?' | [^>]*? )+ )? \s* \/ >
|
< $name (?: \s+ (?: ".*?" | '.*?' | [^>]*? )+ )* \s* > .*? <\/$name\s*>
)
/xs;

while ( $text =~ /$rx/g) {
print '-'x20,"\n",$1,"\n";
}

__END__
Output:

--------------------
<script />
--------------------
<script attr = "asdf" attr = 'wafsd'/>
--------------------
<script a = "asdf" b= 'wafsd'>
use strict;
use warnings;
print "hello world, I'm a <tag>\n";
</script>
--------------------
<script>
// comment me out c++ style
/* now c style
*/
</script>

sln · Dec 16, 2009

Hi,

I'm using Perl 5.8.8 on Mac 10.5.6. I found this script online for
matching a pattern across multiple lines. The problem is, it only
prints out one instance of the expression, and I would like it to
print out all instances. What can I change so that it will print out
all instances?

#!/usr/bin/perl
use strict;
use warnings;

open(my $file, "<", "myfile.txt")
or die "Can't open file: $!";
my $text = do { local $/; <$file> };

if ($text =~ /(<\s*script[^<]*>.*?<\/script>)/gs) {
print $1;
}

Thanks, - Dave

Click to expand...

'while()' should work as others have said.

The above regex should take into account these forms:
<tag>
<tag/>
<tag attr> content </tag>
<tag attr/>

Try this. It takes into account all the above forms
plus handles attributes fairly well, without the need for
[^<]*, where the actual character '<' can exist in the value
part. Handling attrib/vals correctly and taking acccount of all
valid forms are important, it all goes toward partitioning the
data.

Also, this is a complex parse. It includes multiple atomic
markup units, which is debatably <tag> style and content.
Content being the current state that is not markup.
Ideally, the unit is parsed to find the start element 'script',
recording is turned on, then off at the end element 'script'.

As it is now, the regex you are using won't correctly parse the
$text string below.

Late addition:

But alas, no simple regex is going to handle nesting correctly
unless there is recursion. Below handles recursive tags, but
requires Perl 5.10 or better.

usage: html_rx.pl [<tag name> [file name]] - default, if no params

Cmd line examples:

*> html_rx.pl form junk.html - Finds 'form' blocks in html file

*> html_rx.pl script junk.html - Finds 'script' blocks in html file

*> html_rx.pl "(?i)script|object" junk.html
- Finds either 'script' or 'object' blocks, case insensitive
(good little markups will be properly nested, ie. those that have terminators)

-sln
-------------

*> perl html_rx.pl "(?i)script|object"

File name: __DATA__
Tag name: (?i)script|object

-------------------- ** type 1
<script attr = "asdf" attr = 'wafsd' />
-------------------- ** type 2
<script>
use strict;
use warnings;
print "hello world, I'm a <tag>\n";
<script a = "it's" b= 'terminated'/>
<object></object>
<script>
// comment me out c++ style
/* now c style
*/
</script>
</script>
-------------------- ** type 1
<script />

====================
Summary
File name: __DATA__
Tag name: (?i)script|object
type1 <tag, tag-attr /> = 2
type2 <tag, tag-attr>..</tag> = 1

*>
-------------

## html_rx.pl
## -sln

use strict;
use warnings;

require 5.010_000;

# usage: html_rx.pl [<tag name> [file name]]
# ---------------------------------------------
my ($tag,$fname) = @ARGV;
my $text;

$tag = 'script' unless defined $tag;
if (defined $fname) {
open my $fh, '<', $fname or die "Can't open file '$fname' : $!";
$text = join '',<$fh>;
close $fh;
} else {
$fname = '__DATA__';
$text = join '',<DATA>;
}

my ($terminated, $open, $close) =
(
qr {< (?:$tag) (?:\s+[^>]*)? />}x,
qr {< (?:$tag) (?:\s+[^>]*? \s*[^/]> | \s*>) }x,
qr {</ (?:$tag) \s*> }x
);

my $rx = qr {
(
(?: $terminated ) # <tag [attr] />
)
| # OR ...
(
(?: $open ) # <tag [attr] >
(?:
(?: (?!$open|$close) . )++ # possessive
|
(?2) # recurse group 2
)*
(?: $close ) # </tag>
)
}xs;

print "\n",<<INFO;
File name: $fname
Tag name: $tag\n
INFO

my ($cnt1,$cnt2) = (0,0);

while ( $text =~ /$rx/g) {
print '-'x20;
if (defined $1) {
print " ** type 1\n",$1,"\n" ;
$cnt1++;
} else {
print " ** type 2\n",$2,"\n" ;
$cnt2++;
}
}

print "\n",'='x20,"\nSummary\n",<<SUMMARY;
File name: $fname
Tag name: $tag
type1 <tag, tag-attr /> = $cnt1
type2 <tag, tag-attr>..</tag> = $cnt2\n
SUMMARY

__DATA__

<script attr = "asdf" attr = 'wafsd' />
</script>
<script>
<script>
<script>
use strict;
use warnings;
print "hello world, I'm a <tag>\n";
<script a = "it's" b= 'terminated'/>
<object></object>
<script>
// comment me out c++ style
/* now c style
*/
</script>
</script>
<script>
<script />
<notme>

Efficiently searching multiple files	10	May 20, 2010
remove last 10 lines of all files in a directory	6	Sep 26, 2006
Searching for all Occurances and Evaling Them	0	Feb 8, 2007
printing two consecutive lines	9	Aug 21, 2008
Problem Splitting Text String	2	Dec 29, 2022
matching HTML expression across multiple lines	9	May 10, 2006
Multi-level list generation	21	Aug 21, 2010
FAQ 5.3 How do I count the number of lines in a file?	0	Jan 31, 2011

Searching all instances of a pattern across multi-lines

laredotornado

C.DeRykus

sln

sln

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads