Searching all instances of a pattern across multi-lines

L

laredotornado

Hi,

I'm using Perl 5.8.8 on Mac 10.5.6. I found this script online for
matching a pattern across multiple lines. The problem is, it only
prints out one instance of the expression, and I would like it to
print out all instances. What can I change so that it will print out
all instances?


#!/usr/bin/perl
use strict;
use warnings;

open(my $file, "<", "myfile.txt")
or die "Can't open file: $!";
my $text = do { local $/; <$file> };

if ($text =~ /(<\s*script[^<]*>.*?<\/script>)/gs) {
print $1;
}



Thanks, - Dave
 
C

C.DeRykus

Hi,

I'm using Perl 5.8.8 on Mac 10.5.6.  I found this script online for
matching a pattern across multiple lines.  The problem is, it only
prints out one instance of the expression, and I would like it to
print out all instances.  What can I change so that it will print out
all instances?

#!/usr/bin/perl
use strict;
use warnings;

open(my $file, "<", "myfile.txt")
    or die "Can't open file: $!";
my $text = do { local $/; <$file> };

if ($text =~ /(<\s*script[^<]*>.*?<\/script>)/gs) {
    print $1;

}

print $1 while $text =~ /(<\s*script[^<]*>.*?<\/script>)/gs;
 
S

sln

Hi,

I'm using Perl 5.8.8 on Mac 10.5.6. I found this script online for
matching a pattern across multiple lines. The problem is, it only
prints out one instance of the expression, and I would like it to
print out all instances. What can I change so that it will print out
all instances?


#!/usr/bin/perl
use strict;
use warnings;

open(my $file, "<", "myfile.txt")
or die "Can't open file: $!";
my $text = do { local $/; <$file> };

if ($text =~ /(<\s*script[^<]*>.*?<\/script>)/gs) {
print $1;
}



Thanks, - Dave

'while()' should work as others have said.

The above regex should take into account these forms:
<tag>
<tag/>
<tag attr> content </tag>
<tag attr/>

Try this. It takes into account all the above forms
plus handles attributes fairly well, without the need for
[^<]*, where the actual character '<' can exist in the value
part. Handling attrib/vals correctly and taking acccount of all
valid forms are important, it all goes toward partitioning the
data.

Also, this is a complex parse. It includes multiple atomic
markup units, which is debatably <tag> style and content.
Content being the current state that is not markup.
Ideally, the unit is parsed to find the start element 'script',
recording is turned on, then off at the end element 'script'.

As it is now, the regex you are using won't correctly parse the
$text string below.

Good luck!
-sln

------------
use strict;
use warnings;

my $text = <<HTML;
<script />
<notme>
<script attr = "asdf" attr = 'wafsd'/>
<script a = "asdf" b= 'wafsd'>
use strict;
use warnings;
print "hello world, I'm a <tag>\\n";
</script>
<script>
// comment me out c++ style
/* now c style
*/
</script>
HTML

my $name = 'script';

my $rx = qr /
(
< $name (?: \s+ (?: ".*?" | '.*?' | [^>]*? )+ )? \s* \/ >
|
< $name (?: \s+ (?: ".*?" | '.*?' | [^>]*? )+ )* \s* > .*? <\/$name\s*>
)
/xs;

while ( $text =~ /$rx/g) {
print '-'x20,"\n",$1,"\n";
}

__END__
Output:

--------------------
<script />
--------------------
<script attr = "asdf" attr = 'wafsd'/>
--------------------
<script a = "asdf" b= 'wafsd'>
use strict;
use warnings;
print "hello world, I'm a <tag>\n";
</script>
--------------------
<script>
// comment me out c++ style
/* now c style
*/
</script>
 
S

sln

Hi,

I'm using Perl 5.8.8 on Mac 10.5.6. I found this script online for
matching a pattern across multiple lines. The problem is, it only
prints out one instance of the expression, and I would like it to
print out all instances. What can I change so that it will print out
all instances?


#!/usr/bin/perl
use strict;
use warnings;

open(my $file, "<", "myfile.txt")
or die "Can't open file: $!";
my $text = do { local $/; <$file> };

if ($text =~ /(<\s*script[^<]*>.*?<\/script>)/gs) {
print $1;
}



Thanks, - Dave

'while()' should work as others have said.

The above regex should take into account these forms:
<tag>
<tag/>
<tag attr> content </tag>
<tag attr/>

Try this. It takes into account all the above forms
plus handles attributes fairly well, without the need for
[^<]*, where the actual character '<' can exist in the value
part. Handling attrib/vals correctly and taking acccount of all
valid forms are important, it all goes toward partitioning the
data.

Also, this is a complex parse. It includes multiple atomic
markup units, which is debatably <tag> style and content.
Content being the current state that is not markup.
Ideally, the unit is parsed to find the start element 'script',
recording is turned on, then off at the end element 'script'.

As it is now, the regex you are using won't correctly parse the
$text string below.

Late addition:

But alas, no simple regex is going to handle nesting correctly
unless there is recursion. Below handles recursive tags, but
requires Perl 5.10 or better.

usage: html_rx.pl [<tag name> [file name]] - default, if no params

Cmd line examples:

*> html_rx.pl form junk.html - Finds 'form' blocks in html file

*> html_rx.pl script junk.html - Finds 'script' blocks in html file

*> html_rx.pl "(?i)script|object" junk.html
- Finds either 'script' or 'object' blocks, case insensitive
(good little markups will be properly nested, ie. those that have terminators)

-sln
-------------

*> perl html_rx.pl "(?i)script|object"

File name: __DATA__
Tag name: (?i)script|object

-------------------- ** type 1
<script attr = "asdf" attr = 'wafsd' />
-------------------- ** type 2
<script>
use strict;
use warnings;
print "hello world, I'm a <tag>\n";
<script a = "it's" b= 'terminated'/>
<object></object>
<script>
// comment me out c++ style
/* now c style
*/
</script>
</script>
-------------------- ** type 1
<script />

====================
Summary
File name: __DATA__
Tag name: (?i)script|object
type1 <tag, tag-attr /> = 2
type2 <tag, tag-attr>..</tag> = 1

*>
-------------

## html_rx.pl
## -sln

use strict;
use warnings;

require 5.010_000;

# usage: html_rx.pl [<tag name> [file name]]
# ---------------------------------------------
my ($tag,$fname) = @ARGV;
my $text;

$tag = 'script' unless defined $tag;
if (defined $fname) {
open my $fh, '<', $fname or die "Can't open file '$fname' : $!";
$text = join '',<$fh>;
close $fh;
} else {
$fname = '__DATA__';
$text = join '',<DATA>;
}

my ($terminated, $open, $close) =
(
qr {< (?:$tag) (?:\s+[^>]*)? />}x,
qr {< (?:$tag) (?:\s+[^>]*? \s*[^/]> | \s*>) }x,
qr {</ (?:$tag) \s*> }x
);

my $rx = qr {
(
(?: $terminated ) # <tag [attr] />
)
| # OR ...
(
(?: $open ) # <tag [attr] >
(?:
(?: (?!$open|$close) . )++ # possessive
|
(?2) # recurse group 2
)*
(?: $close ) # </tag>
)
}xs;

print "\n",<<INFO;
File name: $fname
Tag name: $tag\n
INFO

my ($cnt1,$cnt2) = (0,0);

while ( $text =~ /$rx/g) {
print '-'x20;
if (defined $1) {
print " ** type 1\n",$1,"\n" ;
$cnt1++;
} else {
print " ** type 2\n",$2,"\n" ;
$cnt2++;
}
}

print "\n",'='x20,"\nSummary\n",<<SUMMARY;
File name: $fname
Tag name: $tag
type1 <tag, tag-attr /> = $cnt1
type2 <tag, tag-attr>..</tag> = $cnt2\n
SUMMARY

__DATA__

<script attr = "asdf" attr = 'wafsd' />
</script>
<script>
<script>
<script>
use strict;
use warnings;
print "hello world, I'm a <tag>\n";
<script a = "it's" b= 'terminated'/>
<object></object>
<script>
// comment me out c++ style
/* now c style
*/
</script>
</script>
<script>
<script />
<notme>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,054
Latest member
LucyCarper

Latest Threads

Top