Searching all instances of a pattern across multi-lines

Discussion in 'Perl Misc' started by laredotornado, Dec 13, 2009.

  1. Hi,

    I'm using Perl 5.8.8 on Mac 10.5.6. I found this script online for
    matching a pattern across multiple lines. The problem is, it only
    prints out one instance of the expression, and I would like it to
    print out all instances. What can I change so that it will print out
    all instances?


    #!/usr/bin/perl
    use strict;
    use warnings;

    open(my $file, "<", "myfile.txt")
    or die "Can't open file: $!";
    my $text = do { local $/; <$file> };

    if ($text =~ /(<\s*script[^<]*>.*?<\/script>)/gs) {
    print $1;
    }



    Thanks, - Dave
     
    laredotornado, Dec 13, 2009
    #1
    1. Advertising

  2. laredotornado

    C.DeRykus Guest

    On Dec 13, 1:22 pm, laredotornado <> wrote:
    > Hi,
    >
    > I'm using Perl 5.8.8 on Mac 10.5.6.  I found this script online for
    > matching a pattern across multiple lines.  The problem is, it only
    > prints out one instance of the expression, and I would like it to
    > print out all instances.  What can I change so that it will print out
    > all instances?
    >
    > #!/usr/bin/perl
    > use strict;
    > use warnings;
    >
    > open(my $file, "<", "myfile.txt")
    >     or die "Can't open file: $!";
    > my $text = do { local $/; <$file> };
    >
    > if ($text =~ /(<\s*script[^<]*>.*?<\/script>)/gs) {
    >     print $1;
    >
    > }


    print $1 while $text =~ /(<\s*script[^<]*>.*?<\/script>)/gs;

    --
    Charles DeRykus
     
    C.DeRykus, Dec 13, 2009
    #2
    1. Advertising

  3. laredotornado

    Guest

    On Sun, 13 Dec 2009 13:22:15 -0800 (PST), laredotornado <> wrote:

    >Hi,
    >
    >I'm using Perl 5.8.8 on Mac 10.5.6. I found this script online for
    >matching a pattern across multiple lines. The problem is, it only
    >prints out one instance of the expression, and I would like it to
    >print out all instances. What can I change so that it will print out
    >all instances?
    >
    >
    >#!/usr/bin/perl
    >use strict;
    >use warnings;
    >
    >open(my $file, "<", "myfile.txt")
    > or die "Can't open file: $!";
    >my $text = do { local $/; <$file> };
    >
    >if ($text =~ /(<\s*script[^<]*>.*?<\/script>)/gs) {
    > print $1;
    >}
    >
    >
    >
    >Thanks, - Dave


    'while()' should work as others have said.

    The above regex should take into account these forms:
    <tag>
    <tag/>
    <tag attr> content </tag>
    <tag attr/>

    Try this. It takes into account all the above forms
    plus handles attributes fairly well, without the need for
    [^<]*, where the actual character '<' can exist in the value
    part. Handling attrib/vals correctly and taking acccount of all
    valid forms are important, it all goes toward partitioning the
    data.

    Also, this is a complex parse. It includes multiple atomic
    markup units, which is debatably <tag> style and content.
    Content being the current state that is not markup.
    Ideally, the unit is parsed to find the start element 'script',
    recording is turned on, then off at the end element 'script'.

    As it is now, the regex you are using won't correctly parse the
    $text string below.

    Good luck!
    -sln

    ------------
    use strict;
    use warnings;

    my $text = <<HTML;
    <script />
    <notme>
    <script attr = "asdf" attr = 'wafsd'/>
    <script a = "asdf" b= 'wafsd'>
    use strict;
    use warnings;
    print "hello world, I'm a <tag>\\n";
    </script>
    <script>
    // comment me out c++ style
    /* now c style
    */
    </script>
    HTML

    my $name = 'script';

    my $rx = qr /
    (
    < $name (?: \s+ (?: ".*?" | '.*?' | [^>]*? )+ )? \s* \/ >
    |
    < $name (?: \s+ (?: ".*?" | '.*?' | [^>]*? )+ )* \s* > .*? <\/$name\s*>
    )
    /xs;

    while ( $text =~ /$rx/g) {
    print '-'x20,"\n",$1,"\n";
    }

    __END__
    Output:

    --------------------
    <script />
    --------------------
    <script attr = "asdf" attr = 'wafsd'/>
    --------------------
    <script a = "asdf" b= 'wafsd'>
    use strict;
    use warnings;
    print "hello world, I'm a <tag>\n";
    </script>
    --------------------
    <script>
    // comment me out c++ style
    /* now c style
    */
    </script>
     
    , Dec 13, 2009
    #3
  4. laredotornado

    Guest

    On Sun, 13 Dec 2009 15:43:13 -0800, wrote:

    >On Sun, 13 Dec 2009 13:22:15 -0800 (PST), laredotornado <> wrote:
    >
    >>Hi,
    >>
    >>I'm using Perl 5.8.8 on Mac 10.5.6. I found this script online for
    >>matching a pattern across multiple lines. The problem is, it only
    >>prints out one instance of the expression, and I would like it to
    >>print out all instances. What can I change so that it will print out
    >>all instances?
    >>
    >>
    >>#!/usr/bin/perl
    >>use strict;
    >>use warnings;
    >>
    >>open(my $file, "<", "myfile.txt")
    >> or die "Can't open file: $!";
    >>my $text = do { local $/; <$file> };
    >>
    >>if ($text =~ /(<\s*script[^<]*>.*?<\/script>)/gs) {
    >> print $1;
    >>}
    >>
    >>
    >>
    >>Thanks, - Dave

    >
    >'while()' should work as others have said.
    >
    >The above regex should take into account these forms:
    > <tag>
    > <tag/>
    > <tag attr> content </tag>
    > <tag attr/>
    >
    >Try this. It takes into account all the above forms
    >plus handles attributes fairly well, without the need for
    >[^<]*, where the actual character '<' can exist in the value
    >part. Handling attrib/vals correctly and taking acccount of all
    >valid forms are important, it all goes toward partitioning the
    >data.
    >
    >Also, this is a complex parse. It includes multiple atomic
    >markup units, which is debatably <tag> style and content.
    >Content being the current state that is not markup.
    >Ideally, the unit is parsed to find the start element 'script',
    >recording is turned on, then off at the end element 'script'.
    >
    >As it is now, the regex you are using won't correctly parse the
    >$text string below.
    >


    Late addition:

    But alas, no simple regex is going to handle nesting correctly
    unless there is recursion. Below handles recursive tags, but
    requires Perl 5.10 or better.

    usage: html_rx.pl [<tag name> [file name]] - default, if no params

    Cmd line examples:

    *> html_rx.pl form junk.html - Finds 'form' blocks in html file

    *> html_rx.pl script junk.html - Finds 'script' blocks in html file

    *> html_rx.pl "(?i)script|object" junk.html
    - Finds either 'script' or 'object' blocks, case insensitive
    (good little markups will be properly nested, ie. those that have terminators)

    -sln
    -------------

    *> perl html_rx.pl "(?i)script|object"

    File name: __DATA__
    Tag name: (?i)script|object

    -------------------- ** type 1
    <script attr = "asdf" attr = 'wafsd' />
    -------------------- ** type 2
    <script>
    use strict;
    use warnings;
    print "hello world, I'm a <tag>\n";
    <script a = "it's" b= 'terminated'/>
    <object></object>
    <script>
    // comment me out c++ style
    /* now c style
    */
    </script>
    </script>
    -------------------- ** type 1
    <script />

    ====================
    Summary
    File name: __DATA__
    Tag name: (?i)script|object
    type1 <tag, tag-attr /> = 2
    type2 <tag, tag-attr>..</tag> = 1

    *>
    -------------

    ## html_rx.pl
    ## -sln

    use strict;
    use warnings;

    require 5.010_000;

    # usage: html_rx.pl [<tag name> [file name]]
    # ---------------------------------------------
    my ($tag,$fname) = @ARGV;
    my $text;

    $tag = 'script' unless defined $tag;
    if (defined $fname) {
    open my $fh, '<', $fname or die "Can't open file '$fname' : $!";
    $text = join '',<$fh>;
    close $fh;
    } else {
    $fname = '__DATA__';
    $text = join '',<DATA>;
    }

    my ($terminated, $open, $close) =
    (
    qr {< (?:$tag) (?:\s+[^>]*)? />}x,
    qr {< (?:$tag) (?:\s+[^>]*? \s*[^/]> | \s*>) }x,
    qr {</ (?:$tag) \s*> }x
    );

    my $rx = qr {
    (
    (?: $terminated ) # <tag [attr] />
    )
    | # OR ...
    (
    (?: $open ) # <tag [attr] >
    (?:
    (?: (?!$open|$close) . )++ # possessive
    |
    (?2) # recurse group 2
    )*
    (?: $close ) # </tag>
    )
    }xs;

    print "\n",<<INFO;
    File name: $fname
    Tag name: $tag\n
    INFO

    my ($cnt1,$cnt2) = (0,0);

    while ( $text =~ /$rx/g) {
    print '-'x20;
    if (defined $1) {
    print " ** type 1\n",$1,"\n" ;
    $cnt1++;
    } else {
    print " ** type 2\n",$2,"\n" ;
    $cnt2++;
    }
    }

    print "\n",'='x20,"\nSummary\n",<<SUMMARY;
    File name: $fname
    Tag name: $tag
    type1 <tag, tag-attr /> = $cnt1
    type2 <tag, tag-attr>..</tag> = $cnt2\n
    SUMMARY

    __DATA__

    <script attr = "asdf" attr = 'wafsd' />
    </script>
    <script>
    <script>
    <script>
    use strict;
    use warnings;
    print "hello world, I'm a <tag>\n";
    <script a = "it's" b= 'terminated'/>
    <object></object>
    <script>
    // comment me out c++ style
    /* now c style
    */
    </script>
    </script>
    <script>
    <script />
    <notme>
     
    , Dec 16, 2009
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John Wohlbier
    Replies:
    2
    Views:
    368
    Josiah Carlson
    Feb 22, 2004
  2. Replies:
    6
    Views:
    323
    Victor Bazarov
    Aug 13, 2005
  3. Replies:
    8
    Views:
    469
    James Stroud
    Jan 29, 2009
  4. Nick Bo
    Replies:
    5
    Views:
    168
    Tim Hunter
    Sep 29, 2008
  5. mike
    Replies:
    3
    Views:
    101
Loading...

Share This Page