matching over multiple lines

Discussion in 'Perl Misc' started by cyborg, Nov 21, 2006.

  1. cyborg

    cyborg Guest

    When I was starting to learn regexes in Perl (2 days ago), I picked up
    some books and some websites and read a bunch. When I though I was
    ready to go, I realized none of those sources taught me how to actually
    write a Perl program from start to end that would open the file I
    wanted to parse and save the parsing results to a second file. That was
    a bummer.

    Bla bla bla etc etc etc all those boring stuff everyone hates to read
    about other people's life bla bla bla.

    Okay, finally I have created a template for my regexes to parse a file,
    save results to another file, and have its matches work OVER MULTIPLE
    LINES. I know this is far from exciting for you perl hackers, but do
    realize that the books I've read don't teach this. (I've got ADHD so if
    they do and I'm just a poor reader, nevermind that statement).
    Also, please understand that when I say "i have created" I mean "I,
    with the help of loads of other people's work and some people's help"
    (because let's face it, it's not that big of a file to need help from
    loads of people). Of course I don't want credit for this, what I do
    want is help. Everything works but some parts I don't understand why.
    Also, I know there are probably better ways to go about some stuff,
    like I think there's that "or die" stuff that would do what the
    "unless" is doing now.

    There are also some comments to help beginners (actually they're to
    help me, a beginner too, not forget what each of the lines do)
    understand what each part is doing and how it contributes to the
    program.

    So consider this thread as if I were asking you "how do I match over
    multiple lines? could you provide full perl code?" and then you replied
    me with some code.

    Here it is:

    #############################################
    #* *#
    # TEMPLATE FOR PERL REGEX PROGRAMS #
    # #
    # THIS TEMPLATE DOES THE FOLLOWING: #
    # #
    #=> reads input file and writes output file #
    #=> undefines line terminator so that you #
    # can match over multiple lines autolly #
    # #
    # #
    # > to choose files from the prompt: #
    # my $source=$ARGV[0]; #
    # my $dest=$ARGV[1]; #
    # #
    #* *#
    #############################################


    # all variables must be declared
    #______________________perl warns us about anything wrong
    use strict;
    use warnings;

    #______________________these are the filenames
    my $source="r.txt";
    my $dest="r2.txt";

    #______________________to store the lines we'll be reading
    my $line;

    #______________________do away with line breaks
    $/ = undef;
    # comment the above line out and the parser won't
    # match over multiple lines anymore.

    #______________________check file existence and permission
    unless($source and $dest){
    print "Source or destination file missing\n";
    }

    #______________________open input and output files
    open SOURCE, "<$source";
    open DEST, ">$dest";

    #______________________read file till eof
    while($line = <SOURCE>){

    # replace "if" for "while" and it will print the first
    # match and nothing more. don't know why.
    # take away g and it will print the first match infinite
    # times. don't know why.
    # take away s and it won't match over multiple lines
    # anymore. that's because s makes . match \n
    # the $/=undef above is just for the file reading
    # part, i guess. it doesn't nullify \n

    while($line =~ m/<(.*?)>/gs) {
    print DEST "----$1----\n";
    }
    }

    #______________________close input and output files
    close SOURCE;
    close DEST;




    Just save r.txt with this to test it:

    tag"><1b>word<2/div>

    <3div class="okay"><4i>o.
    <5/i> notgood,
    akdjsf jkdmhf djaf =ยจ?#$
    <6flunk>yes but<7
    this is
    .. a
    .. multiline
    .. string, the kind of which my
    .. template matches
    :) , yes, > maybe we can

    but we should <8be> careful




    Any improvements will be appreciated.
    cyborg, Nov 21, 2006
    #1
    1. Advertising

  2. cyborg <> wrote:

    > So consider this thread as if I were asking you "how do I match over
    > multiple lines? could you provide full perl code?" and then you replied
    > me with some code.



    > $/ = undef;
    > # comment the above line out and the parser won't
    > # match over multiple lines anymore.



    The comment is plain wrong.

    comment the above line out and the string won't have
    multiple lines it it.

    $/ has NO effect on pattern matching.

    It may have an effect on the string that you are matching the
    pattern against however.


    > #______________________check file existence and permission



    that comment is misleading, as you do NOT check for existence nor
    for permissions.

    You only check that the 2 variable contain true values.


    > unless($source and $dest){
    > print "Source or destination file missing\n";
    > }



    print "$source does not exist\n" unless -e $source;
    print "you do not have read permission on $source\n"
    unless -r $source;


    > #______________________open input and output files
    > open SOURCE, "<$source";
    > open DEST, ">$dest";



    You should always, yes *always*, check the return value from open():

    open SOURCE, "<$source" or die "could not open '$source' $!";

    Even better, use 3-argument open() and an indirect filehandle:

    open my $src, '<', $source or die "could not open '$source' $!";


    > while($line = <SOURCE>){


    while($line = <$src>){


    > # replace "if" for "while" and it will print the first
    > # match and nothing more. don't know why.



    Because while loops and if does not loop.


    > # take away g and it will print the first match infinite
    > # times. don't know why.



    Because the while() condition is never false.


    > # the $/=undef above is just for the file reading
    > # part, i guess. it doesn't nullify \n



    Exactly so, but that isn't what you said above...


    > while($line =~ m/<(.*?)>/gs) {
    > print DEST "----$1----\n";
    > }



    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Nov 21, 2006
    #2
    1. Advertising

  3. cyborg

    cyborg Guest

    I'll number this just for organization's sake.

    1 - $/ has no effect on pattern matching.

    Well, yes, I understand your point of view. But if I take it out it
    won't match multilinely and if I leave it in it will, so do you
    understand my point of view? :)

    -----

    2 - You only check that the 2 variable contain true values.

    yes, very misleading, only now do I realize my mistake. thanks for
    pointing it out.

    -----

    3 - print "$source does not exist\n" unless -e $source;

    -e and -r, nice! probably for Exist and openRead, of course.

    -----

    4 - You should always, yes *always*, check the return value from open()

    yes, that's the "or die" thingy. thanks.

    -----

    5 - Even better, use 3-argument open() and an indirect filehandle

    now why exactly is $src better/safer than <SOURCE>?

    -----

    6 - Because while loops and if does not loop.

    heh, I know the difference between while and if. I'm a c/c++
    programmer. What I don't know is why do I ever need it to loop. What is
    it looping in? And wouldn't the outer loop loop it for me?

    -----

    7 - Because the while() condition is never false.

    which while is never false? outer while or inner while?

    -----

    8 - Exactly so, but that isn't what you said above...

    haha, that proves I knew it all along :p

    -----

    So all that applied leaves me with this:
    Could you please check the last lines, as I'm not sure how to close
    indirect filehandlers, and not sure how to print into them?


    use strict;
    use warnings;

    my $source="r.txt";
    my $dest="r2.txt";

    my $line;

    $/ = undef;

    print "$source does not exist\n" unless -e $source;
    print "you do not have read permission on $source\n" unless -r $source;

    open my $src, '<', $source or die "could not open '$source' $!";
    open my $dst, '>', $dest or die "could not open '$dest' $!";

    while($line = <$src>){
    while($line =~ m/<(.*?)>/gs) {
    print $dst "----$1----\n";
    }
    }

    close $src;
    close $dst;


    Thanks a million!
    cyborg, Nov 21, 2006
    #3
  4. cyborg

    Uri Guttman Guest

    >>>>> "c" == cyborg <> writes:

    c> I'll number this just for organization's sake.
    c> 1 - $/ has no effect on pattern matching.

    c> Well, yes, I understand your point of view. But if I take it out it
    c> won't match multilinely and if I leave it in it will, so do you
    c> understand my point of view? :)

    your point of view is wrong as you don't understand what $/
    does. read perldoc perlvar. it has NOTHING to do with matching. what you
    did was slurp in the file instead of reading it line by line. think a
    bit, if you read it line by line how could you match over multiple
    lines? you never have more than one line in ram!

    and try using File::Slurp for this as it is cleaner and can be much
    faster.


    c> 5 - Even better, use 3-argument open() and an indirect filehandle

    c> now why exactly is $src better/safer than <SOURCE>?

    $src is a lexical and SOURCE is a global and a symref. the former is
    safe and the latter open for possible bugs.

    c> -----

    c> 6 - Because while loops and if does not loop.

    c> heh, I know the difference between while and if. I'm a c/c++
    c> programmer. What I don't know is why do I ever need it to loop. What is
    c> it looping in? And wouldn't the outer loop loop it for me?

    you don't get file vs line i/o and loops. when you undef'ed $/ you SLURP
    the entire file when you call <>. NO LOOP needed. when you want to run
    the regex over and over you need a loop and the /g modifier. LOOP needed
    (or implied with /g in list context).


    c> -----

    c> 7 - Because the while() condition is never false.

    c> which while is never false? outer while or inner while?

    you mentioned an infinite loop. so you should know which loop that is.

    c> $/ = undef;

    lose that and use File::Slurp. it will clear up your code.

    c> print "$source does not exist\n" unless -e $source;
    c> print "you do not have read permission on $source\n" unless -r $source;

    c> open my $src, '<', $source or die "could not open '$source' $!";

    no need for that with File::Slurp.

    c> open my $dst, '>', $dest or die "could not open '$dest' $!";

    c> while($line = <$src>){

    there is NO line there. you slurp in the entire file the first time you
    call <$src> because you undef'ed $/.

    use File::Slurp ;

    my $file_text = read_file( $source ) ;

    NO LOOP NEEDED AS YOU DO THAT ONE TIME ONLY. you want multiline matches
    so you can't do line by line i/o.

    this should be obvious to any c/c++ coder! :)

    c> while($line =~ m/<(.*?)>/gs) {

    that is the real (and now only) loop of the program.

    c> print $dst "----$1----\n";
    c> }
    c> }

    c> close $src;
    no need for that with file::slurp.

    c> close $dst;

    this reduces to (untested and missing some code):

    use File::Slurp ;

    my $file_text = read_file( $source ) ;
    print $dst map "----$_----\n", $file_text =~ m/<(.*?)>/gs ;

    look ma! NO (explicit) LOOPS!!

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
    Uri Guttman, Nov 21, 2006
    #4
  5. On 11/20/2006 07:39 PM, cyborg wrote:
    > When I was starting to learn regexes in Perl (2 days ago), I picked up
    > some books and some websites and read a bunch. When I though I was
    > ready to go, I realized none of those sources taught me how to actually
    > write a Perl program from start to end that would open the file I
    > wanted to parse and save the parsing results to a second file. That was
    > a bummer.
    >
    > Bla bla bla etc etc etc all those boring stuff everyone hates to read
    > about other people's life bla bla bla.
    >
    > Okay, finally I have created a template for my regexes to parse a file,
    > save results to another file, and have its matches work OVER MULTIPLE
    > LINES.
    > [... program snipped ...]
    > Any improvements will be appreciated.
    >


    You could do that, or you could do this:

    #!/usr/bin/perl
    use strict;
    use warnings;
    use Fatal qw(open close);
    die "need source and destination file names" if (@ARGV < 2);

    open (my $fs, '<', $ARGV[0]);
    open (my $fd, '>', $ARGV[1]);

    my @list = join('',<$fs>) =~ /<(.*?)>/sg;
    print $fd "------$_-------\n" for @list;

    close $fd;
    close $fs;
    __END__


    --
    Mumia W. (reading news), Nov 21, 2006
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. lcs Mixmaster Remailer

    Matching multiple lines with regexp

    lcs Mixmaster Remailer, Feb 11, 2004, in forum: Perl Misc
    Replies:
    1
    Views:
    82
    Tad McClellan
    Feb 11, 2004
  2. Stephen Moon

    matching multiple lines as one record

    Stephen Moon, Feb 27, 2004, in forum: Perl Misc
    Replies:
    3
    Views:
    73
    Brad Baxter
    Mar 3, 2004
  3. H.S.
    Replies:
    9
    Views:
    109
  4. Bobby Chamness
    Replies:
    2
    Views:
    215
    Xicheng Jia
    May 3, 2007
  5. Cah Sableng
    Replies:
    0
    Views:
    231
    Cah Sableng
    Apr 23, 2007
Loading...

Share This Page