matching over multiple lines

C

cyborg

When I was starting to learn regexes in Perl (2 days ago), I picked up
some books and some websites and read a bunch. When I though I was
ready to go, I realized none of those sources taught me how to actually
write a Perl program from start to end that would open the file I
wanted to parse and save the parsing results to a second file. That was
a bummer.

Bla bla bla etc etc etc all those boring stuff everyone hates to read
about other people's life bla bla bla.

Okay, finally I have created a template for my regexes to parse a file,
save results to another file, and have its matches work OVER MULTIPLE
LINES. I know this is far from exciting for you perl hackers, but do
realize that the books I've read don't teach this. (I've got ADHD so if
they do and I'm just a poor reader, nevermind that statement).
Also, please understand that when I say "i have created" I mean "I,
with the help of loads of other people's work and some people's help"
(because let's face it, it's not that big of a file to need help from
loads of people). Of course I don't want credit for this, what I do
want is help. Everything works but some parts I don't understand why.
Also, I know there are probably better ways to go about some stuff,
like I think there's that "or die" stuff that would do what the
"unless" is doing now.

There are also some comments to help beginners (actually they're to
help me, a beginner too, not forget what each of the lines do)
understand what each part is doing and how it contributes to the
program.

So consider this thread as if I were asking you "how do I match over
multiple lines? could you provide full perl code?" and then you replied
me with some code.

Here it is:

#############################################
#* *#
# TEMPLATE FOR PERL REGEX PROGRAMS #
# #
# THIS TEMPLATE DOES THE FOLLOWING: #
# #
#=> reads input file and writes output file #
#=> undefines line terminator so that you #
# can match over multiple lines autolly #
# #
# #
# > to choose files from the prompt: #
# my $source=$ARGV[0]; #
# my $dest=$ARGV[1]; #
# #
#* *#
#############################################


# all variables must be declared
#______________________perl warns us about anything wrong
use strict;
use warnings;

#______________________these are the filenames
my $source="r.txt";
my $dest="r2.txt";

#______________________to store the lines we'll be reading
my $line;

#______________________do away with line breaks
$/ = undef;
# comment the above line out and the parser won't
# match over multiple lines anymore.

#______________________check file existence and permission
unless($source and $dest){
print "Source or destination file missing\n";
}

#______________________open input and output files
open SOURCE, "<$source";
open DEST, ">$dest";

#______________________read file till eof
while($line = <SOURCE>){

# replace "if" for "while" and it will print the first
# match and nothing more. don't know why.
# take away g and it will print the first match infinite
# times. don't know why.
# take away s and it won't match over multiple lines
# anymore. that's because s makes . match \n
# the $/=undef above is just for the file reading
# part, i guess. it doesn't nullify \n

while($line =~ m/<(.*?)>/gs) {
print DEST "----$1----\n";
}
}

#______________________close input and output files
close SOURCE;
close DEST;




Just save r.txt with this to test it:

tag"><1b>word<2/div>

<3div class="okay"><4i>o.
<5/i> notgood,
akdjsf jkdmhf djaf =¨?#$
<6flunk>yes but<7
this is
.. a
.. multiline
.. string, the kind of which my
.. template matches
:) , yes, > maybe we can

but we should <8be> careful




Any improvements will be appreciated.
 
T

Tad McClellan

cyborg said:
So consider this thread as if I were asking you "how do I match over
multiple lines? could you provide full perl code?" and then you replied
me with some code.

$/ = undef;
# comment the above line out and the parser won't
# match over multiple lines anymore.


The comment is plain wrong.

comment the above line out and the string won't have
multiple lines it it.

$/ has NO effect on pattern matching.

It may have an effect on the string that you are matching the
pattern against however.

#______________________check file existence and permission


that comment is misleading, as you do NOT check for existence nor
for permissions.

You only check that the 2 variable contain true values.

unless($source and $dest){
print "Source or destination file missing\n";
}


print "$source does not exist\n" unless -e $source;
print "you do not have read permission on $source\n"
unless -r $source;

#______________________open input and output files
open SOURCE, "<$source";
open DEST, ">$dest";


You should always, yes *always*, check the return value from open():

open SOURCE, "<$source" or die "could not open '$source' $!";

Even better, use 3-argument open() and an indirect filehandle:

open my $src, '<', $source or die "could not open '$source' $!";

while($line = <SOURCE>){

# replace "if" for "while" and it will print the first
# match and nothing more. don't know why.


Because while loops and if does not loop.

# take away g and it will print the first match infinite
# times. don't know why.


Because the while() condition is never false.

# the $/=undef above is just for the file reading
# part, i guess. it doesn't nullify \n


Exactly so, but that isn't what you said above...
 
C

cyborg

I'll number this just for organization's sake.

1 - $/ has no effect on pattern matching.

Well, yes, I understand your point of view. But if I take it out it
won't match multilinely and if I leave it in it will, so do you
understand my point of view? :)

-----

2 - You only check that the 2 variable contain true values.

yes, very misleading, only now do I realize my mistake. thanks for
pointing it out.

-----

3 - print "$source does not exist\n" unless -e $source;

-e and -r, nice! probably for Exist and openRead, of course.

-----

4 - You should always, yes *always*, check the return value from open()

yes, that's the "or die" thingy. thanks.

-----

5 - Even better, use 3-argument open() and an indirect filehandle

now why exactly is $src better/safer than <SOURCE>?

-----

6 - Because while loops and if does not loop.

heh, I know the difference between while and if. I'm a c/c++
programmer. What I don't know is why do I ever need it to loop. What is
it looping in? And wouldn't the outer loop loop it for me?

-----

7 - Because the while() condition is never false.

which while is never false? outer while or inner while?

-----

8 - Exactly so, but that isn't what you said above...

haha, that proves I knew it all along :p

-----

So all that applied leaves me with this:
Could you please check the last lines, as I'm not sure how to close
indirect filehandlers, and not sure how to print into them?


use strict;
use warnings;

my $source="r.txt";
my $dest="r2.txt";

my $line;

$/ = undef;

print "$source does not exist\n" unless -e $source;
print "you do not have read permission on $source\n" unless -r $source;

open my $src, '<', $source or die "could not open '$source' $!";
open my $dst, '>', $dest or die "could not open '$dest' $!";

while($line = <$src>){
while($line =~ m/<(.*?)>/gs) {
print $dst "----$1----\n";
}
}

close $src;
close $dst;


Thanks a million!
 
U

Uri Guttman

c> I'll number this just for organization's sake.
c> 1 - $/ has no effect on pattern matching.

c> Well, yes, I understand your point of view. But if I take it out it
c> won't match multilinely and if I leave it in it will, so do you
c> understand my point of view? :)

your point of view is wrong as you don't understand what $/
does. read perldoc perlvar. it has NOTHING to do with matching. what you
did was slurp in the file instead of reading it line by line. think a
bit, if you read it line by line how could you match over multiple
lines? you never have more than one line in ram!

and try using File::Slurp for this as it is cleaner and can be much
faster.


c> 5 - Even better, use 3-argument open() and an indirect filehandle

c> now why exactly is $src better/safer than <SOURCE>?

$src is a lexical and SOURCE is a global and a symref. the former is
safe and the latter open for possible bugs.

c> -----

c> 6 - Because while loops and if does not loop.

c> heh, I know the difference between while and if. I'm a c/c++
c> programmer. What I don't know is why do I ever need it to loop. What is
c> it looping in? And wouldn't the outer loop loop it for me?

you don't get file vs line i/o and loops. when you undef'ed $/ you SLURP
the entire file when you call <>. NO LOOP needed. when you want to run
the regex over and over you need a loop and the /g modifier. LOOP needed
(or implied with /g in list context).


c> -----

c> 7 - Because the while() condition is never false.

c> which while is never false? outer while or inner while?

you mentioned an infinite loop. so you should know which loop that is.

c> $/ = undef;

lose that and use File::Slurp. it will clear up your code.

c> print "$source does not exist\n" unless -e $source;
c> print "you do not have read permission on $source\n" unless -r $source;

c> open my $src, '<', $source or die "could not open '$source' $!";

no need for that with File::Slurp.

c> open my $dst, '>', $dest or die "could not open '$dest' $!";

c> while($line = <$src>){

there is NO line there. you slurp in the entire file the first time you
call <$src> because you undef'ed $/.

use File::Slurp ;

my $file_text = read_file( $source ) ;

NO LOOP NEEDED AS YOU DO THAT ONE TIME ONLY. you want multiline matches
so you can't do line by line i/o.

this should be obvious to any c/c++ coder! :)

c> while($line =~ m/<(.*?)>/gs) {

that is the real (and now only) loop of the program.

c> print $dst "----$1----\n";
c> }
c> }

c> close $src;
no need for that with file::slurp.

c> close $dst;

this reduces to (untested and missing some code):

use File::Slurp ;

my $file_text = read_file( $source ) ;
print $dst map "----$_----\n", $file_text =~ m/<(.*?)>/gs ;

look ma! NO (explicit) LOOPS!!

uri
 
M

Mumia W. (reading news)

When I was starting to learn regexes in Perl (2 days ago), I picked up
some books and some websites and read a bunch. When I though I was
ready to go, I realized none of those sources taught me how to actually
write a Perl program from start to end that would open the file I
wanted to parse and save the parsing results to a second file. That was
a bummer.

Bla bla bla etc etc etc all those boring stuff everyone hates to read
about other people's life bla bla bla.

Okay, finally I have created a template for my regexes to parse a file,
save results to another file, and have its matches work OVER MULTIPLE
LINES.
[... program snipped ...]
Any improvements will be appreciated.

You could do that, or you could do this:

#!/usr/bin/perl
use strict;
use warnings;
use Fatal qw(open close);
die "need source and destination file names" if (@ARGV < 2);

open (my $fs, '<', $ARGV[0]);
open (my $fd, '>', $ARGV[1]);

my @list = join('',<$fs>) =~ /<(.*?)>/sg;
print $fd "------$_-------\n" for @list;

close $fd;
close $fs;
__END__
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top