Possibly useful perl script to filter lines in one file out of another.

B

Ben Burch

Hi!

I needed to take the email addresses that bounced out of an original
mailing list. grep -v -f was far to slow, and comm produced unexpected
results, and so I just wrote something to do it in perl. Thought this
might be useful to somebody else;

#!/usr/bin/perl
#
# filter $file1 $file2
#
# Filters all lines in file1 against lines in file2, copying only lines
# from file 1 not found in file2 to STDOUT
#

# get arguments

my $file1 = shift;
my $file2 = shift;

if(!defined($file1) || !defined($file2))
{
print "\nError, must have two arguments.\n";
print "filter <masterfile> <excludefile>\n";
exit 1;
}

# Copy all lines from file2 into a hash

open (EXCLUDE, $file2);

my %exclude = ();

while ($line = <EXCLUDE>)
{
chomp($line);
$exclude{$line} = 1;
}

close EXCLUDE;

# Now go through input line-by-line comparing to hash and only
# printing lines that do not match

open (DATA, $file1);

while ($line = <DATA>)
{
chomp($line);
if(!exists($exclude{$line}))
{
print "$line\n";
}
}

close DATA;

exit 0;
 
T

Tim McDaniel

I needed to take the email addresses that bounced out of an original
mailing list. grep -v -f was far to slow, and comm produced
unexpected results, and so I just wrote something to do it in perl.

comm requires that both input files be sorted -- presumably in byte
value order rather than by dictionary order. When comm bites me, it's
because I've forgotten that.
 
U

Uri Guttman

BB> I needed to take the email addresses that bounced out of an original
BB> mailing list. grep -v -f was far to slow, and comm produced unexpected
BB> results, and so I just wrote something to do it in perl. Thought this
BB> might be useful to somebody else;

i find it hard to believe that grep -v -f is slower than perl. did you
benchmark the final results?

BB> #!/usr/bin/perl
BB> #

no warnings or strict. use them.
BB> # get arguments

BB> my $file1 = shift;
BB> my $file2 = shift;

BB> if(!defined($file1) || !defined($file2))
BB> {
BB> print "\nError, must have two arguments.\n";
BB> print "filter <masterfile> <excludefile>\n";
BB> exit 1;
BB> }

much simpler and slightly more accurate is to check @ARGV if it has 2
elements:

unless( @ARGV == 2 ) {

die 'blah' ;
}

and use better names than file1 and file2. they are files of different data

my( $exc_file, $data_file ) = @ARGV ;

BB> # Copy all lines from file2 into a hash

BB> open (EXCLUDE, $file2);

BB> my %exclude = ();

BB> while ($line = <EXCLUDE>)
BB> {
BB> chomp($line);
BB> $exclude{$line} = 1;
BB> }

BB> close EXCLUDE;

use File::Slurp ;

my %exclude = map { chomp; $_ => 1 } read_file( $exc_file ) ;

BB> # Now go through input line-by-line comparing to hash and only
BB> # printing lines that do not match

BB> open (DATA, $file1);

don't use DATA for a file handle as it is the handle name for data in
the source file after the __END__ marker

BB> while ($line = <DATA>)
BB> {
BB> chomp($line);
BB> if(!exists($exclude{$line}))
BB> {
BB> print "$line\n";
BB> }

invert that logic for simpler code:

next if $exclude{ $line } ;
print "$line\n" ;

and if your bounce line file isn't that large (for some definition of
large) you can also slurp and filter it out too.

and since your bounce and exclude lines are all ending in newline, there
is no need to chomp in either case. it makes this much easier.

<untested entire main code>

my %exclude = map { $_ => 1 } read_file( $exc_file ) ;
print grep { !$exclude{ $_ } } read_file( $data_file ) ;

ain't perl cool! :)

uri
 
S

sln

use warnings;
use strict;


You should always, yes *always*, check the return value from open():

open my $EXCLUDE, '<', $file2 or die "could not open '$file2' $!";




unless ( exists $exclude{$line} )

or at least make wise use of whitespace and punctuation:

if ( ! exists $exclude{$line} )

Hi Tad.

I've seen that always check the return value of open
here on this NG, then die if not true?

Why die if open didn't die? Whats the worse thing that can happen?
I think the worse thing is that a read or write doesen't happen.
It won't crash the system or mess up the file allocation tables.

Its funny, if you pass a failed open filehandle like
open my $fh, 'non-existant-file.txt'
to a read $fh,... the read passivily fails. There is no
fatal error.

But if you pass an undefined filehandle to read, it
die's.

Something to consider since a failed open does not really
cause problems because and apparently an undefined handle is
enough to cause a die from Perl's i/o functions (well at least read ).

So, why is it always, yes always, necessary to check the return
value from open() ?

-sln
 
N

Nathan Keel

Hi Tad.

I've seen that always check the return value of open
here on this NG, then die if not true?

Why die if open didn't die? Whats the worse thing that can happen?
I think the worse thing is that a read or write doesen't happen.
It won't crash the system or mess up the file allocation tables.

Its funny, if you pass a failed open filehandle like
open my $fh, 'non-existant-file.txt'
to a read $fh,... the read passivily fails. There is no
fatal error.

But if you pass an undefined filehandle to read, it
die's.

Something to consider since a failed open does not really
cause problems because and apparently an undefined handle is
enough to cause a die from Perl's i/o functions (well at least read ).

So, why is it always, yes always, necessary to check the return
value from open() ?

-sln

If you want to open/read/write to a file, there's an intended reason.
It doesn't have to be a die, the point is to be aware of the problem
and have it output or log the problem, which helps troubleshoot
problems (and unintended bugs). He said to always check the return
value, he didn't say to always die. If you have a script that doesn't
need to open a file you told it to, why are you opening it?
 
S

sln

If you want to open/read/write to a file, there's an intended reason.
It doesn't have to be a die, the point is to be aware of the problem
and have it output or log the problem, which helps troubleshoot
problems (and unintended bugs). He said to always check the return
value, he didn't say to always die. If you have a script that doesn't
need to open a file you told it to, why are you opening it?

I used to work for a company that did a lot of automation using perl.
I was new to Perl, but was hired because of my c++ background, but
ended up having to do all perl.
Looking back on it, thier motto was don't die on anything, do not stop
the automation.
The entire environment was dynamically generated. There was not a die
anywhere in any line of code. The check for existence is fine, but you
can't wrap all your other code in if's all the time. Definetly logs though,
lots of them, on the chance something didn't work.

They could have used something like this, though they didn't have it.

use strict;
use warnings;

my ($buf,$length) = ('',5);

# Invoke error #1, NON - FATAL error on read.
# File doesen't exist, however, $fh is valid
open my $fh, '<', 'notexists.txt';

# Invoke error #2, FATAL error on read
#my $fh;

open STDERR, '>errors.txt';

{
local $!;
my $status = eval { read ($fh, $buf, $length) };
$@ =~ s/\s+$//;
if ($@ || (!$status && $!)) {
print "Error in read: ". ($@ ? $@ : $! ). "\n";
}
}

print "More code ...\n";

exit;

__END__


-sln
 
T

Tim McDaniel

You silently get the wrong output and wonder what went wrong.

That's bad, but I think the worst is that you get the wrong output and
DON'T notice (and therefore don't wonder). Instead, you work with bad
or missing data.
 
S

sln

It's been said often enough.




No, then do "whatever is appropriate for your situation".

The admonition is to check the return value.

It is not to take any particular action if the check fails, though
die is often used, as it is most often the appropriate action.

(the purpose of most programs is to process a file's
contents, so there is no point in continuing if such a program
cannot read the file's contents.
)



open() does not die, if it fails it fails silently (which is why
you should always, yes *always*, check its return value).

So I don't know what you mean.

Show me some code where an open() dies...
Yeah, show me, so why die?
You silently get the wrong output and wonder what went wrong.




No it doesn't.

Show me a program where you pass an undefined filehandle to read
and it dies...
use strict;
use warnings;

my ($buf,$length) = ('',5);

# Invoke error #2, FATAL error on read
my $fh;

{
local $!;
my $status = eval { read ($fh, $buf, $length) };
$@ =~ s/\s+$//;
if ($@ || (!$status && $!)) {
print "Error in read: ". ($@ ? $@ : $! ). "\n";
}
}

print "More code ...\n";

__END__
c:\temp>perl ss.pl
Error in read: Can't use an undefined value as a symbol reference at ss.pl line
13.
More code ...

c:\temp>
It does if the purpose of the program is to process that file.
Not if its a juggernaught program that isin't allowed to die.
Aka automation
So that it will fail noisily rather than fail silently!

Fail all you want, but please don't die...
-sln
 
N

Nathan Keel

I used to work for a company that did a lot of automation using perl.
I was new to Perl, but was hired because of my c++ background, but
ended up having to do all perl.
Looking back on it, thier motto was don't die on anything, do not stop
the automation.
The entire environment was dynamically generated. There was not a die
anywhere in any line of code. The check for existence is fine, but you
can't wrap all your other code in if's all the time. Definetly logs
though, lots of them, on the chance something didn't work.

I'm not suggesting any code needs to die. I'm also not suggesting every
read is vital and can't be ignored. Just for the record.
 
S

sln

That's bad, but I think the worst is that you get the wrong output and
DON'T notice (and therefore don't wonder). Instead, you work with bad
or missing data.

In reality, you should never need to check the return value from open().
If you can't program to that spec, you haven't been paid to program.
-sln
 
M

Mart van de Wege

use strict;
use warnings;

my ($buf,$length) = ('',5);

# Invoke error #2, FATAL error on read
my $fh;

{
local $!;
my $status = eval { read ($fh, $buf, $length) };
$@ =~ s/\s+$//;
if ($@ || (!$status && $!)) {
print "Error in read: ". ($@ ? $@ : $! ). "\n";
}
}

print "More code ...\n";

__END__
c:\temp>perl ss.pl
Error in read: Can't use an undefined value as a symbol reference at ss.pl line
13.

Your program is not dying on an undefined filehandle. It is dying on an
undefined scalar. This is not the same.

Mart
 
N

Nathan Keel

In reality, you should never need to check the return value from
open(). If you can't program to that spec, you haven't been paid to
program. -sln

To say one should never need to check the return value (rather than
arguing about die'ing), just shows how unqualified you are as an
alleged programmer. No one is suggesting to not check other things,
but checking that a file opened is sometimes useful or needed. If you
don't recognize or admit that, then you are a complete failure at
simple programming logic.
 
S

sln

Your program is not dying on an undefined filehandle. It is dying on an
undefined scalar. This is not the same.

Mart

Your wrong, its die'ing in runtime, function call 'read()'
-sln
 
S

sln

To say one should never need to check the return value (rather than
arguing about die'ing), just shows how unqualified you are as an
alleged programmer. No one is suggesting to not check other things,
but checking that a file opened is sometimes useful or needed. If you
don't recognize or admit that, then you are a complete failure at
simple programming logic.

Its too simple in that logic.
-sln
 
S

sln

Your wrong, its die'ing in runtime, function call 'read()'
-sln

Furthermore, failed filehandle opens are valid and don't emit this error when
used in a read.
Go figure.. Just when you think you know it all
-sln
 
M

Mart van de Wege

Your wrong, its die'ing in runtime, function call 'read()'
-sln

I'm not.

It's dying on the read, yes, but *not* on an undefined filehandle.

If you can't distinguish your referents from your references, you have
no business programming.

Mart
 
S

sln

I'm not.

It's dying on the read, yes, but *not* on an undefined filehandle.

If you can't distinguish your referents from your references, you have
no business programming.

Mart

I guess a filehandle has to be defined or its not a filehandle,
which is really a reference to GLOB which is undefined.
So the point is that thing passed to the slot called the FILEHANDLE in
the parameter list for read is undefined.

Good call, thanks for schooling me.
Anything else to point out before the real issue, IT DIES man??
Any business on that issue? Re-read the text, stop slurping crap
out of your navel!

-sln
 
M

Mart van de Wege

I guess a filehandle has to be defined or its not a filehandle,
which is really a reference to GLOB which is undefined.
So the point is that thing passed to the slot called the FILEHANDLE in
the parameter list for read is undefined.
Irrelevant. You're passing a scalar to read(), not a filehandle.
Good call, thanks for schooling me.

So just admit you're wrong and shut up.
Anything else to point out before the real issue, IT DIES man??

Sod off you whiner. Tad said that Perl doesn't die on an undefined
filehandle. Your example doesn't refute that. The real issue is that you
are *wrong*, and trying to avoid admitting that.
Any business on that issue? Re-read the text, stop slurping crap
out of your navel!

Learn to program before you open your mouth about the subject.

Mart
 
J

Jürgen Exner

Nathan Keel said:
To say one should never need to check the return value (rather than
arguing about die'ing), just shows how unqualified you are as an
alleged programmer. No one is suggesting to not check other things,
but checking that a file opened is sometimes useful or needed. If you
don't recognize or admit that, then you are a complete failure at
simple programming logic.

Took you a surprisingly long tme to recognize that. Everyone else has
filtered Mr. Sln aka Robic0 a loooooong time ago.

jue
 
N

Nathan Keel

Jürgen Exner said:
Took you a surprisingly long tme to recognize that. Everyone else has
filtered Mr. Sln aka Robic0 a loooooong time ago.

jue

No, I've always realized it, but I don't filter people out. I'm able to
just ignore them when I want. I usually ignore him anymore.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top