Filter content from a list: hard-coded expression or read from a file?


F

Francois Massion

Newbee question:
I have a list of strings like the following list:

Log file content
a long date
the mandatory check
Mark text to replace

I want to keep only the strings which do not begin with certain words.
So far I have done it with a hard coded list of words but this list
may vary and can be very long. I wonder how I could read the list from
a file and achieve the same result.
Here the code which works:

open(INPUT,'mytext.txt') || die("File cannot be opened!\n");
@sentence = <INPUT>;
close(INPUT);
foreach $sentence (@sentence) {
chomp $sentence;
if ($sentence !~ m/^a |^the |^therefore /i) { # Actually a very long
list
push (@filteredresult,$sentence);
}
 
Ad

Advertisements

D

Dr.Ruud

Newbee question:

See also the beginners list @perl.org.

[...]
open(INPUT,'mytext.txt') || die("File cannot be opened!\n");

my $infile = 'mytext.txt';

open my $input, '<', $infile
or die "Error opening '$infile': $!\n");

@sentence =<INPUT>;

No need to slurp the file in, when you will process it by line.

my @words = qw/ a the therefore /;

my $re = join '|', @words;

while ( <$input> ) {
next if /^(?:$re)\x{20}/;
...;
}
 
R

Rainer Weikusat

Francois Massion said:
I have a list of strings like the following list:

Log file content
a long date
the mandatory check
Mark text to replace

I want to keep only the strings which do not begin with certain words.
So far I have done it with a hard coded list of words but this list
may vary and can be very long. I wonder how I could read the list from
a file and achieve the same result.
Here the code which works:

open(INPUT,'mytext.txt') || die("File cannot be opened!\n");
@sentence = <INPUT>;
close(INPUT);
foreach $sentence (@sentence) {
chomp $sentence;
if ($sentence !~ m/^a |^the |^therefore /i) { # Actually a very long
list
push (@filteredresult,$sentence);
}

My suggestion would be to put the exclusion list into a hash (this is
uncompiled example code), ie,

open($fh, '<', '/path/to/list');
%excls = map { chomp; $_, 1; } <$fh>;

and then check it as follows:

next if $sentence =~ /^(\W*)/ && $excls{lc($1));

(push coming after this line) or

push(@result, $sentence) unless $sentence =~ /^(\W*)/ && $excls{lc($1)}
 
F

Francois Massion

My suggestion would be to put the exclusion list into a hash (this is
uncompiled example code), ie,

open($fh, '<', '/path/to/list');
%excls = map { chomp; $_, 1; } <$fh>;

and then check it as follows:

next if $sentence =~ /^(\W*)/ && $excls{lc($1));

(push coming after this line) or

push(@result, $sentence) unless $sentence =~ /^(\W*)/ && $excls{lc($1)}

I have tested 2 versions, unsuccessfully:

Version # 1 (based on Rainer's suggestion):
#!/usr/bin/perl -w

my $infile = 'a.txt';
open my $input, '<', $infile;
open($fh, '<', 'b.txt');
%excls = map { chomp; $_, 1; } <$fh>;
next if $input =~ /^(\W*)/ && $excls{lc($1)};
push(@result, $input) unless $input =~ /^(\W*)/ && $excls{lc($1)} ;
foreach (@result) {
print "$_\n";
}

RESULT: GLOB(0x36f178)
(No idea what this means)

Version # 2 (based on Dr Ruud and Ben's suggestion; sorry if I messed
it up):

#!/usr/bin/perl -w

my $infile = 'a.txt';

open my $input, '<', $infile;
open my $WORDS, '<', 'b.txt';
my @words = <$WORDS>;
my $re = join "|", map quotemeta, @words;
while ( <$input> ) {
next if /^(?:$re)\x{20}/;
push (@filteredresult,$input);

foreach (@filteredresult) {
print "$_\n";
}}

RESULT:
GLOB(0x1ff178)
GLOB(0x1ff178)
GLOB(0x1ff178)
....
 
R

Rainer Weikusat

Francois Massion said:
I have tested 2 versions, unsuccessfully:

Version # 1 (based on Rainer's suggestion):
#!/usr/bin/perl -w

my $infile = 'a.txt';
open my $input, '<', $infile;
open($fh, '<', 'b.txt');
%excls = map { chomp; $_, 1; } <$fh>;
next if $input =~ /^(\W*)/ && $excls{lc($1)};
push(@result, $input) unless $input =~ /^(\W*)/ && $excls{lc($1)} ;
foreach (@result) {
print "$_\n";
}

RESULT: GLOB(0x36f178)
(No idea what this means)

The reason why I wrote 'you can do this OR that' was that these were
supposed to be mutually exclusive options. Also, you obviously need
some kind of input processing loop and test the condition against the
sentences, NOT against the result of stringfying the input file handle
(which is 'some glob').
 
C

ccc31807

Newbee question:
I have a list of strings like the following list:

Log file content
a long date
the mandatory check
Mark text to replace

I want to keep only the strings which do not begin with certain words.

It would have been more helpful (for me, anyway) if you had posted
your actual data, but that's okay.

I have found that these kinds of tasks often decompose into a
particular pattern, illustrated below. The pattern has three phases:
(1) read the file contents into a data structure, (2) munge the data,
and (3) write the data to a file. The following (hypothetical) script
illustrates this:

#! perl
use strict;
use warnings;

my %data;
read_file_contents();
munge_data();
write_data_to_file();
exit(0);

sub read_file_contets
{
open FILE, '<', 'data_file.csv' or die "$!";
next unless /\w/; #skip empty lines
next if /your REGEX to skip/; #skip unneeded lines
chomp;
my ($val1, $val2, $val3, ...) = split(/?/, $_)
$data{$val1} = {
KEY2 => $val2,
KEY3 => $val3,
KEY4 => $val4,
...,
}
close FILE;
}
sub munge_data
{
#you now have your data in a convenient structure
#so you can manipulate it how you please
foreach my $key (keys %data) { munge_record($data{$key}); }
}
sub write_data_to_file
{
open OUT, '>', 'output.csv' or die "$!";
print OUT qq("COL1","COL2","COL3", ...);
foreach my $key (keys %data)
{
print OUT qq("$key","$data{$key}{KEY2}"," ...);
}
close OUT;
}
sub munge_record
{
my $record = shift;
# munge here
}
 
Ad

Advertisements

T

Ted Zlatanov

FM> I have tested 2 versions, unsuccessfully:

Hi Francois,

if you're OK with using different tools, maybe try the GNU egrep tool.

Given files a and b:

% grep . a b
a:1
a:2
a:3
a:4
a:5
b:^[12]
b:^[4]

You can just use the -f option to read patterns from b to filter a:

% egrep -f b a
1
2
4

This approach may work better for you, depending on the OS platforms you
have to support, the size of the file, and the complexity of the regular
expressions. Try it out.

Ted
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top