Filter content from a list: hard-coded expression or read from a file?

Discussion in 'Perl Misc' started by Francois Massion, Mar 26, 2012.

  1. Newbee question:
    I have a list of strings like the following list:

    Log file content
    a long date
    the mandatory check
    Mark text to replace

    I want to keep only the strings which do not begin with certain words.
    So far I have done it with a hard coded list of words but this list
    may vary and can be very long. I wonder how I could read the list from
    a file and achieve the same result.
    Here the code which works:

    open(INPUT,'mytext.txt') || die("File cannot be opened!\n");
    @sentence = <INPUT>;
    close(INPUT);
    foreach $sentence (@sentence) {
    chomp $sentence;
    if ($sentence !~ m/^a |^the |^therefore /i) { # Actually a very long
    list
    push (@filteredresult,$sentence);
    }
     
    Francois Massion, Mar 26, 2012
    #1
    1. Advertising

  2. Francois Massion

    Dr.Ruud Guest

    Re: Filter content from a list: hard-coded expression or read froma file?

    On 2012-03-26 08:00, Francois Massion wrote:

    > Newbee question:


    See also the beginners list @perl.org.


    > [...]
    > open(INPUT,'mytext.txt') || die("File cannot be opened!\n");


    my $infile = 'mytext.txt';

    open my $input, '<', $infile
    or die "Error opening '$infile': $!\n");


    > @sentence =<INPUT>;


    No need to slurp the file in, when you will process it by line.

    my @words = qw/ a the therefore /;

    my $re = join '|', @words;

    while ( <$input> ) {
    next if /^(?:$re)\x{20}/;
    ...;
    }

    --
    Ruud
     
    Dr.Ruud, Mar 26, 2012
    #2
    1. Advertising

  3. Francois Massion <> writes:
    > I have a list of strings like the following list:
    >
    > Log file content
    > a long date
    > the mandatory check
    > Mark text to replace
    >
    > I want to keep only the strings which do not begin with certain words.
    > So far I have done it with a hard coded list of words but this list
    > may vary and can be very long. I wonder how I could read the list from
    > a file and achieve the same result.
    > Here the code which works:
    >
    > open(INPUT,'mytext.txt') || die("File cannot be opened!\n");
    > @sentence = <INPUT>;
    > close(INPUT);
    > foreach $sentence (@sentence) {
    > chomp $sentence;
    > if ($sentence !~ m/^a |^the |^therefore /i) { # Actually a very long
    > list
    > push (@filteredresult,$sentence);
    > }


    My suggestion would be to put the exclusion list into a hash (this is
    uncompiled example code), ie,

    open($fh, '<', '/path/to/list');
    %excls = map { chomp; $_, 1; } <$fh>;

    and then check it as follows:

    next if $sentence =~ /^(\W*)/ && $excls{lc($1));

    (push coming after this line) or

    push(@result, $sentence) unless $sentence =~ /^(\W*)/ && $excls{lc($1)}
     
    Rainer Weikusat, Mar 26, 2012
    #3
  4. On 26 Mrz., 15:59, Rainer Weikusat <> wrote:
    > Francois Massion <> writes:
    > > I have a list of strings like the following list:

    >
    > > Log file content
    > > a long date
    > > the mandatory check
    > > Mark text to replace

    >
    > > I want to keep only the strings which do not begin with certain words.
    > > So far I have done it with a hard coded list of words but this list
    > > may vary and can be very long. I wonder how I could read the list from
    > > a file and achieve the same result.
    > > Here the code which works:

    >
    > > open(INPUT,'mytext.txt') || die("File cannot be opened!\n");
    > > @sentence = <INPUT>;
    > > close(INPUT);
    > > foreach $sentence (@sentence) {
    > >    chomp $sentence;
    > >    if ($sentence !~ m/^a |^the |^therefore /i) { # Actually a very long
    > > list
    > >    push (@filteredresult,$sentence);
    > > }

    >
    > My suggestion would be to put the exclusion list into a hash (this is
    > uncompiled example code), ie,
    >
    > open($fh, '<', '/path/to/list');
    > %excls = map { chomp; $_, 1; } <$fh>;
    >
    > and then check it as follows:
    >
    > next if $sentence =~ /^(\W*)/ && $excls{lc($1));
    >
    > (push coming after this line) or
    >
    > push(@result, $sentence) unless $sentence =~ /^(\W*)/ && $excls{lc($1)}
    >
    >


    I have tested 2 versions, unsuccessfully:

    Version # 1 (based on Rainer's suggestion):
    #!/usr/bin/perl -w

    my $infile = 'a.txt';
    open my $input, '<', $infile;
    open($fh, '<', 'b.txt');
    %excls = map { chomp; $_, 1; } <$fh>;
    next if $input =~ /^(\W*)/ && $excls{lc($1)};
    push(@result, $input) unless $input =~ /^(\W*)/ && $excls{lc($1)} ;
    foreach (@result) {
    print "$_\n";
    }

    RESULT: GLOB(0x36f178)
    (No idea what this means)

    Version # 2 (based on Dr Ruud and Ben's suggestion; sorry if I messed
    it up):

    #!/usr/bin/perl -w

    my $infile = 'a.txt';

    open my $input, '<', $infile;
    open my $WORDS, '<', 'b.txt';
    my @words = <$WORDS>;
    my $re = join "|", map quotemeta, @words;
    while ( <$input> ) {
    next if /^(?:$re)\x{20}/;
    push (@filteredresult,$input);

    foreach (@filteredresult) {
    print "$_\n";
    }}

    RESULT:
    GLOB(0x1ff178)
    GLOB(0x1ff178)
    GLOB(0x1ff178)
    ....
     
    Francois Massion, Mar 26, 2012
    #4
  5. Francois Massion <> writes:
    > On 26 Mrz., 15:59, Rainer Weikusat <> wrote:
    >> Francois Massion <> writes:
    >> > I have a list of strings like the following list:

    >>
    >> > Log file content
    >> > a long date
    >> > the mandatory check
    >> > Mark text to replace

    >>
    >> > I want to keep only the strings which do not begin with certain words.
    >> > So far I have done it with a hard coded list of words but this list
    >> > may vary and can be very long. I wonder how I could read the list from
    >> > a file and achieve the same result.
    >> > Here the code which works:

    >>
    >> > open(INPUT,'mytext.txt') || die("File cannot be opened!\n");
    >> > @sentence = <INPUT>;
    >> > close(INPUT);
    >> > foreach $sentence (@sentence) {
    >> >    chomp $sentence;
    >> >    if ($sentence !~ m/^a |^the |^therefore /i) { # Actually a very long
    >> > list
    >> >    push (@filteredresult,$sentence);
    >> > }

    >>
    >> My suggestion would be to put the exclusion list into a hash (this is
    >> uncompiled example code), ie,
    >>
    >> open($fh, '<', '/path/to/list');
    >> %excls = map { chomp; $_, 1; } <$fh>;
    >>
    >> and then check it as follows:
    >>
    >> next if $sentence =~ /^(\W*)/ && $excls{lc($1));
    >>
    >> (push coming after this line) or
    >>
    >> push(@result, $sentence) unless $sentence =~ /^(\W*)/ && $excls{lc($1)}
    >>
    >>

    >
    > I have tested 2 versions, unsuccessfully:
    >
    > Version # 1 (based on Rainer's suggestion):
    > #!/usr/bin/perl -w
    >
    > my $infile = 'a.txt';
    > open my $input, '<', $infile;
    > open($fh, '<', 'b.txt');
    > %excls = map { chomp; $_, 1; } <$fh>;
    > next if $input =~ /^(\W*)/ && $excls{lc($1)};
    > push(@result, $input) unless $input =~ /^(\W*)/ && $excls{lc($1)} ;
    > foreach (@result) {
    > print "$_\n";
    > }
    >
    > RESULT: GLOB(0x36f178)
    > (No idea what this means)


    The reason why I wrote 'you can do this OR that' was that these were
    supposed to be mutually exclusive options. Also, you obviously need
    some kind of input processing loop and test the condition against the
    sentences, NOT against the result of stringfying the input file handle
    (which is 'some glob').
     
    Rainer Weikusat, Mar 26, 2012
    #5
  6. Francois Massion

    ccc31807 Guest

    On Mar 26, 2:00 am, Francois Massion <> wrote:
    > Newbee question:
    > I have a list of strings like the following list:
    >
    > Log file content
    > a long date
    > the mandatory check
    > Mark text to replace
    >
    > I want to keep only the strings which do not begin with certain words.


    It would have been more helpful (for me, anyway) if you had posted
    your actual data, but that's okay.

    I have found that these kinds of tasks often decompose into a
    particular pattern, illustrated below. The pattern has three phases:
    (1) read the file contents into a data structure, (2) munge the data,
    and (3) write the data to a file. The following (hypothetical) script
    illustrates this:

    #! perl
    use strict;
    use warnings;

    my %data;
    read_file_contents();
    munge_data();
    write_data_to_file();
    exit(0);

    sub read_file_contets
    {
    open FILE, '<', 'data_file.csv' or die "$!";
    next unless /\w/; #skip empty lines
    next if /your REGEX to skip/; #skip unneeded lines
    chomp;
    my ($val1, $val2, $val3, ...) = split(/?/, $_)
    $data{$val1} = {
    KEY2 => $val2,
    KEY3 => $val3,
    KEY4 => $val4,
    ...,
    }
    close FILE;
    }
    sub munge_data
    {
    #you now have your data in a convenient structure
    #so you can manipulate it how you please
    foreach my $key (keys %data) { munge_record($data{$key}); }
    }
    sub write_data_to_file
    {
    open OUT, '>', 'output.csv' or die "$!";
    print OUT qq("COL1","COL2","COL3", ...);
    foreach my $key (keys %data)
    {
    print OUT qq("$key","$data{$key}{KEY2}"," ...);
    }
    close OUT;
    }
    sub munge_record
    {
    my $record = shift;
    # munge here
    }
     
    ccc31807, Mar 26, 2012
    #6
  7. Francois Massion

    Ted Zlatanov Guest

    On Mon, 26 Mar 2012 07:41:37 -0700 (PDT) Francois Massion <> wrote:

    FM> I have tested 2 versions, unsuccessfully:

    Hi Francois,

    if you're OK with using different tools, maybe try the GNU egrep tool.

    Given files a and b:

    % grep . a b
    a:1
    a:2
    a:3
    a:4
    a:5
    b:^[12]
    b:^[4]

    You can just use the -f option to read patterns from b to filter a:

    % egrep -f b a
    1
    2
    4

    This approach may work better for you, depending on the OS platforms you
    have to support, the size of the file, and the complexity of the regular
    expressions. Try it out.

    Ted
     
    Ted Zlatanov, Mar 27, 2012
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Luis Esteban Valencia
    Replies:
    1
    Views:
    537
    Curt_C [MVP]
    Jan 6, 2005
  2. Luke Airig
    Replies:
    1
    Views:
    603
    Dimitre Novatchev
    Dec 24, 2003
  3. rodchar
    Replies:
    2
    Views:
    385
    rodchar
    Jul 1, 2008
  4. Aidan Gauland

    Refactoring hard-coded values

    Aidan Gauland, Jun 29, 2011, in forum: XML
    Replies:
    1
    Views:
    1,349
    Joe Kesselman
    Jun 30, 2011
  5. Grant Curell

    Non-Hard Coded File.open(newFile)

    Grant Curell, Oct 23, 2010, in forum: Ruby
    Replies:
    5
    Views:
    159
    w_a_x_man
    Oct 23, 2010
Loading...

Share This Page