Fastest way to remove common substrings from a list of strings

Discussion in 'Perl Misc' started by freesoft12@gmail.com, Jul 15, 2008.

  1. Guest

    Hi,

    Given a list containing various UNIX directory paths:

    @list = qw(
    /home/bart
    /home/bart/foo
    /usr/lib/
    /home/bart/foo/xyz
    /home/bart/foo/xyz/live.cpp )

    I want to trim this list to contain:

    @list = qw(
    /usr/lib/
    /home/bart/foo/xyz/live.cpp )

    Is there a better way than this possibly n^2 algorithm below.

    my @to_be_removed_indices = ();
    for (my $index=0; $index < #list; ++$index) {
    my $current_item = $list[$index];
    ++$index;
    for (my $j = $index; $j < $#list; ++$j) {
    if ($current_item =~ m/$list[$j]/) { # is the current item a
    substring of another
    push (@to_be_removed_indices, $index); # schedule its removal
    last; # break
    }
    }
    # Remove matched items

    Regards
    John
     
    , Jul 15, 2008
    #1
    1. Advertising

  2. Mirco Wahab Guest

    wrote:
    > Hi,
    >
    > Given a list containing various UNIX directory paths:
    >
    > @list = qw(
    > /home/bart
    > /home/bart/foo
    > /usr/lib/
    > /home/bart/foo/xyz
    > /home/bart/foo/xyz/live.cpp )
    >
    > I want to trim this list to contain:
    >
    > @list = qw(
    > /usr/lib/
    > /home/bart/foo/xyz/live.cpp )
    >
    > Is there a better way than this possibly n^2 algorithm below.


    Still pseudo O(2), but much shorter:

    ...
    foreach my $k (@list) {
    $k = undef if grep defined && /^$k./, @list;
    }

    print join "\n", grep defined, @list;
    ...

    Regards

    M.
     
    Mirco Wahab, Jul 15, 2008
    #2
    1. Advertising

  3. Dave B Guest

    wrote:
    > Hi,
    >
    > Given a list containing various UNIX directory paths:
    >
    > @list = qw(
    > /home/bart
    > /home/bart/foo
    > /usr/lib/
    > /home/bart/foo/xyz
    > /home/bart/foo/xyz/live.cpp )
    >
    > I want to trim this list to contain:
    >
    > @list = qw(
    > /usr/lib/
    > /home/bart/foo/xyz/live.cpp )
    >
    > Is there a better way than this possibly n^2 algorithm below.
    >
    > my @to_be_removed_indices = ();
    > for (my $index=0; $index < #list; ++$index) {
    > my $current_item = $list[$index];
    > ++$index;
    > for (my $j = $index; $j < $#list; ++$j) {
    > if ($current_item =~ m/$list[$j]/) { # is the current item a
    > substring of another
    > push (@to_be_removed_indices, $index); # schedule its removal
    > last; # break
    > }
    > }
    > # Remove matched items


    If your condition is "remove a string if it's a substring of another,
    matching *at the beginning*", then the following code seems to work:

    my @list = sort qw(
    /home/bart
    /home/bart/foo
    /usr/lib/
    /home/bart/foo/xyz
    /home/bart/foo/xyz/live.cpp );

    my $s=$list[0];
    my @newlist;

    for (@list, substr($list[-1],1)) {
    push(@newlist,$s) unless /$s/;
    $s=$_;
    }

    print "@newlist\n";

    # now @newlist has the values that are not
    # substring of anything else

    I hope I understood the problem correctly.

    --
    D.
     
    Dave B, Jul 15, 2008
    #3
  4. Dave B Guest

    Dave B wrote:

    > push(@newlist,$s) unless /$s/;


    This is probably better and safer:

    push(@newlist,$s) unless /^\Q$s\E/;

    --
    D.
     
    Dave B, Jul 15, 2008
    #4
  5. Ted Zlatanov Guest

    On Tue, 15 Jul 2008 06:02:38 -0700 (PDT) "" <> wrote:

    fc> Given a list containing various UNIX directory paths:

    fc> @list = qw(
    fc> /home/bart
    fc> /home/bart/foo
    fc> /usr/lib/
    fc> /home/bart/foo/xyz
    fc> /home/bart/foo/xyz/live.cpp )

    fc> I want to trim this list to contain:

    fc> @list = qw(
    fc> /usr/lib/
    fc> /home/bart/foo/xyz/live.cpp )

    fc> Is there a better way than this possibly n^2 algorithm below.

    fc> my @to_be_removed_indices = ();
    fc> for (my $index=0; $index < #list; ++$index) {
    fc> my $current_item = $list[$index];
    fc> ++$index;
    fc> for (my $j = $index; $j < $#list; ++$j) {
    fc> if ($current_item =~ m/$list[$j]/) { # is the current item a
    fc> substring of another
    fc> push (@to_be_removed_indices, $index); # schedule its removal
    fc> last; # break
    fc> }
    fc> }
    fc> # Remove matched items

    If @to_be_removed_indices has less than 5 elements, don't worry about
    it, your complexity is too low to matter. Even 10 elements won't make a
    big difference. Complexity measurements matter when you get into
    thousands and millions of items ($disclaimers_go_here).

    Otherwise, you can do many optimizations. First one I see is that you
    should use index() instead of a match if you want to check for
    substrings. Another is to find the ASCII characters not in
    @to_be_removed_indices, let's say they are qw/q w p/, and then exclude
    any string that doesn't have those characters. You can also do a trie
    of your search strings, or build an regex alternation of them and hope
    the regex engine can optimize it for you. The general idea is to
    pre-process as much as possible, and let Perl use index() instead of
    regular expressions when possible.

    Ted
     
    Ted Zlatanov, Jul 15, 2008
    #5
  6. Mirco Wahab Guest

    Dave B wrote:
    > Dave B wrote:
    >
    >> push(@newlist,$s) unless /$s/;

    >
    > This is probably better and safer:
    >
    > push(@newlist,$s) unless /^\Q$s\E/;


    You won't need all that if you start from an already *sorted* list,
    a single map will do:

    my @result = map { $list[$_]=~/^\Q$list[$_-1]\E./ ? () : $list[$_] } 1..$#list;

    Regards

    M.
     
    Mirco Wahab, Jul 15, 2008
    #6
  7. Dave B Guest

    Mirco Wahab wrote:

    >>> push(@newlist,$s) unless /$s/;

    >> This is probably better and safer:
    >>
    >> push(@newlist,$s) unless /^\Q$s\E/;

    >
    > You won't need all that if you start from an already *sorted* list,
    > a single map will do:
    >
    > my @result = map { $list[$_]=~/^\Q$list[$_-1]\E./ ? () : $list[$_] } 1..$#list;


    I agree that using map is more Perl-style, but the above statement alone
    doesn't work. You need at least this:

    my @result=map{$list[$_]=~/^\Q$list[$_-1]\E./ ?():$list[$_-1] } 1..$#list;
    push @result, $list[-1];

    --
    D.
     
    Dave B, Jul 15, 2008
    #7
  8. Guest

    "" <> wrote:
    > Hi,
    >
    > Given a list containing various UNIX directory paths:
    >
    > @list = qw(
    > /home/bart
    > /home/bart/foo
    > /usr/lib/
    > /home/bart/foo/xyz
    > /home/bart/foo/xyz/live.cpp )
    >
    > I want to trim this list to contain:
    >
    > @list = qw(
    > /usr/lib/
    > /home/bart/foo/xyz/live.cpp )



    That isn't a complete enough specification. What if there is a
    /foo/home/bart? How about a /home/bartzilla?


    >
    > Is there a better way than this possibly n^2 algorithm below.


    Almost anything would be better than the below. It doesn't compile.
    And once #list is turned to $#list, it does compile but leaves
    @to_be_removed_indices empty, so it doesn't what you say you want done.
    If you are happy with a wrong answer, then there are much better ways than
    n^2 to achieve that.

    >
    > my @to_be_removed_indices = ();
    > for (my $index=0; $index < #list; ++$index) {
    > my $current_item = $list[$index];
    > ++$index;


    You are incrementing ++$index twice, so only even numbered indexes are
    being "processed".

    etc.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Jul 15, 2008
    #8
  9. Mirco Wahab Guest

    Dave B wrote:
    > Mirco Wahab wrote:
    >> my @result = map { $list[$_]=~/^\Q$list[$_-1]\E./ ? () : $list[$_] } 1..$#list;

    >
    > I agree that using map is more Perl-style, but the above statement alone
    > doesn't work. You need at least this:
    >
    > my @result=map{$list[$_]=~/^\Q$list[$_-1]\E./ ?():$list[$_-1] } 1..$#list;
    > push @result, $list[-1];
    >


    Yes, you are right. I saw the mistake and corrected it (tried to):

    my @result =
    (
    (map { $list[$_+1] =~ /^\Q$list[$_]\E./ ? () : $list[$_] } 0..$#list-1),
    $list[-1]
    );

    but you have been faster. The last element was missing.

    Thanks & Regards

    M.
     
    Mirco Wahab, Jul 15, 2008
    #9
  10. Dave B Guest

    Leon Timmermans wrote:

    >> push(@newlist,$s) unless /$s/;

    >
    > You should really quotemeta $s or else weird things can happen if it
    > contains regexp meta characters. Better yet I would write that line as:
    >
    > push(@newlist, $s) if substr($_, 0, length $s) ne $s.
    >
    > Since that doesn't waste time looking for a substring if it isn't at the
    > start.


    I fixed both issues in a subsequent post.

    --
    D.
     
    Dave B, Jul 15, 2008
    #10
  11. On Jul 15, 8:02 am, Dave B <> wrote:
    > Dave B wrote:
    > > push(@newlist,$s) unless /$s/;

    >
    > This is probably better and safer:
    >
    > push(@newlist,$s) unless /^\Q$s\E/;
    >



    here's a shorter (and tad more obfuscated)
    variant:

    my @list = sort qw( ... );
    my ($s, @newlist);
    for (@list, undef) {
    push(@newlist,$s) unless /\Q$s\E/;
    $s=$_;
    }

    --
    Charles DeRykus
     
    comp.llang.perl.moderated, Jul 15, 2008
    #11
  12. szr Guest

    Leon Timmermans wrote:
    > On Tue, 15 Jul 2008 16:47:17 +0200, Dave B wrote:
    >
    >> wrote:
    >>
    >> If your condition is "remove a string if it's a substring of another,
    >> matching *at the beginning*", then the following code seems to work:
    >>

    >
    >> push(@newlist,$s) unless /$s/;

    >
    > You should really quotemeta $s or else weird things can happen if it
    > contains regexp meta characters. Better yet I would write that line
    > as:
    >
    > push(@newlist, $s) if substr($_, 0, length $s) ne $s.
    >
    > Since that doesn't waste time looking for a substring if it isn't at
    > the start.


    Or he just change that regex from /$s/ to /^$s/ to also achor it at the
    beginning. I think the substr might be ever so slightly faster though.

    --
    szr
     
    szr, Jul 15, 2008
    #12
  13. Guest

    thanks for your help! I will try out your suggestions!

    Regards
    John
     
    , Jul 16, 2008
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. SIgnOff
    Replies:
    7
    Views:
    608
    SIgnOff
    Mar 21, 2005
  2. Chris Brat
    Replies:
    7
    Views:
    378
    Marc 'BlackJack' Rintsch
    Sep 6, 2006
  3. amadain
    Replies:
    11
    Views:
    438
    Paul McGuire
    Feb 14, 2007
  4. Dun Peal
    Replies:
    2
    Views:
    263
    Carl Banks
    Oct 18, 2010
  5. Junkone
    Replies:
    2
    Views:
    113
    Jano Svitok
    Dec 26, 2007
Loading...

Share This Page