How to chck for string in array elements

Discussion in 'Perl Misc' started by folkvord@gmail.com, Feb 23, 2006.

  1. Guest

    Hi!

    I am trying to make an index of a lot of textfiles. Since the index
    thends to get quite large, I want to optimize my algorithm som that if
    $string is a part of a element already stored in @words it won't be
    added to the array.

    As an example:

    my @words=qw(this is just an example of words);
    $string = "his";

    Since "his" is a substring of "this", it shouldnt be neccessary to
    index it. But I cant figure out an efficient way to check if $string is
    a substring of an element in the array.

    Can anyone give me a hint her, please ?
     
    , Feb 23, 2006
    #1
    1. Advertising

  2. wrote:
    > Hi!
    >
    > I am trying to make an index of a lot of textfiles. Since the index
    > thends to get quite large, I want to optimize my algorithm som that if
    > $string is a part of a element already stored in @words it won't be
    > added to the array.
    >
    > As an example:
    >
    > my @words=qw(this is just an example of words);
    > $string = "his";
    >
    > Since "his" is a substring of "this", it shouldnt be neccessary to
    > index it. But I cant figure out an efficient way to check if $string is
    > a substring of an element in the array.
    >
    > Can anyone give me a hint her, please ?


    fastest way, probably use the index() function. if speed isn't that
    much of an issue, regex matching m// is the most complete solution.
     
    it_says_BALLS_on_your_forehead, Feb 23, 2006
    #2
    1. Advertising

  3. Ian Wilson Guest

    wrote:
    > Hi!
    >
    > I am trying to make an index of a lot of textfiles. Since the index
    > thends to get quite large, I want to optimize my algorithm som that if
    > $string is a part of a element already stored in @words it won't be
    > added to the array.
    >
    > As an example:
    >
    > my @words=qw(this is just an example of words);
    > $string = "his";
    >
    > Since "his" is a substring of "this", it shouldnt be neccessary to
    > index it.


    Your assertion seems strange to me, "his" is a valid English word that
    I'd expect to be able to look up in an index. Presumably you know what
    you are doing?

    > But I cant figure out an efficient way to check if $string is
    > a substring of an element in the array.
    >
    > Can anyone give me a hint her, please ?
    >


    I've no idea how efficient it is, but maybe this counts as a hint?

    #!/usr/bin/perl
    use strict;
    use warnings;

    my @words=qw(this is just an example of words);
    my $string = "his";

    for my $word (@words) {
    if ($word =~ /$string/) {
    print "$string is a substring of $word\n";
    }
    }

    You may need to post a minimal but complete working example of your
    current code if you need help optimising it.

    HTH
     
    Ian Wilson, Feb 23, 2006
    #3
  4. Guest

    Hi!

    Thanks for your input, both of you. I will check out the index()
    solution to see if it helps.

    Yes, I know what I am doing, and that this might cause the effect you
    are indicating. Even though, this is the way I want it to be.
     
    , Feb 23, 2006
    #4
  5. it_says_BALLS_on_your_forehead wrote:
    > wrote:
    > > Hi!
    > >
    > > I am trying to make an index of a lot of textfiles. Since the index
    > > thends to get quite large, I want to optimize my algorithm som that if
    > > $string is a part of a element already stored in @words it won't be
    > > added to the array.
    > >
    > > As an example:
    > >
    > > my @words=qw(this is just an example of words);
    > > $string = "his";
    > >
    > > Since "his" is a substring of "this", it shouldnt be neccessary to
    > > index it. But I cant figure out an efficient way to check if $string is
    > > a substring of an element in the array.
    > >
    > > Can anyone give me a hint her, please ?

    >
    > fastest way, probably use the index() function. if speed isn't that
    > much of an issue, regex matching m// is the most complete solution.


    this is surprising (goes to show you, always benchmark :))

    ======
    use strict; use warnings;

    use Benchmark qw/cmpthese/;

    sub reg {
    my @words = qw/this is just an example of words/;
    my $string = 'his';
    for my $word ( @words ) {
    $word !~ m/$string/o and push @words, $string;
    }
    }


    sub ind {
    my @words = qw/this is just an example of words/;
    my $string = 'his';
    for my $word ( @words ) {
    my $pos = index( $word, $string );
    $pos == -1 and push @words, $string;
    }
    }

    cmpthese( -1, {
    r => \&reg,
    i => \&ind,
    });


    __END__
    Rate i r
    i 64742/s -- -8%
    r 70314/s 9% --


    ....of course, bear in mind that the data set influences efficiency a
    lot, contents, size, order, etc. typically however, i think it's safe
    to say that index() is faster than regexes. regexes provide you with
    far greater versatility though. you can ignore case for one thing,
    which could be huge.
     
    it_says_BALLS_on_your_forehead, Feb 23, 2006
    #5
  6. wrote:
    > Hi!
    >
    > I am trying to make an index of a lot of textfiles. Since the index
    > thends to get quite large, I want to optimize my algorithm som that if
    > $string is a part of a element already stored in @words it won't be
    > added to the array.
    >
    > As an example:
    >
    > my @words=qw(this is just an example of words);
    > $string = "his";
    >
    > Since "his" is a substring of "this", it shouldnt be neccessary to
    > index it. But I cant figure out an efficient way to check if $string is
    > a substring of an element in the array.
    >
    > Can anyone give me a hint her, please ?


    btw, your example fails to match your spec, since 'is' is a substring
    of 'this'.
     
    it_says_BALLS_on_your_forehead, Feb 23, 2006
    #6
  7. Guest

    You are right!

    But as this was just an example, I obviously didn't put enough effort
    in it ;)

    The reason I want to index this way is that I am making an index of
    each file. If the file contains the words "this", "his" and "is", only
    "this" will be neccessary to store to get the correct match. This way I
    should be able to save a lot of space, and that is my primary concern.
    I am sorry the example faild to match this spec.
     
    , Feb 23, 2006
    #7
  8. Anno Siegel Guest

    <> wrote in comp.lang.perl.misc:
    > Hi!
    >
    > I am trying to make an index of a lot of textfiles. Since the index
    > thends to get quite large, I want to optimize my algorithm som that if
    > $string is a part of a element already stored in @words it won't be
    > added to the array.
    >
    > As an example:
    >
    > my @words=qw(this is just an example of words);
    > $string = "his";
    >
    > Since "his" is a substring of "this", it shouldnt be neccessary to
    > index it. But I cant figure out an efficient way to check if $string is
    > a substring of an element in the array.
    >
    > Can anyone give me a hint her, please ?


    That's going to cost a lot of time. You have to check each new word
    basically against all words yet collected. That will grow quadratically
    with the number of words.

    To be consistent, you'd also have to throw out words that are a substring
    of a given new word.

    If I had to do it, I'd first build a list of (unique) words and sort it
    by descending length. Then collect words by joining them together with
    a "safe" character that doesn't appear in any of the words. That way,
    a single index() operation can decide if a new word is a substring of
    a word already in the list. Since you are adding short words after
    long ones, you'll detect all substrings that way. Untested:

    my $coll = '';
    for ( sort { length( $b) <=> length( $a) } @words ) {
    next if index( $coll, $_) >= 0;
    $coll .= "$_:";
    }
    my @selection = split /:/, $coll;

    Anno
    --
    If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers.
     
    Anno Siegel, Feb 23, 2006
    #8
  9. wrote:
    > You are right!
    >
    > But as this was just an example, I obviously didn't put enough effort
    > in it ;)
    >
    > The reason I want to index this way is that I am making an index of
    > each file. If the file contains the words "this", "his" and "is", only
    > "this" will be neccessary to store to get the correct match. This way I
    > should be able to save a lot of space, and that is my primary concern.
    > I am sorry the example faild to match this spec.


    i'm not exactly sure what you mean by "index" in this context. if you
    want to look up a word, you couldn't if it was a substring of the word,
    since the smaller word would not have its own entry in the index. i
    would just throw every new word into a hash. otherwise, many problems
    could arise.

    consider this: you being your "algorithm", and your array is initially
    empty. the first word you add is 'is'. the next word you come across is
    'his'. you check, and sure enough, 'his' is NOT a substring of 'is', so
    you add 'his' to your array. then, you see 'this', which is neither a
    substring of his or is, so you add that to the array...you can see
    where i'm going with this. to avoid this, you would have to constantly
    iterate over the entire array and "clean it up"--purge entries that are
    subsets of others. this would be prohibitive performance-wise, since
    you would need to check every entry against every other entry in the
    array--O(n^2). and that's just for the cleanup! your index wouldn't
    really serve as an index for words that are substrings of true entries.


    i don't know why space would be an issue, disk is cheap (if you are
    referring to disk space--will you be saving this "index" to a flat
    file? a database?) go with a hash. it's faster, easier, and most
    importantly, correct. to save some space, you can force lower or force
    upper entries to avoid duplicating the same word.

    ====
    use strict; use warnings;

    my $txt = 'It was the best of times it was the worst of times';
    my %words;
    while ( $txt =~ m/(\w+)/g ) {
    $words{"\L$1"}++;
    }


    for my $key ( keys %words ) {
    print $key, ' => ', $words{$key}, "\n";
    }

    __END__

    worst => 1
    the => 2
    was => 2
    it => 2
    of => 2
    best => 1
    times => 2
     
    it_says_BALLS_on_your_forehead, Feb 23, 2006
    #9
  10. it_says_BALLS_on_your_forehead wrote:
    > > fastest way, probably use the index() function. if speed isn't that
    > > much of an issue, regex matching m// is the most complete solution.

    >
    > this is surprising (goes to show you, always benchmark :))
    >
    > ======
    > use strict; use warnings;
    >
    > use Benchmark qw/cmpthese/;
    >
    > sub reg {
    > my @words = qw/this is just an example of words/;
    > my $string = 'his';
    > for my $word ( @words ) {
    > $word !~ m/$string/o and push @words, $string;
    > }
    > }
    >
    >
    > sub ind {
    > my @words = qw/this is just an example of words/;
    > my $string = 'his';
    > for my $word ( @words ) {
    > my $pos = index( $word, $string );
    > $pos == -1 and push @words, $string;
    > }
    > }
    >
    > cmpthese( -1, {
    > r => \&reg,
    > i => \&ind,
    > });
    >
    >
    > __END__
    > Rate i r
    > i 64742/s -- -8%
    > r 70314/s 9% --
    >
    >
    > ...of course, bear in mind that the data set influences efficiency a
    > lot, contents, size, order, etc. typically however, i think it's safe
    > to say that index() is faster than regexes. regexes provide you with
    > far greater versatility though. you can ignore case for one thing,
    > which could be huge.


    bad coding also affects efficiency. i condensed the two statements in
    the for loop of my ind sub to one, and now ind is faster than reg.
    =====

    use strict; use warnings;

    use Benchmark qw/cmpthese/;

    sub reg {
    my @words = qw/this is just an example of words/;
    my $string = 'his';
    for my $word ( @words ) {
    $word !~ m/$string/o and push @words, $string;
    }
    }

    sub ind {
    my @words = qw/this is just an example of words/;
    my $string = 'his';
    for my $word ( @words ) {
    (index( $word, $string )) == -1 and push @words, $string;
    }
    }

    cmpthese( -1, {
    r => \&reg,
    i => \&ind,


    });

    __END__
    Rate r i
    r 70382/s -- -6%
    i 75044/s 7% --
     
    it_says_BALLS_on_your_forehead, Feb 23, 2006
    #10
  11. Anno Siegel wrote:
    > <> wrote in comp.lang.perl.misc:
    > > Hi!
    > >
    > > I am trying to make an index of a lot of textfiles. Since the index
    > > thends to get quite large, I want to optimize my algorithm som that if
    > > $string is a part of a element already stored in @words it won't be
    > > added to the array.
    > >
    > > As an example:
    > >
    > > my @words=qw(this is just an example of words);
    > > $string = "his";
    > >
    > > Since "his" is a substring of "this", it shouldnt be neccessary to
    > > index it. But I cant figure out an efficient way to check if $string is
    > > a substring of an element in the array.
    > >
    > > Can anyone give me a hint her, please ?

    >
    > That's going to cost a lot of time. You have to check each new word
    > basically against all words yet collected. That will grow quadratically
    > with the number of words.
    >
    > To be consistent, you'd also have to throw out words that are a substring
    > of a given new word.
    >
    > If I had to do it, I'd first build a list of (unique) words and sort it
    > by descending length. Then collect words by joining them together with
    > a "safe" character that doesn't appear in any of the words. That way,
    > a single index() operation can decide if a new word is a substring of
    > a word already in the list. Since you are adding short words after
    > long ones, you'll detect all substrings that way. Untested:
    >
    > my $coll = '';
    > for ( sort { length( $b) <=> length( $a) } @words ) {
    > next if index( $coll, $_) >= 0;
    > $coll .= "$_:";
    > }
    > my @selection = split /:/, $coll;
    >


    clever, i like it :).
     
    it_says_BALLS_on_your_forehead, Feb 23, 2006
    #11
  12. Guest

    "it_says_BALLS_on_your_forehead" <> wrote:
    ....
    > >
    > > ...of course, bear in mind that the data set influences efficiency a
    > > lot, contents, size, order, etc. typically however, i think it's safe
    > > to say that index() is faster than regexes. regexes provide you with
    > > far greater versatility though. you can ignore case for one thing,
    > > which could be huge.

    >
    > bad coding also affects efficiency. i condensed the two statements in
    > the for loop of my ind sub to one, and now ind is faster than reg.


    But you also need to benchamrk something that does what it is supposed
    to do. So you should always have a print statement or an assertion or
    something in the subroutine to make sure it works, then comment it out
    for the real benchmarking.

    > =====
    >
    > use strict; use warnings;
    >
    > use Benchmark qw/cmpthese/;
    >
    > sub reg {
    > my @words = qw/this is just an example of words/;
    > my $string = 'his';
    > for my $word ( @words ) {
    > $word !~ m/$string/o and push @words, $string;
    > }


    print "reg: @words\n";

    > }

    ......

    __END__
    reg: this is just an example of words his his his his his his
    reg: this is just an example of words his his his his his his
    reg: this is just an example of words his his his his his his
    reg: this is just an example of words his his his his his his


    Hmmm....


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Feb 23, 2006
    #12
  13. wrote:
    > "it_says_BALLS_on_your_forehead" <> wrote:
    > ...
    > > >
    > > > ...of course, bear in mind that the data set influences efficiency a
    > > > lot, contents, size, order, etc. typically however, i think it's safe
    > > > to say that index() is faster than regexes. regexes provide you with
    > > > far greater versatility though. you can ignore case for one thing,
    > > > which could be huge.

    > >
    > > bad coding also affects efficiency. i condensed the two statements in
    > > the for loop of my ind sub to one, and now ind is faster than reg.

    >
    > But you also need to benchamrk something that does what it is supposed
    > to do.


    well sure, if you want to get all logical about it. ;-).


    > So you should always have a print statement or an assertion or
    > something in the subroutine to make sure it works, then comment it out
    > for the real benchmarking.
    >



    > > =====
    > >
    > > use strict; use warnings;
    > >
    > > use Benchmark qw/cmpthese/;
    > >
    > > sub reg {
    > > my @words = qw/this is just an example of words/;
    > > my $string = 'his';
    > > for my $word ( @words ) {
    > > $word !~ m/$string/o and push @words, $string;
    > > }

    >
    > print "reg: @words\n";
    >
    > > }

    > .....
    >
    > __END__
    > reg: this is just an example of words his his his his his his
    > reg: this is just an example of words his his his his his his
    > reg: this is just an example of words his his his his his his
    > reg: this is just an example of words his his his his his his
    >
    >
    > Hmmm....


    right, i should have checked if ALL of the matches/index checks failed,
    THEN pushed the word into the array. thx for the correction :)
     
    it_says_BALLS_on_your forehead, Feb 23, 2006
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. diffused
    Replies:
    9
    Views:
    752
    Oscar kind
    Aug 1, 2004
  2. Shalini
    Replies:
    2
    Views:
    517
    Brian Genisio
    Jan 9, 2004
  3. P
    Replies:
    1
    Views:
    1,207
    Joe Kesselman
    Jul 7, 2006
  4. Replies:
    10
    Views:
    692
  5. Arti Singh
    Replies:
    2
    Views:
    145
    Arti Singh
    Jul 26, 2010
Loading...

Share This Page