How to chck for string in array elements

F

folkvord

Hi!

I am trying to make an index of a lot of textfiles. Since the index
thends to get quite large, I want to optimize my algorithm som that if
$string is a part of a element already stored in @words it won't be
added to the array.

As an example:

my @words=qw(this is just an example of words);
$string = "his";

Since "his" is a substring of "this", it shouldnt be neccessary to
index it. But I cant figure out an efficient way to check if $string is
a substring of an element in the array.

Can anyone give me a hint her, please ?
 
I

it_says_BALLS_on_your_forehead

Hi!

I am trying to make an index of a lot of textfiles. Since the index
thends to get quite large, I want to optimize my algorithm som that if
$string is a part of a element already stored in @words it won't be
added to the array.

As an example:

my @words=qw(this is just an example of words);
$string = "his";

Since "his" is a substring of "this", it shouldnt be neccessary to
index it. But I cant figure out an efficient way to check if $string is
a substring of an element in the array.

Can anyone give me a hint her, please ?

fastest way, probably use the index() function. if speed isn't that
much of an issue, regex matching m// is the most complete solution.
 
I

Ian Wilson

Hi!

I am trying to make an index of a lot of textfiles. Since the index
thends to get quite large, I want to optimize my algorithm som that if
$string is a part of a element already stored in @words it won't be
added to the array.

As an example:

my @words=qw(this is just an example of words);
$string = "his";

Since "his" is a substring of "this", it shouldnt be neccessary to
index it.

Your assertion seems strange to me, "his" is a valid English word that
I'd expect to be able to look up in an index. Presumably you know what
you are doing?
But I cant figure out an efficient way to check if $string is
a substring of an element in the array.

Can anyone give me a hint her, please ?

I've no idea how efficient it is, but maybe this counts as a hint?

#!/usr/bin/perl
use strict;
use warnings;

my @words=qw(this is just an example of words);
my $string = "his";

for my $word (@words) {
if ($word =~ /$string/) {
print "$string is a substring of $word\n";
}
}

You may need to post a minimal but complete working example of your
current code if you need help optimising it.

HTH
 
F

folkvord

Hi!

Thanks for your input, both of you. I will check out the index()
solution to see if it helps.

Yes, I know what I am doing, and that this might cause the effect you
are indicating. Even though, this is the way I want it to be.
 
I

it_says_BALLS_on_your_forehead

it_says_BALLS_on_your_forehead said:
fastest way, probably use the index() function. if speed isn't that
much of an issue, regex matching m// is the most complete solution.

this is surprising (goes to show you, always benchmark :))

======
use strict; use warnings;

use Benchmark qw/cmpthese/;

sub reg {
my @words = qw/this is just an example of words/;
my $string = 'his';
for my $word ( @words ) {
$word !~ m/$string/o and push @words, $string;
}
}


sub ind {
my @words = qw/this is just an example of words/;
my $string = 'his';
for my $word ( @words ) {
my $pos = index( $word, $string );
$pos == -1 and push @words, $string;
}
}

cmpthese( -1, {
r => \&reg,
i => \&ind,
});


__END__
Rate i r
i 64742/s -- -8%
r 70314/s 9% --


....of course, bear in mind that the data set influences efficiency a
lot, contents, size, order, etc. typically however, i think it's safe
to say that index() is faster than regexes. regexes provide you with
far greater versatility though. you can ignore case for one thing,
which could be huge.
 
I

it_says_BALLS_on_your_forehead

Hi!

I am trying to make an index of a lot of textfiles. Since the index
thends to get quite large, I want to optimize my algorithm som that if
$string is a part of a element already stored in @words it won't be
added to the array.

As an example:

my @words=qw(this is just an example of words);
$string = "his";

Since "his" is a substring of "this", it shouldnt be neccessary to
index it. But I cant figure out an efficient way to check if $string is
a substring of an element in the array.

Can anyone give me a hint her, please ?

btw, your example fails to match your spec, since 'is' is a substring
of 'this'.
 
F

folkvord

You are right!

But as this was just an example, I obviously didn't put enough effort
in it ;)

The reason I want to index this way is that I am making an index of
each file. If the file contains the words "this", "his" and "is", only
"this" will be neccessary to store to get the correct match. This way I
should be able to save a lot of space, and that is my primary concern.
I am sorry the example faild to match this spec.
 
A

Anno Siegel

Hi!

I am trying to make an index of a lot of textfiles. Since the index
thends to get quite large, I want to optimize my algorithm som that if
$string is a part of a element already stored in @words it won't be
added to the array.

As an example:

my @words=qw(this is just an example of words);
$string = "his";

Since "his" is a substring of "this", it shouldnt be neccessary to
index it. But I cant figure out an efficient way to check if $string is
a substring of an element in the array.

Can anyone give me a hint her, please ?

That's going to cost a lot of time. You have to check each new word
basically against all words yet collected. That will grow quadratically
with the number of words.

To be consistent, you'd also have to throw out words that are a substring
of a given new word.

If I had to do it, I'd first build a list of (unique) words and sort it
by descending length. Then collect words by joining them together with
a "safe" character that doesn't appear in any of the words. That way,
a single index() operation can decide if a new word is a substring of
a word already in the list. Since you are adding short words after
long ones, you'll detect all substrings that way. Untested:

my $coll = '';
for ( sort { length( $b) <=> length( $a) } @words ) {
next if index( $coll, $_) >= 0;
$coll .= "$_:";
}
my @selection = split /:/, $coll;

Anno
 
I

it_says_BALLS_on_your_forehead

You are right!

But as this was just an example, I obviously didn't put enough effort
in it ;)

The reason I want to index this way is that I am making an index of
each file. If the file contains the words "this", "his" and "is", only
"this" will be neccessary to store to get the correct match. This way I
should be able to save a lot of space, and that is my primary concern.
I am sorry the example faild to match this spec.

i'm not exactly sure what you mean by "index" in this context. if you
want to look up a word, you couldn't if it was a substring of the word,
since the smaller word would not have its own entry in the index. i
would just throw every new word into a hash. otherwise, many problems
could arise.

consider this: you being your "algorithm", and your array is initially
empty. the first word you add is 'is'. the next word you come across is
'his'. you check, and sure enough, 'his' is NOT a substring of 'is', so
you add 'his' to your array. then, you see 'this', which is neither a
substring of his or is, so you add that to the array...you can see
where i'm going with this. to avoid this, you would have to constantly
iterate over the entire array and "clean it up"--purge entries that are
subsets of others. this would be prohibitive performance-wise, since
you would need to check every entry against every other entry in the
array--O(n^2). and that's just for the cleanup! your index wouldn't
really serve as an index for words that are substrings of true entries.


i don't know why space would be an issue, disk is cheap (if you are
referring to disk space--will you be saving this "index" to a flat
file? a database?) go with a hash. it's faster, easier, and most
importantly, correct. to save some space, you can force lower or force
upper entries to avoid duplicating the same word.

====
use strict; use warnings;

my $txt = 'It was the best of times it was the worst of times';
my %words;
while ( $txt =~ m/(\w+)/g ) {
$words{"\L$1"}++;
}


for my $key ( keys %words ) {
print $key, ' => ', $words{$key}, "\n";
}

__END__

worst => 1
the => 2
was => 2
it => 2
of => 2
best => 1
times => 2
 
I

it_says_BALLS_on_your_forehead

it_says_BALLS_on_your_forehead said:
this is surprising (goes to show you, always benchmark :))

======
use strict; use warnings;

use Benchmark qw/cmpthese/;

sub reg {
my @words = qw/this is just an example of words/;
my $string = 'his';
for my $word ( @words ) {
$word !~ m/$string/o and push @words, $string;
}
}


sub ind {
my @words = qw/this is just an example of words/;
my $string = 'his';
for my $word ( @words ) {
my $pos = index( $word, $string );
$pos == -1 and push @words, $string;
}
}

cmpthese( -1, {
r => \&reg,
i => \&ind,
});


__END__
Rate i r
i 64742/s -- -8%
r 70314/s 9% --


...of course, bear in mind that the data set influences efficiency a
lot, contents, size, order, etc. typically however, i think it's safe
to say that index() is faster than regexes. regexes provide you with
far greater versatility though. you can ignore case for one thing,
which could be huge.

bad coding also affects efficiency. i condensed the two statements in
the for loop of my ind sub to one, and now ind is faster than reg.
=====

use strict; use warnings;

use Benchmark qw/cmpthese/;

sub reg {
my @words = qw/this is just an example of words/;
my $string = 'his';
for my $word ( @words ) {
$word !~ m/$string/o and push @words, $string;
}
}

sub ind {
my @words = qw/this is just an example of words/;
my $string = 'his';
for my $word ( @words ) {
(index( $word, $string )) == -1 and push @words, $string;
}
}

cmpthese( -1, {
r => \&reg,
i => \&ind,


});

__END__
Rate r i
r 70382/s -- -6%
i 75044/s 7% --
 
I

it_says_BALLS_on_your_forehead

Anno said:
That's going to cost a lot of time. You have to check each new word
basically against all words yet collected. That will grow quadratically
with the number of words.

To be consistent, you'd also have to throw out words that are a substring
of a given new word.

If I had to do it, I'd first build a list of (unique) words and sort it
by descending length. Then collect words by joining them together with
a "safe" character that doesn't appear in any of the words. That way,
a single index() operation can decide if a new word is a substring of
a word already in the list. Since you are adding short words after
long ones, you'll detect all substrings that way. Untested:

my $coll = '';
for ( sort { length( $b) <=> length( $a) } @words ) {
next if index( $coll, $_) >= 0;
$coll .= "$_:";
}
my @selection = split /:/, $coll;

clever, i like it :).
 
X

xhoster

it_says_BALLS_on_your_forehead said:
bad coding also affects efficiency. i condensed the two statements in
the for loop of my ind sub to one, and now ind is faster than reg.

But you also need to benchamrk something that does what it is supposed
to do. So you should always have a print statement or an assertion or
something in the subroutine to make sure it works, then comment it out
for the real benchmarking.
=====

use strict; use warnings;

use Benchmark qw/cmpthese/;

sub reg {
my @words = qw/this is just an example of words/;
my $string = 'his';
for my $word ( @words ) {
$word !~ m/$string/o and push @words, $string;
}

print "reg: @words\n";
......

__END__
reg: this is just an example of words his his his his his his
reg: this is just an example of words his his his his his his
reg: this is just an example of words his his his his his his
reg: this is just an example of words his his his his his his


Hmmm....


Xho
 
I

it_says_BALLS_on_your forehead

But you also need to benchamrk something that does what it is supposed
to do.

well sure, if you want to get all logical about it. ;-).

So you should always have a print statement or an assertion or
something in the subroutine to make sure it works, then comment it out
for the real benchmarking.



print "reg: @words\n";

.....

__END__
reg: this is just an example of words his his his his his his
reg: this is just an example of words his his his his his his
reg: this is just an example of words his his his his his his
reg: this is just an example of words his his his his his his


Hmmm....

right, i should have checked if ALL of the matches/index checks failed,
THEN pushed the word into the array. thx for the correction :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,767
Messages
2,569,571
Members
45,045
Latest member
DRCM

Latest Threads

Top