How to chck for string in array elements

folkvord · Feb 23, 2006

Hi!

I am trying to make an index of a lot of textfiles. Since the index
thends to get quite large, I want to optimize my algorithm som that if
$string is a part of a element already stored in @words it won't be
added to the array.

As an example:

my @words=qw(this is just an example of words);
$string = "his";

Since "his" is a substring of "this", it shouldnt be neccessary to
index it. But I cant figure out an efficient way to check if $string is
a substring of an element in the array.

Can anyone give me a hint her, please ?

it_says_BALLS_on_your_forehead · Feb 23, 2006

[email protected] said:
Hi!

I am trying to make an index of a lot of textfiles. Since the index
thends to get quite large, I want to optimize my algorithm som that if
$string is a part of a element already stored in @words it won't be
added to the array.

As an example:

my @words=qw(this is just an example of words);
$string = "his";

Since "his" is a substring of "this", it shouldnt be neccessary to
index it. But I cant figure out an efficient way to check if $string is
a substring of an element in the array.

Can anyone give me a hint her, please ?

fastest way, probably use the index() function. if speed isn't that
much of an issue, regex matching m// is the most complete solution.

Ian Wilson · Feb 23, 2006

Hi!

I am trying to make an index of a lot of textfiles. Since the index
thends to get quite large, I want to optimize my algorithm som that if
$string is a part of a element already stored in @words it won't be
added to the array.

As an example:

my @words=qw(this is just an example of words);
$string = "his";

Since "his" is a substring of "this", it shouldnt be neccessary to
index it.

Your assertion seems strange to me, "his" is a valid English word that
I'd expect to be able to look up in an index. Presumably you know what
you are doing?

But I cant figure out an efficient way to check if $string is
a substring of an element in the array.

Can anyone give me a hint her, please ?

I've no idea how efficient it is, but maybe this counts as a hint?

#!/usr/bin/perl
use strict;
use warnings;

my @words=qw(this is just an example of words);
my $string = "his";

for my $word (@words) {
if ($word =~ /$string/) {
print "$string is a substring of $word\n";
}
}

You may need to post a minimal but complete working example of your
current code if you need help optimising it.

HTH

folkvord · Feb 23, 2006

Hi!

Thanks for your input, both of you. I will check out the index()
solution to see if it helps.

Yes, I know what I am doing, and that this might cause the effect you
are indicating. Even though, this is the way I want it to be.

it_says_BALLS_on_your_forehead · Feb 23, 2006

it_says_BALLS_on_your_forehead said:
fastest way, probably use the index() function. if speed isn't that
much of an issue, regex matching m// is the most complete solution.

this is surprising (goes to show you, always benchmark

)

======
use strict; use warnings;

use Benchmark qw/cmpthese/;

sub reg {
my @words = qw/this is just an example of words/;
my $string = 'his';
for my $word ( @words ) {
$word !~ m/$string/o and push @words, $string;
}
}

sub ind {
my @words = qw/this is just an example of words/;
my $string = 'his';
for my $word ( @words ) {
my $pos = index( $word, $string );
$pos == -1 and push @words, $string;
}
}

cmpthese( -1, {
r => \&reg,
i => \&ind,
});

__END__
Rate i r
i 64742/s -- -8%
r 70314/s 9% --

....of course, bear in mind that the data set influences efficiency a
lot, contents, size, order, etc. typically however, i think it's safe
to say that index() is faster than regexes. regexes provide you with
far greater versatility though. you can ignore case for one thing,
which could be huge.

it_says_BALLS_on_your_forehead · Feb 23, 2006

Hi!

I am trying to make an index of a lot of textfiles. Since the index
thends to get quite large, I want to optimize my algorithm som that if
$string is a part of a element already stored in @words it won't be
added to the array.

As an example:

my @words=qw(this is just an example of words);
$string = "his";

Since "his" is a substring of "this", it shouldnt be neccessary to
index it. But I cant figure out an efficient way to check if $string is
a substring of an element in the array.

Can anyone give me a hint her, please ?

btw, your example fails to match your spec, since 'is' is a substring
of 'this'.

folkvord · Feb 23, 2006

You are right!

But as this was just an example, I obviously didn't put enough effort
in it

The reason I want to index this way is that I am making an index of
each file. If the file contains the words "this", "his" and "is", only
"this" will be neccessary to store to get the correct match. This way I
should be able to save a lot of space, and that is my primary concern.
I am sorry the example faild to match this spec.

Anno Siegel · Feb 23, 2006

Hi!

I am trying to make an index of a lot of textfiles. Since the index
thends to get quite large, I want to optimize my algorithm som that if
$string is a part of a element already stored in @words it won't be
added to the array.

As an example:

my @words=qw(this is just an example of words);
$string = "his";

Since "his" is a substring of "this", it shouldnt be neccessary to
index it. But I cant figure out an efficient way to check if $string is
a substring of an element in the array.

Can anyone give me a hint her, please ?

That's going to cost a lot of time. You have to check each new word
basically against all words yet collected. That will grow quadratically
with the number of words.

To be consistent, you'd also have to throw out words that are a substring
of a given new word.

If I had to do it, I'd first build a list of (unique) words and sort it
by descending length. Then collect words by joining them together with
a "safe" character that doesn't appear in any of the words. That way,
a single index() operation can decide if a new word is a substring of
a word already in the list. Since you are adding short words after
long ones, you'll detect all substrings that way. Untested:

my $coll = '';
for ( sort { length( $b) <=> length( $a) } @words ) {
next if index( $coll, $_) >= 0;
$coll .= "$_:";
}
my @selection = split /:/, $coll;

Anno

it_says_BALLS_on_your_forehead · Feb 23, 2006

You are right!

But as this was just an example, I obviously didn't put enough effort
in it

The reason I want to index this way is that I am making an index of
each file. If the file contains the words "this", "his" and "is", only
"this" will be neccessary to store to get the correct match. This way I
should be able to save a lot of space, and that is my primary concern.
I am sorry the example faild to match this spec.

i'm not exactly sure what you mean by "index" in this context. if you
want to look up a word, you couldn't if it was a substring of the word,
since the smaller word would not have its own entry in the index. i
would just throw every new word into a hash. otherwise, many problems
could arise.

consider this: you being your "algorithm", and your array is initially
empty. the first word you add is 'is'. the next word you come across is
'his'. you check, and sure enough, 'his' is NOT a substring of 'is', so
you add 'his' to your array. then, you see 'this', which is neither a
substring of his or is, so you add that to the array...you can see
where i'm going with this. to avoid this, you would have to constantly
iterate over the entire array and "clean it up"--purge entries that are
subsets of others. this would be prohibitive performance-wise, since
you would need to check every entry against every other entry in the
array--O(n^2). and that's just for the cleanup! your index wouldn't
really serve as an index for words that are substrings of true entries.

i don't know why space would be an issue, disk is cheap (if you are
referring to disk space--will you be saving this "index" to a flat
file? a database?) go with a hash. it's faster, easier, and most
importantly, correct. to save some space, you can force lower or force
upper entries to avoid duplicating the same word.

====
use strict; use warnings;

my $txt = 'It was the best of times it was the worst of times';
my %words;
while ( $txt =~ m/(\w+)/g ) {
$words{"\L$1"}++;
}

for my $key ( keys %words ) {
print $key, ' => ', $words{$key}, "\n";
}

__END__

worst => 1
the => 2
was => 2
it => 2
of => 2
best => 1
times => 2

it_says_BALLS_on_your_forehead · Feb 23, 2006

it_says_BALLS_on_your_forehead said:
this is surprising (goes to show you, always benchmark )

======
use strict; use warnings;

use Benchmark qw/cmpthese/;

sub reg {
my @words = qw/this is just an example of words/;
my $string = 'his';
for my $word ( @words ) {
$word !~ m/$string/o and push @words, $string;
}
}

sub ind {
my @words = qw/this is just an example of words/;
my $string = 'his';
for my $word ( @words ) {
my $pos = index( $word, $string );
$pos == -1 and push @words, $string;
}
}

cmpthese( -1, {
r => \&reg,
i => \&ind,
});

__END__
Rate i r
i 64742/s -- -8%
r 70314/s 9% --

...of course, bear in mind that the data set influences efficiency a
lot, contents, size, order, etc. typically however, i think it's safe
to say that index() is faster than regexes. regexes provide you with
far greater versatility though. you can ignore case for one thing,
which could be huge.

bad coding also affects efficiency. i condensed the two statements in
the for loop of my ind sub to one, and now ind is faster than reg.
=====

use strict; use warnings;

use Benchmark qw/cmpthese/;

sub reg {
my @words = qw/this is just an example of words/;
my $string = 'his';
for my $word ( @words ) {
$word !~ m/$string/o and push @words, $string;
}
}

sub ind {
my @words = qw/this is just an example of words/;
my $string = 'his';
for my $word ( @words ) {
(index( $word, $string )) == -1 and push @words, $string;
}
}

cmpthese( -1, {
r => \&reg,
i => \&ind,

});

__END__
Rate r i
r 70382/s -- -6%
i 75044/s 7% --

it_says_BALLS_on_your_forehead · Feb 23, 2006

Anno said:
That's going to cost a lot of time. You have to check each new word
basically against all words yet collected. That will grow quadratically
with the number of words.

To be consistent, you'd also have to throw out words that are a substring
of a given new word.

If I had to do it, I'd first build a list of (unique) words and sort it
by descending length. Then collect words by joining them together with
a "safe" character that doesn't appear in any of the words. That way,
a single index() operation can decide if a new word is a substring of
a word already in the list. Since you are adding short words after
long ones, you'll detect all substrings that way. Untested:

my $coll = '';
for ( sort { length( $b) <=> length( $a) } @words ) {
next if index( $coll, $_) >= 0;
$coll .= "$_:";
}
my @selection = split /:/, $coll;

clever, i like it

.

xhoster · Feb 23, 2006

it_says_BALLS_on_your_forehead said:
bad coding also affects efficiency. i condensed the two statements in
the for loop of my ind sub to one, and now ind is faster than reg.

But you also need to benchamrk something that does what it is supposed
to do. So you should always have a print statement or an assertion or
something in the subroutine to make sure it works, then comment it out
for the real benchmarking.

=====

use strict; use warnings;

use Benchmark qw/cmpthese/;

sub reg {
my @words = qw/this is just an example of words/;
my $string = 'his';
for my $word ( @words ) {
$word !~ m/$string/o and push @words, $string;
}

print "reg: @words\n";

}

......

__END__
reg: this is just an example of words his his his his his his
reg: this is just an example of words his his his his his his
reg: this is just an example of words his his his his his his
reg: this is just an example of words his his his his his his

Hmmm....

Xho

it_says_BALLS_on_your forehead · Feb 23, 2006

But you also need to benchamrk something that does what it is supposed
to do.

well sure, if you want to get all logical about it. ;-).

So you should always have a print statement or an assertion or
something in the subroutine to make sure it works, then comment it out
for the real benchmarking.

print "reg: @words\n";

.....

__END__
reg: this is just an example of words his his his his his his
reg: this is just an example of words his his his his his his
reg: this is just an example of words his his his his his his
reg: this is just an example of words his his his his his his

Hmmm....

right, i should have checked if ALL of the matches/index checks failed,
THEN pushed the word into the array. thx for the correction

Converting an Array to a String in JavaScript	7	Sep 22, 2023
Copy string from 2D array to a 1D array in C	1	Nov 1, 2023
Hello guys ! How do I convert a string from an array into numbers ? Javascript	3	Dec 19, 2022
How to add a className to an object in array in js?	1	Dec 7, 2021
Accessing array index addresses with custom datatype in a function	0	Jun 2, 2022
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Trying to get the average value of the elements, please help ! JavaScript	3	Dec 13, 2022
Hi, I am a webflow user. I am looking for CSS code that can KEEP ALL ELEMENTS POSITIONED in the SAME spot across all resolutions	0	Oct 27, 2023

How to chck for string in array elements

folkvord

it_says_BALLS_on_your_forehead

Ian Wilson

folkvord

it_says_BALLS_on_your_forehead

it_says_BALLS_on_your_forehead

folkvord

Anno Siegel

it_says_BALLS_on_your_forehead

it_says_BALLS_on_your_forehead

it_says_BALLS_on_your_forehead

xhoster

it_says_BALLS_on_your forehead

Ask a Question

Similar Threads

Staff online

Members online

Forum statistics

Latest Threads