Comparing values of multiple hash keys

J

Jason

In the past I've written a simple search engine and have been using it
for awhile, but now I'm trying to make it a little more intuitive.

Originally, the script was simple. Take a keyword entered by a form,
compare it to each value of an array (@data), and return any value
containing the entry:

$var = param('keyword');
foreach $key (@data) {
if ($key =~ /$var/i) { push (@founddata, $key); }
}


But now I'm trying to allow for multi-word phrases, which is a bit more
complex. I couldn't find how others are doing it, so I'm winging it on
my own. I started by splitting $var by the whitespace into an array,
counting the number of instances that $var appeared in the $key, then
adding the results to a hash value:

my (%founddata, @keywords);
my $var = param('keyword');

$var =~ s/(?:,|'|\.)//g; # Remove comma, apostrophe, or period
@keywords = split(/ /, $var);

foreach my $key (@data) {
foreach my $term (@keywords) {
my @matches = $var =~ /($term)/ig;
my $size = @matches;
if ($size > 0) { $founddata{$term} .= $size::$key . "|"; }
}
}


Let's say that @data = ("Men", "Women", "Children", "Pets",
"Monsters"), and someone searches for "m n" (without the quotes, of
course, because so far I'm not worrying about strict phrases). The
results would be:

$keywords[0] = m;
$keywords[1] = n;

$founddata{'m'} = 1::Men|1::Women|1::Monsters;
$founddata{'n'} = 1::Men|1::Women|1::Children|2::Monsters;


Now, how do I compare the two keys "m" and "n" to both remove any value
that's not in both (ie, Children), and then add the rest together to
create something like 2::Men|2::Women|3::Monsters?

In my mind, I would take the result of this, split it by | into an
array, sort it so that Monsters is first, then split it by :: to remove
the numbers (leaving the final values sorted by the number of
instances), unless you guys know an easier route.

TIA,

Jason
 
J

Jason

$founddata{'n'} = 1::Men|1::Women|1::Children|2::Monsters;

Whoops, I screwed up there. I was trying to think of a word with the
letter "n" in it twice; don't know why I thought that Monsters did
(other than the fact that I posted at 3am).

So obviously this would not be 2::Monsters, it would be 1::Monsters.
But for the sake of discussion, pretend that "Monsters" was a word with
2 n's in it ;-)

- J
 
U

usenet

Jason said:
$founddata{'m'} = 1::Men|1::Women|1::Monsters;
$founddata{'n'} = 1::Men|1::Women|1::Children|2::Monsters;


Now, how do I compare the two keys "m" and "n" to both remove any value
that's not in both (ie, Children), and then add the rest together to
create something like 2::Men|2::Women|3::Monsters?

In my mind, I would take the result of this, split it by | into an
array, sort it so that Monsters is first, then split it by :: to remove
the numbers (leaving the final values sorted by the number of
instances), unless you guys know an easier route.

I think you are making this harder than it needs to be (and less
efficient, too).

A common mistake, especially among less experienced programmers, is to
try to build up a bunch of stuff and then post-process it (this is
often manifested by slurping a file into an array and then looping over
the array, which is almost always silly). In your case, you are taking
multiple input values and looping them against your data list to build
up individual arrays, and then trying to post-process those arrays to
get meaningful data. This is rather inefficent, because you must loop
over your data list multiple times (which is backwards, because your
data list is presumably large and your input list is presumably small).

Instead, turn it inside-out. Don't construct outside loops over your
input lists. Instead, loop over your data list (once). For each item
in the data list, calculate the value of it's match to your criteria
(and ignore that element if it fails to meet your criteria).

This way, you fully resolve (or ignore) the value of each item in @data
one element at a time. Thus, you never need to post-process anything,
because everything you want to know is known when you finish the
outside loop around @data.

Consider this code:

#!/usr/bin/perl
use strict; use warnings;

my @data = qw{ Men Women Children Pets Monnsters };
my %keywords = ( 0 => 'm', 1 => 'n');
my @results = ();

WORD:
foreach my $word(@data) { #outside loop around word database
my $count = 0;
foreach my $keyword( values %keywords ) { #inside loop
next WORD unless $word =~ /$keyword/i; #fail criteria-reject
while ($word =~ /$keyword/ig) { $count++ } #perldoc -q count
}
push @results, "${count}::$word";
}
print join("|", @results), "\n";

__END__
 
D

Dr.Ruud

(e-mail address removed) schreef:
push @results, "${count}::$word";

OR-Variant:

push(@results, "${count}::$word")
if $count ;

AND-Variant:

push(@results, "${count}::$word")
if $count == keys %keywords ;
 
T

Tad McClellan

Jason said:
Originally, the script was simple. Take a keyword entered by a form,
compare it to each value of an array (@data), and return any value
containing the entry:

$var = param('keyword');
foreach $key (@data) {
if ($key =~ /$var/i) { push (@founddata, $key); }
}


But now I'm trying to allow for multi-word phrases, which is a bit more
complex. I couldn't find how others are doing it,


Do just what you are already doing (but a hash is more natural for
representing a set), but do it once for each term in the multi-word phrase.

You then have N sets for your N-term phrase.

Find the intersection of the N sets, and you have the answer.

perldoc -q intersection

How do I compute the difference of two arrays? How do I compute the
intersection of two arrays?


$var =~ s/(?:,|'|\.)//g; # Remove comma, apostrophe, or period


$var =~ tr/,'.//d; # Remove comma, apostrophe, or period
# only clearer and faster
 
T

Ted Zlatanov

Do just what you are already doing (but a hash is more natural for
representing a set), but do it once for each term in the multi-word phrase.

You then have N sets for your N-term phrase.

Find the intersection of the N sets, and you have the answer.

perldoc -q intersection

How do I compute the difference of two arrays? How do I compute the
intersection of two arrays?

The OP may want the union of the sets, based on his imprecise
specifications :)

Ted
 
U

usenet

Dr.Ruud said:
(e-mail address removed) schreef:

OR-Variant:
push(@results, "${count}::$word")
if $count ;
AND-Variant:
push(@results, "${count}::$word")
if $count == keys %keywords ;

The push statement in my code could never be executed unless the
keyword was found at least once in all data elements (as per the OP's
specification) so it would not be necessary to use conditional
expressions on the push as I coded it (interestingly, I sat in on a
great presentation at OSCON that Geoffrey Young gave just a few hours
ago on Devel::Cover which discussed how that module can help identify
redundant secondary checks).

However, the AND variant might not produce correct results. The OP
stipulated that all occurances of each search term be counted within
the data space, so $count could easily exceed keys(), and keys() might
be equal to $count even though the term was not found in all data terms
(because some terms matched more than once).
 
A

anno4000

Ted Zlatanov said:
The OP may want the union of the sets, based on his imprecise
specifications :)

Maybe. The answer is still sufficient. The union isn't in the title,
but the faq explains how to compute it.

Anno
 
D

Dr.Ruud

(e-mail address removed) schreef:
Dr.Ruud:

The push statement in my code could never be executed unless the
keyword was found at least once in all data elements (as per the OP's
specification) so it would not be necessary to use conditional
expressions on the push as I coded it (interestingly, I sat in on a
great presentation at OSCON that Geoffrey Young gave just a few hours
ago on Devel::Cover which discussed how that module can help identify
redundant secondary checks).

However, the AND variant might not produce correct results. The OP
stipulated that all occurances of each search term be counted within
the data space, so $count could easily exceed keys(), and keys() might
be equal to $count even though the term was not found in all data
terms (because some terms matched more than once).

Oops yes, I overlooked the while.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,053
Messages
2,570,431
Members
47,075
Latest member
TysonV438

Latest Threads

Top