Converting a string to multiple search patterns

T

Tore Aursand

Hi all!

I'm stumped on this one: I have an application where I need to refine the
search mechanism. The concept is quite simple: Get a string, convert it
to separate words, count (and "score") each word for each document, and
then display the result based on the score;

my $query = 'A B C D';
my @words = split( /\s+/, $query );
foreach ( @documents ) {
# ...
}

I need to refine it, as said. I want a higher score for word sequences,
and in a particular order. For the example above ('A B C D'), I want to
match in this order:

1. A B C D
2. A B C
3. B C D
4. A B
5. C D
6. A C
7. B D
9. A D
9. A
10. B
11. C
12. D

Anyone know of a module which can accomplis this? I really haven't tried
with anything yet, 'cause I have no clue on how to do it. The closest
thing I've been, has been with the Algorithm::permute module. It doesn't
give me what I want "out of the box", though...

Any ideas?
 
A

Anno Siegel

Tore Aursand said:
Hi all!

I'm stumped on this one: I have an application where I need to refine the
search mechanism. The concept is quite simple: Get a string, convert it
to separate words, count (and "score") each word for each document, and
then display the result based on the score;

my $query = 'A B C D';
my @words = split( /\s+/, $query );
foreach ( @documents ) {
# ...
}

I need to refine it, as said. I want a higher score for word sequences,
and in a particular order. For the example above ('A B C D'), I want to
match in this order:

1. A B C D
2. A B C
3. B C D
4. A B
5. C D
6. A C
7. B D
9. A D
9. A
10. B
11. C
12. D

I'm missing "A B D", "A C D", and " B C " from the collection.
Are these entirely arbitrary?
Anyone know of a module which can accomplis this? I really haven't tried
with anything yet, 'cause I have no clue on how to do it. The closest
thing I've been, has been with the Algorithm::permute module. It doesn't
give me what I want "out of the box", though...

I'm not sure what you are asking. Is it the generation of all selections
of 1 .. 4 objects from a set of 4? These don't correspond to permutations,
but to four-digit binary numbers (so there are 2**4 - 1 = 15 of them,
not counting the empty selection). I'm sure there is a module on CPAN
to generate them, but ad-hoc solutions aren't too hard either.

Or is the issue how to assign a score to each of a collection of
regexes and retrieve the score after each match? This can be done
using the (?{}) construct to execute code at match time.

Starting from your list (@lines, say) above, I'd generate a list @score
of pairs where each pair holds a score and a string to match:

my @score = map [ split /\./], @lines;
$_->[ 1] =~ tr/ //d for @score;

The second line simplifies things by deleting all blanks from the strings
to match. Your practical regexes may look different.

Build an alternation of patterns where each pattern includes code
to set a variable ($scored) to the corresponding score:

my $rex = join '|', map "$_->[ 1](?\{ \$scored = $_->[ 0] \})", @score;

Generate a test string and check it.

my $text = join '', map qw( A B C D E)[ rand 5], 1 .. 100;

my $scored;
use re 'eval';
while ( $text =~ /($rex)/g ) {
print "score $scored: $1\n";
}

Anno
 
J

Jeff 'japhy' Pinyan

[posted & mailed]

I'm stumped on this one: I have an application where I need to refine the
search mechanism. The concept is quite simple: Get a string, convert it
to separate words, count (and "score") each word for each document, and
then display the result based on the score;

my $query = 'A B C D';
my @words = split( /\s+/, $query );
foreach ( @documents ) {
# ...
}

I need to refine it, as said. I want a higher score for word sequences,
and in a particular order. For the example above ('A B C D'), I want to
match in this order:

1. A B C D
2. A B C
3. B C D
4. A B
5. C D
6. A C
7. B D
9. A D
9. A
10. B
11. C
12. D

As Anno said, you're missing three (non-empty) strings:

A B D
A C D
B C

Now, does this mean you want to attempt matching them in that order?
Like, you want your regex to do:

/ABCD|ABC|BCD|ABD|ACD|AB|BC|CD|AC|AD|BD|A|B|C|D/

Is that what you want your regex to do?
Anyone know of a module which can accomplis this? I really haven't tried
with anything yet, 'cause I have no clue on how to do it. The closest
thing I've been, has been with the Algorithm::permute module. It doesn't
give me what I want "out of the box", though...

I think you can accomplish it by way of embedded code in your regex, but
I'd need a little more information about the aim of the regex before I
could write one.
 
T

Tore Aursand

I'm missing "A B D", "A C D", and " B C " from the collection.

Doh! You're right;

1. A B C D
2. A B C
3. B C D
4. A B D <--
5. A C D <--
6. A B
7. B C <--
8. C D
9. A C
10. B D
11. A D
12. A
13. B
14. C
15. D
I'm not sure what you are asking. Is it the generation of all selections
of 1 .. 4 objects from a set of 4?

Yes. I have a string containing 'A B C D'. I split those to an array,
and I want to generate a new list (as above), ie. and AoA;

$VAR1 = [
[ A, B, C, D ],
[ A, B, C ],
[ B, C, D ],
 
T

Tore Aursand

[...]
Now, does this mean you want to attempt matching them in that order?

Yes, As mentioned, the matching isn't the problem. It's generating the
structure/pattern which is the problem. See my reply to Anno for more
information.
 
A

Anno Siegel

Tore Aursand said:
I'm missing "A B D", "A C D", and " B C " from the collection.

Doh! You're right;

1. A B C D
2. A B C
3. B C D
4. A B D <--
5. A C D <--
6. A B
7. B C <--
8. C D
9. A C
10. B D
11. A D
12. A
13. B
14. C
15. D
I'm not sure what you are asking. Is it the generation of all selections
of 1 .. 4 objects from a set of 4?

Yes. I have a string containing 'A B C D'. I split those to an array,
and I want to generate a new list (as above), ie. and AoA;

$VAR1 = [
[ A, B, C, D ],
[ A, B, C ],
[ B, C, D ],
.
.
.
]

Once I have this structure, it's quite easy to do the scoring. But I'm
really stuck on how to generate this structure... :(

I found the other part harder.

sub selections {
my @sel = [];
for my $elem ( @_ ) {
unshift @sel, map [ $elem, @$_], @sel;
}
@sel;
}

Anno
 
B

Brad Baxter

Doh! You're right;

1. A B C D
2. A B C
3. B C D
4. A B D <--
5. A C D <--
6. A B
7. B C <--
8. C D
9. A C
10. B D
11. A D
12. A
13. B
14. C
15. D


Yes. I have a string containing 'A B C D'. I split those to an array,
and I want to generate a new list (as above), ie. and AoA;

The order you list isn't the order I'd expect from a simple generation of
all selections, so tell me if my interpretation is correct:

o More terms = higher score
o More adjacent terms = higher score
o Leftward terms = higher score than those to the right

Right?

Regards,

Brad
 
J

Jeff 'japhy' Pinyan

[posted & mailed]

[posted & mailed]

Now, does this mean you want to attempt matching them in that order?
Like, you want your regex to do:

/ABCD|ABC|BCD|ABD|ACD|AB|BC|CD|AC|AD|BD|A|B|C|D/

Is that what you want your regex to do?

I *think* this is what you want your regex to do:

my $rx = qr{
(?{ local ($s, $f) = (0, 1) })
^ \s*
(?: A (?{ $s += $f <<= 1 }) | (?{ $f = 1 }) ) \s*
(?: B (?{ $s += $f <<= 1 }) | (?{ $f = 1 }) ) \s*
(?: C (?{ $s += $f <<= 1 }) | (?{ $f = 1 }) ) \s*
(?: D (?{ $s += $f <<= 1 }) | (?{ $f = 1 }) ) \s*
$
(?{ $s })
}x;

You don't need to worry about the set logic now -- the regex engine will
take care of that.

That regex will match "ABCD", "ABC", "BCD" (and " BCD"), etc.
Specifically, it will match the 15 non-empty strings listed, with optional
whitespace throughout. Modification of HOW it matches is simple. What's
important to see is the scoring algorithm.

Score ($s) starts out at 0, and the factor ($f) starts out at 1. Every
time we hit a consecutive keyword, we left-shift the factor by 1 (that is,
we multiply it by 2), and add it to the score. Every time the next
keyword is missing, we reset the factor to 1. This means that the string
"ABCD" has the maximum score of 30 (2 + 4 + 8 + 16), and single-character
strings like "B" have the minimum non-zero score of 2.

To *use* this, we would do something like:

# this could be made more efficient via a Guttman-Rosler
# Transform, but that's not the point here

@sorted_by_score =
map { $_->[1] }
sort { $b->[0] <=> $a->[0] }
map { /$rx/ ? [ $^R, $_ ] : () }
@data;

The $^R variable contains the return value of the most recent (?{ ... })
in a regex -- for us, this is the one that merely holds the value of $s.
We then sort the data by its score, and extract the data.

If this is totally off-base, I apologize. On the other hand, if this is
exactly what you're looking for, then the process of constructing the
regex is simple:

use re 'eval';
my @kw = qw( A B C D );
my $rx = qr{
(?{ local ($s, $f) = (0, 1) })
^ \s*
@{[ map "(?:\Q$_\E(?{ \$s += \$f <<= 1 })|(?{ \$f = 1 }) )\\s*", @kw ]}
$
(?{ $s })
}x;

That does it.
 
T

Tore Aursand

[...]
o More terms = higher score
o More adjacent terms = higher score
o Leftward terms = higher score than those to the right

That's correct. Am I missing something? Is my list wrong? Please
correct me or make any suggestions.
 
T

Tore Aursand

sub selections {
my @sel = [];
for my $elem ( @_ ) {
unshift @sel, map [ $elem, @$_], @sel;
}
@sel;
}

This one almost does the job, but it doesn't output what I want; The
elements are mostly reversed, so reverse()'ing @_ makes it a bit better.

Still, 'A B D' comes before 'B C D'. Take a look at my previous reply to
you:

1. A B C D
2. A B C
3. B C D
4. A B D
5. A C D
6. A B
7. B C
8. C D
9. A C
10. B D
11. A D
12. A
13. B
14. C
15. D

Try this simple script using your subroutine (rewritten, 'cause
"sometimes" Pan won't let me paste thing I mark in xterm):

my @array = selections( qw(A B C D) );
for ( 0..$#array ) {
print "$_. " . join(' ', @{$array[$_]}) . "\n";
}

But you're close, Anno! :)
 
P

Peter Hickman

Tore said:
[...]
Now, does this mean you want to attempt matching them in that order?


Yes, As mentioned, the matching isn't the problem. It's generating the
structure/pattern which is the problem. See my reply to Anno for more
information.

Is this what you are after?

[peter@wasabi peter]$ cat xx
my @letters = qw/A B C D/;

foreach my $x ((0..15)) {
my %z;
@z{@letters} = split('', sprintf("%04b", $x));

printf "%2d: ", $x;
foreach my $letter (@letters) {
printf "%s ", (($z{$letter} == 1) ? $letter : ' ');
}
print "\n";
}

[peter@wasabi peter]$ perl xx
0:
1: D
2: C
3: C D
4: B
5: B D
6: B C
7: B C D
8: A
9: A D
10: A C
11: A C D
12: A B
13: A B D
14: A B C
15: A B C D
[peter@wasabi peter]$

More caffine than time ;-)
 
T

Tore Aursand

Is this what you are after?
[...]

No. It still doesn't create the pattern excactly how I want it. Maybe
there's no logic in my pattern, but... :)

Take a look at one of my previous posts (a reply to Anno) where I outline
the pattern the way I want it. Example: 'ABC' and 'BCD' should follow
right after 'ABCD'.


--
Tore Aursand <[email protected]>
"Scientists are complaining that the new "Dinosaur" movie shows
dinosaurs with lemurs, who didn't evolve for another million years.
They're afraid the movie will give kids a mistaken impression. What
about the fact that the dinosaurs are singing and dancing?" (Jay Leno)
 
A

Anno Siegel

Tore Aursand said:
sub selections {
my @sel = [];
for my $elem ( @_ ) {
unshift @sel, map [ $elem, @$_], @sel;
}
@sel;
}

This one almost does the job, but it doesn't output what I want; The
elements are mostly reversed, so reverse()'ing @_ makes it a bit better.

Still, 'A B D' comes before 'B C D'. Take a look at my previous reply to
you:

1. A B C D
2. A B C
3. B C D
4. A B D
5. A C D
6. A B
7. B C
8. C D
9. A C
10. B D
11. A D
12. A
13. B
14. C
15. D

Try this simple script using your subroutine (rewritten, 'cause
"sometimes" Pan won't let me paste thing I mark in xterm):

my @array = selections( qw(A B C D) );
for ( 0..$#array ) {
print "$_. " . join(' ', @{$array[$_]}) . "\n";
}

But you're close, Anno! :)

I don't think I am. Unless I'm missing something obvious, there is
no simple logic that generates exactly the sequence you want. Quoting
the exchange with Brad Baxter
o More terms = higher score
o More adjacent terms = higher score
o Leftward terms = higher score than those to the right

the resulting sequence looks rather arbitrary, best realized as a three-
level sort that follows the description.

Even generating all n-element selections before the (n+1)-element ones
isn't quite trivial, let alone generating those with many adjacent
elements before those with fewer, etc.

So, unless you can reveal a simple logic behind your sequence, my
recipe is, generate them in any sequence, then sort them into the
required order.

The sorting step could be realized as a hash slice through pre-sorted
sequences of integers. Each sequence is a permutation of 0 .. 2**(n-1),
where n is the total number of terms, four in the example. I think
Japhy was doing something similar at one stage of his Perl artistics
in the other subthread.

Anno
 
J

Jeff 'japhy' Pinyan

The sorting step could be realized as a hash slice through pre-sorted
sequences of integers. Each sequence is a permutation of 0 .. 2**(n-1),
where n is the total number of terms, four in the example. I think
Japhy was doing something similar at one stage of his Perl artistics
in the other subthread.

The concept of my solution is to match them arbitrarily, but assign each
match a score, and sort based on the scores. I think that's the most
efficient way (even if my regex is not efficient because of its code
evaluation assertions).
 
D

David K. Wall

Anno Siegel said:
1. A B C D
2. A B C
3. B C D
4. A B D
5. A C D
6. A B
7. B C
8. C D
9. A C
10. B D
11. A D
12. A
13. B
14. C
15. D
[snip]
But you're close, Anno! :)

I don't think I am. Unless I'm missing something obvious, there
is no simple logic that generates exactly the sequence you want.


# building on previously posted code
use strict;
use warnings;

my @array = 'A' .. 'D';

my @subsets = subsets( 0 .. $#array );

my @sequences;
for my $subset ( @subsets ) {
print join(' ', @array[ @$subset ]), "\n" ;
}

sub subsets {
my @sel = [];
for my $elem ( @_ ) {
push @sel, map [ @$_, $elem], @sel unless @sel == 0;
}
return sort {
@$b <=> @$a
||
($a->[-1] - $a->[0]) <=> ($b->[-1] - $b->[0])
||
join('', @$a) cmp join('', @$b)
} @sel;
}
 
D

David K. Wall

I said:
# building on previously posted code
use strict;
use warnings;

my @array = 'A' .. 'D';

my @subsets = subsets( 0 .. $#array );

my @sequences;
for my $subset ( @subsets ) {
print join(' ', @array[ @$subset ]), "\n" ;
}

sub subsets {
my @sel = [];
for my $elem ( @_ ) {
push @sel, map [ @$_, $elem], @sel unless @sel == 0;
}
return sort {
@$b <=> @$a
||
($a->[-1] - $a->[0]) <=> ($b->[-1] - $b->[0])
||
join('', @$a) cmp join('', @$b)

I should point out that this will break if $#array > 9.
 
B

Brad Baxter

Anno Siegel said:
1. A B C D
2. A B C
3. B C D
4. A B D
5. A C D
6. A B
7. B C
8. C D
9. A C
10. B D
11. A D
12. A
13. B
14. C
15. D
[snip]
But you're close, Anno! :)

I don't think I am. Unless I'm missing something obvious, there
is no simple logic that generates exactly the sequence you want.


# building on previously posted code
use strict;
use warnings;

my @array = 'A' .. 'D';

my @subsets = subsets( 0 .. $#array );

my @sequences;
for my $subset ( @subsets ) {
print join(' ', @array[ @$subset ]), "\n" ;
}

sub subsets {
my @sel = [];
for my $elem ( @_ ) {
push @sel, map [ @$_, $elem], @sel unless @sel == 0;
}
return sort {
@$b <=> @$a
||
($a->[-1] - $a->[0]) <=> ($b->[-1] - $b->[0])
||
join('', @$a) cmp join('', @$b)
} @sel;
}

I think I disagree with this. While it agrees with the OP's stated specs,
I'm not sure it agrees with the spirit of the specs. :) Of course, my
interpretation may simply be wrong.

Where I disagree first is with the stated specs:

I think 11. A D should come before 10. B D, because all else being equal,
A comes before B. In addition, when your code expands the terms 'A'..'E',
you get:

A B C D E
A B C D
B C D E
A B C E
A B D E
A C D E
A B C
B C D
C D E
....

While 'A B D E' has more terms, I think 'A B C', 'B C D', and 'C D E'
should outrank it, because they have more adjacent terms in a row.

So, while I think your code is MUCH prettier, below is my take on this
problem. The scoring is bizarre--I just want to weight the right things
while eliminating dupes. Seems to work, but I can't give you a
mathematical proof, so it's probably flawed. :)


use strict;
use warnings;

my @sets = subsets( 'A' .. 'D' );

print "@$_\n" for @sets;

sub subsets {
my @words = @_;
my $n = @words;
my @sets;
my @scored;
my %seen;

# create "binary" sets, '1's represent words present
push @sets, sprintf "%0${n}b", $_ for 0 .. 2**$n-1;

# $x is for unique sort keys ("scores")
my $x = $n - 1;

for my $si ( 0 .. $#sets ) { # need $si in score
my $set = $sets[ $si ];

# split into groups of adjacent terms
my @groups = split( /0/, $set);

my $score = 0;
for my $gi ( 0 .. $#groups ) { # need $gi in score
my $group = $groups[ $gi ];
next unless $group;

# sets are scored by length, number,
# and lefthandedness of their groups
my $len = length $group;
$score += $x**(2*$len) + ($x-$gi)*$len + $si;
}

# convert "binary" sets to sets of words
my @wordset;
my $i = 0;
for( split //, $set) {
my $word = $words[$i++];
$_ && push @wordset, $word;
}
push @scored, [$score, [@wordset]];

$seen{$score}++; # to prove our scores are unique
}

for ( sort keys %seen ) { die "Dupe: $_" if $seen{$_}>1 }

map { $_->[1] }
sort { $b->[0] <=> $a->[0] }
@scored;

}

__END__


Regards,

Brad
 
J

Jeff 'japhy' Pinyan

I think 11. A D should come before 10. B D, because all else being equal,
A comes before B. In addition, when your code expands the terms 'A'..'E',
you get:

A B C D E
A B C D
B C D E
A B C E
A B D E
A C D E
A B C
B C D
C D E

While 'A B D E' has more terms, I think 'A B C', 'B C D', and 'C D E'
should outrank it, because they have more adjacent terms in a row.

Well, my regex solution agrees with you, and while the insides of the
regex are a bit ugly to look at, the algorithm is far simpler than it
seems.

#!/usr/bin/perl -l

my $rx;

{
use re 'eval';
my @kw = qw( A B C D E );
$rx = qr{
(?{ local ($s, $f) = (0, 1) })
^ \s*
@{[map qq{ (?:
\Q$_\E (?{ \$s += \$f <<= 1 }) |
(?{ \$f = 1 })
) \\s*
}, @kw ]}
$
(?{ $s })
}x;
}

while (<{A,}{B,}{C,}{D,}{E,}>) {
chomp;
print "$^R\t$_" if /$rx/ and $^R;
}

Run that code through '| sort -n', and you'll get:

2 A
2 B
2 C
2 D
2 E
4 AC
4 AD
4 AE
4 BD
4 BE
4 CE
6 AB
6 ACE
6 BC
6 CD
6 DE
8 ABD
8 ABE
8 ACD
8 ADE
8 BCE
8 BDE
12 ABDE
14 ABC
14 BCD
14 CDE
16 ABCE
16 ACDE
30 ABCD
30 BCDE
62 ABCDE

which I think is consistent with the rules.
 
D

David K. Wall

Brad Baxter said:
On Tue, 8 Jun 2004, David K. Wall wrote:
[snip code]

(BTW, I don't like the way my posted code (mis)handles the empty set,
but never mind)
I think I disagree with this. While it agrees with the OP's
stated specs, I'm not sure it agrees with the spirit of the specs.
:)
Heh.

Of course, my interpretation may simply be wrong.

Or mine. It's Tore's problem, let him worry about it. :)

Where I disagree first is with the stated specs:


I think 11. A D should come before 10. B D, because all else being
equal, A comes before B.

But all else isn't equal. The "distance" between A and D is greater
than the "distance" between B and D. I'm not sure how to express
this clearly other than in code, but the way I understood Tore was
this: the combinations are grouped

first by the number of terms/elements in a combination,
then by the "range" of the combination,
then by the order of the original set.

-- but maybe I was reading too much into the choice of 'A'..'D' for
the example?
In addition, when your code expands the
terms 'A'..'E', you get:

A B C D E
A B C D
B C D E
A B C E
A B D E
A C D E
A B C
B C D
C D E
...

While 'A B D E' has more terms, I think 'A B C', 'B C D', and 'C D
E' should outrank it, because they have more adjacent terms in a
row.

That's a good point -- I certainly won't argue against it. What Would
Google Do? :)
 
T

Tore Aursand

It appears as though you are sorting on "edit distance", which is a
well-defined term, and even has a module, String::Approx, to compute
it.
[...]

Hmm. I've already had a look at String::Approx, but I don't think it will
help me solve this.

Remember that 'A B C D' really are four _words_;

my $query = 'A B C D'; # What the user wants to search for
my @words = split( /\s+/, $query );

Is there really no module which lets you do something like the following?

my @words = ( whatever );
foreach ( @Documents ) {
my $score = search( $_->text(), \@words );
}

Or something? I want it! :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top