Converting a string to multiple search patterns

Discussion in 'Perl Misc' started by Tore Aursand, Jun 8, 2004.

  1. Tore Aursand

    Tore Aursand Guest

    Hi all!

    I'm stumped on this one: I have an application where I need to refine the
    search mechanism. The concept is quite simple: Get a string, convert it
    to separate words, count (and "score") each word for each document, and
    then display the result based on the score;

    my $query = 'A B C D';
    my @words = split( /\s+/, $query );
    foreach ( @documents ) {
    # ...
    }

    I need to refine it, as said. I want a higher score for word sequences,
    and in a particular order. For the example above ('A B C D'), I want to
    match in this order:

    1. A B C D
    2. A B C
    3. B C D
    4. A B
    5. C D
    6. A C
    7. B D
    9. A D
    9. A
    10. B
    11. C
    12. D

    Anyone know of a module which can accomplis this? I really haven't tried
    with anything yet, 'cause I have no clue on how to do it. The closest
    thing I've been, has been with the Algorithm::permute module. It doesn't
    give me what I want "out of the box", though...

    Any ideas?


    --
    Tore Aursand <>
    "Progress is made by lazy men looking for easier ways to do things."
    (Robert Heinlein)
     
    Tore Aursand, Jun 8, 2004
    #1
    1. Advertising

  2. Tore Aursand

    Anno Siegel Guest

    Tore Aursand <> wrote in comp.lang.perl.misc:
    > Hi all!
    >
    > I'm stumped on this one: I have an application where I need to refine the
    > search mechanism. The concept is quite simple: Get a string, convert it
    > to separate words, count (and "score") each word for each document, and
    > then display the result based on the score;
    >
    > my $query = 'A B C D';
    > my @words = split( /\s+/, $query );
    > foreach ( @documents ) {
    > # ...
    > }
    >
    > I need to refine it, as said. I want a higher score for word sequences,
    > and in a particular order. For the example above ('A B C D'), I want to
    > match in this order:
    >
    > 1. A B C D
    > 2. A B C
    > 3. B C D
    > 4. A B
    > 5. C D
    > 6. A C
    > 7. B D
    > 9. A D
    > 9. A
    > 10. B
    > 11. C
    > 12. D


    I'm missing "A B D", "A C D", and " B C " from the collection.
    Are these entirely arbitrary?

    > Anyone know of a module which can accomplis this? I really haven't tried
    > with anything yet, 'cause I have no clue on how to do it. The closest
    > thing I've been, has been with the Algorithm::permute module. It doesn't
    > give me what I want "out of the box", though...


    I'm not sure what you are asking. Is it the generation of all selections
    of 1 .. 4 objects from a set of 4? These don't correspond to permutations,
    but to four-digit binary numbers (so there are 2**4 - 1 = 15 of them,
    not counting the empty selection). I'm sure there is a module on CPAN
    to generate them, but ad-hoc solutions aren't too hard either.

    Or is the issue how to assign a score to each of a collection of
    regexes and retrieve the score after each match? This can be done
    using the (?{}) construct to execute code at match time.

    Starting from your list (@lines, say) above, I'd generate a list @score
    of pairs where each pair holds a score and a string to match:

    my @score = map [ split /\./], @lines;
    $_->[ 1] =~ tr/ //d for @score;

    The second line simplifies things by deleting all blanks from the strings
    to match. Your practical regexes may look different.

    Build an alternation of patterns where each pattern includes code
    to set a variable ($scored) to the corresponding score:

    my $rex = join '|', map "$_->[ 1](?\{ \$scored = $_->[ 0] \})", @score;

    Generate a test string and check it.

    my $text = join '', map qw( A B C D E)[ rand 5], 1 .. 100;

    my $scored;
    use re 'eval';
    while ( $text =~ /($rex)/g ) {
    print "score $scored: $1\n";
    }

    Anno
     
    Anno Siegel, Jun 8, 2004
    #2
    1. Advertising

  3. [posted & mailed]

    On Tue, 8 Jun 2004, Tore Aursand wrote:

    >I'm stumped on this one: I have an application where I need to refine the
    >search mechanism. The concept is quite simple: Get a string, convert it
    >to separate words, count (and "score") each word for each document, and
    >then display the result based on the score;
    >
    > my $query = 'A B C D';
    > my @words = split( /\s+/, $query );
    > foreach ( @documents ) {
    > # ...
    > }
    >
    >I need to refine it, as said. I want a higher score for word sequences,
    >and in a particular order. For the example above ('A B C D'), I want to
    >match in this order:
    >
    > 1. A B C D
    > 2. A B C
    > 3. B C D
    > 4. A B
    > 5. C D
    > 6. A C
    > 7. B D
    > 9. A D
    > 9. A
    > 10. B
    > 11. C
    > 12. D


    As Anno said, you're missing three (non-empty) strings:

    A B D
    A C D
    B C

    Now, does this mean you want to attempt matching them in that order?
    Like, you want your regex to do:

    /ABCD|ABC|BCD|ABD|ACD|AB|BC|CD|AC|AD|BD|A|B|C|D/

    Is that what you want your regex to do?

    >Anyone know of a module which can accomplis this? I really haven't tried
    >with anything yet, 'cause I have no clue on how to do it. The closest
    >thing I've been, has been with the Algorithm::permute module. It doesn't
    >give me what I want "out of the box", though...


    I think you can accomplish it by way of embedded code in your regex, but
    I'd need a little more information about the aim of the regex before I
    could write one.

    --
    Jeff Pinyan RPI Acacia Brother #734 RPI Acacia Corp Secretary
    "And I vos head of Gestapo for ten | Michael Palin (as Heinrich Bimmler)
    years. Ah! Five years! Nein! No! | in: The North Minehead Bye-Election
    Oh. Was NOT head of Gestapo AT ALL!" | (Monty Python's Flying Circus)
     
    Jeff 'japhy' Pinyan, Jun 8, 2004
    #3
  4. Tore Aursand

    Tore Aursand Guest

    On Tue, 08 Jun 2004 11:53:53 +0000, Anno Siegel wrote:
    >> 1. A B C D
    >> 2. A B C
    >> 3. B C D
    >> 4. A B
    >> 5. C D
    >> 6. A C
    >> 7. B D
    >> 9. A D
    >> 9. A
    >> 10. B
    >> 11. C
    >> 12. D


    > I'm missing "A B D", "A C D", and " B C " from the collection.


    Doh! You're right;

    1. A B C D
    2. A B C
    3. B C D
    4. A B D <--
    5. A C D <--
    6. A B
    7. B C <--
    8. C D
    9. A C
    10. B D
    11. A D
    12. A
    13. B
    14. C
    15. D

    > I'm not sure what you are asking. Is it the generation of all selections
    > of 1 .. 4 objects from a set of 4?


    Yes. I have a string containing 'A B C D'. I split those to an array,
    and I want to generate a new list (as above), ie. and AoA;

    $VAR1 = [
    [ A, B, C, D ],
    [ A, B, C ],
    [ B, C, D ],
     
    Tore Aursand, Jun 8, 2004
    #4
  5. Tore Aursand

    Tore Aursand Guest

    On Tue, 08 Jun 2004 09:31:39 -0400, Jeff 'japhy' Pinyan wrote:
    > [...]
    > Now, does this mean you want to attempt matching them in that order?


    Yes, As mentioned, the matching isn't the problem. It's generating the
    structure/pattern which is the problem. See my reply to Anno for more
    information.


    --
    Tore Aursand <>
    "First get your facts; then you can distort them at your leisure."
    (Mark Twain)
     
    Tore Aursand, Jun 8, 2004
    #5
  6. Tore Aursand

    Anno Siegel Guest

    Tore Aursand <> wrote in comp.lang.perl.misc:
    > On Tue, 08 Jun 2004 11:53:53 +0000, Anno Siegel wrote:
    > >> 1. A B C D
    > >> 2. A B C
    > >> 3. B C D
    > >> 4. A B
    > >> 5. C D
    > >> 6. A C
    > >> 7. B D
    > >> 9. A D
    > >> 9. A
    > >> 10. B
    > >> 11. C
    > >> 12. D

    >
    > > I'm missing "A B D", "A C D", and " B C " from the collection.

    >
    > Doh! You're right;
    >
    > 1. A B C D
    > 2. A B C
    > 3. B C D
    > 4. A B D <--
    > 5. A C D <--
    > 6. A B
    > 7. B C <--
    > 8. C D
    > 9. A C
    > 10. B D
    > 11. A D
    > 12. A
    > 13. B
    > 14. C
    > 15. D
    >
    > > I'm not sure what you are asking. Is it the generation of all selections
    > > of 1 .. 4 objects from a set of 4?

    >
    > Yes. I have a string containing 'A B C D'. I split those to an array,
    > and I want to generate a new list (as above), ie. and AoA;
    >
    > $VAR1 = [
    > [ A, B, C, D ],
    > [ A, B, C ],
    > [ B, C, D ],
    > .
    > .
    > .
    > ]
    >
    > Once I have this structure, it's quite easy to do the scoring. But I'm
    > really stuck on how to generate this structure... :(


    I found the other part harder.

    sub selections {
    my @sel = [];
    for my $elem ( @_ ) {
    unshift @sel, map [ $elem, @$_], @sel;
    }
    @sel;
    }

    Anno
     
    Anno Siegel, Jun 8, 2004
    #6
  7. Tore Aursand

    Brad Baxter Guest

    On Tue, 8 Jun 2004, Tore Aursand wrote:
    > Doh! You're right;
    >
    > 1. A B C D
    > 2. A B C
    > 3. B C D
    > 4. A B D <--
    > 5. A C D <--
    > 6. A B
    > 7. B C <--
    > 8. C D
    > 9. A C
    > 10. B D
    > 11. A D
    > 12. A
    > 13. B
    > 14. C
    > 15. D
    >
    > > I'm not sure what you are asking. Is it the generation of all selections
    > > of 1 .. 4 objects from a set of 4?

    >
    > Yes. I have a string containing 'A B C D'. I split those to an array,
    > and I want to generate a new list (as above), ie. and AoA;


    The order you list isn't the order I'd expect from a simple generation of
    all selections, so tell me if my interpretation is correct:

    o More terms = higher score
    o More adjacent terms = higher score
    o Leftward terms = higher score than those to the right

    Right?

    Regards,

    Brad
     
    Brad Baxter, Jun 8, 2004
    #7
  8. [posted & mailed]

    On Tue, 8 Jun 2004, Jeff 'japhy' Pinyan wrote:

    >[posted & mailed]
    >
    >Now, does this mean you want to attempt matching them in that order?
    >Like, you want your regex to do:
    >
    > /ABCD|ABC|BCD|ABD|ACD|AB|BC|CD|AC|AD|BD|A|B|C|D/
    >
    >Is that what you want your regex to do?


    I *think* this is what you want your regex to do:

    my $rx = qr{
    (?{ local ($s, $f) = (0, 1) })
    ^ \s*
    (?: A (?{ $s += $f <<= 1 }) | (?{ $f = 1 }) ) \s*
    (?: B (?{ $s += $f <<= 1 }) | (?{ $f = 1 }) ) \s*
    (?: C (?{ $s += $f <<= 1 }) | (?{ $f = 1 }) ) \s*
    (?: D (?{ $s += $f <<= 1 }) | (?{ $f = 1 }) ) \s*
    $
    (?{ $s })
    }x;

    You don't need to worry about the set logic now -- the regex engine will
    take care of that.

    That regex will match "ABCD", "ABC", "BCD" (and " BCD"), etc.
    Specifically, it will match the 15 non-empty strings listed, with optional
    whitespace throughout. Modification of HOW it matches is simple. What's
    important to see is the scoring algorithm.

    Score ($s) starts out at 0, and the factor ($f) starts out at 1. Every
    time we hit a consecutive keyword, we left-shift the factor by 1 (that is,
    we multiply it by 2), and add it to the score. Every time the next
    keyword is missing, we reset the factor to 1. This means that the string
    "ABCD" has the maximum score of 30 (2 + 4 + 8 + 16), and single-character
    strings like "B" have the minimum non-zero score of 2.

    To *use* this, we would do something like:

    # this could be made more efficient via a Guttman-Rosler
    # Transform, but that's not the point here

    @sorted_by_score =
    map { $_->[1] }
    sort { $b->[0] <=> $a->[0] }
    map { /$rx/ ? [ $^R, $_ ] : () }
    @data;

    The $^R variable contains the return value of the most recent (?{ ... })
    in a regex -- for us, this is the one that merely holds the value of $s.
    We then sort the data by its score, and extract the data.

    If this is totally off-base, I apologize. On the other hand, if this is
    exactly what you're looking for, then the process of constructing the
    regex is simple:

    use re 'eval';
    my @kw = qw( A B C D );
    my $rx = qr{
    (?{ local ($s, $f) = (0, 1) })
    ^ \s*
    @{[ map "(?:\Q$_\E(?{ \$s += \$f <<= 1 })|(?{ \$f = 1 }) )\\s*", @kw ]}
    $
    (?{ $s })
    }x;

    That does it.

    --
    Jeff Pinyan RPI Acacia Brother #734 RPI Acacia Corp Secretary
    "And I vos head of Gestapo for ten | Michael Palin (as Heinrich Bimmler)
    years. Ah! Five years! Nein! No! | in: The North Minehead Bye-Election
    Oh. Was NOT head of Gestapo AT ALL!" | (Monty Python's Flying Circus)
     
    Jeff 'japhy' Pinyan, Jun 8, 2004
    #8
  9. Tore Aursand

    Tore Aursand Guest

    On Tue, 08 Jun 2004 11:02:29 -0400, Brad Baxter wrote:
    > [...]
    > o More terms = higher score
    > o More adjacent terms = higher score
    > o Leftward terms = higher score than those to the right


    That's correct. Am I missing something? Is my list wrong? Please
    correct me or make any suggestions.


    --
    Tore Aursand <>
    "What we anticipate seldom occurs. What we least expected generally
    happens." (Benjamin Disraeli)
     
    Tore Aursand, Jun 8, 2004
    #9
  10. Tore Aursand

    Tore Aursand Guest

    On Tue, 08 Jun 2004 14:52:41 +0000, Anno Siegel wrote:
    > sub selections {
    > my @sel = [];
    > for my $elem ( @_ ) {
    > unshift @sel, map [ $elem, @$_], @sel;
    > }
    > @sel;
    > }


    This one almost does the job, but it doesn't output what I want; The
    elements are mostly reversed, so reverse()'ing @_ makes it a bit better.

    Still, 'A B D' comes before 'B C D'. Take a look at my previous reply to
    you:

    1. A B C D
    2. A B C
    3. B C D
    4. A B D
    5. A C D
    6. A B
    7. B C
    8. C D
    9. A C
    10. B D
    11. A D
    12. A
    13. B
    14. C
    15. D

    Try this simple script using your subroutine (rewritten, 'cause
    "sometimes" Pan won't let me paste thing I mark in xterm):

    my @array = selections( qw(A B C D) );
    for ( 0..$#array ) {
    print "$_. " . join(' ', @{$array[$_]}) . "\n";
    }

    But you're close, Anno! :)


    --
    Tore Aursand <>
    "What we do is never understood, but only praised and blamed."
    (Friedrich Nietzsche)
     
    Tore Aursand, Jun 8, 2004
    #10
  11. Tore Aursand wrote:

    > On Tue, 08 Jun 2004 09:31:39 -0400, Jeff 'japhy' Pinyan wrote:
    >
    >>[...]
    >>Now, does this mean you want to attempt matching them in that order?

    >
    >
    > Yes, As mentioned, the matching isn't the problem. It's generating the
    > structure/pattern which is the problem. See my reply to Anno for more
    > information.
    >
    >


    Is this what you are after?

    [peter@wasabi peter]$ cat xx
    my @letters = qw/A B C D/;

    foreach my $x ((0..15)) {
    my %z;
    @z{@letters} = split('', sprintf("%04b", $x));

    printf "%2d: ", $x;
    foreach my $letter (@letters) {
    printf "%s ", (($z{$letter} == 1) ? $letter : ' ');
    }
    print "\n";
    }

    [peter@wasabi peter]$ perl xx
    0:
    1: D
    2: C
    3: C D
    4: B
    5: B D
    6: B C
    7: B C D
    8: A
    9: A D
    10: A C
    11: A C D
    12: A B
    13: A B D
    14: A B C
    15: A B C D
    [peter@wasabi peter]$

    More caffine than time ;-)
     
    Peter Hickman, Jun 8, 2004
    #11
  12. Tore Aursand

    Tore Aursand Guest

    On Tue, 08 Jun 2004 17:30:52 +0100, Peter Hickman wrote:
    > Is this what you are after?
    > [...]


    No. It still doesn't create the pattern excactly how I want it. Maybe
    there's no logic in my pattern, but... :)

    Take a look at one of my previous posts (a reply to Anno) where I outline
    the pattern the way I want it. Example: 'ABC' and 'BCD' should follow
    right after 'ABCD'.


    --
    Tore Aursand <>
    "Scientists are complaining that the new "Dinosaur" movie shows
    dinosaurs with lemurs, who didn't evolve for another million years.
    They're afraid the movie will give kids a mistaken impression. What
    about the fact that the dinosaurs are singing and dancing?" (Jay Leno)
     
    Tore Aursand, Jun 8, 2004
    #12
  13. Tore Aursand

    Anno Siegel Guest

    Tore Aursand <> wrote in comp.lang.perl.misc:
    > On Tue, 08 Jun 2004 14:52:41 +0000, Anno Siegel wrote:
    > > sub selections {
    > > my @sel = [];
    > > for my $elem ( @_ ) {
    > > unshift @sel, map [ $elem, @$_], @sel;
    > > }
    > > @sel;
    > > }

    >
    > This one almost does the job, but it doesn't output what I want; The
    > elements are mostly reversed, so reverse()'ing @_ makes it a bit better.
    >
    > Still, 'A B D' comes before 'B C D'. Take a look at my previous reply to
    > you:
    >
    > 1. A B C D
    > 2. A B C
    > 3. B C D
    > 4. A B D
    > 5. A C D
    > 6. A B
    > 7. B C
    > 8. C D
    > 9. A C
    > 10. B D
    > 11. A D
    > 12. A
    > 13. B
    > 14. C
    > 15. D
    >
    > Try this simple script using your subroutine (rewritten, 'cause
    > "sometimes" Pan won't let me paste thing I mark in xterm):
    >
    > my @array = selections( qw(A B C D) );
    > for ( 0..$#array ) {
    > print "$_. " . join(' ', @{$array[$_]}) . "\n";
    > }
    >
    > But you're close, Anno! :)


    I don't think I am. Unless I'm missing something obvious, there is
    no simple logic that generates exactly the sequence you want. Quoting
    the exchange with Brad Baxter

    > o More terms = higher score
    > o More adjacent terms = higher score
    > o Leftward terms = higher score than those to the right


    the resulting sequence looks rather arbitrary, best realized as a three-
    level sort that follows the description.

    Even generating all n-element selections before the (n+1)-element ones
    isn't quite trivial, let alone generating those with many adjacent
    elements before those with fewer, etc.

    So, unless you can reveal a simple logic behind your sequence, my
    recipe is, generate them in any sequence, then sort them into the
    required order.

    The sorting step could be realized as a hash slice through pre-sorted
    sequences of integers. Each sequence is a permutation of 0 .. 2**(n-1),
    where n is the total number of terms, four in the example. I think
    Japhy was doing something similar at one stage of his Perl artistics
    in the other subthread.

    Anno
     
    Anno Siegel, Jun 8, 2004
    #13
  14. On 8 Jun 2004, Anno Siegel wrote:

    >Tore Aursand <> wrote in comp.lang.perl.misc:
    >
    >> 1. A B C D
    >> 2. A B C
    >> 3. B C D
    >> 4. A B D
    >> 5. A C D
    >> 6. A B
    >> 7. B C
    >> 8. C D
    >> 9. A C
    >> 10. B D
    >> 11. A D
    >> 12. A
    >> 13. B
    >> 14. C
    >> 15. D

    >
    >The sorting step could be realized as a hash slice through pre-sorted
    >sequences of integers. Each sequence is a permutation of 0 .. 2**(n-1),
    >where n is the total number of terms, four in the example. I think
    >Japhy was doing something similar at one stage of his Perl artistics
    >in the other subthread.


    The concept of my solution is to match them arbitrarily, but assign each
    match a score, and sort based on the scores. I think that's the most
    efficient way (even if my regex is not efficient because of its code
    evaluation assertions).

    --
    Jeff Pinyan RPI Acacia Brother #734 RPI Acacia Corp Secretary
    "And I vos head of Gestapo for ten | Michael Palin (as Heinrich Bimmler)
    years. Ah! Five years! Nein! No! | in: The North Minehead Bye-Election
    Oh. Was NOT head of Gestapo AT ALL!" | (Monty Python's Flying Circus)
     
    Jeff 'japhy' Pinyan, Jun 8, 2004
    #14
  15. Anno Siegel <-berlin.de> wrote:

    > Tore Aursand <> wrote in comp.lang.perl.misc:

    [snip]

    >> 1. A B C D
    >> 2. A B C
    >> 3. B C D
    >> 4. A B D
    >> 5. A C D
    >> 6. A B
    >> 7. B C
    >> 8. C D
    >> 9. A C
    >> 10. B D
    >> 11. A D
    >> 12. A
    >> 13. B
    >> 14. C
    >> 15. D


    [snip]

    >> But you're close, Anno! :)

    >
    > I don't think I am. Unless I'm missing something obvious, there
    > is no simple logic that generates exactly the sequence you want.



    # building on previously posted code
    use strict;
    use warnings;

    my @array = 'A' .. 'D';

    my @subsets = subsets( 0 .. $#array );

    my @sequences;
    for my $subset ( @subsets ) {
    print join(' ', @array[ @$subset ]), "\n" ;
    }

    sub subsets {
    my @sel = [];
    for my $elem ( @_ ) {
    push @sel, map [ @$_, $elem], @sel unless @sel == 0;
    }
    return sort {
    @$b <=> @$a
    ||
    ($a->[-1] - $a->[0]) <=> ($b->[-1] - $b->[0])
    ||
    join('', @$a) cmp join('', @$b)
    } @sel;
    }
     
    David K. Wall, Jun 8, 2004
    #15
  16. I wrote:

    > # building on previously posted code
    > use strict;
    > use warnings;
    >
    > my @array = 'A' .. 'D';
    >
    > my @subsets = subsets( 0 .. $#array );
    >
    > my @sequences;
    > for my $subset ( @subsets ) {
    > print join(' ', @array[ @$subset ]), "\n" ;
    >}
    >
    > sub subsets {
    > my @sel = [];
    > for my $elem ( @_ ) {
    > push @sel, map [ @$_, $elem], @sel unless @sel == 0;
    > }
    > return sort {
    > @$b <=> @$a
    > ||
    > ($a->[-1] - $a->[0]) <=> ($b->[-1] - $b->[0])
    > ||
    > join('', @$a) cmp join('', @$b)


    I should point out that this will break if $#array > 9.

    > } @sel;
    >}
     
    David K. Wall, Jun 8, 2004
    #16
  17. Tore Aursand

    Brad Baxter Guest

    On Tue, 8 Jun 2004, David K. Wall wrote:

    > Anno Siegel <-berlin.de> wrote:
    >
    > > Tore Aursand <> wrote in comp.lang.perl.misc:

    > [snip]
    >
    > >> 1. A B C D
    > >> 2. A B C
    > >> 3. B C D
    > >> 4. A B D
    > >> 5. A C D
    > >> 6. A B
    > >> 7. B C
    > >> 8. C D
    > >> 9. A C
    > >> 10. B D
    > >> 11. A D
    > >> 12. A
    > >> 13. B
    > >> 14. C
    > >> 15. D

    >
    > [snip]
    >
    > >> But you're close, Anno! :)

    > >
    > > I don't think I am. Unless I'm missing something obvious, there
    > > is no simple logic that generates exactly the sequence you want.

    >
    >
    > # building on previously posted code
    > use strict;
    > use warnings;
    >
    > my @array = 'A' .. 'D';
    >
    > my @subsets = subsets( 0 .. $#array );
    >
    > my @sequences;
    > for my $subset ( @subsets ) {
    > print join(' ', @array[ @$subset ]), "\n" ;
    > }
    >
    > sub subsets {
    > my @sel = [];
    > for my $elem ( @_ ) {
    > push @sel, map [ @$_, $elem], @sel unless @sel == 0;
    > }
    > return sort {
    > @$b <=> @$a
    > ||
    > ($a->[-1] - $a->[0]) <=> ($b->[-1] - $b->[0])
    > ||
    > join('', @$a) cmp join('', @$b)
    > } @sel;
    > }
    >


    I think I disagree with this. While it agrees with the OP's stated specs,
    I'm not sure it agrees with the spirit of the specs. :) Of course, my
    interpretation may simply be wrong.

    Where I disagree first is with the stated specs:

    > >> 1. A B C D
    > >> 2. A B C
    > >> 3. B C D
    > >> 4. A B D
    > >> 5. A C D
    > >> 6. A B
    > >> 7. B C
    > >> 8. C D
    > >> 9. A C
    > >> 10. B D
    > >> 11. A D
    > >> 12. A
    > >> 13. B
    > >> 14. C
    > >> 15. D


    I think 11. A D should come before 10. B D, because all else being equal,
    A comes before B. In addition, when your code expands the terms 'A'..'E',
    you get:

    A B C D E
    A B C D
    B C D E
    A B C E
    A B D E
    A C D E
    A B C
    B C D
    C D E
    ....

    While 'A B D E' has more terms, I think 'A B C', 'B C D', and 'C D E'
    should outrank it, because they have more adjacent terms in a row.

    So, while I think your code is MUCH prettier, below is my take on this
    problem. The scoring is bizarre--I just want to weight the right things
    while eliminating dupes. Seems to work, but I can't give you a
    mathematical proof, so it's probably flawed. :)


    use strict;
    use warnings;

    my @sets = subsets( 'A' .. 'D' );

    print "@$_\n" for @sets;

    sub subsets {
    my @words = @_;
    my $n = @words;
    my @sets;
    my @scored;
    my %seen;

    # create "binary" sets, '1's represent words present
    push @sets, sprintf "%0${n}b", $_ for 0 .. 2**$n-1;

    # $x is for unique sort keys ("scores")
    my $x = $n - 1;

    for my $si ( 0 .. $#sets ) { # need $si in score
    my $set = $sets[ $si ];

    # split into groups of adjacent terms
    my @groups = split( /0/, $set);

    my $score = 0;
    for my $gi ( 0 .. $#groups ) { # need $gi in score
    my $group = $groups[ $gi ];
    next unless $group;

    # sets are scored by length, number,
    # and lefthandedness of their groups
    my $len = length $group;
    $score += $x**(2*$len) + ($x-$gi)*$len + $si;
    }

    # convert "binary" sets to sets of words
    my @wordset;
    my $i = 0;
    for( split //, $set) {
    my $word = $words[$i++];
    $_ && push @wordset, $word;
    }
    push @scored, [$score, [@wordset]];

    $seen{$score}++; # to prove our scores are unique
    }

    for ( sort keys %seen ) { die "Dupe: $_" if $seen{$_}>1 }

    map { $_->[1] }
    sort { $b->[0] <=> $a->[0] }
    @scored;

    }

    __END__


    Regards,

    Brad
     
    Brad Baxter, Jun 8, 2004
    #17
  18. On Tue, 8 Jun 2004, Brad Baxter wrote:

    >I think 11. A D should come before 10. B D, because all else being equal,
    >A comes before B. In addition, when your code expands the terms 'A'..'E',
    >you get:
    >
    >A B C D E
    >A B C D
    >B C D E
    >A B C E
    >A B D E
    >A C D E
    >A B C
    >B C D
    >C D E
    >
    >While 'A B D E' has more terms, I think 'A B C', 'B C D', and 'C D E'
    >should outrank it, because they have more adjacent terms in a row.


    Well, my regex solution agrees with you, and while the insides of the
    regex are a bit ugly to look at, the algorithm is far simpler than it
    seems.

    #!/usr/bin/perl -l

    my $rx;

    {
    use re 'eval';
    my @kw = qw( A B C D E );
    $rx = qr{
    (?{ local ($s, $f) = (0, 1) })
    ^ \s*
    @{[map qq{ (?:
    \Q$_\E (?{ \$s += \$f <<= 1 }) |
    (?{ \$f = 1 })
    ) \\s*
    }, @kw ]}
    $
    (?{ $s })
    }x;
    }

    while (<{A,}{B,}{C,}{D,}{E,}>) {
    chomp;
    print "$^R\t$_" if /$rx/ and $^R;
    }

    Run that code through '| sort -n', and you'll get:

    2 A
    2 B
    2 C
    2 D
    2 E
    4 AC
    4 AD
    4 AE
    4 BD
    4 BE
    4 CE
    6 AB
    6 ACE
    6 BC
    6 CD
    6 DE
    8 ABD
    8 ABE
    8 ACD
    8 ADE
    8 BCE
    8 BDE
    12 ABDE
    14 ABC
    14 BCD
    14 CDE
    16 ABCE
    16 ACDE
    30 ABCD
    30 BCDE
    62 ABCDE

    which I think is consistent with the rules.

    --
    Jeff Pinyan RPI Acacia Brother #734 RPI Acacia Corp Secretary
    "And I vos head of Gestapo for ten | Michael Palin (as Heinrich Bimmler)
    years. Ah! Five years! Nein! No! | in: The North Minehead Bye-Election
    Oh. Was NOT head of Gestapo AT ALL!" | (Monty Python's Flying Circus)
     
    Jeff 'japhy' Pinyan, Jun 8, 2004
    #18
  19. Brad Baxter <> wrote:

    > On Tue, 8 Jun 2004, David K. Wall wrote:
    >

    [snip code]

    (BTW, I don't like the way my posted code (mis)handles the empty set,
    but never mind)

    > I think I disagree with this. While it agrees with the OP's
    > stated specs, I'm not sure it agrees with the spirit of the specs.
    > :)


    Heh.

    > Of course, my interpretation may simply be wrong.


    Or mine. It's Tore's problem, let him worry about it. :)


    > Where I disagree first is with the stated specs:
    >
    >> >> 1. A B C D
    >> >> 2. A B C
    >> >> 3. B C D
    >> >> 4. A B D
    >> >> 5. A C D
    >> >> 6. A B
    >> >> 7. B C
    >> >> 8. C D
    >> >> 9. A C
    >> >> 10. B D
    >> >> 11. A D
    >> >> 12. A
    >> >> 13. B
    >> >> 14. C
    >> >> 15. D

    >
    > I think 11. A D should come before 10. B D, because all else being
    > equal, A comes before B.


    But all else isn't equal. The "distance" between A and D is greater
    than the "distance" between B and D. I'm not sure how to express
    this clearly other than in code, but the way I understood Tore was
    this: the combinations are grouped

    first by the number of terms/elements in a combination,
    then by the "range" of the combination,
    then by the order of the original set.

    -- but maybe I was reading too much into the choice of 'A'..'D' for
    the example?

    > In addition, when your code expands the
    > terms 'A'..'E', you get:
    >
    > A B C D E
    > A B C D
    > B C D E
    > A B C E
    > A B D E
    > A C D E
    > A B C
    > B C D
    > C D E
    > ...
    >
    > While 'A B D E' has more terms, I think 'A B C', 'B C D', and 'C D
    > E' should outrank it, because they have more adjacent terms in a
    > row.


    That's a good point -- I certainly won't argue against it. What Would
    Google Do? :)
     
    David K. Wall, Jun 8, 2004
    #19
  20. Tore Aursand

    Tore Aursand Guest

    On Tue, 08 Jun 2004 14:18:22 -0700, Randal L. Schwartz wrote:
    > It appears as though you are sorting on "edit distance", which is a
    > well-defined term, and even has a module, String::Approx, to compute
    > it.
    > [...]


    Hmm. I've already had a look at String::Approx, but I don't think it will
    help me solve this.

    Remember that 'A B C D' really are four _words_;

    my $query = 'A B C D'; # What the user wants to search for
    my @words = split( /\s+/, $query );

    Is there really no module which lets you do something like the following?

    my @words = ( whatever );
    foreach ( @Documents ) {
    my $score = search( $_->text(), \@words );
    }

    Or something? I want it! :)


    --
    Tore Aursand <>
    "Those people who think they know everything are a great annoyance to
    those of us who do." (Isaac Asimov)
     
    Tore Aursand, Jun 8, 2004
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. crichmon
    Replies:
    4
    Views:
    499
    Mabden
    Jul 7, 2004
  2. Xah Lee
    Replies:
    0
    Views:
    631
    Xah Lee
    Jun 14, 2006
  3. Xah Lee
    Replies:
    0
    Views:
    342
    Xah Lee
    Jun 14, 2006
  4. tomasz
    Replies:
    7
    Views:
    354
    Jonathan Gardner
    Dec 18, 2007
  5. Xah Lee
    Replies:
    0
    Views:
    282
    Xah Lee
    Jun 14, 2006
Loading...

Share This Page