Matching mixed up words

Discussion in 'Perl Misc' started by Michael T. Davis, Apr 12, 2005.

  1. Say I want to match "gremlin" or the letters that compose the word
    "gremlin", but in any order. Note that once "g" is consumed, the set of
    available letters no longer includes "g". (Also, "g" isn't necessarily
    going to be the first letter.) I would anticipate that a proper solution
    for a word of <N> letters would approach a complexity (or "big O") of N!
    (read "N factorial"). Is there a solution which could be implemented as
    a single match, or would this require some extra code around a match?

    Thanks,
    Mike
    --
    | Systems Specialist: CBE,MSE
    Michael T. Davis | Departmental Networking/Computing
    http://www.ecr6.ohio-state.edu/~davism/ | The Ohio State University
    | 197 Watts, (614) 292-6928
    Michael T. Davis, Apr 12, 2005
    #1
    1. Advertising

  2. -state.edu (Michael T. Davis) wrote in
    news:d3h7bu$hbt$-state.edu:

    > Say I want to match "gremlin" or the letters that compose the
    > word "gremlin", but in any order. Note that once "g" is consumed,
    > the set of available letters no longer includes "g". (Also, "g" isn't
    > necessarily going to be the first letter.) I would anticipate that a
    > proper solution for a word of <N> letters would approach a complexity
    > (or "big O") of N! (read "N factorial").


    You are too pessimistic :)

    > Is there a solution which could be implemented as a single match,
    > or would this require some extra code around a match?


    I don't see any mention of regexes in your post. I am not sure if that
    is what you are after. There is a simple solution to this that falls
    directly from your explanation of the problem:

    use strict;
    use warnings;

    use Data::Dumper;

    sub check {
    my ($orig, $target) = @_;

    my %c;

    use integer;
    my @l = split //, $orig;
    ++$c{$_} for @l;

    @l = split //, $target;
    for (@l) {
    if(exists($c{$_}) and $c{$_}) {
    --$c{$_};
    }
    }

    @l = grep { $_ > 0 } values %c;
    scalar @l ? 0 : 1;
    }

    my %check = (
    sinan => [ 'nasin', 'nasina', 'lasin' ],
    gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
    );

    for my $k (keys %check) {
    for my $t (@{ $check{$k} }) {
    print "$k matches $t?: ";
    if(check($k, $t)) {
    print "Yes\n";
    } else {
    print "No\n";
    }
    }
    }


    __END__


    I am sure someone will show a regex solution that I have overlooked.

    By the way, your signature is not formatted properly:

    > --


    The proper signature marker is two dashes followed by a space and a
    newline. Please do use that.

    Sinan


    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
    A. Sinan Unur, Apr 12, 2005
    #2
    1. Advertising

  3. Michael T. Davis

    thundergnat Guest

    Michael T. Davis wrote:
    > Say I want to match "gremlin" or the letters that compose the word
    > "gremlin", but in any order. Note that once "g" is consumed, the set of
    > available letters no longer includes "g". (Also, "g" isn't necessarily
    > going to be the first letter.) I would anticipate that a proper solution
    > for a word of <N> letters would approach a complexity (or "big O") of N!
    > (read "N factorial"). Is there a solution which could be implemented as
    > a single match, or would this require some extra code around a match?
    >


    I'm sure it could be done more efficiently but it was an interesting
    little diversion. I wandered a little from the OPs spec since I am
    ignoring spaces, punctuation and case, I guess.


    use warnings;
    use strict;

    my $phrase = 'George W. Bush';

    my %letters;

    for (split//, $phrase){
    $letters{lc($_)}++ if /[a-zA-Z]/;
    }

    while (<DATA>){
    chomp (my $test_phrase = $_);
    my $no_match;
    my %testhash = %letters;
    for (split//, $test_phrase){
    if (/[a-zA-Z]/){
    if (--$testhash{lc($_)} < 0){
    $no_match++;
    last;
    }
    }
    }
    for (values %testhash){
    last if $no_match;
    if ($_ < 0){
    $no_match++;
    }
    }
    print "Phrase \"$test_phrase\" ".($no_match ?
    'does not match' : 'matches')." $phrase.\n";
    }

    __DATA__
    NOT A MATCH
    SHRUB EGG WOE
    BUG GORE HEWS
    GOB SEWER HUG
    WEB USER GOGH
    RUBES EGG WHO
    BUG GREW HOSE
    WHOSE BUGGER
    BEG WORSE UGH
    A BOGUS ENTRY
    thundergnat, Apr 12, 2005
    #3
  4. Michael T. Davis

    Guest

    -state.edu (Michael T. Davis) wrote:
    > Say I want to match "gremlin" or the letters that compose the
    > word "gremlin", but in any order. Note that once "g" is consumed, the
    > set of available letters no longer includes "g". (Also, "g" isn't
    > necessarily going to be the first letter.) I would anticipate that a
    > proper solution for a word of <N> letters would approach a complexity (or
    > "big O") of N! (read "N factorial"). Is there a solution which could be
    > implemented as a single match, or would this require some extra code
    > around a match?


    canon("gremlin") eq canon($foo) or die;

    sub canon {
    join "", sort split //, $_[0];
    };

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Apr 12, 2005
    #4
  5. wrote in news:20050412170728.908$:

    > -state.edu (Michael T. Davis) wrote:
    >> Say I want to match "gremlin" or the letters that compose the
    >> word "gremlin", but in any order match?


    ....

    > canon("gremlin") eq canon($foo) or die;
    >
    > sub canon {
    > join "", sort split //, $_[0];
    > };


    That's what I call the power of a clear mind :)

    Simple and elegant.

    Sinan

    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
    A. Sinan Unur, Apr 12, 2005
    #5
  6. Also sprach A. Sinan Unur:

    > -state.edu (Michael T. Davis) wrote in
    > news:d3h7bu$hbt$-state.edu:
    >
    >> Say I want to match "gremlin" or the letters that compose the
    >> word "gremlin", but in any order. Note that once "g" is consumed,
    >> the set of available letters no longer includes "g". (Also, "g" isn't
    >> necessarily going to be the first letter.) I would anticipate that a
    >> proper solution for a word of <N> letters would approach a complexity
    >> (or "big O") of N! (read "N factorial").

    >
    > You are too pessimistic :)
    >
    >> Is there a solution which could be implemented as a single match,
    >> or would this require some extra code around a match?

    >
    > I don't see any mention of regexes in your post. I am not sure if that
    > is what you are after. There is a simple solution to this that falls
    > directly from your explanation of the problem:
    >
    > use strict;
    > use warnings;
    >
    > use Data::Dumper;
    >
    > sub check {
    > my ($orig, $target) = @_;
    >
    > my %c;
    >
    > use integer;
    > my @l = split //, $orig;
    > ++$c{$_} for @l;
    >
    > @l = split //, $target;
    > for (@l) {
    > if(exists($c{$_}) and $c{$_}) {
    > --$c{$_};
    > }
    > }
    >
    > @l = grep { $_ > 0 } values %c;
    > scalar @l ? 0 : 1;
    > }
    >
    > my %check = (
    > sinan => [ 'nasin', 'nasina', 'lasin' ],
    > gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
    > );
    >
    > for my $k (keys %check) {
    > for my $t (@{ $check{$k} }) {
    > print "$k matches $t?: ";
    > if(check($k, $t)) {
    > print "Yes\n";
    > } else {
    > print "No\n";
    > }
    > }
    > }


    A faster solution appears to involve sort(): split both strings, sort
    them and compare for equality. According to a benchmark:

    use strict;
    use Benchmark qw/cmpthese/;

    sub check {
    my ($orig, $target) = @_;
    my %c;
    use integer;
    my @l = split //, $orig;
    ++$c{$_} for @l;
    @l = split //, $target;
    for (@l) {
    if(exists($c{$_}) and $c{$_}) {
    --$c{$_};
    }
    }
    @l = grep { $_ > 0 } values %c;
    scalar @l ? 0 : 1;
    }

    sub check_sort {
    my ($orig, $target) = @_;
    my $o = join '', sort split //, $orig;
    my $t = join '', sort split //, $target;
    return $o eq $t;
    }

    my %check = (
    sinan => [ 'nasin', 'nasina', 'lasin', ],
    gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
    );

    cmpthese(-2, {
    histo => sub {
    for my $k (keys %check) {
    for my $t (@{ $check{$k} }) {
    check($k, $t);
    }
    }
    },
    sort => sub {
    for my $k (keys %check) {
    for my $t (@{ $check{$k} }) {
    check_sort($k, $t);
    }
    }
    },
    });
    __END__
    Rate histo sort
    histo 2280/s -- -43%
    sort 3992/s 75% --

    This might however be due to a denser implementation of check_sort()
    avoiding temporary variables etc.

    Also, check_sort() is more correct as it wont falsely report 'sinan' and
    'nasina' as matching, which check() does. ;-) I'd write check() thusly:

    sub check {
    my ($orig, $target) = @_;
    my %c;
    ++$c{$_} for split //, $orig;
    --$c{$_} for split //, $target;
    return ! grep $_, values %c;
    }

    This is still slower by roughly 25% than using sort. The 'use integer'
    appears to have no effect on the benchmark.

    Tassilo
    --
    use bigint;
    $n=71423350343770280161397026330337371139054411854220053437565440;
    $m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($m+=8)<=200);
    Tassilo v. Parseval, Apr 12, 2005
    #6
  7. "Tassilo v. Parseval" <> wrote in
    news::

    > sub check_sort {
    > my ($orig, $target) = @_;
    > my $o = join '', sort split //, $orig;
    > my $t = join '', sort split //, $target;
    > return $o eq $t;
    > }

    ....

    > Also, check_sort() is more correct as it wont falsely report 'sinan'
    > and 'nasina' as matching, which check() does. ;-)


    And to think that I actually look at the output, and somehow did not
    notice my error. Thank you for catching that.

    Sinan

    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
    A. Sinan Unur, Apr 13, 2005
    #7
  8. Just to be clear, I'm looking for a regex-based mechanism that will
    work within the confines of "m/.../". I would imagine it's going to need to
    rely on the "(${code})" construct.

    BTW, my signature includes a trailing space at the end of the first
    line, but the gateway I'm using apparently strips it off. I have alerted
    them to the mistake.

    Regards,
    Mike
    --
    | Systems Specialist: CBE,MSE
    Michael T. Davis | Departmental Networking/Computing
    http://www.ecr6.ohio-state.edu/~davism/ | The Ohio State University
    | 197 Watts, (614) 292-6928
    Michael T. Davis, Apr 13, 2005
    #8
  9. Also sprach Michael T. Davis:

    > Just to be clear, I'm looking for a regex-based mechanism that will
    > work within the confines of "m/.../". I would imagine it's going to need to
    > rely on the "(${code})" construct.


    Most likely even (??{CODE}). However, any of my attempts so far ended up
    in a segmentation fault or 'panic: '. I knew that some of these extended
    patterns are flagged as experimental but I didn't expect them to be that
    fragile. It's tricky enough coming up with a pure regex solution but
    here you'll also need to find one that wont crash perl. So I wouldn't
    bother.

    Tassilo
    --
    use bigint;
    $n=71423350343770280161397026330337371139054411854220053437565440;
    $m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($m+=8)<=200);
    Tassilo v. Parseval, Apr 13, 2005
    #9
  10. Michael T. Davis

    Anno Siegel Guest

    A. Sinan Unur <> wrote in comp.lang.perl.misc:
    > -state.edu (Michael T. Davis) wrote in
    > news:d3h7bu$hbt$-state.edu:


    [how to test for anagrams]

    > use strict;
    > use warnings;
    >
    > use Data::Dumper;
    >
    > sub check {
    > my ($orig, $target) = @_;
    >
    > my %c;
    >
    > use integer;
    > my @l = split //, $orig;
    > ++$c{$_} for @l;
    >
    > @l = split //, $target;
    > for (@l) {
    > if(exists($c{$_}) and $c{$_}) {
    > --$c{$_};
    > }
    > }
    >
    > @l = grep { $_ > 0 } values %c;
    > scalar @l ? 0 : 1;
    > }
    >
    > my %check = (
    > sinan => [ 'nasin', 'nasina', 'lasin' ],
    > gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
    > );
    >
    > for my $k (keys %check) {
    > for my $t (@{ $check{$k} }) {
    > print "$k matches $t?: ";
    > if(check($k, $t)) {
    > print "Yes\n";
    > } else {
    > print "No\n";
    > }
    > }
    > }
    >
    >
    > __END__
    >
    >
    > I am sure someone will show a regex solution that I have overlooked.


    A regex solution seems unlikely. It would require jumping back and
    forth in a string while keeping track of what was matched where.
    Regexes aren't very good at that.

    Using a hash for counting is just fine. It is basically a well known
    data structure that implements what has been called "bags". Bags are
    like sets, but each element (a hash key) can be contained multiple times
    (the hash value). Containment and equality of bags are defined in the
    obvious way. Then, to check if two strings are anagrams, create the
    corresponding bags and test for equality. Code:

    my %check = (
    sinan => [ 'nasin', 'nasina', 'lasin' ],
    gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
    );

    for my $k (keys %check) {
    my $bk = Bag->embag( $k);
    for my $t (@{ $check{$k} }) {
    print "$k matches $t?: ";
    print $bk eq Bag->embag( $t) ? "Yes\n" : "No\n";
    }
    }

    #########################################################################

    package Bag;

    sub embag { # create a bag of letters from a string
    my $class = shift;
    my %bag;
    $bag{ $_} ++ for split //, shift;
    bless \ %bag, $class;
    }

    sub contained {
    my ( $b1, $b2) = @_;
    $b2->{ $_} and $b1->{ $_} > $b2->{ $_} and return 0 for keys %$b1;
    1;
    }

    use overload(
    le => 'contained',
    eq => sub { $_[ 0] le $_[ 1] and $_[ 1] le $_[ 0] },
    );

    __END__

    Anno
    Anno Siegel, Apr 13, 2005
    #10
  11. Michael T. Davis

    Anno Siegel Guest

    Tassilo v. Parseval <> wrote in comp.lang.perl.misc:
    > Also sprach A. Sinan Unur:
    >
    > > -state.edu (Michael T. Davis) wrote in
    > > news:d3h7bu$hbt$-state.edu:


    [anagram detection by counting letters]

    > A faster solution appears to involve sort(): split both strings, sort
    > them and compare for equality. According to a benchmark:


    [benchmark snipped]

    Interesting, since counting is linear and sorting is n*log n. Presumably,
    with huge words, counting would win in the end, but there probably
    never was a language (not even German) with words long enough to bring
    out the difference.

    Anno
    Anno Siegel, Apr 13, 2005
    #11
  12. Also sprach Anno Siegel:
    > Tassilo v. Parseval <> wrote in comp.lang.perl.misc:
    >> Also sprach A. Sinan Unur:
    >>
    >> > -state.edu (Michael T. Davis) wrote in
    >> > news:d3h7bu$hbt$-state.edu:

    >
    > [anagram detection by counting letters]
    >
    >> A faster solution appears to involve sort(): split both strings, sort
    >> them and compare for equality. According to a benchmark:

    >
    > [benchmark snipped]
    >
    > Interesting, since counting is linear and sorting is n*log n. Presumably,
    > with huge words, counting would win in the end, but there probably
    > never was a language (not even German) with words long enough to bring
    > out the difference.


    Altering the benchmark a little so that we can change the length of the
    words more easily:

    use Benchmark qw/cmpthese/;

    sub check {
    my ($orig, $target) = @_;
    my %c;
    ++$c{$_} for split //, $orig;
    --$c{$_} for split //, $target;
    return ! grep $_, values %c;
    }

    sub check_sort {
    my ($orig, $target) = @_;
    my $o = join '', sort split //, $orig;
    my $t = join '', sort split //, $target;
    return $o eq $t;
    }

    my $len = shift;
    my $key = join '', map { ['a'..'z']->[rand 26] } 1 .. $len;

    cmpthese(-2, {
    histo => sub {
    check($key, scalar reverse $key);
    },
    sort => sub {
    check_sort($key, scalar reverse $key);
    },
    });

    $len = 20:

    Rate histo sort
    histo 8029/s -- -19%
    sort 9962/s 24% --

    $len = 50:

    Rate histo sort
    histo 3600/s -- -10%
    sort 4015/s 12% --

    $len = 100:

    Rate histo sort
    histo 1912/s -- -3%
    sort 1981/s 4% --

    $len = 200:

    Rate sort histo
    sort 972/s -- -5%
    histo 1018/s 5% --

    Aha! So the words need to be unrealistically long in order for the
    linear method to win. Which says quite something about the efficiency of
    perl's sort implementations. Of course, check() could be made to return
    earlier, for instance when a negative value shows up in the second
    for-loop. The same is true for the final grep().

    Still, for real-world words I suspect using sort() is still a very
    efficient (both coding- and runtime-wise) solution.

    Tassilo
    --
    use bigint;
    $n=71423350343770280161397026330337371139054411854220053437565440;
    $m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($m+=8)<=200);
    Tassilo v. Parseval, Apr 13, 2005
    #12
  13. Michael T. Davis

    Anno Siegel Guest

    Tassilo v. Parseval <> wrote in comp.lang.perl.misc:
    > Also sprach Anno Siegel:
    > > Tassilo v. Parseval <> wrote in

    > comp.lang.perl.misc:
    > >> Also sprach A. Sinan Unur:
    > >>
    > >> > -state.edu (Michael T. Davis) wrote in
    > >> > news:d3h7bu$hbt$-state.edu:

    > >
    > > [anagram detection by counting letters]
    > >
    > >> A faster solution appears to involve sort(): split both strings, sort
    > >> them and compare for equality. According to a benchmark:

    > >
    > > [benchmark snipped]
    > >
    > > Interesting, since counting is linear and sorting is n*log n. Presumably,
    > > with huge words, counting would win in the end, but there probably
    > > never was a language (not even German) with words long enough to bring
    > > out the difference.

    >
    > Altering the benchmark a little so that we can change the length of the
    > words more easily:


    [shortened]

    > $len = 100:
    >
    > Rate histo sort
    > histo 1912/s -- -3%
    > sort 1981/s 4% --
    >
    > $len = 200:
    >
    > Rate sort histo
    > sort 972/s -- -5%
    > histo 1018/s 5% --
    >
    > Aha! So the words need to be unrealistically long in order for the
    > linear method to win.


    I wouldn't have been amazed to find the crossover length even higher,
    at 1000 or so.

    > Which says quite something about the efficiency of
    > perl's sort implementations. Of course, check() could be made to return
    > earlier, for instance when a negative value shows up in the second
    > for-loop. The same is true for the final grep().


    List::Util::first is the grep replacement for that. How much it
    saves depends heavily on the distribution of the strings. If
    strings vary wildly, it can save a lot, if most comparisons are
    for almost-anagrams it won't save so much.

    > Still, for real-world words I suspect using sort() is still a very
    > efficient (both coding- and runtime-wise) solution.


    In one implementation I used byte vectors for letter counting, a la

    embag {
    my $bag = '';
    ++ vec( $bag, ord $_, 8) for split //, shift;
    $bag;
    }

    which is easily Inline-able. Equality of counts is 'eq', like with
    sorting. The count vectors can be compacted using another level of
    indirection ( $charno[ ord $_] instead of ord $_), which is still
    easily Inlined. Sorting was no option for the application, so I never
    benchmarked against it, but I'd expect the Inlined code to be in the
    same ballpark, even for short strings.

    Anno
    Anno Siegel, Apr 13, 2005
    #13
  14. Anno Siegel <-berlin.de> kirjoitti 13.04.2005:
    > A. Sinan Unur <> wrote in comp.lang.perl.misc:
    >> -state.edu (Michael T. Davis) wrote in
    >> news:d3h7bu$hbt$-state.edu:

    >
    > [how to test for anagrams]
    >
    >> I am sure someone will show a regex solution that I have overlooked.

    >
    > A regex solution seems unlikely. It would require jumping back and
    > forth in a string while keeping track of what was matched where.
    > Regexes aren't very good at that.


    Nonetheless, here's a regex solution:

    sub anagram_re {
    my $word = shift;
    return "" if $word eq "";
    my (@re, %seen);
    foreach my $i (0 .. length($word)-1) {
    my $temp = $word;
    my $ch = substr($temp, $i, 1, "");
    next if $seen{$ch}++;
    push @re, quotemeta($ch) . anagram_re($temp);
    }
    return @re > 1 ? "(?:".join("|", @re).")" : $re[0];
    }

    Give it a word, and it will return a regex to match any anagram of it.
    For example, here's the regex for "food" (sans "?:" modifiers):

    (f(o(od|do)|doo)|o(f(od|do)|o(fd|df)|d(fo|of))|d(foo|o(fo|of)))

    The corresponding regex for "gremlin" is 33218 characters long with
    the "?:" modifiers, or 25978 without them.

    --
    Ilmari Karonen
    To reply by e-mail, please replace ".invalid" with ".net" in address.
    Ilmari Karonen, Apr 13, 2005
    #14
  15. Michael T. Davis

    Anno Siegel Guest

    Ilmari Karonen <> wrote in comp.lang.perl.misc:
    > Anno Siegel <-berlin.de> kirjoitti 13.04.2005:
    > > A. Sinan Unur <> wrote in comp.lang.perl.misc:
    > >> -state.edu (Michael T. Davis) wrote in
    > >> news:d3h7bu$hbt$-state.edu:

    > >
    > > [how to test for anagrams]
    > >
    > >> I am sure someone will show a regex solution that I have overlooked.

    > >
    > > A regex solution seems unlikely. It would require jumping back and
    > > forth in a string while keeping track of what was matched where.
    > > Regexes aren't very good at that.

    >
    > Nonetheless, here's a regex solution:
    >
    > sub anagram_re {
    > my $word = shift;
    > return "" if $word eq "";
    > my (@re, %seen);
    > foreach my $i (0 .. length($word)-1) {
    > my $temp = $word;
    > my $ch = substr($temp, $i, 1, "");
    > next if $seen{$ch}++;
    > push @re, quotemeta($ch) . anagram_re($temp);
    > }
    > return @re > 1 ? "(?:".join("|", @re).")" : $re[0];
    > }
    >
    > Give it a word, and it will return a regex to match any anagram of it.
    > For example, here's the regex for "food" (sans "?:" modifiers):
    >
    > (f(o(od|do)|doo)|o(f(od|do)|o(fd|df)|d(fo|of))|d(foo|o(fo|of)))


    Nice hack...

    > The corresponding regex for "gremlin" is 33218 characters long with
    > the "?:" modifiers, or 25978 without them.


    ....and mostly useless. I like it :)

    Anno
    Anno Siegel, Apr 14, 2005
    #15
  16. Anno Siegel <-berlin.de> kirjoitti 13.04.2005:
    > Ilmari Karonen <> wrote in comp.lang.perl.misc:
    >>
    >> Give it a word, and it will return a regex to match any anagram of it.

    >
    > Nice hack...
    >
    >> The corresponding regex for "gremlin" is 33218 characters long with
    >> the "?:" modifiers, or 25978 without them.

    >
    > ...and mostly useless. I like it :)


    It does, however, have one advantage -- it's fast. Really fast. Over
    an order of magnitude faster than any other solution in this thread so
    far, in fact.

    Of course, that's only if you ignore the time to build and compile the
    regex. And it only works for fairly short words anyway.

    But if you want a solution that both runs _and_ starts fast, here's
    something adapted from an earlier thread titled "perl scramble":

    #!/usr/bin/perl -w
    use strict;

    my $word = shift;
    my $canon = join "", sort split //, "$word\n";

    my $code = q{
    while (<>) {
    print if length == length $canon
    and !tr/LETTERS//c
    and $canon eq join "", sort split //;
    }
    };
    $code =~ s/LETTERS/\Q$canon/;
    eval $code; die if $@;

    This assumes input comes from a file (or stdin), but it can be easily
    modified to, say, grep an array.

    --
    Ilmari Karonen
    To reply by e-mail, please replace ".invalid" with ".net" in address.
    Ilmari Karonen, Apr 14, 2005
    #16
  17. Michael T. Davis

    Anno Siegel Guest

    Ilmari Karonen <> wrote in comp.lang.perl.misc:
    > Anno Siegel <-berlin.de> kirjoitti 13.04.2005:
    > > Ilmari Karonen <> wrote in comp.lang.perl.misc:
    > >>
    > >> Give it a word, and it will return a regex to match any anagram of it.

    > >
    > > Nice hack...


    [...]

    > But if you want a solution that both runs _and_ starts fast, here's
    > something adapted from an earlier thread titled "perl scramble":
    >
    > #!/usr/bin/perl -w
    > use strict;
    >
    > my $word = shift;
    > my $canon = join "", sort split //, "$word\n";
    >
    > my $code = q{
    > while (<>) {
    > print if length == length $canon
    > and !tr/LETTERS//c
    > and $canon eq join "", sort split //;
    > }
    > };
    > $code =~ s/LETTERS/\Q$canon/;
    > eval $code; die if $@;
    >
    > This assumes input comes from a file (or stdin), but it can be easily
    > modified to, say, grep an array.


    Ah, that's basically the "sort-solution", but the length and tr/// tests
    speed it up. Sorting only happens when a word is entirely made of the
    same letters, but in different numbers (with the same total). That helps
    a lot in typical situations when most candidates are not anagrams. Nifty.

    Anno
    Anno Siegel, Apr 14, 2005
    #17
  18. Also sprach Abigail:

    > Tassilo v. Parseval () wrote on
    > MMMMCCXLIII September MCMXCIII in <URL:news:>:
    >|| Also sprach Michael T. Davis:
    >||
    >|| > Just to be clear, I'm looking for a regex-based mechanism that will
    >|| > work within the confines of "m/.../". I would imagine it's going to need to
    >|| > rely on the "(${code})" construct.
    >||
    >|| Most likely even (??{CODE}). However, any of my attempts so far ended up
    >|| in a segmentation fault or 'panic: '. I knew that some of these extended
    >|| patterns are flagged as experimental but I didn't expect them to be that
    >|| fragile. It's tricky enough coming up with a pure regex solution but
    >|| here you'll also need to find one that wont crash perl. So I wouldn't
    >|| bother.
    >
    >
    > #!/usr/bin/perl
    >
    > use strict;
    > use warnings;
    > no warnings qw /syntax/;
    >
    > my $word = "gremlin";
    > my $ana = "nlmregi";
    >
    > my (%h);
    > print $word =~
    > /^(?{%h = ()})
    > (?{$h {substr $ana => $_, 1} ++ for 0 .. length ($ana) - 1})
    > (?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
    > $(?(?{grep {$_} values %h}})(?!)|)/x ? "match\n" : "no match\n";


    Hmmh, this doesn't compile:

    Sequence (?{...}) not terminated or not {}-balanced in regex; marked by <-- HERE in m/^(?{%h = ()})
    (?{$h {substr $ana => $_, 1} ++ for 0 .. length ($ana) - 1})
    (?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
    $(?(?{ <-- HERE grep {$_} values %h}})(?!)|)/ at - line 13.

    I can get rid of these by strategically inserting a few spaces here and
    there, but then it eventually complains about an "Unknown switch
    condition".

    Tassilo
    --
    use bigint;
    $n=71423350343770280161397026330337371139054411854220053437565440;
    $m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($m+=8)<=200);
    Tassilo v. Parseval, Apr 15, 2005
    #18
  19. * Tassilo v. Parseval schrieb:
    > Also sprach Abigail:
    > >
    > > my (%h);
    > > print $word =~
    > > /^(?{%h = ()})
    > > (?{$h {substr $ana => $_, 1} ++ for 0 .. length ($ana) - 1})
    > > (?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
    > > $(?(?{grep {$_} values %h}})(?!)|)/x ? "match\n" : "no match\n";

    ^^
    >
    > Hmmh, this doesn't compile:


    Delete one of those marked curly parentheses and it'll work fine.

    regards,
    fabian
    Fabian Pilkowski, Apr 15, 2005
    #19
  20. Also sprach Abigail:

    > #!/usr/bin/perl
    >
    > use strict;
    > use warnings;
    > no warnings qw /syntax/;
    >
    > my $word = "gremlin";
    > my $ana = "nlmregi";
    >
    > my (%h);
    > print $word =~
    > /^(?{%h = ()})
    > (?{$h {substr $ana => $_, 1} ++ for 0 .. length ($ana) - 1})
    > (?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
    > $(?(?{grep {$_} values %h})(?!)|)/x
    > ? "match\n" : "no match\n";


    Indeed, this is much better. Interestingly enough, it stops working when
    using split:

    use strict;
    use warnings;
    no warnings qw /syntax/;

    my $word = "gremlin";
    my $ana = "nlmregi";

    my (%h);
    print $word =~ m#
    ^(?{%h = ()})
    (?{$h {$_} ++ for split //, $ana})
    (?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
    $(?(?{grep {$_} values %h})(?!)|)#x
    ? "match\n" : "no match\n";

    As far as I see it, this code should be functionally equivalent to
    yours. Probably these extended patterns don't work too well when another
    pattern match happens inside.

    Tassilo
    --
    use bigint;
    $n=71423350343770280161397026330337371139054411854220053437565440;
    $m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($m+=8)<=200);
    Tassilo v. Parseval, Apr 15, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Strøiman
    Replies:
    1
    Views:
    2,070
    Peter Strøiman
    Aug 23, 2005
  2. Richard Heathfield
    Replies:
    7
    Views:
    350
    Barry Schwarz
    Oct 5, 2003
  3. phaeton123
    Replies:
    2
    Views:
    390
    Alain Frisch
    Aug 29, 2006
  4. utab

    Words Words

    utab, Feb 16, 2006, in forum: C++
    Replies:
    6
    Views:
    415
    Daniel T.
    Feb 16, 2006
  5. BerlinBrown
    Replies:
    6
    Views:
    4,431
Loading...

Share This Page