Matching mixed up words

Michael T. Davis · Apr 12, 2005

Say I want to match "gremlin" or the letters that compose the word
"gremlin", but in any order. Note that once "g" is consumed, the set of
available letters no longer includes "g". (Also, "g" isn't necessarily
going to be the first letter.) I would anticipate that a proper solution
for a word of <N> letters would approach a complexity (or "big O") of N!
(read "N factorial"). Is there a solution which could be implemented as
a single match, or would this require some extra code around a match?

Thanks,
Mike

A. Sinan Unur · Apr 12, 2005

(e-mail address removed)-state.edu (Michael T. Davis) wrote in

Say I want to match "gremlin" or the letters that compose the
word "gremlin", but in any order. Note that once "g" is consumed,
the set of available letters no longer includes "g". (Also, "g" isn't
necessarily going to be the first letter.) I would anticipate that a
proper solution for a word of <N> letters would approach a complexity
(or "big O") of N! (read "N factorial").

You are too pessimistic

Is there a solution which could be implemented as a single match,
or would this require some extra code around a match?

I don't see any mention of regexes in your post. I am not sure if that
is what you are after. There is a simple solution to this that falls
directly from your explanation of the problem:

use strict;
use warnings;

use Data:

umper;

sub check {
my ($orig, $target) = @_;

my %c;

use integer;
my @l = split //, $orig;
++$c{$_} for @l;

@l = split //, $target;
for (@l) {
if(exists($c{$_}) and $c{$_}) {
--$c{$_};
}
}

@l = grep { $_ > 0 } values %c;
scalar @l ? 0 : 1;
}

my %check = (
sinan => [ 'nasin', 'nasina', 'lasin' ],
gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
);

for my $k (keys %check) {
for my $t (@{ $check{$k} }) {
print "$k matches $t?: ";
if(check($k, $t)) {
print "Yes\n";
} else {
print "No\n";
}
}
}

__END__

I am sure someone will show a regex solution that I have overlooked.

By the way, your signature is not formatted properly:

--

The proper signature marker is two dashes followed by a space and a
newline. Please do use that.

Sinan

thundergnat · Apr 12, 2005

Michael said:
Say I want to match "gremlin" or the letters that compose the word
"gremlin", but in any order. Note that once "g" is consumed, the set of
available letters no longer includes "g". (Also, "g" isn't necessarily
going to be the first letter.) I would anticipate that a proper solution
for a word of <N> letters would approach a complexity (or "big O") of N!
(read "N factorial"). Is there a solution which could be implemented as
a single match, or would this require some extra code around a match?

I'm sure it could be done more efficiently but it was an interesting
little diversion. I wandered a little from the OPs spec since I am
ignoring spaces, punctuation and case, I guess.

use warnings;
use strict;

my $phrase = 'George W. Bush';

my %letters;

for (split//, $phrase){
$letters{lc($_)}++ if /[a-zA-Z]/;
}

while (<DATA>){
chomp (my $test_phrase = $_);
my $no_match;
my %testhash = %letters;
for (split//, $test_phrase){
if (/[a-zA-Z]/){
if (--$testhash{lc($_)} < 0){
$no_match++;
last;
}
}
}
for (values %testhash){
last if $no_match;
if ($_ < 0){
$no_match++;
}
}
print "Phrase \"$test_phrase\" ".($no_match ?
'does not match' : 'matches')." $phrase.\n";
}

__DATA__
NOT A MATCH
SHRUB EGG WOE
BUG GORE HEWS
GOB SEWER HUG
WEB USER GOGH
RUBES EGG WHO
BUG GREW HOSE
WHOSE BUGGER
BEG WORSE UGH
A BOGUS ENTRY

xhoster · Apr 12, 2005

Say I want to match "gremlin" or the letters that compose the
word "gremlin", but in any order. Note that once "g" is consumed, the
set of available letters no longer includes "g". (Also, "g" isn't
necessarily going to be the first letter.) I would anticipate that a
proper solution for a word of <N> letters would approach a complexity (or
"big O") of N! (read "N factorial"). Is there a solution which could be
implemented as a single match, or would this require some extra code
around a match?

canon("gremlin") eq canon($foo) or die;

sub canon {
join "", sort split //, $_[0];
};

Xho

A. Sinan Unur · Apr 12, 2005

Say I want to match "gremlin" or the letters that compose the
word "gremlin", but in any order match?

Click to expand...

....

canon("gremlin") eq canon($foo) or die;

sub canon {
join "", sort split //, $_[0];
};

That's what I call the power of a clear mind

Simple and elegant.

Sinan

Tassilo v. Parseval · Apr 12, 2005

Also sprach A. Sinan Unur:

(e-mail address removed)-state.edu (Michael T. Davis) wrote in

Say I want to match "gremlin" or the letters that compose the
word "gremlin", but in any order. Note that once "g" is consumed,
the set of available letters no longer includes "g". (Also, "g" isn't
necessarily going to be the first letter.) I would anticipate that a
proper solution for a word of <N> letters would approach a complexity
(or "big O") of N! (read "N factorial").

Click to expand...

You are too pessimistic

Is there a solution which could be implemented as a single match,
or would this require some extra code around a match?

Click to expand...

I don't see any mention of regexes in your post. I am not sure if that
is what you are after. There is a simple solution to this that falls
directly from your explanation of the problem:

use strict;
use warnings;

use Data:umper;

sub check {
my ($orig, $target) = @_;

my %c;

use integer;
my @l = split //, $orig;
++$c{$_} for @l;

@l = split //, $target;
for (@l) {
if(exists($c{$_}) and $c{$_}) {
--$c{$_};
}
}

@l = grep { $_ > 0 } values %c;
scalar @l ? 0 : 1;
}

my %check = (
sinan => [ 'nasin', 'nasina', 'lasin' ],
gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
);

for my $k (keys %check) {
for my $t (@{ $check{$k} }) {
print "$k matches $t?: ";
if(check($k, $t)) {
print "Yes\n";
} else {
print "No\n";
}
}
}

A faster solution appears to involve sort(): split both strings, sort
them and compare for equality. According to a benchmark:

use strict;
use Benchmark qw/cmpthese/;

sub check {
my ($orig, $target) = @_;
my %c;
use integer;
my @l = split //, $orig;
++$c{$_} for @l;
@l = split //, $target;
for (@l) {
if(exists($c{$_}) and $c{$_}) {
--$c{$_};
}
}
@l = grep { $_ > 0 } values %c;
scalar @l ? 0 : 1;
}

sub check_sort {
my ($orig, $target) = @_;
my $o = join '', sort split //, $orig;
my $t = join '', sort split //, $target;
return $o eq $t;
}

my %check = (
sinan => [ 'nasin', 'nasina', 'lasin', ],
gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
);

cmpthese(-2, {
histo => sub {
for my $k (keys %check) {
for my $t (@{ $check{$k} }) {
check($k, $t);
}
}
},
sort => sub {
for my $k (keys %check) {
for my $t (@{ $check{$k} }) {
check_sort($k, $t);
}
}
},
});
__END__
Rate histo sort
histo 2280/s -- -43%
sort 3992/s 75% --

This might however be due to a denser implementation of check_sort()
avoiding temporary variables etc.

Also, check_sort() is more correct as it wont falsely report 'sinan' and
'nasina' as matching, which check() does. ;-) I'd write check() thusly:

sub check {
my ($orig, $target) = @_;
my %c;
++$c{$_} for split //, $orig;
--$c{$_} for split //, $target;
return ! grep $_, values %c;
}

This is still slower by roughly 25% than using sort. The 'use integer'
appears to have no effect on the benchmark.

Tassilo

A. Sinan Unur · Apr 12, 2005

sub check_sort {
my ($orig, $target) = @_;
my $o = join '', sort split //, $orig;
my $t = join '', sort split //, $target;
return $o eq $t;
} ....

Also, check_sort() is more correct as it wont falsely report 'sinan'
and 'nasina' as matching, which check() does. ;-)

And to think that I actually look at the output, and somehow did not
notice my error. Thank you for catching that.

Sinan

Michael T. Davis · Apr 12, 2005

Just to be clear, I'm looking for a regex-based mechanism that will
work within the confines of "m/.../". I would imagine it's going to need to
rely on the "(${code})" construct.

BTW, my signature includes a trailing space at the end of the first
line, but the gateway I'm using apparently strips it off. I have alerted
them to the mistake.

Regards,
Mike

Tassilo v. Parseval · Apr 13, 2005

Also sprach Michael T. Davis:

Just to be clear, I'm looking for a regex-based mechanism that will
work within the confines of "m/.../". I would imagine it's going to need to
rely on the "(${code})" construct.

Most likely even (??{CODE}). However, any of my attempts so far ended up
in a segmentation fault or 'panic: '. I knew that some of these extended
patterns are flagged as experimental but I didn't expect them to be that
fragile. It's tricky enough coming up with a pure regex solution but
here you'll also need to find one that wont crash perl. So I wouldn't
bother.

Tassilo

Anno Siegel · Apr 13, 2005

A. Sinan Unur said:
(e-mail address removed)-state.edu (Michael T. Davis) wrote in
news:[email protected]:

[how to test for anagrams]

use strict;
use warnings;

use Data:umper;

sub check {
my ($orig, $target) = @_;

my %c;

use integer;
my @l = split //, $orig;
++$c{$_} for @l;

@l = split //, $target;
for (@l) {
if(exists($c{$_}) and $c{$_}) {
--$c{$_};
}
}

@l = grep { $_ > 0 } values %c;
scalar @l ? 0 : 1;
}

my %check = (
sinan => [ 'nasin', 'nasina', 'lasin' ],
gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
);

for my $k (keys %check) {
for my $t (@{ $check{$k} }) {
print "$k matches $t?: ";
if(check($k, $t)) {
print "Yes\n";
} else {
print "No\n";
}
}
}

__END__

I am sure someone will show a regex solution that I have overlooked.

A regex solution seems unlikely. It would require jumping back and
forth in a string while keeping track of what was matched where.
Regexes aren't very good at that.

Using a hash for counting is just fine. It is basically a well known
data structure that implements what has been called "bags". Bags are
like sets, but each element (a hash key) can be contained multiple times
(the hash value). Containment and equality of bags are defined in the
obvious way. Then, to check if two strings are anagrams, create the
corresponding bags and test for equality. Code:

my %check = (
sinan => [ 'nasin', 'nasina', 'lasin' ],
gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
);

for my $k (keys %check) {
my $bk = Bag->embag( $k);
for my $t (@{ $check{$k} }) {
print "$k matches $t?: ";
print $bk eq Bag->embag( $t) ? "Yes\n" : "No\n";
}
}

#########################################################################

package Bag;

sub embag { # create a bag of letters from a string
my $class = shift;
my %bag;
$bag{ $_} ++ for split //, shift;
bless \ %bag, $class;
}

sub contained {
my ( $b1, $b2) = @_;
$b2->{ $_} and $b1->{ $_} > $b2->{ $_} and return 0 for keys %$b1;
1;
}

use overload(
le => 'contained',
eq => sub { $_[ 0] le $_[ 1] and $_[ 1] le $_[ 0] },
);

__END__

Anno

Anno Siegel · Apr 13, 2005

Tassilo v. Parseval said:
Also sprach A. Sinan Unur:

[anagram detection by counting letters]

A faster solution appears to involve sort(): split both strings, sort
them and compare for equality. According to a benchmark:

[benchmark snipped]

Interesting, since counting is linear and sorting is n*log n. Presumably,
with huge words, counting would win in the end, but there probably
never was a language (not even German) with words long enough to bring
out the difference.

Anno

Tassilo v. Parseval · Apr 13, 2005

Also sprach Anno Siegel:

Tassilo v. Parseval said:
Tassilo v. Parseval said:

Also sprach A. Sinan Unur:

Click to expand...

[anagram detection by counting letters]

A faster solution appears to involve sort(): split both strings, sort
them and compare for equality. According to a benchmark:

Click to expand...

[benchmark snipped]

Interesting, since counting is linear and sorting is n*log n. Presumably,
with huge words, counting would win in the end, but there probably
never was a language (not even German) with words long enough to bring
out the difference.

Altering the benchmark a little so that we can change the length of the
words more easily:

use Benchmark qw/cmpthese/;

sub check {
my ($orig, $target) = @_;
my %c;
++$c{$_} for split //, $orig;
--$c{$_} for split //, $target;
return ! grep $_, values %c;
}

sub check_sort {
my ($orig, $target) = @_;
my $o = join '', sort split //, $orig;
my $t = join '', sort split //, $target;
return $o eq $t;
}

my $len = shift;
my $key = join '', map { ['a'..'z']->[rand 26] } 1 .. $len;

cmpthese(-2, {
histo => sub {
check($key, scalar reverse $key);
},
sort => sub {
check_sort($key, scalar reverse $key);
},
});

$len = 20:

Rate histo sort
histo 8029/s -- -19%
sort 9962/s 24% --

$len = 50:

Rate histo sort
histo 3600/s -- -10%
sort 4015/s 12% --

$len = 100:

Rate histo sort
histo 1912/s -- -3%
sort 1981/s 4% --

$len = 200:

Rate sort histo
sort 972/s -- -5%
histo 1018/s 5% --

Aha! So the words need to be unrealistically long in order for the
linear method to win. Which says quite something about the efficiency of
perl's sort implementations. Of course, check() could be made to return
earlier, for instance when a negative value shows up in the second
for-loop. The same is true for the final grep().

Still, for real-world words I suspect using sort() is still a very
efficient (both coding- and runtime-wise) solution.

Tassilo

Anno Siegel · Apr 13, 2005

Tassilo v. Parseval said:
Also sprach Anno Siegel:

Also sprach A. Sinan Unur:

(e-mail address removed)-state.edu (Michael T. Davis) wrote in
news:[email protected]:

Click to expand...

[anagram detection by counting letters]

A faster solution appears to involve sort(): split both strings, sort
them and compare for equality. According to a benchmark:

Click to expand...

[benchmark snipped]

Interesting, since counting is linear and sorting is n*log n. Presumably,
with huge words, counting would win in the end, but there probably
never was a language (not even German) with words long enough to bring
out the difference.

Click to expand...

Altering the benchmark a little so that we can change the length of the
words more easily:
[shortened]

$len = 100:

Rate histo sort
histo 1912/s -- -3%
sort 1981/s 4% --

$len = 200:

Rate sort histo
sort 972/s -- -5%
histo 1018/s 5% --

Aha! So the words need to be unrealistically long in order for the
linear method to win.

I wouldn't have been amazed to find the crossover length even higher,
at 1000 or so.

Which says quite something about the efficiency of
perl's sort implementations. Of course, check() could be made to return
earlier, for instance when a negative value shows up in the second
for-loop. The same is true for the final grep().

List::Util::first is the grep replacement for that. How much it
saves depends heavily on the distribution of the strings. If
strings vary wildly, it can save a lot, if most comparisons are
for almost-anagrams it won't save so much.

Still, for real-world words I suspect using sort() is still a very
efficient (both coding- and runtime-wise) solution.

In one implementation I used byte vectors for letter counting, a la

embag {
my $bag = '';
++ vec( $bag, ord $_, 8) for split //, shift;
$bag;
}

which is easily Inline-able. Equality of counts is 'eq', like with
sorting. The count vectors can be compacted using another level of
indirection ( $charno[ ord $_] instead of ord $_), which is still
easily Inlined. Sorting was no option for the application, so I never
benchmarked against it, but I'd expect the Inlined code to be in the
same ballpark, even for short strings.

Anno

Ilmari Karonen · Apr 13, 2005

Anno Siegel said:
A. Sinan Unur said:

(e-mail address removed)-state.edu (Michael T. Davis) wrote in
news:[email protected]:

Click to expand...

[how to test for anagrams]

I am sure someone will show a regex solution that I have overlooked.

Click to expand...

A regex solution seems unlikely. It would require jumping back and
forth in a string while keeping track of what was matched where.
Regexes aren't very good at that.

Nonetheless, here's a regex solution:

sub anagram_re {
my $word = shift;
return "" if $word eq "";
my (@re, %seen);
foreach my $i (0 .. length($word)-1) {
my $temp = $word;
my $ch = substr($temp, $i, 1, "");
next if $seen{$ch}++;
push @re, quotemeta($ch) . anagram_re($temp);
}
return @re > 1 ? "(?:".join("|", @re).")" : $re[0];
}

Give it a word, and it will return a regex to match any anagram of it.
For example, here's the regex for "food" (sans "?:" modifiers):

(f(o(od|do)|doo)|o(f(od|do)|o(fd|df)|d(fo|of))|d(foo|o(fo|of)))

The corresponding regex for "gremlin" is 33218 characters long with
the "?:" modifiers, or 25978 without them.

Anno Siegel · Apr 13, 2005

Ilmari Karonen said:
Anno Siegel said:

A. Sinan Unur said:

(e-mail address removed)-state.edu (Michael T. Davis) wrote in
news:[email protected]:

Click to expand...

[how to test for anagrams]

I am sure someone will show a regex solution that I have overlooked.

Click to expand...

A regex solution seems unlikely. It would require jumping back and
forth in a string while keeping track of what was matched where.
Regexes aren't very good at that.

Click to expand...

Nonetheless, here's a regex solution:

sub anagram_re {
my $word = shift;
return "" if $word eq "";
my (@re, %seen);
foreach my $i (0 .. length($word)-1) {
my $temp = $word;
my $ch = substr($temp, $i, 1, "");
next if $seen{$ch}++;
push @re, quotemeta($ch) . anagram_re($temp);
}
return @re > 1 ? "(?:".join("|", @re).")" : $re[0];
}

Give it a word, and it will return a regex to match any anagram of it.
For example, here's the regex for "food" (sans "?:" modifiers):

(f(o(od|do)|doo)|o(f(od|do)|o(fd|df)|d(fo|of))|d(foo|o(fo|of)))

Nice hack...

The corresponding regex for "gremlin" is 33218 characters long with
the "?:" modifiers, or 25978 without them.

....and mostly useless. I like it

Anno

Ilmari Karonen · Apr 14, 2005

Anno Siegel said:
Nice hack...

...and mostly useless. I like it

It does, however, have one advantage -- it's fast. Really fast. Over
an order of magnitude faster than any other solution in this thread so
far, in fact.

Of course, that's only if you ignore the time to build and compile the
regex. And it only works for fairly short words anyway.

But if you want a solution that both runs _and_ starts fast, here's
something adapted from an earlier thread titled "perl scramble":

#!/usr/bin/perl -w
use strict;

my $word = shift;
my $canon = join "", sort split //, "$word\n";

my $code = q{
while (<>) {
print if length == length $canon
and !tr/LETTERS//c
and $canon eq join "", sort split //;
}
};
$code =~ s/LETTERS/\Q$canon/;
eval $code; die if $@;

This assumes input comes from a file (or stdin), but it can be easily
modified to, say, grep an array.

Anno Siegel · Apr 14, 2005

Ilmari Karonen said:
Anno Siegel said:

Nice hack...

Click to expand...

[...]

But if you want a solution that both runs _and_ starts fast, here's
something adapted from an earlier thread titled "perl scramble":

#!/usr/bin/perl -w
use strict;

my $word = shift;
my $canon = join "", sort split //, "$word\n";

my $code = q{
while (<>) {
print if length == length $canon
and !tr/LETTERS//c
and $canon eq join "", sort split //;
}
};
$code =~ s/LETTERS/\Q$canon/;
eval $code; die if $@;

This assumes input comes from a file (or stdin), but it can be easily
modified to, say, grep an array.

Ah, that's basically the "sort-solution", but the length and tr/// tests
speed it up. Sorting only happens when a word is entirely made of the
same letters, but in different numbers (with the same total). That helps
a lot in typical situations when most candidates are not anagrams. Nifty.

Anno

Tassilo v. Parseval · Apr 14, 2005

Also sprach Abigail:

Tassilo v. Parseval ([email protected]) wrote on
MMMMCCXLIII September MCMXCIII in <URL:|| Also sprach Michael T. Davis:
||
|| > Just to be clear, I'm looking for a regex-based mechanism that will
|| > work within the confines of "m/.../". I would imagine it's going to need to
|| > rely on the "(${code})" construct.
||
|| Most likely even (??{CODE}). However, any of my attempts so far ended up
|| in a segmentation fault or 'panic: '. I knew that some of these extended
|| patterns are flagged as experimental but I didn't expect them to be that
|| fragile. It's tricky enough coming up with a pure regex solution but
|| here you'll also need to find one that wont crash perl. So I wouldn't
|| bother.

#!/usr/bin/perl

use strict;
use warnings;
no warnings qw /syntax/;

my $word = "gremlin";
my $ana = "nlmregi";

my (%h);
print $word =~
/^(?{%h = ()})
(?{$h {substr $ana => $_, 1} ++ for 0 .. length ($ana) - 1})
(?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
$(?(?{grep {$_} values %h}})(?!)|)/x ? "match\n" : "no match\n";

Hmmh, this doesn't compile:

Sequence (?{...}) not terminated or not {}-balanced in regex; marked by <-- HERE in m/^(?{%h = ()})
(?{$h {substr $ana => $_, 1} ++ for 0 .. length ($ana) - 1})
(?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
$(?(?{ <-- HERE grep {$_} values %h}})(?!)|)/ at - line 13.

I can get rid of these by strategically inserting a few spaces here and
there, but then it eventually complains about an "Unknown switch
condition".

Tassilo

Fabian Pilkowski · Apr 14, 2005

* Tassilo v. Parseval said:
Also sprach Abigail: ^^

Hmmh, this doesn't compile:

Delete one of those marked curly parentheses and it'll work fine.

regards,
fabian

Tassilo v. Parseval · Apr 15, 2005

Also sprach Abigail:

#!/usr/bin/perl

use strict;
use warnings;
no warnings qw /syntax/;

my $word = "gremlin";
my $ana = "nlmregi";

my (%h);
print $word =~
/^(?{%h = ()})
(?{$h {substr $ana => $_, 1} ++ for 0 .. length ($ana) - 1})
(?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
$(?(?{grep {$_} values %h})(?!)|)/x
? "match\n" : "no match\n";

Indeed, this is much better. Interestingly enough, it stops working when
using split:

use strict;
use warnings;
no warnings qw /syntax/;

my $word = "gremlin";
my $ana = "nlmregi";

my (%h);
print $word =~ m#
^(?{%h = ()})
(?{$h {$_} ++ for split //, $ana})
(?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
$(?(?{grep {$_} values %h})(?!)|)#x
? "match\n" : "no match\n";

As far as I see it, this code should be functionally equivalent to
yours. Probably these extended patterns don't work too well when another
pattern match happens inside.

Tassilo

Word matching with specific parameters	1	Jan 26, 2025
Adobe Acrobat JavaScript PDF Script Issues: File Matching and Dynamic Retrieval	0	Nov 29, 2024
Matching words and letters	7	Dec 8, 2004
FAQ 6.2 I'm having trouble matching over more than one line. What's wrong?	0	Jan 30, 2011
find words that contains some specific letters	60	Jun 1, 2009
FAQ 6.20 What good is "\G" in a regular expression?	0	Mar 3, 2011
A script to flag commonly misused words	2	Jul 31, 2007
[SUMMARY] Rhyming Words (#195)	0	Mar 15, 2009

Matching mixed up words

Michael T. Davis

A. Sinan Unur

thundergnat

xhoster

A. Sinan Unur

Tassilo v. Parseval

A. Sinan Unur

Michael T. Davis

Tassilo v. Parseval

Anno Siegel

Anno Siegel

Tassilo v. Parseval

Anno Siegel

Ilmari Karonen

Anno Siegel

Ilmari Karonen

Anno Siegel

Tassilo v. Parseval

Fabian Pilkowski

Tassilo v. Parseval

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads