Matching mixed up words

M

Michael T. Davis

Say I want to match "gremlin" or the letters that compose the word
"gremlin", but in any order. Note that once "g" is consumed, the set of
available letters no longer includes "g". (Also, "g" isn't necessarily
going to be the first letter.) I would anticipate that a proper solution
for a word of <N> letters would approach a complexity (or "big O") of N!
(read "N factorial"). Is there a solution which could be implemented as
a single match, or would this require some extra code around a match?

Thanks,
Mike
 
A

A. Sinan Unur

(e-mail address removed)-state.edu (Michael T. Davis) wrote in
Say I want to match "gremlin" or the letters that compose the
word "gremlin", but in any order. Note that once "g" is consumed,
the set of available letters no longer includes "g". (Also, "g" isn't
necessarily going to be the first letter.) I would anticipate that a
proper solution for a word of <N> letters would approach a complexity
(or "big O") of N! (read "N factorial").

You are too pessimistic :)
Is there a solution which could be implemented as a single match,
or would this require some extra code around a match?

I don't see any mention of regexes in your post. I am not sure if that
is what you are after. There is a simple solution to this that falls
directly from your explanation of the problem:

use strict;
use warnings;

use Data::Dumper;

sub check {
my ($orig, $target) = @_;

my %c;

use integer;
my @l = split //, $orig;
++$c{$_} for @l;

@l = split //, $target;
for (@l) {
if(exists($c{$_}) and $c{$_}) {
--$c{$_};
}
}

@l = grep { $_ > 0 } values %c;
scalar @l ? 0 : 1;
}

my %check = (
sinan => [ 'nasin', 'nasina', 'lasin' ],
gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
);

for my $k (keys %check) {
for my $t (@{ $check{$k} }) {
print "$k matches $t?: ";
if(check($k, $t)) {
print "Yes\n";
} else {
print "No\n";
}
}
}


__END__


I am sure someone will show a regex solution that I have overlooked.

By the way, your signature is not formatted properly:

The proper signature marker is two dashes followed by a space and a
newline. Please do use that.

Sinan
 
T

thundergnat

Michael said:
Say I want to match "gremlin" or the letters that compose the word
"gremlin", but in any order. Note that once "g" is consumed, the set of
available letters no longer includes "g". (Also, "g" isn't necessarily
going to be the first letter.) I would anticipate that a proper solution
for a word of <N> letters would approach a complexity (or "big O") of N!
(read "N factorial"). Is there a solution which could be implemented as
a single match, or would this require some extra code around a match?

I'm sure it could be done more efficiently but it was an interesting
little diversion. I wandered a little from the OPs spec since I am
ignoring spaces, punctuation and case, I guess.


use warnings;
use strict;

my $phrase = 'George W. Bush';

my %letters;

for (split//, $phrase){
$letters{lc($_)}++ if /[a-zA-Z]/;
}

while (<DATA>){
chomp (my $test_phrase = $_);
my $no_match;
my %testhash = %letters;
for (split//, $test_phrase){
if (/[a-zA-Z]/){
if (--$testhash{lc($_)} < 0){
$no_match++;
last;
}
}
}
for (values %testhash){
last if $no_match;
if ($_ < 0){
$no_match++;
}
}
print "Phrase \"$test_phrase\" ".($no_match ?
'does not match' : 'matches')." $phrase.\n";
}

__DATA__
NOT A MATCH
SHRUB EGG WOE
BUG GORE HEWS
GOB SEWER HUG
WEB USER GOGH
RUBES EGG WHO
BUG GREW HOSE
WHOSE BUGGER
BEG WORSE UGH
A BOGUS ENTRY
 
X

xhoster

Say I want to match "gremlin" or the letters that compose the
word "gremlin", but in any order. Note that once "g" is consumed, the
set of available letters no longer includes "g". (Also, "g" isn't
necessarily going to be the first letter.) I would anticipate that a
proper solution for a word of <N> letters would approach a complexity (or
"big O") of N! (read "N factorial"). Is there a solution which could be
implemented as a single match, or would this require some extra code
around a match?

canon("gremlin") eq canon($foo) or die;

sub canon {
join "", sort split //, $_[0];
};

Xho
 
A

A. Sinan Unur

Say I want to match "gremlin" or the letters that compose the
word "gremlin", but in any order match?
....

canon("gremlin") eq canon($foo) or die;

sub canon {
join "", sort split //, $_[0];
};

That's what I call the power of a clear mind :)

Simple and elegant.

Sinan
 
T

Tassilo v. Parseval

Also sprach A. Sinan Unur:
(e-mail address removed)-state.edu (Michael T. Davis) wrote in
Say I want to match "gremlin" or the letters that compose the
word "gremlin", but in any order. Note that once "g" is consumed,
the set of available letters no longer includes "g". (Also, "g" isn't
necessarily going to be the first letter.) I would anticipate that a
proper solution for a word of <N> letters would approach a complexity
(or "big O") of N! (read "N factorial").

You are too pessimistic :)
Is there a solution which could be implemented as a single match,
or would this require some extra code around a match?

I don't see any mention of regexes in your post. I am not sure if that
is what you are after. There is a simple solution to this that falls
directly from your explanation of the problem:

use strict;
use warnings;

use Data::Dumper;

sub check {
my ($orig, $target) = @_;

my %c;

use integer;
my @l = split //, $orig;
++$c{$_} for @l;

@l = split //, $target;
for (@l) {
if(exists($c{$_}) and $c{$_}) {
--$c{$_};
}
}

@l = grep { $_ > 0 } values %c;
scalar @l ? 0 : 1;
}

my %check = (
sinan => [ 'nasin', 'nasina', 'lasin' ],
gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
);

for my $k (keys %check) {
for my $t (@{ $check{$k} }) {
print "$k matches $t?: ";
if(check($k, $t)) {
print "Yes\n";
} else {
print "No\n";
}
}
}

A faster solution appears to involve sort(): split both strings, sort
them and compare for equality. According to a benchmark:

use strict;
use Benchmark qw/cmpthese/;

sub check {
my ($orig, $target) = @_;
my %c;
use integer;
my @l = split //, $orig;
++$c{$_} for @l;
@l = split //, $target;
for (@l) {
if(exists($c{$_}) and $c{$_}) {
--$c{$_};
}
}
@l = grep { $_ > 0 } values %c;
scalar @l ? 0 : 1;
}

sub check_sort {
my ($orig, $target) = @_;
my $o = join '', sort split //, $orig;
my $t = join '', sort split //, $target;
return $o eq $t;
}

my %check = (
sinan => [ 'nasin', 'nasina', 'lasin', ],
gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
);

cmpthese(-2, {
histo => sub {
for my $k (keys %check) {
for my $t (@{ $check{$k} }) {
check($k, $t);
}
}
},
sort => sub {
for my $k (keys %check) {
for my $t (@{ $check{$k} }) {
check_sort($k, $t);
}
}
},
});
__END__
Rate histo sort
histo 2280/s -- -43%
sort 3992/s 75% --

This might however be due to a denser implementation of check_sort()
avoiding temporary variables etc.

Also, check_sort() is more correct as it wont falsely report 'sinan' and
'nasina' as matching, which check() does. ;-) I'd write check() thusly:

sub check {
my ($orig, $target) = @_;
my %c;
++$c{$_} for split //, $orig;
--$c{$_} for split //, $target;
return ! grep $_, values %c;
}

This is still slower by roughly 25% than using sort. The 'use integer'
appears to have no effect on the benchmark.

Tassilo
 
A

A. Sinan Unur

sub check_sort {
my ($orig, $target) = @_;
my $o = join '', sort split //, $orig;
my $t = join '', sort split //, $target;
return $o eq $t;
} ....

Also, check_sort() is more correct as it wont falsely report 'sinan'
and 'nasina' as matching, which check() does. ;-)

And to think that I actually look at the output, and somehow did not
notice my error. Thank you for catching that.

Sinan
 
M

Michael T. Davis

Just to be clear, I'm looking for a regex-based mechanism that will
work within the confines of "m/.../". I would imagine it's going to need to
rely on the "(${code})" construct.

BTW, my signature includes a trailing space at the end of the first
line, but the gateway I'm using apparently strips it off. I have alerted
them to the mistake.

Regards,
Mike
 
T

Tassilo v. Parseval

Also sprach Michael T. Davis:
Just to be clear, I'm looking for a regex-based mechanism that will
work within the confines of "m/.../". I would imagine it's going to need to
rely on the "(${code})" construct.

Most likely even (??{CODE}). However, any of my attempts so far ended up
in a segmentation fault or 'panic: '. I knew that some of these extended
patterns are flagged as experimental but I didn't expect them to be that
fragile. It's tricky enough coming up with a pure regex solution but
here you'll also need to find one that wont crash perl. So I wouldn't
bother.

Tassilo
 
A

Anno Siegel

A. Sinan Unur said:
(e-mail address removed)-state.edu (Michael T. Davis) wrote in
news:[email protected]:

[how to test for anagrams]
use strict;
use warnings;

use Data::Dumper;

sub check {
my ($orig, $target) = @_;

my %c;

use integer;
my @l = split //, $orig;
++$c{$_} for @l;

@l = split //, $target;
for (@l) {
if(exists($c{$_}) and $c{$_}) {
--$c{$_};
}
}

@l = grep { $_ > 0 } values %c;
scalar @l ? 0 : 1;
}

my %check = (
sinan => [ 'nasin', 'nasina', 'lasin' ],
gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
);

for my $k (keys %check) {
for my $t (@{ $check{$k} }) {
print "$k matches $t?: ";
if(check($k, $t)) {
print "Yes\n";
} else {
print "No\n";
}
}
}


__END__


I am sure someone will show a regex solution that I have overlooked.

A regex solution seems unlikely. It would require jumping back and
forth in a string while keeping track of what was matched where.
Regexes aren't very good at that.

Using a hash for counting is just fine. It is basically a well known
data structure that implements what has been called "bags". Bags are
like sets, but each element (a hash key) can be contained multiple times
(the hash value). Containment and equality of bags are defined in the
obvious way. Then, to check if two strings are anagrams, create the
corresponding bags and test for equality. Code:

my %check = (
sinan => [ 'nasin', 'nasina', 'lasin' ],
gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
);

for my $k (keys %check) {
my $bk = Bag->embag( $k);
for my $t (@{ $check{$k} }) {
print "$k matches $t?: ";
print $bk eq Bag->embag( $t) ? "Yes\n" : "No\n";
}
}

#########################################################################

package Bag;

sub embag { # create a bag of letters from a string
my $class = shift;
my %bag;
$bag{ $_} ++ for split //, shift;
bless \ %bag, $class;
}

sub contained {
my ( $b1, $b2) = @_;
$b2->{ $_} and $b1->{ $_} > $b2->{ $_} and return 0 for keys %$b1;
1;
}

use overload(
le => 'contained',
eq => sub { $_[ 0] le $_[ 1] and $_[ 1] le $_[ 0] },
);

__END__

Anno
 
A

Anno Siegel

Tassilo v. Parseval said:
Also sprach A. Sinan Unur:

[anagram detection by counting letters]
A faster solution appears to involve sort(): split both strings, sort
them and compare for equality. According to a benchmark:

[benchmark snipped]

Interesting, since counting is linear and sorting is n*log n. Presumably,
with huge words, counting would win in the end, but there probably
never was a language (not even German) with words long enough to bring
out the difference.

Anno
 
T

Tassilo v. Parseval

Also sprach Anno Siegel:
Tassilo v. Parseval said:
Also sprach A. Sinan Unur:

[anagram detection by counting letters]
A faster solution appears to involve sort(): split both strings, sort
them and compare for equality. According to a benchmark:

[benchmark snipped]

Interesting, since counting is linear and sorting is n*log n. Presumably,
with huge words, counting would win in the end, but there probably
never was a language (not even German) with words long enough to bring
out the difference.

Altering the benchmark a little so that we can change the length of the
words more easily:

use Benchmark qw/cmpthese/;

sub check {
my ($orig, $target) = @_;
my %c;
++$c{$_} for split //, $orig;
--$c{$_} for split //, $target;
return ! grep $_, values %c;
}

sub check_sort {
my ($orig, $target) = @_;
my $o = join '', sort split //, $orig;
my $t = join '', sort split //, $target;
return $o eq $t;
}

my $len = shift;
my $key = join '', map { ['a'..'z']->[rand 26] } 1 .. $len;

cmpthese(-2, {
histo => sub {
check($key, scalar reverse $key);
},
sort => sub {
check_sort($key, scalar reverse $key);
},
});

$len = 20:

Rate histo sort
histo 8029/s -- -19%
sort 9962/s 24% --

$len = 50:

Rate histo sort
histo 3600/s -- -10%
sort 4015/s 12% --

$len = 100:

Rate histo sort
histo 1912/s -- -3%
sort 1981/s 4% --

$len = 200:

Rate sort histo
sort 972/s -- -5%
histo 1018/s 5% --

Aha! So the words need to be unrealistically long in order for the
linear method to win. Which says quite something about the efficiency of
perl's sort implementations. Of course, check() could be made to return
earlier, for instance when a negative value shows up in the second
for-loop. The same is true for the final grep().

Still, for real-world words I suspect using sort() is still a very
efficient (both coding- and runtime-wise) solution.

Tassilo
 
A

Anno Siegel

Tassilo v. Parseval said:
Also sprach Anno Siegel:
Also sprach A. Sinan Unur:

(e-mail address removed)-state.edu (Michael T. Davis) wrote in
news:[email protected]:

[anagram detection by counting letters]
A faster solution appears to involve sort(): split both strings, sort
them and compare for equality. According to a benchmark:

[benchmark snipped]

Interesting, since counting is linear and sorting is n*log n. Presumably,
with huge words, counting would win in the end, but there probably
never was a language (not even German) with words long enough to bring
out the difference.

Altering the benchmark a little so that we can change the length of the
words more easily:
[shortened]

$len = 100:

Rate histo sort
histo 1912/s -- -3%
sort 1981/s 4% --

$len = 200:

Rate sort histo
sort 972/s -- -5%
histo 1018/s 5% --

Aha! So the words need to be unrealistically long in order for the
linear method to win.

I wouldn't have been amazed to find the crossover length even higher,
at 1000 or so.
Which says quite something about the efficiency of
perl's sort implementations. Of course, check() could be made to return
earlier, for instance when a negative value shows up in the second
for-loop. The same is true for the final grep().

List::Util::first is the grep replacement for that. How much it
saves depends heavily on the distribution of the strings. If
strings vary wildly, it can save a lot, if most comparisons are
for almost-anagrams it won't save so much.
Still, for real-world words I suspect using sort() is still a very
efficient (both coding- and runtime-wise) solution.

In one implementation I used byte vectors for letter counting, a la

embag {
my $bag = '';
++ vec( $bag, ord $_, 8) for split //, shift;
$bag;
}

which is easily Inline-able. Equality of counts is 'eq', like with
sorting. The count vectors can be compacted using another level of
indirection ( $charno[ ord $_] instead of ord $_), which is still
easily Inlined. Sorting was no option for the application, so I never
benchmarked against it, but I'd expect the Inlined code to be in the
same ballpark, even for short strings.

Anno
 
I

Ilmari Karonen

Anno Siegel said:
A. Sinan Unur said:
(e-mail address removed)-state.edu (Michael T. Davis) wrote in
news:[email protected]:

[how to test for anagrams]
I am sure someone will show a regex solution that I have overlooked.

A regex solution seems unlikely. It would require jumping back and
forth in a string while keeping track of what was matched where.
Regexes aren't very good at that.

Nonetheless, here's a regex solution:

sub anagram_re {
my $word = shift;
return "" if $word eq "";
my (@re, %seen);
foreach my $i (0 .. length($word)-1) {
my $temp = $word;
my $ch = substr($temp, $i, 1, "");
next if $seen{$ch}++;
push @re, quotemeta($ch) . anagram_re($temp);
}
return @re > 1 ? "(?:".join("|", @re).")" : $re[0];
}

Give it a word, and it will return a regex to match any anagram of it.
For example, here's the regex for "food" (sans "?:" modifiers):

(f(o(od|do)|doo)|o(f(od|do)|o(fd|df)|d(fo|of))|d(foo|o(fo|of)))

The corresponding regex for "gremlin" is 33218 characters long with
the "?:" modifiers, or 25978 without them.
 
A

Anno Siegel

Ilmari Karonen said:
Anno Siegel said:
A. Sinan Unur said:
(e-mail address removed)-state.edu (Michael T. Davis) wrote in
news:[email protected]:

[how to test for anagrams]
I am sure someone will show a regex solution that I have overlooked.

A regex solution seems unlikely. It would require jumping back and
forth in a string while keeping track of what was matched where.
Regexes aren't very good at that.

Nonetheless, here's a regex solution:

sub anagram_re {
my $word = shift;
return "" if $word eq "";
my (@re, %seen);
foreach my $i (0 .. length($word)-1) {
my $temp = $word;
my $ch = substr($temp, $i, 1, "");
next if $seen{$ch}++;
push @re, quotemeta($ch) . anagram_re($temp);
}
return @re > 1 ? "(?:".join("|", @re).")" : $re[0];
}

Give it a word, and it will return a regex to match any anagram of it.
For example, here's the regex for "food" (sans "?:" modifiers):

(f(o(od|do)|doo)|o(f(od|do)|o(fd|df)|d(fo|of))|d(foo|o(fo|of)))

Nice hack...
The corresponding regex for "gremlin" is 33218 characters long with
the "?:" modifiers, or 25978 without them.

....and mostly useless. I like it :)

Anno
 
I

Ilmari Karonen

Anno Siegel said:
Nice hack...


...and mostly useless. I like it :)

It does, however, have one advantage -- it's fast. Really fast. Over
an order of magnitude faster than any other solution in this thread so
far, in fact.

Of course, that's only if you ignore the time to build and compile the
regex. And it only works for fairly short words anyway.

But if you want a solution that both runs _and_ starts fast, here's
something adapted from an earlier thread titled "perl scramble":

#!/usr/bin/perl -w
use strict;

my $word = shift;
my $canon = join "", sort split //, "$word\n";

my $code = q{
while (<>) {
print if length == length $canon
and !tr/LETTERS//c
and $canon eq join "", sort split //;
}
};
$code =~ s/LETTERS/\Q$canon/;
eval $code; die if $@;

This assumes input comes from a file (or stdin), but it can be easily
modified to, say, grep an array.
 
A

Anno Siegel

Ilmari Karonen said:
Anno Siegel said:
Nice hack...
[...]

But if you want a solution that both runs _and_ starts fast, here's
something adapted from an earlier thread titled "perl scramble":

#!/usr/bin/perl -w
use strict;

my $word = shift;
my $canon = join "", sort split //, "$word\n";

my $code = q{
while (<>) {
print if length == length $canon
and !tr/LETTERS//c
and $canon eq join "", sort split //;
}
};
$code =~ s/LETTERS/\Q$canon/;
eval $code; die if $@;

This assumes input comes from a file (or stdin), but it can be easily
modified to, say, grep an array.

Ah, that's basically the "sort-solution", but the length and tr/// tests
speed it up. Sorting only happens when a word is entirely made of the
same letters, but in different numbers (with the same total). That helps
a lot in typical situations when most candidates are not anagrams. Nifty.

Anno
 
T

Tassilo v. Parseval

Also sprach Abigail:
Tassilo v. Parseval ([email protected]) wrote on
MMMMCCXLIII September MCMXCIII in <URL:|| Also sprach Michael T. Davis:
||
|| > Just to be clear, I'm looking for a regex-based mechanism that will
|| > work within the confines of "m/.../". I would imagine it's going to need to
|| > rely on the "(${code})" construct.
||
|| Most likely even (??{CODE}). However, any of my attempts so far ended up
|| in a segmentation fault or 'panic: '. I knew that some of these extended
|| patterns are flagged as experimental but I didn't expect them to be that
|| fragile. It's tricky enough coming up with a pure regex solution but
|| here you'll also need to find one that wont crash perl. So I wouldn't
|| bother.


#!/usr/bin/perl

use strict;
use warnings;
no warnings qw /syntax/;

my $word = "gremlin";
my $ana = "nlmregi";

my (%h);
print $word =~
/^(?{%h = ()})
(?{$h {substr $ana => $_, 1} ++ for 0 .. length ($ana) - 1})
(?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
$(?(?{grep {$_} values %h}})(?!)|)/x ? "match\n" : "no match\n";

Hmmh, this doesn't compile:

Sequence (?{...}) not terminated or not {}-balanced in regex; marked by <-- HERE in m/^(?{%h = ()})
(?{$h {substr $ana => $_, 1} ++ for 0 .. length ($ana) - 1})
(?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
$(?(?{ <-- HERE grep {$_} values %h}})(?!)|)/ at - line 13.

I can get rid of these by strategically inserting a few spaces here and
there, but then it eventually complains about an "Unknown switch
condition".

Tassilo
 
F

Fabian Pilkowski

* Tassilo v. Parseval said:
Also sprach Abigail: ^^

Hmmh, this doesn't compile:

Delete one of those marked curly parentheses and it'll work fine.

regards,
fabian
 
T

Tassilo v. Parseval

Also sprach Abigail:
#!/usr/bin/perl

use strict;
use warnings;
no warnings qw /syntax/;

my $word = "gremlin";
my $ana = "nlmregi";

my (%h);
print $word =~
/^(?{%h = ()})
(?{$h {substr $ana => $_, 1} ++ for 0 .. length ($ana) - 1})
(?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
$(?(?{grep {$_} values %h})(?!)|)/x
? "match\n" : "no match\n";

Indeed, this is much better. Interestingly enough, it stops working when
using split:

use strict;
use warnings;
no warnings qw /syntax/;

my $word = "gremlin";
my $ana = "nlmregi";

my (%h);
print $word =~ m#
^(?{%h = ()})
(?{$h {$_} ++ for split //, $ana})
(?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
$(?(?{grep {$_} values %h})(?!)|)#x
? "match\n" : "no match\n";

As far as I see it, this code should be functionally equivalent to
yours. Probably these extended patterns don't work too well when another
pattern match happens inside.

Tassilo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top