Pattern Searching

Shiraz · Jun 22, 2005

This is my goal, if i have an input file:

BEGIN----------
Smith
Johnathan
Smiths
Mona
Moynaham
END------------

i want to generate a minimum match so that with the minimum match
string, each entry can be identified. the output would be:

BEGIN----------
Smith - Smith
Smiths - Smiths or hs
Mona - Mon or ona
Moynaham - Moy or nah or aha or ham
END------------

Is there any built-in function to do such analysis in perl?

A. Sinan Unur · Jun 23, 2005

This is my goal, if i have an input file:

BEGIN----------
Smith
Johnathan
Smiths
Mona
Moynaham
END------------

i want to generate a minimum match so that with the minimum match
string, each entry can be identified. the output would be:

BEGIN----------
Smith - Smith
Smiths - Smiths or hs
Mona - Mon or ona
Moynaham - Moy or nah or aha or ham
END------------

Is there any built-in function to do such analysis in perl?

No.

But you can always browse through CPAN to see if someone has written a
module to help with this task.

Sinan

Arne Ruhnau · Jun 23, 2005

Shiraz said:
This is my goal, if i have an input file:

BEGIN----------
Smith
Johnathan
Smiths
Mona
Moynaham
END------------

i want to generate a minimum match so that with the minimum match
string, each entry can be identified. the output would be:

BEGIN----------
Smith - Smith
Smiths - Smiths or hs
Mona - Mon or ona
Moynaham - Moy or nah or aha or ham
END------------

Is there any built-in function to do such analysis in perl?

Hmm... my first idea would be to, for every entry, extract every possible
2..length($entry) - Gram (i.e. each sequence of 2..length($entry) letters)
and count them.
For Smith, this would become
Sm, mi, it, th, Smi, mit, ith, Smit, mith, Smith
and for Smiths
Sm, mi, it, th, hs, Smi, mit, ith, ths, Smit, mith, iths, Smith, miths, Smiths

(thus, there are more unique-to-the-entry substrings than you listed above)

Of course, one would have to remember which entry generated the ngram just
counted. Afterwards, you can grep all those from the resulting tree whose
terminal nodes (i.e. counts) are exactly one, i.e. which were unique to the
associated word.

hth,

Arne Ruhnau

Anno Siegel · Jun 23, 2005

Arne Ruhnau said:
Hmm... my first idea would be to, for every entry, extract every possible
2..length($entry) - Gram (i.e. each sequence of 2..length($entry) letters)

Why only strings of length 2 or more? "y" uniquely identifies "Moynaham"
in the list above.

and count them.

For Smith, this would become
Sm, mi, it, th, Smi, mit, ith, Smit, mith, Smith
and for Smiths
Sm, mi, it, th, hs, Smi, mit, ith, ths, Smit, mith, iths, Smith, miths, Smiths

(thus, there are more unique-to-the-entry substrings than you listed above)

Of course, one would have to remember which entry generated the ngram just
counted. Afterwards, you can grep all those from the resulting tree whose
terminal nodes (i.e. counts) are exactly one, i.e. which were unique to the
associated word.

If you know the substring is unique (i.e. is a substring of only one of
the names) you don't *have* to remember which name generated which
substring. You can find it by looking which of the names contains the
substring.

Anno

Arne Ruhnau · Jun 23, 2005

Anno said:
Why only strings of length 2 or more? "y" uniquely identifies "Moynaham"
in the list above.

Well, I thought unique unigrams could be too sparse / non-existent, so that
there is no point in looking at them anyway. But for completeness/maximally
short unique substrings, they should be included.

If you know the substring is unique (i.e. is a substring of only one of
the names) you don't *have* to remember which name generated which
substring. You can find it by looking which of the names contains the
substring.

Correct, hadn't thought of that... But if you want to make some kind of
look-up based on the unique substring, returning the string identified by
it, you have to know it (to prevent yourself from scanning the original
list). Or am I getting it wrong?

Arne

Anno Siegel · Jun 23, 2005

Arne Ruhnau said:
[...]

If you know the substring is unique (i.e. is a substring of only one of
the names) you don't *have* to remember which name generated which
substring. You can find it by looking which of the names contains the
substring.

Click to expand...

Correct, hadn't thought of that... But if you want to make some kind of
look-up based on the unique substring, returning the string identified by
it, you have to know it (to prevent yourself from scanning the original
list). Or am I getting it wrong?

Sure, you probably will eventually build a lookup table for the unique
substrings, but you don't have to keep track of each substring's origin
while you find them.

Anno

Shiraz · Jun 23, 2005

Arne, i appreciate the input.... i'll see what kind of tuning can be
done to accomplish that fast since the input file maybe as big as 50000
lines at a time.....

i'll put up the code in a couple of days.

Thanks all

A. Sinan Unur · Jun 23, 2005

Shiraz said:
Arne, i appreciate the input....

What input was that? Please quote some context when posting replies.

done to accomplish that fast since the input file maybe as big as 50000
lines at a time.....

50000 is really not that huge a number.

i'll put up the code in a couple of days.

Please read the posting guidelines for this group by then.

Sinan

xhoster · Jun 23, 2005

Shiraz said:
This is my goal, if i have an input file:

BEGIN----------
Smith
Johnathan
Smiths
Mona
Moynaham
END------------

i want to generate a minimum match so that with the minimum match
string, each entry can be identified. the output would be:

BEGIN----------
Smith - Smith
Smiths - Smiths or hs
Mona - Mon or ona
Moynaham - Moy or nah or aha or ham
END------------

Smith has *no* unambiguous match string!

What do you mean by "minimum"? "hs" is shorter than "Smiths" so why
is "Smiths" included if you only want the minimum?

Is there any built-in function to do such analysis in perl?

No.

sub foo {
my %h2;
foreach (@_) {
my %h;
foreach my $p1 (0..length($_)-1) {
foreach my $p2 ($p1 .. length($_)-1) {
$h{substr $_, $p1, $p2-$p1+1}=1;
}
};
$h2{$_}++ foreach keys %h;
};
return map { my $t=$_; $t, grep -1!=index($_,$t),@_}
grep $h2{$_}==1, keys %h2;
};

print join "\t", foo(qw/Smith Smiths Mona Moynaham/);

Xho

Shiraz · Jun 23, 2005

Arne,
the input about how to analyze the problem, look for strings of various
lengths and see which one is unique to an entry

xhos
if in the list that i get contains, both smith and smiths, there is
nothing i can do about it, so to identify them separately, i need a
total match of smith, to identify 'smith' and i need 'hs' to identify
'Smiths';
you were correct about the hs;

Shiraz · Jun 25, 2005

Hey guys there is a package called Text-Ngrams which solved my
problems.....

Shiraz · Jun 25, 2005

Hey guys there is a package called Text-Ngrams which solved my
problems.....

Shiraz · Jun 25, 2005

Hey guys there is a package called Text-Ngrams which solved my
problems.....

Tad McClellan · Jun 25, 2005

Shiraz said:
Hey guys there is a package called Text-Ngrams which solved my
problems.....

What were your problems that the module solved?

Shiraz · Jun 26, 2005

i appologize, i should be more verbose... any ways i was looking for a
way to come up with a minimum match string, (given a list of stings) by
which each entry can be uniquely identified.

Smith - Smith
Smiths - Smiths or hs
Mona - Mon or ona
Moynaham - Moy or nah or aha or ham

just using
use Text::Ngrams;
my $ng3 = Text::Ngrams->new( windowsize => 6 );

$ng3->process_files('rates/atsi-BF.csv');
print F_OUT_LOG $ng3->to_string( out);

Gunnar Hjalmarsson · Jun 26, 2005

Shiraz said:
i appologize, i should be more verbose...

No, you should rather have provided context by quoting appropriate part
of previous messages. ;-)

Wildcard String Comparisons: Set Pattern to a Wildcard Source	7	Oct 5, 2010
Difficulty removing element from array	3	Sep 21, 2003
CSV dB script help	9	Jun 2, 2004
[SUMMARY] Statistician II (#168)	1	Jul 10, 2008
How bad is $'? (Was: "Get substring of line")	4	Jan 18, 2005
[QUIZ] Food Database (#159)	14	Mar 14, 2008
Chat client/server print failed	14	Jan 16, 2008
[QUIZ] Statistician I (#167)	12	Jun 27, 2008

Pattern Searching

Shiraz

A. Sinan Unur

Arne Ruhnau

Anno Siegel

Arne Ruhnau

Anno Siegel

Shiraz

A. Sinan Unur

xhoster

Shiraz

Shiraz

Shiraz

Shiraz

Tad McClellan

Shiraz

Gunnar Hjalmarsson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads