Pattern Searching

S

Shiraz

This is my goal, if i have an input file:

BEGIN----------
Smith
Johnathan
Smiths
Mona
Moynaham
END------------

i want to generate a minimum match so that with the minimum match
string, each entry can be identified. the output would be:

BEGIN----------
Smith - Smith
Smiths - Smiths or hs
Mona - Mon or ona
Moynaham - Moy or nah or aha or ham
END------------

Is there any built-in function to do such analysis in perl?
 
A

A. Sinan Unur

This is my goal, if i have an input file:

BEGIN----------
Smith
Johnathan
Smiths
Mona
Moynaham
END------------

i want to generate a minimum match so that with the minimum match
string, each entry can be identified. the output would be:

BEGIN----------
Smith - Smith
Smiths - Smiths or hs
Mona - Mon or ona
Moynaham - Moy or nah or aha or ham
END------------

Is there any built-in function to do such analysis in perl?

No.

But you can always browse through CPAN to see if someone has written a
module to help with this task.

Sinan
 
A

Arne Ruhnau

Shiraz said:
This is my goal, if i have an input file:

BEGIN----------
Smith
Johnathan
Smiths
Mona
Moynaham
END------------

i want to generate a minimum match so that with the minimum match
string, each entry can be identified. the output would be:

BEGIN----------
Smith - Smith
Smiths - Smiths or hs
Mona - Mon or ona
Moynaham - Moy or nah or aha or ham
END------------

Is there any built-in function to do such analysis in perl?

Hmm... my first idea would be to, for every entry, extract every possible
2..length($entry) - Gram (i.e. each sequence of 2..length($entry) letters)
and count them.
For Smith, this would become
Sm, mi, it, th, Smi, mit, ith, Smit, mith, Smith
and for Smiths
Sm, mi, it, th, hs, Smi, mit, ith, ths, Smit, mith, iths, Smith, miths, Smiths

(thus, there are more unique-to-the-entry substrings than you listed above)

Of course, one would have to remember which entry generated the ngram just
counted. Afterwards, you can grep all those from the resulting tree whose
terminal nodes (i.e. counts) are exactly one, i.e. which were unique to the
associated word.

hth,

Arne Ruhnau
 
A

Anno Siegel

Arne Ruhnau said:
Hmm... my first idea would be to, for every entry, extract every possible
2..length($entry) - Gram (i.e. each sequence of 2..length($entry) letters)

Why only strings of length 2 or more? "y" uniquely identifies "Moynaham"
in the list above.
and count them.
For Smith, this would become
Sm, mi, it, th, Smi, mit, ith, Smit, mith, Smith
and for Smiths
Sm, mi, it, th, hs, Smi, mit, ith, ths, Smit, mith, iths, Smith, miths, Smiths

(thus, there are more unique-to-the-entry substrings than you listed above)

Of course, one would have to remember which entry generated the ngram just
counted. Afterwards, you can grep all those from the resulting tree whose
terminal nodes (i.e. counts) are exactly one, i.e. which were unique to the
associated word.

If you know the substring is unique (i.e. is a substring of only one of
the names) you don't *have* to remember which name generated which
substring. You can find it by looking which of the names contains the
substring.

Anno
 
A

Arne Ruhnau

Anno said:
Why only strings of length 2 or more? "y" uniquely identifies "Moynaham"
in the list above.

Well, I thought unique unigrams could be too sparse / non-existent, so that
there is no point in looking at them anyway. But for completeness/maximally
short unique substrings, they should be included.

If you know the substring is unique (i.e. is a substring of only one of
the names) you don't *have* to remember which name generated which
substring. You can find it by looking which of the names contains the
substring.

Correct, hadn't thought of that... But if you want to make some kind of
look-up based on the unique substring, returning the string identified by
it, you have to know it (to prevent yourself from scanning the original
list). Or am I getting it wrong?

Arne
 
A

Anno Siegel

Arne Ruhnau said:
[...]
If you know the substring is unique (i.e. is a substring of only one of
the names) you don't *have* to remember which name generated which
substring. You can find it by looking which of the names contains the
substring.

Correct, hadn't thought of that... But if you want to make some kind of
look-up based on the unique substring, returning the string identified by
it, you have to know it (to prevent yourself from scanning the original
list). Or am I getting it wrong?

Sure, you probably will eventually build a lookup table for the unique
substrings, but you don't have to keep track of each substring's origin
while you find them.

Anno
 
S

Shiraz

Arne, i appreciate the input.... i'll see what kind of tuning can be
done to accomplish that fast since the input file maybe as big as 50000
lines at a time.....

i'll put up the code in a couple of days.

Thanks all
 
A

A. Sinan Unur

Shiraz said:
Arne, i appreciate the input....

What input was that? Please quote some context when posting replies.
done to accomplish that fast since the input file maybe as big as 50000
lines at a time.....

50000 is really not that huge a number.
i'll put up the code in a couple of days.

Please read the posting guidelines for this group by then.

Sinan
 
X

xhoster

Shiraz said:
This is my goal, if i have an input file:

BEGIN----------
Smith
Johnathan
Smiths
Mona
Moynaham
END------------

i want to generate a minimum match so that with the minimum match
string, each entry can be identified. the output would be:

BEGIN----------
Smith - Smith
Smiths - Smiths or hs
Mona - Mon or ona
Moynaham - Moy or nah or aha or ham
END------------

Smith has *no* unambiguous match string!

What do you mean by "minimum"? "hs" is shorter than "Smiths" so why
is "Smiths" included if you only want the minimum?
Is there any built-in function to do such analysis in perl?

No.

sub foo {
my %h2;
foreach (@_) {
my %h;
foreach my $p1 (0..length($_)-1) {
foreach my $p2 ($p1 .. length($_)-1) {
$h{substr $_, $p1, $p2-$p1+1}=1;
}
};
$h2{$_}++ foreach keys %h;
};
return map { my $t=$_; $t, grep -1!=index($_,$t),@_}
grep $h2{$_}==1, keys %h2;
};

print join "\t", foo(qw/Smith Smiths Mona Moynaham/);

Xho
 
S

Shiraz

Arne,
the input about how to analyze the problem, look for strings of various
lengths and see which one is unique to an entry

xhos
if in the list that i get contains, both smith and smiths, there is
nothing i can do about it, so to identify them separately, i need a
total match of smith, to identify 'smith' and i need 'hs' to identify
'Smiths';
you were correct about the hs;
 
S

Shiraz

i appologize, i should be more verbose... any ways i was looking for a
way to come up with a minimum match string, (given a list of stings) by
which each entry can be uniquely identified.


Smith - Smith
Smiths - Smiths or hs
Mona - Mon or ona
Moynaham - Moy or nah or aha or ham

just using
use Text::Ngrams;
my $ng3 = Text::Ngrams->new( windowsize => 6 );

$ng3->process_files('rates/atsi-BF.csv');
print F_OUT_LOG $ng3->to_string( out);
 
G

Gunnar Hjalmarsson

Shiraz said:
i appologize, i should be more verbose...

No, you should rather have provided context by quoting appropriate part
of previous messages. ;-)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top