Matching the most common

P

Perl Learner

Okay, I know there should be something in perl that lets me do this,
but I tried all kinds of google search but ended up with normal pattern
matching.

My problem is .. i have a bunch of names.. in fact, 500 of those.. and
i want to put them in groups...

for example:

INVX1
INVX2
...
...
INVXLH
INVLH
CLKINV2X1
CLKINV
...
...
NAND2X2
NAND2X4
...
...
NANDSOMETHINGELSE
...
...
AND_OTHER_GROUPS

What I would like to do is match all "INV"s .. and match all "NAND"s
...etc
I know you can match all "INVs" by doing /^INV/ .. but that's not what
I meant..

I want it to read all INV*** and come up with INV (most common
starting name)
(Notice some names have INVX* and one name has just INVLH .. so the
most common is INV) .. name case with NANDs and other "groups"

The problem is that I dont know how long this "root name" could be (for
INVs it is only 3, for NANDs, it is 4.. it all depends on the list of
names given), if i knew the length, the problem would be much simpler.

Is there a function in perl that can match the longest "root name"
possible?
If there isn't one, I have to write it myself. But if there is, i'd be
glad if someone points me to it.
 
G

Gunnar Hjalmarsson

Perl said:
The problem is that I dont know how long this "root name" could be (for
INVs it is only 3, for NANDs, it is 4.. it all depends on the list of
names given), if i knew the length, the problem would be much simpler.

Is there a function in perl that can match the longest "root name"
possible?

If you can't tell Perl how you identify the "root name", how on earth
would such a function work?
If there isn't one, I have to write it myself.

Er.. Good luck! ;-)
 
B

Brian Wakem

Perl said:
Okay, I know there should be something in perl that lets me do this,
but I tried all kinds of google search but ended up with normal pattern
matching.

My problem is .. i have a bunch of names.. in fact, 500 of those.. and
i want to put them in groups...

for example:

INVX1
INVX2
..
..
INVXLH
INVLH
CLKINV2X1
CLKINV
..
..
NAND2X2
NAND2X4
..
..
NANDSOMETHINGELSE
..
..
AND_OTHER_GROUPS

What I would like to do is match all "INV"s .. and match all "NAND"s
..etc
I know you can match all "INVs" by doing /^INV/ .. but that's not what
I meant..

I want it to read all INV*** and come up with INV (most common
starting name)
(Notice some names have INVX* and one name has just INVLH .. so the
most common is INV) .. name case with NANDs and other "groups"

The problem is that I dont know how long this "root name" could be (for
INVs it is only 3, for NANDs, it is 4.. it all depends on the list of
names given), if i knew the length, the problem would be much simpler.

Is there a function in perl that can match the longest "root name"
possible?
If there isn't one, I have to write it myself. But if there is, i'd be
glad if someone points me to it.


Your description is not very concise. It sounds like all you need to do is
'sort' them.
 
J

Jim Keenan

Perl said:
Okay, I know there should be something in perl that lets me do this,
but I tried all kinds of google search but ended up with normal pattern
matching.

My problem is .. i have a bunch of names.. in fact, 500 of those.. and
i want to put them in groups...

for example:

INVX1
INVX2
..
..
INVXLH
INVLH
CLKINV2X1
CLKINV
..
..
NAND2X2
NAND2X4
..
..
NANDSOMETHINGELSE
..
..
AND_OTHER_GROUPS

What I would like to do is match all "INV"s .. and match all "NAND"s
..etc
I know you can match all "INVs" by doing /^INV/ .. but that's not what
I meant..

I want it to read all INV*** and come up with INV (most common
starting name)
(Notice some names have INVX* and one name has just INVLH .. so the
most common is INV) .. name case with NANDs and other "groups"

The problem is that I dont know how long this "root name" could be (for
INVs it is only 3, for NANDs, it is 4.. it all depends on the list of
names given), if i knew the length, the problem would be much simpler.
Since there is a significant aspect to the data which you can't pin
down, I can't provide a simple solution either.

But, here's a less-than-elegant approach: Define a number of hashes,
one of which will have keys consisting of the first 3 letters; one of
the first 4 letters; etc. Examine each datum with 'substr' iteratively
for each hash so defined. Increment the value for each hash element
when a particular 3-, 4-, etc initial substring in seen.

jimk
 
A

Arne Ruhnau

Perl said:
My problem is .. i have a bunch of names.. in fact, 500 of those.. and
i want to put them in groups...
What I would like to do is match all "INV"s .. and match all "NAND"s
..etc
I know you can match all "INVs" by doing /^INV/ .. but that's not what
I meant..

I want it to read all INV*** and come up with INV (most common
starting name)
(Notice some names have INVX* and one name has just INVLH .. so the
most common is INV) .. name case with NANDs and other "groups"

The problem is that I dont know how long this "root name" could be (for
INVs it is only 3, for NANDs, it is 4.. it all depends on the list of
names given), if i knew the length, the problem would be much simpler.

You could use something like the Tree::Trie - Module to store the data you
have in a tree. Afterwards, you walk through this datastructure, and as
soon as you have more than one possible daughter to your current node, you
have your longest prefix.

Of course, you have to mess with the internal representation of the
Tree::Trie-Module, but as it was the first one that popped into my mind, I
mentioned it. Every other Module / own Code that constructs the right tree
is in order. You might want to do it similar to this way, which has the
sideeffect of counting what it gets...

sub addData {
my ($tree, $data) = @_; # $data holds [split //, $lineOfYourFile
# $tree is just a hashref
my $ref_to_former_value = $tree;
my $limit = scalar(@$data)-1; # the length of your the string to insert
map {
if($_ == $limit) {
$ref_to_former_value->{$data->[$_]}++;
}
else {
$ref_to_former_value->{$data->[$_]} ||= {};
$ref_to_former_value = $ref_to_former_value->{$data->[$_]};
}
} (0..$limit);
return;
}

A recursive method would then walk down the hash until there is more than
one key in the current hash, remember which nodes it visited and put this
path on a list.
Afterwards, this list holds your longest unique prefixes.

hth,

Arne Ruhnau
 
A

Arne Ruhnau

Arne said:
Perl said:
What I would like to do is match all "INV"s .. and match all "NAND"s
..etc
I know you can match all "INVs" by doing /^INV/ .. but that's not what
I meant..

I want it to read all INV*** and come up with INV (most common
starting name)
(Notice some names have INVX* and one name has just INVLH .. so the
most common is INV) .. name case with NANDs and other "groups"
Of course, you have to mess with the internal representation of the
Tree::Trie-Module, but as it was the first one that popped into my mind, I
mentioned it. Every other Module / own Code that constructs the right tree
is in order. You might want to do it similar to this way, which has the
sideeffect of counting what it gets...

sub addData {
my ($tree, $data) = @_; # $data holds [split //, $lineOfYourFile
# $tree is just a hashref
my $ref_to_former_value = $tree;
my $limit = scalar(@$data)-1; # the length of your the string to insert
map {
if($_ == $limit) {
$ref_to_former_value->{$data->[$_]}++;
}
else {
$ref_to_former_value->{$data->[$_]} ||= {};
$ref_to_former_value = $ref_to_former_value->{$data->[$_]};
}
} (0..$limit);
return;
}

Er, sorry. This will certainly work only for equal-length strings. Better
use the Tree::Trie-Module instead.

Arne "If I would only read before I post" Ruhnau
 
A

Ala Qumsieh

Perl said:
The problem is that I dont know how long this "root name" could be (for
INVs it is only 3, for NANDs, it is 4.. it all depends on the list of
names given), if i knew the length, the problem would be much simpler.

Why not go the other way around? Strip out all know suffixes:

s/X\d+$|X?LH$/;

that should leave you with the root names. While this simplistic
solution might not cover all your cases, it will narrow it down
considerably, and you will be able to handle special cases independently.

--Ala
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,280
Latest member
BGBBrock56

Latest Threads

Top