finding common words

V

viv2k

I'm new to Perl but a friend has told me that the task I wana do can
be done very efficiently by Perl.

Right now, I have two very long lists of words with comments attached
to each of the words and I want to find the words that are common in
both of them.

example:

ListA ListB

apple 1.1 apple 100
banana 2.2 boy 500
cat 3.3 cat 1000

And I want the result to look something like:

ListA ListB
apple 1.1 apple 100
cat 3.3 cat 1000

any idea on how to tackle this? I'm off to reading some Perl
introductory books now koz I really need that list asap. Any help
would be greatly appreciated.

thanks
viv
 
G

Gunnar Hjalmarsson

viv2k said:
Right now, I have two very long lists of words with comments
attached to each of the words and I want to find the words that are
common in both of them.

example:

ListA ListB

apple 1.1 apple 100
banana 2.2 boy 500
cat 3.3 cat 1000

And I want the result to look something like:

ListA ListB
apple 1.1 apple 100
cat 3.3 cat 1000

my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
my %ListB = ( apple => 100, boy => 500, cat => 1000 );

for (keys %ListA) {
delete $ListA{$_} unless exists $ListB{$_};
}
for (keys %ListB) {
delete $ListB{$_} unless exists $ListA{$_};
}

print "ListA\n";
print "$_\t$ListA{$_}\n" for sort keys %ListA;
print "\n";
print "ListB\n";
print "$_\t$ListB{$_}\n" for sort keys %ListB;
 
D

David K. Wall

Right now, I have two very long lists of words with comments attached
to each of the words and I want to find the words that are common in
both of them.

perldoc -q intersection

Gunnar Hjalmarsson has already posted working code. I just wanted to point to
the FAQ entry, as the code there is readily adapted to the above problem.
 
M

Matt Garrish

Gunnar Hjalmarsson said:
my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
my %ListB = ( apple => 100, boy => 500, cat => 1000 );

for (keys %ListA) {
delete $ListA{$_} unless exists $ListB{$_};
}
for (keys %ListB) {
delete $ListB{$_} unless exists $ListA{$_};
}

I've never been a big fan of looping over both sets like that (especially if
you want to retain the original hashes). If you just want to extract the
common elements, my personal preference would be to do something like the
following instead:

my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
my %ListB = ( apple => 100, boy => 500, cat => 1000 );
my %common;

for (keys %ListA) {
if ($ListB{$_}) { $common{$_} = [$ListA{$_}, $ListB{$_}] };
}

print "ListA\n";
print "$_\t$common{$_}[0]\n" for sort keys %common;

print "\nListB\n";
print "$_\t$common{$_}[1]\n" for sort keys %common;


But there's certainly nothing wrong with your code...

Matt
 
G

Gunnar Hjalmarsson

David said:
perldoc -q intersection

Gunnar Hjalmarsson has already posted working code. I just wanted
to point to the FAQ entry, as the code there is readily adapted to
the above problem.

Well, applying that FAQ entry, you could do something like this:

my ($elem, %count, %intersection);
for $elem (keys %ListA, keys %ListB) { $count{$elem}++ }
for $elem (keys %count) {
$intersection{$elem} = "$ListA{$elem} : $ListB{$elem}"
if $count{$elem} > 1;
}
print "$_\t$intersection{$_}\n" for sort keys %intersection;

But provided that it makes sense to start with populating two hashes
with the lists of words + comments, I don't really see that the FAQ
entry is very well adapted, since applying it would not take advantage
of the initial hashes.
 
G

Gunnar Hjalmarsson

Matt said:
I've never been a big fan of looping over both sets like that
(especially if you want to retain the original hashes). If you just
want to extract the common elements, my personal preference would
be to do something like the following instead:

my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
my %ListB = ( apple => 100, boy => 500, cat => 1000 );
my %common;

for (keys %ListA) {
if ($ListB{$_}) { $common{$_} = [$ListA{$_}, $ListB{$_}] };
}

print "ListA\n";
print "$_\t$common{$_}[0]\n" for sort keys %common;

print "\nListB\n";
print "$_\t$common{$_}[1]\n" for sort keys %common;


But there's certainly nothing wrong with your code...

Maybe not, but I like your solution. It would only require looping
through one of the lists.
 
D

David K. Wall

Gunnar Hjalmarsson said:
Well, applying that FAQ entry, you could do something like this:

my ($elem, %count, %intersection);
for $elem (keys %ListA, keys %ListB) { $count{$elem}++ }
for $elem (keys %count) {
$intersection{$elem} = "$ListA{$elem} : $ListB{$elem}"
if $count{$elem} > 1;
}
print "$_\t$intersection{$_}\n" for sort keys %intersection;

But provided that it makes sense to start with populating two hashes
with the lists of words + comments, I don't really see that the FAQ
entry is very well adapted, since applying it would not take advantage
of the initial hashes.

How about this?

my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
my %ListB = ( apple => 100, boy => 500, cat => 1000 );
my %count;
my (%count, @intersection);
for my $element (keys %ListA, keys %ListB) {
push @intersection, $element if ++$count{$element} > 1;
}
my (%new_ListA, %new_ListB);
@new_ListA{@intersection} = @ListA{@intersection};
@new_ListB{@intersection} = @ListB{@intersection};

It's a bit different from the FAQ, but was directly inspired by it...
<shrug>

If memory is a consideration I'd go with deleting the non-intersecting
elements. Neat idea, and one that didn't occur to me.
 
D

David K. Wall

David K. Wall said:
my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
my %ListB = ( apple => 100, boy => 500, cat => 1000 );
my %count;
my (%count, @intersection);

Oops, extra declaration. I originally wrote it as a for() loop followed by a
grep(), but then saw that the grep() could be eliminated and just re-pasted
the part I changed.
 
G

Gunnar Hjalmarsson

David said:
How about this?

my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
my %ListB = ( apple => 100, boy => 500, cat => 1000 );
my (%count, @intersection);
for my $element (keys %ListA, keys %ListB) {
push @intersection, $element if ++$count{$element} > 1;
}
my (%new_ListA, %new_ListB);
@new_ListA{@intersection} = @ListA{@intersection};
@new_ListB{@intersection} = @ListB{@intersection};

It's a bit different from the FAQ, but was directly inspired by
it... <shrug>

I still don't think that the FAQ approach is good here. The FAQ deals
with arrays, and since we are starting with hashes here, you'd better
take advantage of the ability to look up elements in a hash.
If memory is a consideration

The FAQ approach is indeed memory expensive. OP mentioned "two very
long lists of words", and this solution creates 6(!) variables with
lists: %ListA, %ListB, %count, @intersection, %new_ListA and
%new_ListB.
I'd go with deleting the non-intersecting elements. Neat idea, and
one that didn't occur to me.

If you want to keep the original hashes intact, I personally find
Matt's solution to be the neatest.
 
V

viv2k

Thanks for the tips. But does that mean I will have to manually write
and create the two lists before comparing them?

Because my list is actually a txt file where after each word there is
a space and then the number associated and then a comma and then the
next word and so on. Example:

apple 1.1, banana 2.2, cat 3.3, etc

Is there a way of taking the whole text file as input and using
'space' and 'comma' as delimiters to do the task?

Thanks
viv
 
T

Tore Aursand

ListA ListB

apple 1.1 apple 100
banana 2.2 boy 500
cat 3.3 cat 1000

And I want the result to look something like:

ListA ListB
apple 1.1 apple 100
cat 3.3 cat 1000

You have probably already seen a few very good solutions from other
people, but I thought I would give you mine;

my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
my %ListB = ( apple => 100, boy => 500, cat => 1000 );
my @common = grep { $ListB{$_} } keys %ListA;
for ( sort @common ) {
print "$_\t$ListA{$_}\t$_\t$ListB{$_}\n";
}

Anything wrong with this one? I thought about using 'map' for the 'grep'
part above, but couldn't find a nice way to have it _not_ return anything
when there's no match, ie;

my @common = map { (exists $ListB{$_}) ? $_ : undef } keys %ListA;
foreach ( sort @common ) {
next unless defined;
# ...
}

Comments/suggestions/corrections are appreciated!


--
Tore Aursand <[email protected]>
"Writing is a lot like sex. At first you do it because you like it.
Then you find yourself doing it for a few close friends and people you
like. But if you're any good at all, you end up doing it for money."
-- Unknown
 
G

Gunnar Hjalmarsson

[ Please respond below the quoted text. ]
Thanks for the tips. But does that mean I will have to manually
write and create the two lists before comparing them?

Only if they are only on a sheet of paper, or something. :)
Because my list is actually a txt file where after each word there
is a space and then the number associated and then a comma and then
the next word and so on. Example:

apple 1.1, banana 2.2, cat 3.3, etc

Is there a way of taking the whole text file as input and using
'space' and 'comma' as delimiters to do the task?

Yes, of course. Assuming that neither the words nor the numbers can
include spaces, this is one way:

my %ListA;
open my $fh, '< ListA.txt' or die "Couldn't open ListA.txt $!";
while (<$fh>) {
for (split /,\s*/) {
my ($key, $value) = split;
$ListA{$key} = $value;
}
}
close $fh;
 
G

Gunnar Hjalmarsson

Tore said:
You have probably already seen a few very good solutions from other
people, but I thought I would give you mine;

my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
my %ListB = ( apple => 100, boy => 500, cat => 1000 );
my @common = grep { $ListB{$_} } keys %ListA;
for ( sort @common ) {
print "$_\t$ListA{$_}\t$_\t$ListB{$_}\n";
}

Comments/suggestions/corrections are appreciated!

Nice. :)
 
M

Matt Garrish

Tore Aursand said:
You have probably already seen a few very good solutions from other
people, but I thought I would give you mine;

my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
my %ListB = ( apple => 100, boy => 500, cat => 1000 );
my @common = grep { $ListB{$_} } keys %ListA;
for ( sort @common ) {
print "$_\t$ListA{$_}\t$_\t$ListB{$_}\n";
}

Anything wrong with this one? I thought about using 'map' for the 'grep'
part above, but couldn't find a nice way to have it _not_ return anything
when there's no match, ie;

my @common = map { (exists $ListB{$_}) ? $_ : undef } keys %ListA;
foreach ( sort @common ) {
next unless defined;
# ...
}

I like your first solution (I wasn't thinking about *not* using a hash,
myself). As for the second, the best way I can see to use map would be
without checking the return value:

my @common;
map { push @common, $_ if exists $ListB{$_} } keys %ListA;

Matt
 
D

David K. Wall

Tore Aursand said:
my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
my %ListB = ( apple => 100, boy => 500, cat => 1000 );
my @common = grep { $ListB{$_} } keys %ListA;
for ( sort @common ) {
print "$_\t$ListA{$_}\t$_\t$ListB{$_}\n";
}

Anything wrong with this one?

One nitpick: make it

my @common = grep { exists $ListB{$_} } keys %ListA;

in case %listB has a key whose value is false.

It's interesting to see so many different ways to do it. :)
 
D

David K. Wall

David K. Wall said:
One nitpick: make it

my @common = grep { exists $ListB{$_} } keys %ListA;

in case %listB has a key whose value is false.

Oh, I see you already did that in another part of your post. Never mind.

I like it better than my adaptation of the 'intersection' FAQ answer.
 
G

gnari

Gunnar Hjalmarsson said:
[ Please respond below the quoted text. ]
Because my list is actually a txt file where after each word there
is a space and then the number associated and then a comma and then
the next word and so on. Example:

apple 1.1, banana 2.2, cat 3.3, etc
Yes, of course. Assuming that neither the words nor the numbers can
include spaces, this is one way:

my %ListA;
open my $fh, '< ListA.txt' or die "Couldn't open ListA.txt $!";
while (<$fh>) {
for (split /,\s*/) {
my ($key, $value) = split;
$ListA{$key} = $value;
}
}
close $fh;

or with same assumptions:

my %ListA;
open my $fh, '< ListA.txt' or die "Couldn't open ListA.txt $!";
while (<$fh>) {
%ListA=(%ListA,split /[\s,]+/);
}
close $fh;

gnari
 
A

Anno Siegel

Gunnar Hjalmarsson said:
I still don't think that the FAQ approach is good here. The FAQ deals
with arrays, and since we are starting with hashes here, you'd better
take advantage of the ability to look up elements in a hash.

In another branch of the thread we learned (unsurprisingly) that the
data comes in disk files. This suggests we read one of the files
into a hash (the smaller one, if there is a difference), and process
the other one record-by-record.
The FAQ approach is indeed memory expensive. OP mentioned "two very
long lists of words", and this solution creates 6(!) variables with
lists: %ListA, %ListB, %count, @intersection, %new_ListA and
%new_ListB.

Well, that's because the FAQ is over-explicit to get the idea across.
Once you've grasped that, you're free to cut the cruft. Some FAQ answers
aren't meant to be direct copy/paste solutions.

Unfortunately it requires both hashes in memory.
If you want to keep the original hashes intact, I personally find
Matt's solution to be the neatest.

As it seems, there are no original hashes, that was an artifact of the
first reply in the thread.

Assuming filehandles $listA and $listB opened (and $/ set to ',', IIRC),
I'd do the obvious (untested):

my %listA;
while ( <$listA> ) {
chomp; # non-standard $/, split won't do the job
my( $key, $val) = split;
$lista{ $key} = $val;
}

my %new_listB;
while ( <$listB> ) {
chomp;
my ( $key, $val) = split;
$new_listB{ $key} = $val if exists $listA{ $key};
}

If %new_listA is explicitly needed, it can be derived from the above.

Anno
 
A

Anno Siegel

Tore Aursand said:
You have probably already seen a few very good solutions from other
people, but I thought I would give you mine;

my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
my %ListB = ( apple => 100, boy => 500, cat => 1000 );
my @common = grep { $ListB{$_} } keys %ListA;
for ( sort @common ) {
print "$_\t$ListA{$_}\t$_\t$ListB{$_}\n";
}

Anything wrong with this one?

Nothing, at visual inspection. It's a good high-level approach. Like
most of the thread, it assumes both hashes in memory. We may want to
avoid that.
I thought about using 'map' for the 'grep'
part above, but couldn't find a nice way to have it _not_ return anything
when there's no match, ie;

my @common = map { (exists $ListB{$_}) ? $_ : undef } keys %ListA;
foreach ( sort @common ) {
next unless defined;
# ...
}

Use "()" where you have tried "undef". Map evaluates its arguments in
list context, so if one returns nothing, it adds nothing..

Anno
 
T

Tore Aursand

my @common;
map { push @common, $_ if exists $ListB{$_} } keys %ListA;

I know this approach, but I also know that one shouldn't use 'map' in void
context.


--
Tore Aursand <[email protected]>
"Anyone who slaps a 'this page is best viewed with Browser X'-label on
a web page appears to be yearning for the bad old days, before the
web, when you had very little chance of reading a document written on
another computer, another word processor or another network." -- Tim
Berners-Lee, July 1996
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top