finding common words

Discussion in 'Perl Misc' started by viv2k, Feb 22, 2004.

  1. viv2k

    viv2k Guest

    I'm new to Perl but a friend has told me that the task I wana do can
    be done very efficiently by Perl.

    Right now, I have two very long lists of words with comments attached
    to each of the words and I want to find the words that are common in
    both of them.

    example:

    ListA ListB

    apple 1.1 apple 100
    banana 2.2 boy 500
    cat 3.3 cat 1000

    And I want the result to look something like:

    ListA ListB
    apple 1.1 apple 100
    cat 3.3 cat 1000

    any idea on how to tackle this? I'm off to reading some Perl
    introductory books now koz I really need that list asap. Any help
    would be greatly appreciated.

    thanks
    viv
    viv2k, Feb 22, 2004
    #1
    1. Advertising

  2. viv2k wrote:
    > Right now, I have two very long lists of words with comments
    > attached to each of the words and I want to find the words that are
    > common in both of them.
    >
    > example:
    >
    > ListA ListB
    >
    > apple 1.1 apple 100
    > banana 2.2 boy 500
    > cat 3.3 cat 1000
    >
    > And I want the result to look something like:
    >
    > ListA ListB
    > apple 1.1 apple 100
    > cat 3.3 cat 1000


    my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
    my %ListB = ( apple => 100, boy => 500, cat => 1000 );

    for (keys %ListA) {
    delete $ListA{$_} unless exists $ListB{$_};
    }
    for (keys %ListB) {
    delete $ListB{$_} unless exists $ListA{$_};
    }

    print "ListA\n";
    print "$_\t$ListA{$_}\n" for sort keys %ListA;
    print "\n";
    print "ListB\n";
    print "$_\t$ListB{$_}\n" for sort keys %ListB;

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Feb 22, 2004
    #2
    1. Advertising

  3. (viv2k) wrote:

    > Right now, I have two very long lists of words with comments attached
    > to each of the words and I want to find the words that are common in
    > both of them.


    perldoc -q intersection

    Gunnar Hjalmarsson has already posted working code. I just wanted to point to
    the FAQ entry, as the code there is readily adapted to the above problem.

    --
    David Wall
    David K. Wall, Feb 23, 2004
    #3
  4. viv2k

    Matt Garrish Guest

    "Gunnar Hjalmarsson" <> wrote in message
    news:c1b7qa$1fpfks$-berlin.de...
    > viv2k wrote:
    > > Right now, I have two very long lists of words with comments
    > > attached to each of the words and I want to find the words that are
    > > common in both of them.
    > >
    > > example:
    > >
    > > ListA ListB
    > >
    > > apple 1.1 apple 100
    > > banana 2.2 boy 500
    > > cat 3.3 cat 1000
    > >
    > > And I want the result to look something like:
    > >
    > > ListA ListB
    > > apple 1.1 apple 100
    > > cat 3.3 cat 1000

    >
    > my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
    > my %ListB = ( apple => 100, boy => 500, cat => 1000 );
    >
    > for (keys %ListA) {
    > delete $ListA{$_} unless exists $ListB{$_};
    > }
    > for (keys %ListB) {
    > delete $ListB{$_} unless exists $ListA{$_};
    > }
    >


    I've never been a big fan of looping over both sets like that (especially if
    you want to retain the original hashes). If you just want to extract the
    common elements, my personal preference would be to do something like the
    following instead:

    my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
    my %ListB = ( apple => 100, boy => 500, cat => 1000 );
    my %common;

    for (keys %ListA) {
    if ($ListB{$_}) { $common{$_} = [$ListA{$_}, $ListB{$_}] };
    }

    print "ListA\n";
    print "$_\t$common{$_}[0]\n" for sort keys %common;

    print "\nListB\n";
    print "$_\t$common{$_}[1]\n" for sort keys %common;


    But there's certainly nothing wrong with your code...

    Matt
    Matt Garrish, Feb 23, 2004
    #4
  5. David K. Wall wrote:
    > (viv2k) wrote:
    >> Right now, I have two very long lists of words with comments
    >> attached to each of the words and I want to find the words that
    >> are common in both of them.

    >
    > perldoc -q intersection
    >
    > Gunnar Hjalmarsson has already posted working code. I just wanted
    > to point to the FAQ entry, as the code there is readily adapted to
    > the above problem.


    Well, applying that FAQ entry, you could do something like this:

    my ($elem, %count, %intersection);
    for $elem (keys %ListA, keys %ListB) { $count{$elem}++ }
    for $elem (keys %count) {
    $intersection{$elem} = "$ListA{$elem} : $ListB{$elem}"
    if $count{$elem} > 1;
    }
    print "$_\t$intersection{$_}\n" for sort keys %intersection;

    But provided that it makes sense to start with populating two hashes
    with the lists of words + comments, I don't really see that the FAQ
    entry is very well adapted, since applying it would not take advantage
    of the initial hashes.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Feb 23, 2004
    #5
  6. Matt Garrish wrote:
    > I've never been a big fan of looping over both sets like that
    > (especially if you want to retain the original hashes). If you just
    > want to extract the common elements, my personal preference would
    > be to do something like the following instead:
    >
    > my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
    > my %ListB = ( apple => 100, boy => 500, cat => 1000 );
    > my %common;
    >
    > for (keys %ListA) {
    > if ($ListB{$_}) { $common{$_} = [$ListA{$_}, $ListB{$_}] };
    > }
    >
    > print "ListA\n";
    > print "$_\t$common{$_}[0]\n" for sort keys %common;
    >
    > print "\nListB\n";
    > print "$_\t$common{$_}[1]\n" for sort keys %common;
    >
    >
    > But there's certainly nothing wrong with your code...


    Maybe not, but I like your solution. It would only require looping
    through one of the lists.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Feb 23, 2004
    #6
  7. Gunnar Hjalmarsson <> wrote:

    > David K. Wall wrote:
    >> (viv2k) wrote:
    >>> Right now, I have two very long lists of words with comments
    >>> attached to each of the words and I want to find the words that
    >>> are common in both of them.

    >>
    >> perldoc -q intersection
    >>
    >> Gunnar Hjalmarsson has already posted working code. I just wanted
    >> to point to the FAQ entry, as the code there is readily adapted to
    >> the above problem.

    >
    > Well, applying that FAQ entry, you could do something like this:
    >
    > my ($elem, %count, %intersection);
    > for $elem (keys %ListA, keys %ListB) { $count{$elem}++ }
    > for $elem (keys %count) {
    > $intersection{$elem} = "$ListA{$elem} : $ListB{$elem}"
    > if $count{$elem} > 1;
    > }
    > print "$_\t$intersection{$_}\n" for sort keys %intersection;
    >
    > But provided that it makes sense to start with populating two hashes
    > with the lists of words + comments, I don't really see that the FAQ
    > entry is very well adapted, since applying it would not take advantage
    > of the initial hashes.


    How about this?

    my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
    my %ListB = ( apple => 100, boy => 500, cat => 1000 );
    my %count;
    my (%count, @intersection);
    for my $element (keys %ListA, keys %ListB) {
    push @intersection, $element if ++$count{$element} > 1;
    }
    my (%new_ListA, %new_ListB);
    @new_ListA{@intersection} = @ListA{@intersection};
    @new_ListB{@intersection} = @ListB{@intersection};

    It's a bit different from the FAQ, but was directly inspired by it...
    <shrug>

    If memory is a consideration I'd go with deleting the non-intersecting
    elements. Neat idea, and one that didn't occur to me.

    --
    David Wall
    David K. Wall, Feb 23, 2004
    #7
  8. "David K. Wall" <> wrote:

    > my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
    > my %ListB = ( apple => 100, boy => 500, cat => 1000 );
    > my %count;
    > my (%count, @intersection);


    Oops, extra declaration. I originally wrote it as a for() loop followed by a
    grep(), but then saw that the grep() could be eliminated and just re-pasted
    the part I changed.

    > for my $element (keys %ListA, keys %ListB) {
    > push @intersection, $element if ++$count{$element} > 1;
    > }
    > my (%new_ListA, %new_ListB);
    > @new_ListA{@intersection} = @ListA{@intersection};
    > @new_ListB{@intersection} = @ListB{@intersection};
    David K. Wall, Feb 23, 2004
    #8
  9. David K. Wall wrote:
    > How about this?
    >
    > my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
    > my %ListB = ( apple => 100, boy => 500, cat => 1000 );
    > my (%count, @intersection);
    > for my $element (keys %ListA, keys %ListB) {
    > push @intersection, $element if ++$count{$element} > 1;
    > }
    > my (%new_ListA, %new_ListB);
    > @new_ListA{@intersection} = @ListA{@intersection};
    > @new_ListB{@intersection} = @ListB{@intersection};
    >
    > It's a bit different from the FAQ, but was directly inspired by
    > it... <shrug>


    I still don't think that the FAQ approach is good here. The FAQ deals
    with arrays, and since we are starting with hashes here, you'd better
    take advantage of the ability to look up elements in a hash.

    > If memory is a consideration


    The FAQ approach is indeed memory expensive. OP mentioned "two very
    long lists of words", and this solution creates 6(!) variables with
    lists: %ListA, %ListB, %count, @intersection, %new_ListA and
    %new_ListB.

    > I'd go with deleting the non-intersecting elements. Neat idea, and
    > one that didn't occur to me.


    If you want to keep the original hashes intact, I personally find
    Matt's solution to be the neatest.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Feb 23, 2004
    #9
  10. viv2k

    viv2k Guest

    Thanks for the tips. But does that mean I will have to manually write
    and create the two lists before comparing them?

    Because my list is actually a txt file where after each word there is
    a space and then the number associated and then a comma and then the
    next word and so on. Example:

    apple 1.1, banana 2.2, cat 3.3, etc

    Is there a way of taking the whole text file as input and using
    'space' and 'comma' as delimiters to do the task?

    Thanks
    viv

    Gunnar Hjalmarsson <> wrote in message news:<c1b7qa$1fpfks$-berlin.de>...
    > my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
    > my %ListB = ( apple => 100, boy => 500, cat => 1000 );
    >
    > for (keys %ListA) {
    > delete $ListA{$_} unless exists $ListB{$_};
    > }
    > for (keys %ListB) {
    > delete $ListB{$_} unless exists $ListA{$_};
    > }
    >
    > print "ListA\n";
    > print "$_\t$ListA{$_}\n" for sort keys %ListA;
    > print "\n";
    > print "ListB\n";
    > print "$_\t$ListB{$_}\n" for sort keys %ListB;
    viv2k, Feb 23, 2004
    #10
  11. viv2k

    Tore Aursand Guest

    On Sun, 22 Feb 2004 13:20:54 -0800, viv2k wrote:
    > ListA ListB
    >
    > apple 1.1 apple 100
    > banana 2.2 boy 500
    > cat 3.3 cat 1000
    >
    > And I want the result to look something like:
    >
    > ListA ListB
    > apple 1.1 apple 100
    > cat 3.3 cat 1000


    You have probably already seen a few very good solutions from other
    people, but I thought I would give you mine;

    my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
    my %ListB = ( apple => 100, boy => 500, cat => 1000 );
    my @common = grep { $ListB{$_} } keys %ListA;
    for ( sort @common ) {
    print "$_\t$ListA{$_}\t$_\t$ListB{$_}\n";
    }

    Anything wrong with this one? I thought about using 'map' for the 'grep'
    part above, but couldn't find a nice way to have it _not_ return anything
    when there's no match, ie;

    my @common = map { (exists $ListB{$_}) ? $_ : undef } keys %ListA;
    foreach ( sort @common ) {
    next unless defined;
    # ...
    }

    Comments/suggestions/corrections are appreciated!


    --
    Tore Aursand <>
    "Writing is a lot like sex. At first you do it because you like it.
    Then you find yourself doing it for a few close friends and people you
    like. But if you're any good at all, you end up doing it for money."
    -- Unknown
    Tore Aursand, Feb 23, 2004
    #11
  12. [ Please respond below the quoted text. ]

    viv2k wrote:
    > Thanks for the tips. But does that mean I will have to manually
    > write and create the two lists before comparing them?


    Only if they are only on a sheet of paper, or something. :)

    > Because my list is actually a txt file where after each word there
    > is a space and then the number associated and then a comma and then
    > the next word and so on. Example:
    >
    > apple 1.1, banana 2.2, cat 3.3, etc
    >
    > Is there a way of taking the whole text file as input and using
    > 'space' and 'comma' as delimiters to do the task?


    Yes, of course. Assuming that neither the words nor the numbers can
    include spaces, this is one way:

    my %ListA;
    open my $fh, '< ListA.txt' or die "Couldn't open ListA.txt $!";
    while (<$fh>) {
    for (split /,\s*/) {
    my ($key, $value) = split;
    $ListA{$key} = $value;
    }
    }
    close $fh;

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Feb 23, 2004
    #12
  13. Tore Aursand wrote:
    > You have probably already seen a few very good solutions from other
    > people, but I thought I would give you mine;
    >
    > my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
    > my %ListB = ( apple => 100, boy => 500, cat => 1000 );
    > my @common = grep { $ListB{$_} } keys %ListA;
    > for ( sort @common ) {
    > print "$_\t$ListA{$_}\t$_\t$ListB{$_}\n";
    > }


    <snip>

    > Comments/suggestions/corrections are appreciated!


    Nice. :)

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Feb 23, 2004
    #13
  14. viv2k

    Matt Garrish Guest

    "Tore Aursand" <> wrote in message
    news:p...
    > On Sun, 22 Feb 2004 13:20:54 -0800, viv2k wrote:
    > > ListA ListB
    > >
    > > apple 1.1 apple 100
    > > banana 2.2 boy 500
    > > cat 3.3 cat 1000
    > >
    > > And I want the result to look something like:
    > >
    > > ListA ListB
    > > apple 1.1 apple 100
    > > cat 3.3 cat 1000

    >
    > You have probably already seen a few very good solutions from other
    > people, but I thought I would give you mine;
    >
    > my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
    > my %ListB = ( apple => 100, boy => 500, cat => 1000 );
    > my @common = grep { $ListB{$_} } keys %ListA;
    > for ( sort @common ) {
    > print "$_\t$ListA{$_}\t$_\t$ListB{$_}\n";
    > }
    >
    > Anything wrong with this one? I thought about using 'map' for the 'grep'
    > part above, but couldn't find a nice way to have it _not_ return anything
    > when there's no match, ie;
    >
    > my @common = map { (exists $ListB{$_}) ? $_ : undef } keys %ListA;
    > foreach ( sort @common ) {
    > next unless defined;
    > # ...
    > }
    >


    I like your first solution (I wasn't thinking about *not* using a hash,
    myself). As for the second, the best way I can see to use map would be
    without checking the return value:

    my @common;
    map { push @common, $_ if exists $ListB{$_} } keys %ListA;

    Matt
    Matt Garrish, Feb 23, 2004
    #14
  15. Tore Aursand <> wrote:

    > my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
    > my %ListB = ( apple => 100, boy => 500, cat => 1000 );
    > my @common = grep { $ListB{$_} } keys %ListA;
    > for ( sort @common ) {
    > print "$_\t$ListA{$_}\t$_\t$ListB{$_}\n";
    > }
    >
    > Anything wrong with this one?


    One nitpick: make it

    my @common = grep { exists $ListB{$_} } keys %ListA;

    in case %listB has a key whose value is false.

    It's interesting to see so many different ways to do it. :)

    --
    David Wall
    David K. Wall, Feb 23, 2004
    #15
  16. David K. Wall <> wrote:

    > Tore Aursand <> wrote:
    >
    >> my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
    >> my %ListB = ( apple => 100, boy => 500, cat => 1000 );
    >> my @common = grep { $ListB{$_} } keys %ListA;
    >> for ( sort @common ) {
    >> print "$_\t$ListA{$_}\t$_\t$ListB{$_}\n";
    >> }
    >>
    >> Anything wrong with this one?

    >
    > One nitpick: make it
    >
    > my @common = grep { exists $ListB{$_} } keys %ListA;
    >
    > in case %listB has a key whose value is false.


    Oh, I see you already did that in another part of your post. Never mind.

    I like it better than my adaptation of the 'intersection' FAQ answer.

    --
    David Wall
    David K. Wall, Feb 23, 2004
    #16
  17. viv2k

    gnari Guest

    "Gunnar Hjalmarsson" <> wrote in message
    news:c1ckja$1h59e6$-berlin.de...
    > [ Please respond below the quoted text. ]
    >
    > viv2k wrote:
    >
    > > Because my list is actually a txt file where after each word there
    > > is a space and then the number associated and then a comma and then
    > > the next word and so on. Example:
    > >
    > > apple 1.1, banana 2.2, cat 3.3, etc

    > Yes, of course. Assuming that neither the words nor the numbers can
    > include spaces, this is one way:
    >
    > my %ListA;
    > open my $fh, '< ListA.txt' or die "Couldn't open ListA.txt $!";
    > while (<$fh>) {
    > for (split /,\s*/) {
    > my ($key, $value) = split;
    > $ListA{$key} = $value;
    > }
    > }
    > close $fh;


    or with same assumptions:

    my %ListA;
    open my $fh, '< ListA.txt' or die "Couldn't open ListA.txt $!";
    while (<$fh>) {
    %ListA=(%ListA,split /[\s,]+/);
    }
    close $fh;

    gnari
    gnari, Feb 23, 2004
    #17
  18. viv2k

    Anno Siegel Guest

    Gunnar Hjalmarsson <> wrote in comp.lang.perl.misc:
    > David K. Wall wrote:
    > > How about this?
    > >
    > > my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
    > > my %ListB = ( apple => 100, boy => 500, cat => 1000 );
    > > my (%count, @intersection);
    > > for my $element (keys %ListA, keys %ListB) {
    > > push @intersection, $element if ++$count{$element} > 1;
    > > }
    > > my (%new_ListA, %new_ListB);
    > > @new_ListA{@intersection} = @ListA{@intersection};
    > > @new_ListB{@intersection} = @ListB{@intersection};
    > >
    > > It's a bit different from the FAQ, but was directly inspired by
    > > it... <shrug>

    >
    > I still don't think that the FAQ approach is good here. The FAQ deals
    > with arrays, and since we are starting with hashes here, you'd better
    > take advantage of the ability to look up elements in a hash.


    In another branch of the thread we learned (unsurprisingly) that the
    data comes in disk files. This suggests we read one of the files
    into a hash (the smaller one, if there is a difference), and process
    the other one record-by-record.

    > > If memory is a consideration

    >
    > The FAQ approach is indeed memory expensive. OP mentioned "two very
    > long lists of words", and this solution creates 6(!) variables with
    > lists: %ListA, %ListB, %count, @intersection, %new_ListA and
    > %new_ListB.


    Well, that's because the FAQ is over-explicit to get the idea across.
    Once you've grasped that, you're free to cut the cruft. Some FAQ answers
    aren't meant to be direct copy/paste solutions.

    > > I'd go with deleting the non-intersecting elements. Neat idea, and
    > > one that didn't occur to me.


    Unfortunately it requires both hashes in memory.

    > If you want to keep the original hashes intact, I personally find
    > Matt's solution to be the neatest.


    As it seems, there are no original hashes, that was an artifact of the
    first reply in the thread.

    Assuming filehandles $listA and $listB opened (and $/ set to ',', IIRC),
    I'd do the obvious (untested):

    my %listA;
    while ( <$listA> ) {
    chomp; # non-standard $/, split won't do the job
    my( $key, $val) = split;
    $lista{ $key} = $val;
    }

    my %new_listB;
    while ( <$listB> ) {
    chomp;
    my ( $key, $val) = split;
    $new_listB{ $key} = $val if exists $listA{ $key};
    }

    If %new_listA is explicitly needed, it can be derived from the above.

    Anno
    Anno Siegel, Feb 23, 2004
    #18
  19. viv2k

    Anno Siegel Guest

    Tore Aursand <> wrote in comp.lang.perl.misc:
    > On Sun, 22 Feb 2004 13:20:54 -0800, viv2k wrote:
    > > ListA ListB
    > >
    > > apple 1.1 apple 100
    > > banana 2.2 boy 500
    > > cat 3.3 cat 1000
    > >
    > > And I want the result to look something like:
    > >
    > > ListA ListB
    > > apple 1.1 apple 100
    > > cat 3.3 cat 1000

    >
    > You have probably already seen a few very good solutions from other
    > people, but I thought I would give you mine;
    >
    > my %ListA = ( apple => 1.1, banana => 2.2, cat => 3.3 );
    > my %ListB = ( apple => 100, boy => 500, cat => 1000 );
    > my @common = grep { $ListB{$_} } keys %ListA;
    > for ( sort @common ) {
    > print "$_\t$ListA{$_}\t$_\t$ListB{$_}\n";
    > }
    >
    > Anything wrong with this one?


    Nothing, at visual inspection. It's a good high-level approach. Like
    most of the thread, it assumes both hashes in memory. We may want to
    avoid that.

    > I thought about using 'map' for the 'grep'
    > part above, but couldn't find a nice way to have it _not_ return anything
    > when there's no match, ie;
    >
    > my @common = map { (exists $ListB{$_}) ? $_ : undef } keys %ListA;
    > foreach ( sort @common ) {
    > next unless defined;
    > # ...
    > }
    >


    Use "()" where you have tried "undef". Map evaluates its arguments in
    list context, so if one returns nothing, it adds nothing..

    Anno
    Anno Siegel, Feb 23, 2004
    #19
  20. viv2k

    Tore Aursand Guest

    On Mon, 23 Feb 2004 07:58:07 -0500, Matt Garrish wrote:
    > my @common;
    > map { push @common, $_ if exists $ListB{$_} } keys %ListA;


    I know this approach, but I also know that one shouldn't use 'map' in void
    context.


    --
    Tore Aursand <>
    "Anyone who slaps a 'this page is best viewed with Browser X'-label on
    a web page appears to be yearning for the bad old days, before the
    web, when you had very little chance of reading a document written on
    another computer, another word processor or another network." -- Tim
    Berners-Lee, July 1996
    Tore Aursand, Feb 23, 2004
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. sfu
    Replies:
    15
    Views:
    12,339
    William Brogden
    Sep 14, 2003
  2. Denny
    Replies:
    1
    Views:
    766
  3. AbidDF
    Replies:
    0
    Views:
    953
    AbidDF
    Feb 19, 2010
  4. John Reye
    Replies:
    28
    Views:
    1,344
    Tim Rentsch
    May 8, 2012
  5. Jason Mellone
    Replies:
    3
    Views:
    81
    Jurko Gospodnetić
    May 7, 2014
Loading...

Share This Page