A twist on Cookbook recipe 4.6

Discussion in 'Perl Misc' started by Amer Neely, Mar 22, 2007.

  1. Amer Neely

    Amer Neely Guest

    I'm stuck trying to get a count of duplicate emails in a HoH.

    Input is a pipe-delimited file of users with their email addresses.
    Element 0 is a unique ID number; 10 is their email.

    The problem is some users are present multiple times, so that their ID
    number is unique but their email remains the same.

    I'm trying to prepend a string to the duplicated emails, so I can use
    them to populate a MySQL table along with the ID numbers. The email
    addresses will not (likely) work. Not my problem.

    What I have is a working script from the 'Perl Cookbook' (recipe #4.6)
    that shows me which emails are duplicated.

    I'm trying to get a count of those emails ALONG with their respective IDs.

    My attempts so far:

    #! /usr/bin/perl
    use strict;
    use warnings;

    my ($ID,$Email);
    my %SeenEmails=();

    open IN,"<",$InFile or die "Can't open $InFile: $!\n";
    foreach my $line (<IN>)
    {
    ($ID,$Email) = ((split /\|/, $line)[0,10]);
    $SeenEmails{$Email}++;
    }
    close IN or die "Can't close $InFile: $!\n";;

    for (sort keys %SeenEmails)
    {
    if ($SeenEmails{$_} > 1)
    {
    print "$_ -> $SeenEmails{$_}\n";
    for my $i (1..($SeenEmails{$_}-1))
    {
    my $thisone=$_;
    $thisone = 'dupe' . $i . '_' . $_;
    print "\t$thisone\n";
    }
    }
    }

    This prepends the email with a dynamically generated string, but I'm
    missing the ID number when I populate the hash initially.

    If I use
    $SeenEmails{$ID}{$Email}++;
    I get the ID number, but I'm counting the wrong thing.

    How can I get the duplicated emails (and how many there are) with their
    respective IDs, so I can prepend my string?

    --
    Amer Neely
    w: www.softouch.on.ca/
     
    Amer Neely, Mar 22, 2007
    #1
    1. Advertising

  2. Amer Neely

    Ben Morrow Guest

    Quoth Amer Neely <>:
    >
    > I'm trying to get a count of those [duplicated] emails ALONG with
    > their respective IDs.
    >
    > My attempts so far:
    >
    > #! /usr/bin/perl
    > use strict;
    > use warnings;
    >
    > my ($ID,$Email);


    Don't declare variables before you need them. These should go inside the
    loop.

    > my %SeenEmails=();
    >
    > open IN,"<",$InFile or die "Can't open $InFile: $!\n";


    I would suggest

    open my $IN, ...

    instead. Global filehandles are as bad as any other global variable.

    > foreach my $line (<IN>)


    This will work, but is a terrible waste of time. <FH> in list context
    returns a list of lines, so you are reading in the file, splitting that
    into lines, and then going through them one at a time. If you want to go
    through one at a time, use while instead:

    while (my $line = <IN>) {

    > {
    > ($ID,$Email) = ((split /\|/, $line)[0,10]);
    > $SeenEmails{$Email}++;


    Here you want to store a list of the IDs, instead of just incrementing.
    Something like

    $SeenEmails{$Email} ||= [];
    push @{ $SeenEmails{$Email} }, $ID;

    which could be shortend to

    push @{ $SeenEmails{$Email} ||= [] }, $ID;

    once you're comfortable with the syntax.

    If you don't understand refs (the [] and @{...} stuff) read perldoc
    perlreftut.

    > }
    > close IN or die "Can't close $InFile: $!\n";;
    >
    > for (sort keys %SeenEmails)
    > {
    > if ($SeenEmails{$_} > 1)


    my $seen = $SeenEmails{$_};

    if (@$seen > 1) {

    I don't really understand what you want to generate from this, but
    perhaps you want something like

    print "$_ -> " . scalar(@$seen) . "\n";

    for my $id (@$seen) {
    print "\tdupe${id}_$_\n";
    }

    That print string is quite complicated; it can be split into

    "\t" . 'dupe' . $id . '_' . $_ . "\n"

    .. A decent highlighting editor will help, of course.

    FWIW, while I often use $_ as an implicit 'for' loop variable, I only
    use it on the *innermost* loop. I find that once a loop gets large
    enough to have sub-loops of its own, it probably deserves a real
    variable; and having a $_ around that isn't 'the current item' is
    unhelpful and confusing :).

    > {
    > print "$_ -> $SeenEmails{$_}\n";
    > for my $i (1..($SeenEmails{$_}-1))
    > {
    > my $thisone=$_;
    > $thisone = 'dupe' . $i . '_' . $_;
    > print "\t$thisone\n";
    > }
    > }
    > }


    Ben

    --
    2 + 2 = 5 for sufficiently large values of 2
    []
     
    Ben Morrow, Mar 22, 2007
    #2
    1. Advertising

  3. Ben Morrow wrote:
    >
    > Quoth Amer Neely <>:
    >>
    >> {
    >> ($ID,$Email) = ((split /\|/, $line)[0,10]);
    >> $SeenEmails{$Email}++;

    >
    > Here you want to store a list of the IDs, instead of just incrementing.
    > Something like
    >
    > $SeenEmails{$Email} ||= [];
    > push @{ $SeenEmails{$Email} }, $ID;
    >
    > which could be shortend to
    >
    > push @{ $SeenEmails{$Email} ||= [] }, $ID;


    All you really need is:

    push @{ $SeenEmails{ $Email } }, $ID;


    (Lookup autovivification in the docs.)



    John
    --
    Perl isn't a toolbox, but a small machine shop where you can special-order
    certain sorts of tools at low cost and in short order. -- Larry Wall
     
    John W. Krahn, Mar 22, 2007
    #3
  4. Amer Neely

    Amer Neely Guest

    Ben Morrow wrote:
    > Quoth Amer Neely <>:
    >> I'm trying to get a count of those [duplicated] emails ALONG with
    >> their respective IDs.
    >>
    >> My attempts so far:
    >>
    >> #! /usr/bin/perl
    >> use strict;
    >> use warnings;
    >>
    >> my ($ID,$Email);

    >
    > Don't declare variables before you need them. These should go inside the
    > loop.
    >
    >> my %SeenEmails=();
    >>
    >> open IN,"<",$InFile or die "Can't open $InFile: $!\n";

    >
    > I would suggest
    >
    > open my $IN, ...
    >
    > instead. Global filehandles are as bad as any other global variable.
    >
    >> foreach my $line (<IN>)

    >
    > This will work, but is a terrible waste of time. <FH> in list context
    > returns a list of lines, so you are reading in the file, splitting that
    > into lines, and then going through them one at a time. If you want to go
    > through one at a time, use while instead:
    >
    > while (my $line = <IN>) {
    >
    >> {
    >> ($ID,$Email) = ((split /\|/, $line)[0,10]);
    >> $SeenEmails{$Email}++;

    >
    > Here you want to store a list of the IDs, instead of just incrementing.
    > Something like
    >
    > $SeenEmails{$Email} ||= [];
    > push @{ $SeenEmails{$Email} }, $ID;
    >
    > which could be shortend to
    >
    > push @{ $SeenEmails{$Email} ||= [] }, $ID;
    >
    > once you're comfortable with the syntax.
    >
    > If you don't understand refs (the [] and @{...} stuff) read perldoc
    > perlreftut.
    >
    >> }
    >> close IN or die "Can't close $InFile: $!\n";;
    >>
    >> for (sort keys %SeenEmails)
    >> {
    >> if ($SeenEmails{$_} > 1)

    >
    > my $seen = $SeenEmails{$_};
    >
    > if (@$seen > 1) {
    >
    > I don't really understand what you want to generate from this, but
    > perhaps you want something like
    >
    > print "$_ -> " . scalar(@$seen) . "\n";
    >
    > for my $id (@$seen) {
    > print "\tdupe${id}_$_\n";
    > }
    >
    > That print string is quite complicated; it can be split into
    >
    > "\t" . 'dupe' . $id . '_' . $_ . "\n"
    >
    > . A decent highlighting editor will help, of course.
    >
    > FWIW, while I often use $_ as an implicit 'for' loop variable, I only
    > use it on the *innermost* loop. I find that once a loop gets large
    > enough to have sub-loops of its own, it probably deserves a real
    > variable; and having a $_ around that isn't 'the current item' is
    > unhelpful and confusing :).
    >
    >> {
    >> print "$_ -> $SeenEmails{$_}\n";
    >> for my $i (1..($SeenEmails{$_}-1))
    >> {
    >> my $thisone=$_;
    >> $thisone = 'dupe' . $i . '_' . $_;
    >> print "\t$thisone\n";
    >> }
    >> }
    >> }

    >
    > Ben


    Ben, thanks for the advice and direction. I can see I need to get a good
    handle on references.

    --
    Amer Neely
    w: www.softouch.on.ca/
     
    Amer Neely, Mar 22, 2007
    #4
  5. Amer Neely

    Amer Neely Guest

    John W. Krahn wrote:
    > Ben Morrow wrote:
    >> Quoth Amer Neely <>:
    >>> {
    >>> ($ID,$Email) = ((split /\|/, $line)[0,10]);
    >>> $SeenEmails{$Email}++;

    >> Here you want to store a list of the IDs, instead of just incrementing.
    >> Something like
    >>
    >> $SeenEmails{$Email} ||= [];
    >> push @{ $SeenEmails{$Email} }, $ID;
    >>
    >> which could be shortend to
    >>
    >> push @{ $SeenEmails{$Email} ||= [] }, $ID;

    >
    > All you really need is:
    >
    > push @{ $SeenEmails{ $Email } }, $ID;
    >
    >
    > (Lookup autovivification in the docs.)
    >
    >
    >
    > John


    Thank you for the alternative John, I'll do more reading :)

    --
    Amer Neely
    w: www.softouch.on.ca/
     
    Amer Neely, Mar 22, 2007
    #5
  6. Amer Neely

    Ben Morrow Guest

    Quoth "John W. Krahn" <>:
    > Ben Morrow wrote:
    > >
    > > push @{ $SeenEmails{$Email} ||= [] }, $ID;

    >
    > All you really need is:
    >
    > push @{ $SeenEmails{ $Email } }, $ID;
    >
    > (Lookup autovivification in the docs.)


    Duh... I even knew that, I'd just temporarily forgotten (and wasn't
    trusting Perl to DWIM :) ). Thanks for the reminder...

    Ben

    --
    Razors pain you / Rivers are damp
    Acids stain you / And drugs cause cramp. [Dorothy Parker]
    Guns aren't lawful / Nooses give
    Gas smells awful / You might as well live.
     
    Ben Morrow, Mar 23, 2007
    #6
  7. Ben Morrow wrote:
    > Quoth "John W. Krahn" <>:
    >>Ben Morrow wrote:
    >>> push @{ $SeenEmails{$Email} ||= [] }, $ID;

    >>All you really need is:
    >>
    >> push @{ $SeenEmails{ $Email } }, $ID;
    >>
    >>(Lookup autovivification in the docs.)

    >
    > Duh... I even knew that, I'd just temporarily forgotten (and wasn't
    > trusting Perl to DWIM :) ). Thanks for the reminder...


    No problemo. :)



    John
    --
    Perl isn't a toolbox, but a small machine shop where you can special-order
    certain sorts of tools at low cost and in short order. -- Larry Wall
     
    John W. Krahn, Mar 23, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Raaijmakers, Vincent \(GE Infrastructure\)

    Looking for cookbook/recipe generating multipart web data

    Raaijmakers, Vincent \(GE Infrastructure\), Apr 5, 2004, in forum: Python
    Replies:
    0
    Views:
    293
    Raaijmakers, Vincent \(GE Infrastructure\)
    Apr 5, 2004
  2. Gabriel Rossetti
    Replies:
    0
    Views:
    360
    Gabriel Rossetti
    Mar 18, 2008
  3. Gabriel Rossetti
    Replies:
    1
    Views:
    325
    wesley chun
    Mar 19, 2008
  4. Gabriel Rossetti
    Replies:
    1
    Views:
    350
    George Sakkis
    Mar 18, 2008
  5. Len Lawrence

    Ruby Cookbook - problem recipe

    Len Lawrence, Jan 21, 2007, in forum: Ruby
    Replies:
    0
    Views:
    89
    Len Lawrence
    Jan 21, 2007
Loading...

Share This Page