A twist on Cookbook recipe 4.6

A

Amer Neely

I'm stuck trying to get a count of duplicate emails in a HoH.

Input is a pipe-delimited file of users with their email addresses.
Element 0 is a unique ID number; 10 is their email.

The problem is some users are present multiple times, so that their ID
number is unique but their email remains the same.

I'm trying to prepend a string to the duplicated emails, so I can use
them to populate a MySQL table along with the ID numbers. The email
addresses will not (likely) work. Not my problem.

What I have is a working script from the 'Perl Cookbook' (recipe #4.6)
that shows me which emails are duplicated.

I'm trying to get a count of those emails ALONG with their respective IDs.

My attempts so far:

#! /usr/bin/perl
use strict;
use warnings;

my ($ID,$Email);
my %SeenEmails=();

open IN,"<",$InFile or die "Can't open $InFile: $!\n";
foreach my $line (<IN>)
{
($ID,$Email) = ((split /\|/, $line)[0,10]);
$SeenEmails{$Email}++;
}
close IN or die "Can't close $InFile: $!\n";;

for (sort keys %SeenEmails)
{
if ($SeenEmails{$_} > 1)
{
print "$_ -> $SeenEmails{$_}\n";
for my $i (1..($SeenEmails{$_}-1))
{
my $thisone=$_;
$thisone = 'dupe' . $i . '_' . $_;
print "\t$thisone\n";
}
}
}

This prepends the email with a dynamically generated string, but I'm
missing the ID number when I populate the hash initially.

If I use
$SeenEmails{$ID}{$Email}++;
I get the ID number, but I'm counting the wrong thing.

How can I get the duplicated emails (and how many there are) with their
respective IDs, so I can prepend my string?
 
B

Ben Morrow

Quoth Amer Neely said:
I'm trying to get a count of those [duplicated] emails ALONG with
their respective IDs.

My attempts so far:

#! /usr/bin/perl
use strict;
use warnings;

my ($ID,$Email);

Don't declare variables before you need them. These should go inside the
loop.
my %SeenEmails=();

open IN,"<",$InFile or die "Can't open $InFile: $!\n";

I would suggest

open my $IN, ...

instead. Global filehandles are as bad as any other global variable.
foreach my $line (<IN>)

This will work, but is a terrible waste of time. <FH> in list context
returns a list of lines, so you are reading in the file, splitting that
into lines, and then going through them one at a time. If you want to go
through one at a time, use while instead:

while (my $line = said:
{
($ID,$Email) = ((split /\|/, $line)[0,10]);
$SeenEmails{$Email}++;

Here you want to store a list of the IDs, instead of just incrementing.
Something like

$SeenEmails{$Email} ||= [];
push @{ $SeenEmails{$Email} }, $ID;

which could be shortend to

push @{ $SeenEmails{$Email} ||= [] }, $ID;

once you're comfortable with the syntax.

If you don't understand refs (the [] and @{...} stuff) read perldoc
perlreftut.
}
close IN or die "Can't close $InFile: $!\n";;

for (sort keys %SeenEmails)
{
if ($SeenEmails{$_} > 1)

my $seen = $SeenEmails{$_};

if (@$seen > 1) {

I don't really understand what you want to generate from this, but
perhaps you want something like

print "$_ -> " . scalar(@$seen) . "\n";

for my $id (@$seen) {
print "\tdupe${id}_$_\n";
}

That print string is quite complicated; it can be split into

"\t" . 'dupe' . $id . '_' . $_ . "\n"

.. A decent highlighting editor will help, of course.

FWIW, while I often use $_ as an implicit 'for' loop variable, I only
use it on the *innermost* loop. I find that once a loop gets large
enough to have sub-loops of its own, it probably deserves a real
variable; and having a $_ around that isn't 'the current item' is
unhelpful and confusing :).
{
print "$_ -> $SeenEmails{$_}\n";
for my $i (1..($SeenEmails{$_}-1))
{
my $thisone=$_;
$thisone = 'dupe' . $i . '_' . $_;
print "\t$thisone\n";
}
}
}

Ben
 
J

John W. Krahn

Ben said:
Quoth Amer Neely said:
{
($ID,$Email) = ((split /\|/, $line)[0,10]);
$SeenEmails{$Email}++;

Here you want to store a list of the IDs, instead of just incrementing.
Something like

$SeenEmails{$Email} ||= [];
push @{ $SeenEmails{$Email} }, $ID;

which could be shortend to

push @{ $SeenEmails{$Email} ||= [] }, $ID;

All you really need is:

push @{ $SeenEmails{ $Email } }, $ID;


(Lookup autovivification in the docs.)



John
 
A

Amer Neely

Ben said:
Quoth Amer Neely said:
I'm trying to get a count of those [duplicated] emails ALONG with
their respective IDs.

My attempts so far:

#! /usr/bin/perl
use strict;
use warnings;

my ($ID,$Email);

Don't declare variables before you need them. These should go inside the
loop.
my %SeenEmails=();

open IN,"<",$InFile or die "Can't open $InFile: $!\n";

I would suggest

open my $IN, ...

instead. Global filehandles are as bad as any other global variable.
foreach my $line (<IN>)

This will work, but is a terrible waste of time. <FH> in list context
returns a list of lines, so you are reading in the file, splitting that
into lines, and then going through them one at a time. If you want to go
through one at a time, use while instead:

while (my $line = said:
{
($ID,$Email) = ((split /\|/, $line)[0,10]);
$SeenEmails{$Email}++;

Here you want to store a list of the IDs, instead of just incrementing.
Something like

$SeenEmails{$Email} ||= [];
push @{ $SeenEmails{$Email} }, $ID;

which could be shortend to

push @{ $SeenEmails{$Email} ||= [] }, $ID;

once you're comfortable with the syntax.

If you don't understand refs (the [] and @{...} stuff) read perldoc
perlreftut.
}
close IN or die "Can't close $InFile: $!\n";;

for (sort keys %SeenEmails)
{
if ($SeenEmails{$_} > 1)

my $seen = $SeenEmails{$_};

if (@$seen > 1) {

I don't really understand what you want to generate from this, but
perhaps you want something like

print "$_ -> " . scalar(@$seen) . "\n";

for my $id (@$seen) {
print "\tdupe${id}_$_\n";
}

That print string is quite complicated; it can be split into

"\t" . 'dupe' . $id . '_' . $_ . "\n"

. A decent highlighting editor will help, of course.

FWIW, while I often use $_ as an implicit 'for' loop variable, I only
use it on the *innermost* loop. I find that once a loop gets large
enough to have sub-loops of its own, it probably deserves a real
variable; and having a $_ around that isn't 'the current item' is
unhelpful and confusing :).
{
print "$_ -> $SeenEmails{$_}\n";
for my $i (1..($SeenEmails{$_}-1))
{
my $thisone=$_;
$thisone = 'dupe' . $i . '_' . $_;
print "\t$thisone\n";
}
}
}

Ben

Ben, thanks for the advice and direction. I can see I need to get a good
handle on references.
 
A

Amer Neely

John said:
Ben said:
Quoth Amer Neely said:
{
($ID,$Email) = ((split /\|/, $line)[0,10]);
$SeenEmails{$Email}++;
Here you want to store a list of the IDs, instead of just incrementing.
Something like

$SeenEmails{$Email} ||= [];
push @{ $SeenEmails{$Email} }, $ID;

which could be shortend to

push @{ $SeenEmails{$Email} ||= [] }, $ID;

All you really need is:

push @{ $SeenEmails{ $Email } }, $ID;


(Lookup autovivification in the docs.)



John

Thank you for the alternative John, I'll do more reading :)
 
B

Ben Morrow

Quoth "John W. Krahn said:
Ben said:
push @{ $SeenEmails{$Email} ||= [] }, $ID;

All you really need is:

push @{ $SeenEmails{ $Email } }, $ID;

(Lookup autovivification in the docs.)

Duh... I even knew that, I'd just temporarily forgotten (and wasn't
trusting Perl to DWIM :) ). Thanks for the reminder...

Ben
 
J

John W. Krahn

Ben said:
Quoth "John W. Krahn said:
Ben said:
push @{ $SeenEmails{$Email} ||= [] }, $ID;
All you really need is:

push @{ $SeenEmails{ $Email } }, $ID;

(Lookup autovivification in the docs.)

Duh... I even knew that, I'd just temporarily forgotten (and wasn't
trusting Perl to DWIM :) ). Thanks for the reminder...

No problemo. :)



John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,190
Latest member
Martindap

Latest Threads

Top