Scanning @array elements for similair content

R

Randy

Hello,

I have a text file that stores names and email addresses. This data is built
from a feedback form on my website. Here is the format of my textfile
entries:

Dan Smith,[email protected]
Mike Roberts,[email protected]
Steve Anderson,[email protected]

and so on.

As you can see, it's pretty much a standard CSV textfile. Overtime, this
database has grown very big, and there are several duplicate email addresses
in the data. Until recently I have had to visually go through the data and
remove duplicate email addresses I can find, regardless of what is found in
the name field. I am seeking assistance on how I could write a script that
would scan each line, separate the names field from the email address field,
then scan and remove duplicates. So far all I have is the following:

#!/usr/bin/perl

use CGI;
use CGI::Carp qw(fatalsToBrowser);
use strict;

my @data, $data, $name, $email;

open (FH, "<data.txt") or die "Can't open file: $!";
@data=<FH>;
close(FH);

foreach $data (@data) {
chomp ($data);
($name,$email)=split(/\,/,$data);

\\ Missing scan for duplicates and removal code here \\
}

open (FH, ">data.txt") or die "Can't open file: $!";
print FH @data;
close(FH);

Yes I am a newbie Perl programmer. I'm not very good at brainstorming an
approach to sorting/matching routines. I would very much appreciate some
help understanding and building the final element. Another complication is
what if there are two identical email addresses but one is all caps and the
other isn't. I'm not looking for someone to write me the code I need,
instead to point me in the right direction so that I actually learn
something and forward my Perl skills. Thankx everyone.

Robert
 
T

Tad McClellan

Randy said:
I have a text file that stores names and email addresses.

I am seeking assistance on how I could write a script that
would scan each line,


You have that already!

(though poorly done)

separate the names field from the email address field,


You have that already too.

then scan and remove duplicates.


perldoc -q duplicate

How can I remove duplicate elements from a list or array?

(pay particular attention to the last sentence of the answer given there.)

You are expected to check the Perl FAQ *before* posting to
the Perl newsgroup you know.

use strict;


Very good, but you should also have:

use warnings;

(and look in your server error logs for its output, or,
even better, run your CGI program from the command line
during early development, rather than in the CGI environment)

open (FH, "<data.txt") or die "Can't open file: $!";
@data=<FH>;
close(FH);

foreach $data (@data) {


It is bad practice to read an entire file into memory only to
process it line-by-line anyway.

Why not just read and process line-by-line?

chomp ($data);
($name,$email)=split(/\,/,$data);


Whitespace is not a scarce resource, feel free to use as much of
it as you like to make your code easier to read.

\\ Missing scan for duplicates and removal code here \\
}


my %emails;
while ( my $data = <INPUT> ) { # untested
chomp $data;
my($name, $email) = split(/\,/, $data);
$emails{$email} = $name;
}

foreach my $adr ( sort keys %emails ) {
print OUTPUT "$emails{$adr},$adr\n";
}

Yes I am a newbie Perl programmer.


I was thinking that you are a newbie to programming itself.

I would very much appreciate some
help understanding and building the final element.


Use a hash to eliminate duplicates.

Another complication is
what if there are two identical email addresses but one is all caps and the
other isn't.


You need to decide what to do, then we can help you write Perl
code that does that.

You could perhaps just normalize them all to a single case
before storing or searching the hash:

perldoc -f uc
perldoc -f lc

I'm not looking for someone to write me the code I need,


Oops, too late. :)

instead to point me in the right direction so that I actually learn
something and forward my Perl skills.


A depressingly infrequent display of Good Attitude for this here group.

Good for you! (and us)
 
R

Randy

Tad McClellan said:
perldoc -q duplicate

How can I remove duplicate elements from a list or array?

(pay particular attention to the last sentence of the answer given there.)

You are expected to check the Perl FAQ *before* posting to
the Perl newsgroup you know.


I had actually checked the perldoc for this and did find ways to remove
duplicate array entires but didn't know what to do when I wanted to match
the specific split item .. ie .. match the email address only, not the
entire array element.

Very good, but you should also have:

use warnings;

(and look in your server error logs for its output, or,
even better, run your CGI program from the command line
during early development, rather than in the CGI environment)


I have also added 'use warnings,' thank you.

It is bad practice to read an entire file into memory only to
process it line-by-line anyway.

Why not just read and process line-by-line?


Agreed, I now use the CPAN module File::Slurp to read textfile entires into
an @array efficiently.
http://search.cpan.org/~uri/File-Slurp-9999.09/lib/File/Slurp.pm

Whitespace is not a scarce resource, feel free to use as much of
it as you like to make your code easier to read.


Point noted.

You need to decide what to do, then we can help you write Perl
code that does that.

You could perhaps just normalize them all to a single case
before storing or searching the hash:


During the split phase where I separate the $name for the $email, I now use
this regex: $email =~ tr/A-Z/a-z/;

A depressingly infrequent display of Good Attitude for this here group.

Good for you! (and us)


You are correct Tad, I am new to programming in general. I'm trying my best
to better understand the basics. Here is the final code I use to remove the
duplicate entries and it does do it's job:

#!/usr/bin/perl

use CGI;
use CGI::Carp qw(fatalsToBrowser);

use strict;
use warnings;

open (INPUT, "<data.txt") or die "Can't open file: $!";
my %entries;
while ( my $data = <INPUT> ) {
chomp $data;
my ($name, $email) = split(/\,/, $data);
$name =~ s/(\w+)/\u\L$1/g;
$email =~ tr/A-Z/a-z/;
$entries{$email} = $name;
}
close(INPUT);

foreach my $adr ( sort keys %entries ) {
print "$entries{$adr},$adr\n";
}

exit;

That said, I'm not entirely certain what part of the code IS actually
detecting and removing the duplicate entries. I have a hunch that this is
taking please in the foreach loop. I created a test data.txt file and
manually entered several duplicate email addresses. When the script is run,
any duplicate is removed, seems it kills duplicates from the top down .. ie
... if (e-mail address removed) was found on 5 lines, it keeps the last occurrence... or
maybe it removed all but the last alphabetically sorted item.

Tad, thank you for this. I would like to ask one final question on this
matter ... right now, when the script is run, it prints to screen all remain
hash entries without any duplicates. Under that I would like it to show
which entries got removed. I assume to do this, I would need to modify the
script to push any matched duplicates into a secondary array or hash and
then print that last. Perhaps not. Your thoughts are appreciated.

Robert

P.S. Your going to laugh at this but until recent I have never used the
command 'use strict'. To be honest, I'm not 100% certain what exactly this
does, or how it is benefiting me or the script. All I know for certain is
that without adding "my" to variable definitions, the script doesn't
work/run. Most articles I have read online highly recommend using this
command but don't go into great detail why. I ask you this because I wish to
better my understanding of Perl and to ensure I write proper scripts in the
future.
 
J

Jürgen Exner

Randy said:
During the split phase where I separate the $name for the $email, I
now use this regex: $email =~ tr/A-Z/a-z/;

Just to be nitpicking: tr/// does not use REs. That's one big difference to
s///.

And it's better to use the function lc() instead of your tr/// code because
lc() handles non-English characters correctly, too, while your code fails
for anything that is outside the basic 26 latin characters.

jue
 
R

Randy

Randy said:
"Tad McClellan" <[email protected]> wrote in message
That said, I'm not entirely certain what part of the code IS actually
detecting and removing the duplicate entries. I have a hunch that this is
taking please in the foreach loop.

Tad, I did a little more research on hashes. I now think the duplicate
elimination is NOT happening during the foreach loop, that loop is just
sorting the hash and printing it; instead it is occuring when you are
defining each hash element in the initial <while> loop. I think this happens
because you are assigning (your method) the key as the email address and the
value as the name. In doing so you can't have duplicated key names?!?!?! so
the hash just ignores when a request for a duplicate key name is
requested?!?!?!?

If i'm wrong about this I hope you don't think less of me ... I really am
trying to learn.

Robert
 
D

Damian James

... so
the hash just ignores when a request for a duplicate key name is
requested?!?!?!?

No, the second assignment simply overrides the first one.

my %blah;
$blah{ x } = 'test 1';
$blah{ x } = 'test 2';
print "$blah{ x }\n";
If i'm wrong about this I hope you don't think less of me ... I really am
trying to learn.

Pfft, we've all been there. Never care about seeming foolish when
the object is to learn. It's the folks who try to look like they
already know everything who are foolish.

--damian
 
B

Brian McCauley

Randy said:
I had actually checked the perldoc for this and did find ways to remove
duplicate array entires but didn't know what to do when I wanted to match
the specific split item .. ie .. match the email address only, not the
entire array element.

Agreed, I now use the CPAN module File::Slurp to read textfile entires into
an @array efficiently.

If you have a need to slurp then File::Slurp will do so efficiently but
you have no need. It is better to read a line at a time as Tad showed.
I see looking a the end of the post you have indeed done so. Good.
During the split phase where I separate the $name for the $email, I now use
this regex: $email =~ tr/A-Z/a-z/;

There is no regex there. I agree with Tad that the lc() function would
be beter than tr///.

$email = lc $email;
Ditto.

You are correct Tad, I am new to programming in general. I'm trying my best
to better understand the basics. Here is the final code I use to remove the
duplicate entries and it does do it's job:

It looks good. I will now proceed t criticise it but don't let this
detract from the fact that it is good.
#!/usr/bin/perl

use CGI;
use CGI::Carp qw(fatalsToBrowser);

Is this a CGI script? It doesn't look time one?
use strict;
use warnings;

Generally best to put these two ASAP. That way you'll even get their
protection in your other use statements. The only thing I like to see
above these two are comments and, in a the case of a module, a package
directive.
open (INPUT, "<data.txt") or die "Can't open file: $!";
my %entries;
while ( my $data = <INPUT> ) {
chomp $data;
my ($name, $email) = split(/\,/, $data);

No need to backslash the comma in a regex. I'm not as paranoid about
leaning toothpick syndrome as Tad but I wouldn't bother here.
$name =~ s/(\w+)/\u\L$1/g;

OK, nothing whatever to do with Perl, but this is bad. There are a lot
of names (like mine) that have non-trivial capitaliztion. You risk
offending and alienating many people. This has been oft discssed here.
There is no solution as sometimes there can be two distinct names that
differ only in capialization.
$email =~ tr/A-Z/a-z/;
$entries{$email} = $name;
}
close(INPUT);

Your code looks nice but your use of indentation between the open/close
is rather unconventional.
foreach my $adr ( sort keys %entries ) {
print "$entries{$adr},$adr\n";
}

exit;

It is more conventional just to let perl fall off the end of your script
and exit() implicitly.
That said, I'm not entirely certain what part of the code IS actually
detecting and removing the duplicate entries. I have a hunch that this is
taking please in the foreach loop.

No - it is the line

$entries{$email} = $name;

If you encounter a second record in the input with an e-mail address
that's been encountered before the above line will replace the old entry
in %entries with a new one, thus forgetting all but the last entry with
a given e-mail.
.. if (e-mail address removed) was found on 5 lines, it keeps the last occurrence...
Yep.

Tad, thank you for this. I would like to ask one final question on this
matter ... right now, when the script is run, it prints to screen all remain
hash entries without any duplicates. Under that I would like it to show
which entries got removed. I assume to do this, I would need to modify the
script to push any matched duplicates into a secondary array or hash and
then print that last.

Yes that would work.

if ( defined $entries{$email} ) {
push @duplicates => $data;
} else {
$entries{$email} = $name;
}

Note - this now preserves the first instance of each address and puts
the rest into @duplicates.
P.S. Your going to laugh at this but until recent I have never used the
command 'use strict'. To be honest, I'm not 100% certain what exactly this
does, or how it is benefiting me or the script. All I know for certain is
that without adding "my" to variable definitions, the script doesn't
work/run.

Yes that is probably the most noticable of the three effects. Without
'use strict' perl will treat the first mention of an undeclared variable
as an implicit declaration of a package-scoped variable (well kinda).
This can be a great convenience in 1-line scripts but is generally a
liability in scripts longer than about 10 lines.
Most articles I have read online highly recommend using this
command but don't go into great detail why. I ask you this because I wish to
better my understanding of Perl and to ensure I write proper scripts in the
future.

I would argue (and indeed have argued with giants) that it is best to
see 'use strict' as disabling three fairly obscure features and that
understanding of these features is something that should not concern
people too early in their learning of Perl.

http://groups-beta.google.com/group/comp.lang.perl.misc/msg/89f307d6b9e83c65
 
T

Tad McClellan

[snip stuff already answered in other followups]

P.S. Your going to laugh at this


No I'm not.

I used to program without it myself (because it did not yet exist :)

but until recent I have never used the
command 'use strict'.
^^^^^^^
^^^^^^^

It is more properly called a "pragma", which is fancy college-talk
for a "compiler directive".

To be honest, I'm not 100% certain what exactly this
does,


You can read its docs with

perldoc strict

or how it is benefiting me or the script.


It finds common mistakes.

"strict vars" in particular, finds typos, which is a very common
mistake made by all of us.

All I know for certain is
that without adding "my" to variable definitions, the script doesn't
work/run.


When you put "use strict" in your program, you are making a
promise to perl.

I promise to declare all of my variables before I use them
(or I will use their fully qualified package name).

If you break your promise, perl will refuse to run your program.

Most articles I have read online highly recommend using this
command but don't go into great detail why.


It finds bugs in microseconds rather than in a bazillion- microsecond
debugging session.

I ask you this because I wish to
better my understanding of Perl and to ensure I write proper scripts in the
future.


strict and warnings will save you time by finding common
mistakes *for you*.
 
R

RedGrittyBrick

Randy said:
Hello,

I have a text file that stores names and email addresses. This data is built
from a feedback form on my website. Here is the format of my textfile
entries:

Dan Smith,[email protected]
Mike Roberts,[email protected]
Steve Anderson,[email protected]

and so on.

As you can see, it's pretty much a standard CSV textfile. Overtime, this
database has grown very big, and there are several duplicate email addresses
in the data. Until recently I have had to visually go through the data and
remove duplicate email addresses I can find, regardless of what is found in
the name field. I am seeking assistance on how I could write a script that
would scan each line, separate the names field from the email address field,
then scan and remove duplicates. So far all I have is the following:

#!/usr/bin/perl

use CGI;
use CGI::Carp qw(fatalsToBrowser);
use strict;

my @data, $data, $name, $email;

open (FH, "<data.txt") or die "Can't open file: $!";
@data=<FH>;
close(FH);

foreach $data (@data) {
chomp ($data);
($name,$email)=split(/\,/,$data);

\\ Missing scan for duplicates and removal code here \\
}

open (FH, ">data.txt") or die "Can't open file: $!";
print FH @data;
close(FH);

Yes I am a newbie Perl programmer. I'm not very good at brainstorming an
approach to sorting/matching routines. I would very much appreciate some
help understanding and building the final element. Another complication is
what if there are two identical email addresses but one is all caps and the
other isn't. I'm not looking for someone to write me the code I need,
instead to point me in the right direction so that I actually learn
something and forward my Perl skills. Thankx everyone.

Rather than removing duplicates, I'd not insert them

#!perl
use strict;
use warnings;
open my $fh, '<', 'data.txt'
or die "unable to open data.txt because $!";
while (<$fh>) {
chomp;
my ($name, $address) = split(/\,/,$_,2);
print "$name, $address\n" unless $seen{$address}++:
}
close $fh;

Untested.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top