How to test if a string is already in a array

M

Marc Eggenberger

Hi there.

I havent done perl coding for some years now and forgot a lot so bear
with me ...

I open a text file, read it line by line and do some splitting and
substr to get an emailadress of the line. Now the textfile has some
10k lines and a lot of dublicate mail addresses. I only need each
emailaddress once .. a bit like a select distinct emailadress would be
in SQL.

My though now was to create a array and test if the address is already
in the array and if not push it into the array. I dont need to have
the position of the address in the array ... so I thought of something
like

if(! exists $address_array[$address]=
{
push(@address_array,$address);
}

this of course does not work ...

how would I achive my goal?

Thanks for any help

Marc
 
J

John Bokma

Marc said:
Hi there.

I havent done perl coding for some years now and forgot a lot so bear
with me ...
I open a text file, read it line by line and do some splitting and
substr to get an emailadress of the line. Now the textfile has some
10k lines and a lot of dublicate mail addresses. I only need each
emailaddress once .. a bit like a select distinct emailadress would be
in SQL.

My though now was to create a array and test if the address is already
in the array and if not push it into the array. I dont need to have
the position of the address in the array ... so I thought of something
like

if(! exists $address_array[$address]=
{
push(@address_array,$address);
}

this of course does not work ...

how would I achive my goal?

my %address_hash;


and in your loop:

$address_hash{ $address } = 1;

( no need for the test thing )


keys %address_hash gives the unique addresses.

BTW: put use strict; use warnings; on top of your script.
 
M

Maxim

if(! exists $address_array[$address])
{
push(@address_array,$address);
}

Checking every time the string in array would yield O(n^2) complexity.
The easiest way (I guess) is to do the following: (which is O(n*log n) )

my %address_hash;

if( ! exists $address_hash{$address} )
{
$address_hash{$address} = 1;
}

my @address_array = keys %address_hash;

Hope this helps
 
E

Eric Bohlman

(e-mail address removed) (Marc Eggenberger) wrote in
I havent done perl coding for some years now and forgot a lot so bear
with me ...

I open a text file, read it line by line and do some splitting and
substr to get an emailadress of the line. Now the textfile has some
10k lines and a lot of dublicate mail addresses. I only need each
emailaddress once .. a bit like a select distinct emailadress would be
in SQL.

When you start saying words like "duplicate" or "distinct," you should be
immediately thinking "hash."
My though now was to create a array and test if the address is already
in the array and if not push it into the array. I dont need to have
the position of the address in the array ... so I thought of something
like

if(! exists $address_array[$address]=
{
push(@address_array,$address);
}

this of course does not work ...

how would I achive my goal?

my %addresses;
....
$addresses{$address}=1;
....
foreach my $address (keys %addresses) {
#do something with the address
}
 
M

marc.eggenberger

ok ... I changed it .. but when I run my new script it prints adresses
more than once .. why is that?

Here's my code:

#!/usr/bin/perl
use strict;
use warnings;

my $textfile = 'empfaenger.txt';

open(EMPFAENGER, $textfile) || die("Could not open file $textfile");

my @raw_data = <EMPFAENGER>;
my %ad_hash;

foreach my $line (@raw_data)
{
my @fields = split(/ /, $line);
my @fields2 = split(/=/, $fields[6]);
my $address = $fields2[1];
$address = substr($address,1,length($address) - 3);

if(index($address,"domain.ch") > 0)
{
$ad_hash{$address} = 1;
}

foreach my $key(keys(%ad_hash))
{
print $key . "\n";
}
}
close(EMPFAENGER);
 
J

John Bokma

wrote:
ok ... I changed it .. but when I run my new script it prints adresses
more than once .. why is that?

because you print the keys inside the loop
Here's my code:

#!/usr/bin/perl
use strict;
use warnings;

my $textfile = 'empfaenger.txt';

open(EMPFAENGER, $textfile) || die("Could not open file $textfile");

open my $fh, $textfile or die "Can't open '$textfile': $!";

$! = why it didn't work, if you don't print that, you get quite a
meaningless error
my @raw_data = <EMPFAENGER>;

if you do this, you can close now, not after the loop, or:

my %ad_hash;

while ( my $line = said:
my %ad_hash;

foreach my $line (@raw_data)
{
my @fields = split(/ /, $line);
my @fields2 = split(/=/, $fields[6]);
my $address = $fields2[1];
$address = substr($address,1,length($address) - 3);

this probably could be done in a shorter way :-D
if(index($address,"domain.ch") > 0)
{
$ad_hash{$address} = 1;
}

$ad_hash{ $address } = 1 if index( $address, "domain.ch") > 0;
}

or:

index( $address, "domain.ch" ) > 0 and $ad_hash{ $address } = 1;
}

or:

index( $address, "domain.ch" ) > 0 or next;
$ad_hash{ $address } = 1;
}

then close:

close $fh or die "Can't close '$textfile': $!";

foreach my $key(keys(%ad_hash))
{
print $key . "\n";
}

You can write the print as:

print "$key\n";

a shorter way to write the print all:

print "$_\n" for keys %add_hash;

or

print map { "$_\n } keys %add_hash;
 
T

Tad McClellan

Marc Eggenberger said:
I havent done perl coding for some years now and forgot a lot so bear
with me ...


You are still expected to check the Perl FAQ *before* posting.

a lot of dublicate mail addresses. I only need each
emailaddress once


The answer is easy to find once you spell the search term correctly:

perldoc -q duplicate

How can I remove duplicate elements from a list or array?
 
J

Jürgen Exner

Marc Eggenberger wrote:
[...]
substr to get an emailadress of the line. Now the textfile has some
10k lines and a lot of dublicate mail addresses. I only need each
emailaddress once ..

See the very last sentence in "perldoc -q duplicate"

jue
 
F

Fabian Pilkowski

* Tad McClellan said:
What would index() return if

$address = 'domain.ch';

??

Sure, index() returns the position where the string 'domain.ch' starts
and -1 otherwise. So when address is 'domain.ch' it will return 0. But
be aware that 'domain.ch' is no valid mail address. Therefore I see no
need to add that to a hash containing mail addresses. Hence, I suggest
to test whether index() returns *two or more*. It's the minimum a mail
address must have in front of the domain part: at least one character
for the local part, followed by »@«.

Sure, mail addresses like '(e-mail address removed)' will by-pass this
*filter*, but index() is not made for complicated things ;-)

regards,
fabian
 
F

Fabian Pilkowski

* Tad McClellan said:
So the test above should be >= rather than >

Or

if ( index($address,'domain.ch') >= 2 )

as I mentioned in my previous posting.

regards,
fabian
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top