Remove duplicate lines from array - Yes I checked before posting

P

phillyfan

I have an .csv file I have pulled into an array. I have searched for a
way to remove duplicate lines from the array. I have used a couple of
different coding techques but because they are use the hash key value
technique I end up removing lines I need. Here is a sample of my file:
The fields are Classcode, start time, end time, building number, days
of week, class title, proff id, and professor name. They are comma
delimited in the .csv file.

ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS1 920 1030 172 222 MWF Accounting
I 901063085 Arnold Schneider
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
ACCT2102TS1 1040 1150 172 222 MWF Accounting
II 901063085 Arnold Schneider
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn

If I use:

#! /perl/bin/perl
use strict;
use warnings;
$| = 1;


my @bannerfile = ();
open(INTO, 'data-banner.csv') or die "Can't open data-banner.csv for
reading: $!\n";
chomp(@bannerfile = <INTO>);
close(INTO) or die "Can't close data-banner.csv: $!\n";

my %seen = ();
my $item;


my @uniq = @bannerfile;
@uniq = do { my %seen; grep !$seen{$_}++, @uniq };

or

foreach $item(@bannerfile){
push(@uniq, $item) unless exists $seen{$item};}

What happens, I am sure you already know is because the same classcode
is found it is removed regardless if the information after itis
different. My goal is to strip off the duplicate records that exist
from the file. Example:
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
shows up twice just keep one instance of this record and also be able
to keep
ACCT2101TS1 920 1030 172 222 MWF Accounting
I 901063085 Arnold Schneider
because it is a different record.
Hopefully I have made sense in what I am trying to achieve. Thank you
for your help and tutelage.
 
J

John Bokma

phillyfan said:
What happens, I am sure you already know is because the same classcode
is found it is removed regardless if the information after itis
different. My goal is to strip off the duplicate records that exist

Did you really test your code? Since it's *line* based, ie.

A12 foo
A12 bar

will be seen as *not* unique.

#!/usr/bin/perl

use strict;
use warnings;

my $filename = 'data-banner.csv';

open my $fh, $filename or
die "Can't open '$filename' for reading: $!";

my %check;
my @lines;

while ( my $line = <$fh> ) {

exists $check{ $line } and next;

$check{ $line } = 1;
push @lines, $line; # keep original order
}

close $fh or die "Can't close '$filename' after reading: $!";

print @lines;

(untested)
 
P

phillyfan

Yes I did check the code but did not do a thorough check of my results
all three variations of the code worked. A sort helped me see the error
of my ways. I thank you for waking me up.
 
A

axel

phillyfan said:
I have an .csv file I have pulled into an array. I have searched for a
way to remove duplicate lines from the array. I have used a couple of
different coding techques but because they are use the hash key value
technique I end up removing lines I need. Here is a sample of my file:
The fields are Classcode, start time, end time, building number, days
of week, class title, proff id, and professor name. They are comma
delimited in the .csv file.

ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS1 920 1030 172 222 MWF Accounting
I 901063085 Arnold Schneider
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
ACCT2102TS1 1040 1150 172 222 MWF Accounting
II 901063085 Arnold Schneider
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
If I use:
#! /perl/bin/perl
use strict;
use warnings;
$| = 1;

my @bannerfile = ();
open(INTO, 'data-banner.csv') or die "Can't open data-banner.csv for
reading: $!\n";
chomp(@bannerfile = <INTO>);
close(INTO) or die "Can't close data-banner.csv: $!\n";

It would be better to read in the data line by line for scaleability
my %seen = ();
my $item;

$item should only be introduced when it is actually needed.
or

foreach $item(@bannerfile){
push(@uniq, $item) unless exists $seen{$item};}

It will never be 'seen'... as you never mark it that way.

foreach my $item (@bannerfile) {
push(@uniq, $item) unless exists $seen{$item};
print "Yes" if $seen{$item}; # Diagnostic so you can see what happened
print "No" if ! $seen{$item}; # Remove these after testing
$seen{$item} = 1;
}
What happens, I am sure you already know is because the same classcode
is found it is removed regardless if the information after itis
different. My goal is to strip off the duplicate records that exist

No that is not what happened at all.

Axel
 
W

William James

phillyfan said:
I have an .csv file I have pulled into an array. I have searched for a
way to remove duplicate lines from the array. I have used a couple of
different coding techques but because they are use the hash key value
technique I end up removing lines I need. Here is a sample of my file:
The fields are Classcode, start time, end time, building number, days
of week, class title, proff id, and professor name. They are comma
delimited in the .csv file.

ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS1 920 1030 172 222 MWF Accounting
I 901063085 Arnold Schneider
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
ACCT2102TS1 1040 1150 172 222 MWF Accounting
II 901063085 Arnold Schneider
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn

In Ruby:

array = DATA.read.split("\n")
puts array.size
puts array.uniq.size
puts array.uniq

__END__
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS1 920 1030 172 222 MWF Accounting
I 901063085 Arnold Schneider
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
ACCT2102TS1 1040 1150 172 222 MWF Accounting
II 901063085 Arnold Schneider
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn

Output:

15
11
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS1 920 1030 172 222 MWF Accounting
I 901063085 Arnold Schneider
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
ACCT2102TS1 1040 1150 172 222 MWF Accounting
II 901063085 Arnold Schneider
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top