Remove duplicate lines from array - Yes I checked before posting

phillyfan · Sep 9, 2005

I have an .csv file I have pulled into an array. I have searched for a
way to remove duplicate lines from the array. I have used a couple of
different coding techques but because they are use the hash key value
technique I end up removing lines I need. Here is a sample of my file:
The fields are Classcode, start time, end time, building number, days
of week, class title, proff id, and professor name. They are comma
delimited in the .csv file.

ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS1 920 1030 172 222 MWF Accounting
I 901063085 Arnold Schneider
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
ACCT2102TS1 1040 1150 172 222 MWF Accounting
II 901063085 Arnold Schneider
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn

If I use:

#! /perl/bin/perl
use strict;
use warnings;
$| = 1;

my @bannerfile = ();
open(INTO, 'data-banner.csv') or die "Can't open data-banner.csv for
reading: $!\n";
chomp(@bannerfile = <INTO>);
close(INTO) or die "Can't close data-banner.csv: $!\n";

my %seen = ();
my $item;

my @uniq = @bannerfile;
@uniq = do { my %seen; grep !$seen{$_}++, @uniq };

or

foreach $item(@bannerfile){
push(@uniq, $item) unless exists $seen{$item};}

What happens, I am sure you already know is because the same classcode
is found it is removed regardless if the information after itis
different. My goal is to strip off the duplicate records that exist
from the file. Example:
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
shows up twice just keep one instance of this record and also be able
to keep
ACCT2101TS1 920 1030 172 222 MWF Accounting
I 901063085 Arnold Schneider
because it is a different record.
Hopefully I have made sense in what I am trying to achieve. Thank you
for your help and tutelage.

John Bokma · Sep 9, 2005

phillyfan said:
What happens, I am sure you already know is because the same classcode
is found it is removed regardless if the information after itis
different. My goal is to strip off the duplicate records that exist

Did you really test your code? Since it's *line* based, ie.

A12 foo
A12 bar

will be seen as *not* unique.

#!/usr/bin/perl

use strict;
use warnings;

my $filename = 'data-banner.csv';

open my $fh, $filename or
die "Can't open '$filename' for reading: $!";

my %check;
my @lines;

while ( my $line = <$fh> ) {

exists $check{ $line } and next;

$check{ $line } = 1;
push @lines, $line; # keep original order
}

close $fh or die "Can't close '$filename' after reading: $!";

print @lines;

(untested)

phillyfan · Sep 9, 2005

Yes I did check the code but did not do a thorough check of my results
all three variations of the code worked. A sort helped me see the error
of my ways. I thank you for waking me up.

axel · Sep 9, 2005

phillyfan said:
I have an .csv file I have pulled into an array. I have searched for a
way to remove duplicate lines from the array. I have used a couple of
different coding techques but because they are use the hash key value
technique I end up removing lines I need. Here is a sample of my file:
The fields are Classcode, start time, end time, building number, days
of week, class title, proff id, and professor name. They are comma
delimited in the .csv file.

ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS1 920 1030 172 222 MWF Accounting
I 901063085 Arnold Schneider
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
ACCT2102TS1 1040 1150 172 222 MWF Accounting
II 901063085 Arnold Schneider
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn

If I use:

#! /perl/bin/perl
use strict;
use warnings;
$| = 1;

my @bannerfile = ();
open(INTO, 'data-banner.csv') or die "Can't open data-banner.csv for
reading: $!\n";
chomp(@bannerfile = <INTO>);
close(INTO) or die "Can't close data-banner.csv: $!\n";

It would be better to read in the data line by line for scaleability

my %seen = ();
my $item;

$item should only be introduced when it is actually needed.

[snip]

or

foreach $item(@bannerfile){
push(@uniq, $item) unless exists $seen{$item};}

It will never be 'seen'... as you never mark it that way.

foreach my $item (@bannerfile) {
push(@uniq, $item) unless exists $seen{$item};
print "Yes" if $seen{$item}; # Diagnostic so you can see what happened
print "No" if ! $seen{$item}; # Remove these after testing
$seen{$item} = 1;
}

What happens, I am sure you already know is because the same classcode
is found it is removed regardless if the information after itis
different. My goal is to strip off the duplicate records that exist

No that is not what happened at all.

Axel

Joe Smith · Sep 10, 2005

phillyfan said:
my %seen = ();
push(@uniq, $item) unless exists $seen{$item};}

push(@uniq, $item) unless $seen{$item}++;

-Joe

William James · Sep 11, 2005

phillyfan said:
I have an .csv file I have pulled into an array. I have searched for a
way to remove duplicate lines from the array. I have used a couple of
different coding techques but because they are use the hash key value
technique I end up removing lines I need. Here is a sample of my file:
The fields are Classcode, start time, end time, building number, days
of week, class title, proff id, and professor name. They are comma
delimited in the .csv file.

ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS1 920 1030 172 222 MWF Accounting
I 901063085 Arnold Schneider
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
ACCT2102TS1 1040 1150 172 222 MWF Accounting
II 901063085 Arnold Schneider
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn

In Ruby:

array = DATA.read.split("\n")
puts array.size
puts array.uniq.size
puts array.uniq

__END__
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS1 920 1030 172 222 MWF Accounting
I 901063085 Arnold Schneider
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
ACCT2102TS1 1040 1150 172 222 MWF Accounting
II 901063085 Arnold Schneider
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn

Output:

15
11
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS1 920 1030 172 222 MWF Accounting
I 901063085 Arnold Schneider
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
ACCT2102TS1 1040 1150 172 222 MWF Accounting
II 901063085 Arnold Schneider

Fundamentals of Financial Management Concise 7e Brigham Houston	0	May 1, 2011
How can I make a better program from the following one	1	Jun 14, 2008
Speed up creation of combo box options	2	Mar 10, 2006
OK ... AGE FUNCTION TEST RESULTS ...	1	Jan 11, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006

Remove duplicate lines from array - Yes I checked before posting

phillyfan

John Bokma

phillyfan

axel

Joe Smith

William James

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads