Remove duplicate lines from array - Yes I checked before posting

Discussion in 'Perl Misc' started by phillyfan, Sep 9, 2005.

  1. phillyfan

    phillyfan Guest

    I have an .csv file I have pulled into an array. I have searched for a
    way to remove duplicate lines from the array. I have used a couple of
    different coding techques but because they are use the hash key value
    technique I end up removing lines I need. Here is a sample of my file:
    The fields are Classcode, start time, end time, building number, days
    of week, class title, proff id, and professor name. They are comma
    delimited in the .csv file.

    ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
    ACCT2101TS1 920 1030 172 222 MWF Accounting
    I 901063085 Arnold Schneider
    ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
    ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
    ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
    ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
    ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
    ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
    ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
    ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
    ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
    ACCT2102TS1 1040 1150 172 222 MWF Accounting
    II 901063085 Arnold Schneider
    ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn

    If I use:

    #! /perl/bin/perl
    use strict;
    use warnings;
    $| = 1;


    my @bannerfile = ();
    open(INTO, 'data-banner.csv') or die "Can't open data-banner.csv for
    reading: $!\n";
    chomp(@bannerfile = <INTO>);
    close(INTO) or die "Can't close data-banner.csv: $!\n";

    my %seen = ();
    my $item;


    my @uniq = @bannerfile;
    @uniq = do { my %seen; grep !$seen{$_}++, @uniq };

    or

    foreach $item(@bannerfile){
    push(@uniq, $item) unless exists $seen{$item};}

    What happens, I am sure you already know is because the same classcode
    is found it is removed regardless if the information after itis
    different. My goal is to strip off the duplicate records that exist
    from the file. Example:
    ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
    shows up twice just keep one instance of this record and also be able
    to keep
    ACCT2101TS1 920 1030 172 222 MWF Accounting
    I 901063085 Arnold Schneider
    because it is a different record.
    Hopefully I have made sense in what I am trying to achieve. Thank you
    for your help and tutelage.
     
    phillyfan, Sep 9, 2005
    #1
    1. Advertising

  2. phillyfan

    John Bokma Guest

    "phillyfan" <> wrote:

    > What happens, I am sure you already know is because the same classcode
    > is found it is removed regardless if the information after itis
    > different. My goal is to strip off the duplicate records that exist


    Did you really test your code? Since it's *line* based, ie.

    A12 foo
    A12 bar

    will be seen as *not* unique.

    #!/usr/bin/perl

    use strict;
    use warnings;

    my $filename = 'data-banner.csv';

    open my $fh, $filename or
    die "Can't open '$filename' for reading: $!";

    my %check;
    my @lines;

    while ( my $line = <$fh> ) {

    exists $check{ $line } and next;

    $check{ $line } = 1;
    push @lines, $line; # keep original order
    }

    close $fh or die "Can't close '$filename' after reading: $!";

    print @lines;

    (untested)


    --
    John Small Perl scripts: http://johnbokma.com/perl/
    Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
     
    John Bokma, Sep 9, 2005
    #2
    1. Advertising

  3. phillyfan

    phillyfan Guest

    Yes I did check the code but did not do a thorough check of my results
    all three variations of the code worked. A sort helped me see the error
    of my ways. I thank you for waking me up.
     
    phillyfan, Sep 9, 2005
    #3
  4. phillyfan

    Guest

    phillyfan <> wrote:
    > I have an .csv file I have pulled into an array. I have searched for a
    > way to remove duplicate lines from the array. I have used a couple of
    > different coding techques but because they are use the hash key value
    > technique I end up removing lines I need. Here is a sample of my file:
    > The fields are Classcode, start time, end time, building number, days
    > of week, class title, proff id, and professor name. They are comma
    > delimited in the .csv file.
    >
    > ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
    > ACCT2101TS1 920 1030 172 222 MWF Accounting
    > I 901063085 Arnold Schneider
    > ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
    > ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
    > ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
    > ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
    > ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
    > ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
    > ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
    > ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
    > ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
    > ACCT2102TS1 1040 1150 172 222 MWF Accounting
    > II 901063085 Arnold Schneider
    > ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn


    > If I use:


    > #! /perl/bin/perl
    > use strict;
    > use warnings;
    > $| = 1;



    > my @bannerfile = ();
    > open(INTO, 'data-banner.csv') or die "Can't open data-banner.csv for
    > reading: $!\n";
    > chomp(@bannerfile = <INTO>);
    > close(INTO) or die "Can't close data-banner.csv: $!\n";


    It would be better to read in the data line by line for scaleability

    > my %seen = ();
    > my $item;


    $item should only be introduced when it is actually needed.

    > [snip]


    > or
    >
    > foreach $item(@bannerfile){
    > push(@uniq, $item) unless exists $seen{$item};}


    It will never be 'seen'... as you never mark it that way.

    foreach my $item (@bannerfile) {
    push(@uniq, $item) unless exists $seen{$item};
    print "Yes" if $seen{$item}; # Diagnostic so you can see what happened
    print "No" if ! $seen{$item}; # Remove these after testing
    $seen{$item} = 1;
    }

    > What happens, I am sure you already know is because the same classcode
    > is found it is removed regardless if the information after itis
    > different. My goal is to strip off the duplicate records that exist


    No that is not what happened at all.

    Axel
     
    , Sep 9, 2005
    #4
  5. phillyfan

    Joe Smith Guest

    phillyfan wrote:
    > my %seen = ();
    > push(@uniq, $item) unless exists $seen{$item};}


    push(@uniq, $item) unless $seen{$item}++;

    -Joe
     
    Joe Smith, Sep 10, 2005
    #5
  6. phillyfan wrote:
    > I have an .csv file I have pulled into an array. I have searched for a
    > way to remove duplicate lines from the array. I have used a couple of
    > different coding techques but because they are use the hash key value
    > technique I end up removing lines I need. Here is a sample of my file:
    > The fields are Classcode, start time, end time, building number, days
    > of week, class title, proff id, and professor name. They are comma
    > delimited in the .csv file.
    >
    > ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
    > ACCT2101TS1 920 1030 172 222 MWF Accounting
    > I 901063085 Arnold Schneider
    > ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
    > ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
    > ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
    > ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
    > ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
    > ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
    > ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
    > ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
    > ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
    > ACCT2102TS1 1040 1150 172 222 MWF Accounting
    > II 901063085 Arnold Schneider
    > ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn


    In Ruby:

    array = DATA.read.split("\n")
    puts array.size
    puts array.uniq.size
    puts array.uniq

    __END__
    ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
    ACCT2101TS1 920 1030 172 222 MWF Accounting
    I 901063085 Arnold Schneider
    ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
    ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
    ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
    ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
    ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
    ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
    ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
    ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
    ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
    ACCT2102TS1 1040 1150 172 222 MWF Accounting
    II 901063085 Arnold Schneider
    ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn

    Output:

    15
    11
    ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
    ACCT2101TS1 920 1030 172 222 MWF Accounting
    I 901063085 Arnold Schneider
    ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
    ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
    ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
    ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
    ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
    ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
    ACCT2102TS1 1040 1150 172 222 MWF Accounting
    II 901063085 Arnold Schneider
     
    William James, Sep 11, 2005
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. yurps
    Replies:
    1
    Views:
    6,989
    rajeshthangarasu1983
    Feb 29, 2008
  2. teo
    Replies:
    3
    Views:
    569
  3. Jamie Herre
    Replies:
    1
    Views:
    227
    why the lucky stiff
    Jan 7, 2005
  4. Li Chen
    Replies:
    19
    Views:
    265
    Li Chen
    Sep 26, 2009
  5. PerlFAQ Server
    Replies:
    0
    Views:
    188
    PerlFAQ Server
    Mar 1, 2011
Loading...

Share This Page