Help: Duplicate and Unique Lines Problem

Discussion in 'Perl Misc' started by Amy Lee, Sep 29, 2008.

  1. Amy Lee

    Amy Lee Guest

    Hello,

    Dose perl has functions like the UNIX command sort and uniq can output
    duplicate lines and unique lines?

    There's my codes, what if I run this it will output many lines but I just
    want to save the duplicate line just once and unique line.

    while (<>)
    {
    if (/^\>.*/)
    {
    s/\>//g;
    if (/\w+\s\w+\s(.*)\smiR.*\s\w+/g)
    {
    print "$1\n";
    }
    }
    }

    The output is like this:

    .......
    Homo sapiens
    Homo sapiens
    Homo sapiens
    Homo sapiens
    Homo sapiens
    Homo sapiens
    Homo sapiens
    Caenorhabditis elegans
    Caenorhabditis elegans
    Caenorhabditis elegans
    Caenorhabditis elegans
    Mus musculus
    Mus musculus
    Mus musculus
    Mus musculus
    Mus musculus
    Mus musculus
    Mus musculus
    Arabidopsis thaliana
    .........

    And mu purpose is the output should be like that:
    ........
    Homo sapiens
    Caenorhabditis elegans
    Mus musculus
    Arabidopsis thaliana
    ........

    Thank you very much~

    Best Regards,

    Amy Lee
    Amy Lee, Sep 29, 2008
    #1
    1. Advertising

  2. Amy Lee <> writes:

    > Dose perl has functions like the UNIX command sort and uniq can output
    > duplicate lines and unique lines?


    There is a uniq function in the List::MoreUtils module otherwise the
    standard way is to use the printed stings as keys in a hash to mark
    which lines is allready printed.

    //Makholm
    Peter Makholm, Sep 29, 2008
    #2
    1. Advertising

  3. Amy Lee

    Amy Lee Guest

    On Mon, 29 Sep 2008 14:17:16 +0100, bugbear wrote:

    > Amy Lee wrote:
    >> Hello,
    >>
    >> Dose perl has functions like the UNIX command sort and uniq can output
    >> duplicate lines and unique lines?
    >>
    >> There's my codes, what if I run this it will output many lines but I just
    >> want to save the duplicate line just once and unique line.
    >>
    >> while (<>)
    >> {
    >> if (/^\>.*/)
    >> {
    >> s/\>//g;
    >> if (/\w+\s\w+\s(.*)\smiR.*\s\w+/g)
    >> {
    >> print "$1\n";
    >> }
    >> }
    >> }

    >
    > If you're running on *NIX, just pipe your script to sort/uniq and you're done.
    >
    > BugBear

    Thank you. But I hope make it more convenient so I could put codes into
    another perl script.

    Regards,

    Amy Lee
    Amy Lee, Sep 29, 2008
    #3
  4. Amy Lee

    Amy Lee Guest

    On Mon, 29 Sep 2008 15:28:51 +0200, Peter Makholm wrote:

    > Amy Lee <> writes:
    >
    >> Dose perl has functions like the UNIX command sort and uniq can output
    >> duplicate lines and unique lines?

    >
    > There is a uniq function in the List::MoreUtils module otherwise the
    > standard way is to use the printed stings as keys in a hash to mark
    > which lines is allready printed.
    >
    > //Makholm

    Hello,

    I use this module List::MoreUtils to have a process but still failed and
    output just the last line, here's my codes.

    use List::MoreUtils qw(any all none notall true false firstidx first_index
    lastidx last_index insert_after insert_after_string
    apply after after_incl before before_incl indexes
    firstval first_value lastval last_value each_array
    each_arrayref pairwise natatime mesh zip uniq minmax);

    $file = $ARGV[0];
    open FILE, '<', "$file";
    while (<FILE>)
    {
    @raw_list = split /\n/, $_;
    }
    @list = uniq @raw_list;
    foreach $single (@list)
    {
    print "$single\n";
    }

    Thank you very much.

    Regards,

    Amy
    Amy Lee, Sep 29, 2008
    #4
  5. Amy Lee

    Bart Lateur Guest

    Bart Lateur, Sep 29, 2008
    #5
  6. Amy Lee

    Amy Lee Guest

    On Mon, 29 Sep 2008 16:54:15 +0200, Bart Lateur wrote:

    > Amy Lee wrote:
    >
    >>Dose perl has functions like the UNIX command sort and uniq can output
    >>duplicate lines and unique lines?

    >
    > Perl has a built in sort, and unique can be implemented with a few lines
    > of code. They're even in the official FAQ:
    >
    > perlfaq4: How can I remove duplicate elements from a list or
    > array?
    >
    > http://perldoc.perl.org/perlfaq4.html#How-can-I-remove-duplicate-elements-from-a-list-or-array?

    Thanks, but my problem seems a little strange. Because I don't know if
    uniq function can process list such as @list. When I use uniq to process
    it I can just see the last line of the file.

    Amy
    Amy Lee, Sep 29, 2008
    #6
  7. Amy Lee

    Amy Lee Guest

    On Mon, 29 Sep 2008 16:54:15 +0200, Bart Lateur wrote:

    > Amy Lee wrote:
    >
    >>Dose perl has functions like the UNIX command sort and uniq can output
    >>duplicate lines and unique lines?

    >
    > Perl has a built in sort, and unique can be implemented with a few lines
    > of code. They're even in the official FAQ:
    >
    > perlfaq4: How can I remove duplicate elements from a list or
    > array?
    >
    > http://perldoc.perl.org/perlfaq4.html#How-can-I-remove-duplicate-elements-from-a-list-or-array?

    Here's the codes:

    open FILE, '<', "$file";
    while (<FILE>)
    {
    @raw_list = split /\n/, $_;
    @list = uniq (@raw_list);
    print "@list\n";
    }
    It seems that the uniq does nothing! I don't know the reason.

    Amy
    Amy Lee, Sep 29, 2008
    #7
  8. Amy Lee

    Ben Morrow Guest

    Quoth Amy Lee <>:
    > On Mon, 29 Sep 2008 15:28:51 +0200, Peter Makholm wrote:
    >
    > > Amy Lee <> writes:
    > >
    > >> Dose perl has functions like the UNIX command sort and uniq can output
    > >> duplicate lines and unique lines?

    > >
    > > There is a uniq function in the List::MoreUtils module otherwise the
    > > standard way is to use the printed stings as keys in a hash to mark
    > > which lines is allready printed.

    >
    > I use this module List::MoreUtils to have a process but still failed and
    > output just the last line, here's my codes.
    >
    > use List::MoreUtils qw(any all none notall true false firstidx first_index
    > lastidx last_index insert_after insert_after_string
    > apply after after_incl before before_incl indexes
    > firstval first_value lastval last_value each_array
    > each_arrayref pairwise natatime mesh zip uniq
    > minmax);


    Don't import more than you need.

    use List::MoreUtils qw(uniq);

    > $file = $ARGV[0];


    Your script should start with

    use warnings;
    use strict;

    which will mean you need 'my' on all your variables

    my $file = $ARGV[0];

    > open FILE, '<', "$file";


    Use lexical filehandles.
    Always check the return value of open.
    Don't quote things when you don't need to.

    open my $FILE, '<', $file
    or die "can't read '$file': $!";

    > while (<FILE>)
    > {
    > @raw_list = split /\n/, $_;


    while (<FILE>) reads the file one line at a time. You then split that
    line on /\n/ (which won't do anything except remove the trailing
    newline, since it's just a single line) and replace the contents of
    @raw_line with the result. This means @raw_list never has more than one
    element (the last line read).

    Since you want to keep all the lines, either push them onto the array:

    while (<$FILE>) {
    chomp; # remove the newline
    push @raw_list, $_;
    }

    or, better, use <> in list context, which returns all the lines:

    my @raw_list = <$FILE>;
    chomp @raw_list; # remove all the newlines at once

    > }
    > @list = uniq @raw_list;
    > foreach $single (@list)
    > {
    > print "$single\n";


    Ben

    --
    Outside of a dog, a book is a man's best friend.
    Inside of a dog, it's too dark to read.
    Groucho Marx
    Ben Morrow, Sep 29, 2008
    #8
  9. Amy Lee

    Amy Lee Guest

    On Mon, 29 Sep 2008 16:29:26 +0100, Ben Morrow wrote:

    >
    > Quoth Amy Lee <>:
    >> On Mon, 29 Sep 2008 15:28:51 +0200, Peter Makholm wrote:
    >>
    >> > Amy Lee <> writes:
    >> >
    >> >> Dose perl has functions like the UNIX command sort and uniq can output
    >> >> duplicate lines and unique lines?
    >> >
    >> > There is a uniq function in the List::MoreUtils module otherwise the
    >> > standard way is to use the printed stings as keys in a hash to mark
    >> > which lines is allready printed.

    >>
    >> I use this module List::MoreUtils to have a process but still failed and
    >> output just the last line, here's my codes.
    >>
    >> use List::MoreUtils qw(any all none notall true false firstidx first_index
    >> lastidx last_index insert_after insert_after_string
    >> apply after after_incl before before_incl indexes
    >> firstval first_value lastval last_value each_array
    >> each_arrayref pairwise natatime mesh zip uniq
    >> minmax);

    >
    > Don't import more than you need.
    >
    > use List::MoreUtils qw(uniq);
    >
    >> $file = $ARGV[0];

    >
    > Your script should start with
    >
    > use warnings;
    > use strict;
    >
    > which will mean you need 'my' on all your variables
    >
    > my $file = $ARGV[0];
    >
    >> open FILE, '<', "$file";

    >
    > Use lexical filehandles.
    > Always check the return value of open.
    > Don't quote things when you don't need to.
    >
    > open my $FILE, '<', $file
    > or die "can't read '$file': $!";
    >
    >> while (<FILE>)
    >> {
    >> @raw_list = split /\n/, $_;

    >
    > while (<FILE>) reads the file one line at a time. You then split that
    > line on /\n/ (which won't do anything except remove the trailing
    > newline, since it's just a single line) and replace the contents of
    > @raw_line with the result. This means @raw_list never has more than one
    > element (the last line read).
    >
    > Since you want to keep all the lines, either push them onto the array:
    >
    > while (<$FILE>) {
    > chomp; # remove the newline
    > push @raw_list, $_;
    > }
    >
    > or, better, use <> in list context, which returns all the lines:
    >
    > my @raw_list = <$FILE>;
    > chomp @raw_list; # remove all the newlines at once
    >
    >> }
    >> @list = uniq @raw_list;
    >> foreach $single (@list)
    >> {
    >> print "$single\n";

    >
    > Ben

    Thank you very much. I have solved this one by your method.

    Best Regards,

    Amy
    Amy Lee, Sep 29, 2008
    #9
  10. Amy Lee wrote:
    > Hello,
    >
    > Dose perl has functions like the UNIX command sort and uniq can output
    > duplicate lines and unique lines?
    >
    > There's my codes, what if I run this it will output many lines but I just
    > want to save the duplicate line just once and unique line.
    >


    #!/usr/bin/perl
    use strict;
    use warnings;

    my %seen;
    for(sort <DATA>) {
    chomp;
    if (/(\w+\s+\w+\s+)/) {
    print "$1\n" unless $seen{$1}++;
    }
    }


    __END__
    Homo sapiens E
    Homo sapiens D
    Arabidopsis thaliana S
    Homo sapiens G
    Mus musculus P
    Mus musculus Q
    Mus musculus R
    Homo sapiens F
    Caenorhabditis elegans H
    Caenorhabditis elegans I
    Homo sapiens A
    Homo sapiens B
    Homo sapiens C
    Caenorhabditis elegans J
    Mus musculus L
    Mus musculus O
    Mus musculus M
    Mus musculus N
    Caenorhabditis elegans K

    --
    RGB
    RedGrittyBrick, Sep 29, 2008
    #10
  11. RedGrittyBrick wrote:
    >
    > Amy Lee wrote:
    >> Hello,
    >>
    >> Dose perl has functions like the UNIX command sort and uniq can output
    >> duplicate lines and unique lines?
    >>
    >> There's my codes, what if I run this it will output many lines but I just
    >> want to save the duplicate line just once and unique line.
    >>

    >
    > #!/usr/bin/perl
    > use strict;
    > use warnings;
    >
    > my %seen;
    > for(sort <DATA>) {
    > chomp;
    > if (/(\w+\s+\w+\s+)/) {
    > print "$1\n" unless $seen{$1}++;
    > }
    > }
    >


    P.S. For large amounts of data I'd prefer

    #!/usr/bin/perl
    use strict;
    use warnings;
    my %seen;
    my @uniq;
    for(<DATA>) {
    chomp;
    if (/(\w+\s+\w+\s+)/) {
    push @uniq, "$1\n" unless $seen{$1}++;
    }
    }
    print sort @uniq;


    __END__
    Homo sapiens E
    Homo sapiens D
    Arabidopsis thaliana S
    Homo sapiens G
    Mus musculus P
    Mus musculus Q
    Mus musculus R
    Homo sapiens F
    Caenorhabditis elegans H
    Caenorhabditis elegans I
    Homo sapiens A
    Homo sapiens B
    Homo sapiens C
    Caenorhabditis elegans J
    Mus musculus L
    Mus musculus O
    Mus musculus M
    Mus musculus N
    Caenorhabditis elegans K


    --
    RGB
    RedGrittyBrick, Sep 29, 2008
    #11
  12. RedGrittyBrick wrote:
    >
    > RedGrittyBrick wrote:
    >>
    >> Amy Lee wrote:
    >>> Hello,
    >>>
    >>> Dose perl has functions like the UNIX command sort and uniq can output
    >>> duplicate lines and unique lines?
    >>>
    >>> There's my codes, what if I run this it will output many lines but I
    >>> just
    >>> want to save the duplicate line just once and unique line.
    >>>

    >>
    >> #!/usr/bin/perl
    >> use strict;
    >> use warnings;
    >>
    >> my %seen;
    >> for(sort <DATA>) {
    >> chomp;
    >> if (/(\w+\s+\w+\s+)/) {
    >> print "$1\n" unless $seen{$1}++;
    >> }
    >> }
    >>

    >
    > P.S. For large amounts of data I'd prefer
    >
    > #!/usr/bin/perl
    > use strict;
    > use warnings;
    > my %seen;
    > my @uniq;
    > for(<DATA>) {
    > chomp;
    > if (/(\w+\s+\w+\s+)/) {
    > push @uniq, "$1\n" unless $seen{$1}++;
    > }
    > }
    > print sort @uniq;



    #!/usr/bin/perl
    use strict;
    use warnings;
    my %seen;
    for(<DATA>) {
    chomp;
    if (/(\w+\s+\w+\s+)/) {
    $seen{"$1\n"}++;
    }
    }
    print sort keys %seen;


    This is a deep hole I've dug myself into :)


    --
    RGB
    RedGrittyBrick, Sep 29, 2008
    #12
  13. Amy Lee

    Amy Lee Guest

    On Mon, 29 Sep 2008 16:59:28 +0100, RedGrittyBrick wrote:

    >
    > RedGrittyBrick wrote:
    >>
    >> Amy Lee wrote:
    >>> Hello,
    >>>
    >>> Dose perl has functions like the UNIX command sort and uniq can output
    >>> duplicate lines and unique lines?
    >>>
    >>> There's my codes, what if I run this it will output many lines but I just
    >>> want to save the duplicate line just once and unique line.
    >>>

    >>
    >> #!/usr/bin/perl
    >> use strict;
    >> use warnings;
    >>
    >> my %seen;
    >> for(sort <DATA>) {
    >> chomp;
    >> if (/(\w+\s+\w+\s+)/) {
    >> print "$1\n" unless $seen{$1}++;
    >> }
    >> }
    >>

    >
    > P.S. For large amounts of data I'd prefer
    >
    > #!/usr/bin/perl
    > use strict;
    > use warnings;
    > my %seen;
    > my @uniq;
    > for(<DATA>) {
    > chomp;
    > if (/(\w+\s+\w+\s+)/) {
    > push @uniq, "$1\n" unless $seen{$1}++;
    > }
    > }
    > print sort @uniq;
    >
    >
    > __END__
    > Homo sapiens E
    > Homo sapiens D
    > Arabidopsis thaliana S
    > Homo sapiens G
    > Mus musculus P
    > Mus musculus Q
    > Mus musculus R
    > Homo sapiens F
    > Caenorhabditis elegans H
    > Caenorhabditis elegans I
    > Homo sapiens A
    > Homo sapiens B
    > Homo sapiens C
    > Caenorhabditis elegans J
    > Mus musculus L
    > Mus musculus O
    > Mus musculus M
    > Mus musculus N
    > Caenorhabditis elegans K

    Thank you very much!

    Regards,

    Amy
    Amy Lee, Sep 29, 2008
    #13
  14. Amy Lee <> wrote:
    >Dose perl has functions like the UNIX command sort


    What does 'perldoc -f sort' tell you?

    >and uniq can output
    >duplicate lines and unique lines?


    Did you check the FAQ? Please see 'perldoc -q duplicate'.

    jue
    Jürgen Exner, Sep 29, 2008
    #14
  15. Amy Lee

    Bart Lateur Guest

    Amy Lee wrote:

    >Here's the codes:
    >
    >open FILE, '<', "$file";
    >while (<FILE>)
    >{
    > @raw_list = split /\n/, $_;
    > @list = uniq (@raw_list);
    > print "@list\n";
    >}
    >It seems that the uniq does nothing! I don't know the reason.


    You need to slurp the whole file before working on the data. Now you're
    checking fior every line, if there's not the same line in the list of
    one. Which is impossible.

    Using one of the tricks from the FAQ, one can do this:

    open FILE, '<', "$file";
    my %seen;
    while (<FILE>)
    {
    print unless $seen{$_}++;
    }

    The hash %seen is used to check for lines in the past, too. That's why
    it works across lines.

    --
    Bart.
    Bart Lateur, Sep 29, 2008
    #15
  16. Amy Lee

    Tim Greer Guest

    Amy Lee wrote:

    > On Mon, 29 Sep 2008 15:28:51 +0200, Peter Makholm wrote:
    >
    >> Amy Lee <> writes:
    >>
    >>> Dose perl has functions like the UNIX command sort and uniq can
    >>> output duplicate lines and unique lines?

    >>
    >> There is a uniq function in the List::MoreUtils module otherwise the
    >> standard way is to use the printed stings as keys in a hash to mark
    >> which lines is allready printed.
    >>
    >> //Makholm

    > Hello,
    >
    > I use this module List::MoreUtils to have a process but still failed
    > and output just the last line, here's my codes.
    >
    > use List::MoreUtils qw(any all none notall true false firstidx
    > first_index
    > lastidx last_index insert_after
    > insert_after_string apply after after_incl
    > before before_incl indexes firstval
    > first_value lastval last_value each_array
    > each_arrayref pairwise natatime mesh zip
    > uniq minmax);
    >
    > $file = $ARGV[0];
    > open FILE, '<', "$file";
    > while (<FILE>)
    > {
    > @raw_list = split /\n/, $_;
    > }
    > @list = uniq @raw_list;
    > foreach $single (@list)
    > {
    > print "$single\n";
    > }
    >
    > Thank you very much.
    >
    > Regards,
    >
    > Amy


    Why read it into an array, just to break it down again? Per line, use
    hashes and check to see if it's been 'seen' yet.
    --
    Tim Greer, CEO/Founder/CTO, BurlyHost.com, Inc.
    Shared Hosting, Reseller Hosting, Dedicated & Semi-Dedicated servers
    and Custom Hosting. 24/7 support, 30 day guarantee, secure servers.
    Industry's most experienced staff! -- Web Hosting With Muscle!
    Tim Greer, Sep 29, 2008
    #16
  17. On Mon, 29 Sep 2008 16:59:28 +0100,
    RedGrittyBrick <> wrote:
    >
    > RedGrittyBrick wrote:


    > P.S. For large amounts of data I'd prefer
    >
    > #!/usr/bin/perl
    > use strict;
    > use warnings;
    > my %seen;
    > my @uniq;
    > for(<DATA>) {
    > chomp;
    > if (/(\w+\s+\w+\s+)/) {
    > push @uniq, "$1\n" unless $seen{$1}++;
    > }
    > }
    > print sort @uniq;


    Wouldn't it be better to use while(<DATA>){} (or one of the equivalent
    forms listed in perlop), as for() builds a list? Or is this no longer
    the case?

    Martien

    PS. I couldn't find anything in the delta documents, since 5.6, about
    foreach having changed this behaviour, but then, there's a lot of
    documentation, and I could easily have missed something.
    --
    |
    Martien Verbruggen | Quick! Hire a teenager while they still know
    | everything.
    |
    Martien Verbruggen, Sep 29, 2008
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Luke Airig
    Replies:
    0
    Views:
    569
    Luke Airig
    Dec 23, 2003
  2. Replies:
    5
    Views:
    1,090
    Keith Thompson
    Jan 12, 2006
  3. ToshiBoy
    Replies:
    6
    Views:
    832
    ToshiBoy
    Aug 12, 2008
  4. Damien Wyart

    One-liner removing duplicate lines

    Damien Wyart, Oct 5, 2005, in forum: Ruby
    Replies:
    35
    Views:
    935
  5. Token Type
    Replies:
    9
    Views:
    341
    Chris Angelico
    Sep 9, 2012
Loading...

Share This Page