Help: Duplicate and Unique Lines Problem

Amy Lee · Sep 29, 2008

Hello,

Dose perl has functions like the UNIX command sort and uniq can output
duplicate lines and unique lines?

There's my codes, what if I run this it will output many lines but I just
want to save the duplicate line just once and unique line.

while (<>)
{
if (/^\>.*/)
{
s/\>//g;
if (/\w+\s\w+\s(.*)\smiR.*\s\w+/g)
{
print "$1\n";
}
}
}

The output is like this:

.......
Homo sapiens
Homo sapiens
Homo sapiens
Homo sapiens
Homo sapiens
Homo sapiens
Homo sapiens
Caenorhabditis elegans
Caenorhabditis elegans
Caenorhabditis elegans
Caenorhabditis elegans
Mus musculus
Mus musculus
Mus musculus
Mus musculus
Mus musculus
Mus musculus
Mus musculus
Arabidopsis thaliana
.........

And mu purpose is the output should be like that:
........
Homo sapiens
Caenorhabditis elegans
Mus musculus
Arabidopsis thaliana
........

Thank you very much~

Best Regards,

Amy Lee

Peter Makholm · Sep 29, 2008

Amy Lee said:
Dose perl has functions like the UNIX command sort and uniq can output
duplicate lines and unique lines?

There is a uniq function in the List::MoreUtils module otherwise the
standard way is to use the printed stings as keys in a hash to mark
which lines is allready printed.

//Makholm

Amy Lee · Sep 29, 2008

If you're running on *NIX, just pipe your script to sort/uniq and you're done.

BugBear

Thank you. But I hope make it more convenient so I could put codes into
another perl script.

Regards,

Amy Lee

Amy Lee · Sep 29, 2008

There is a uniq function in the List::MoreUtils module otherwise the
standard way is to use the printed stings as keys in a hash to mark
which lines is allready printed.

//Makholm

Hello,

I use this module List::MoreUtils to have a process but still failed and
output just the last line, here's my codes.

use List::MoreUtils qw(any all none notall true false firstidx first_index
lastidx last_index insert_after insert_after_string
apply after after_incl before before_incl indexes
firstval first_value lastval last_value each_array
each_arrayref pairwise natatime mesh zip uniq minmax);

$file = $ARGV[0];
open FILE, '<', "$file";
while (<FILE>)
{
@raw_list = split /\n/, $_;
}
@list = uniq @raw_list;
foreach $single (@list)
{
print "$single\n";
}

Thank you very much.

Regards,

Amy

Bart Lateur · Sep 29, 2008

Amy said:
Dose perl has functions like the UNIX command sort and uniq can output
duplicate lines and unique lines?

Perl has a built in sort, and unique can be implemented with a few lines
of code. They're even in the official FAQ:

perlfaq4: How can I remove duplicate elements from a list or
array?

http://perldoc.perl.org/perlfaq4.html#How-can-I-remove-duplicate-elements-from-a-list-or-array?

Amy Lee · Sep 29, 2008

Perl has a built in sort, and unique can be implemented with a few lines
of code. They're even in the official FAQ:

perlfaq4: How can I remove duplicate elements from a list or
array?

http://perldoc.perl.org/perlfaq4.html#How-can-I-remove-duplicate-elements-from-a-list-or-array?

Thanks, but my problem seems a little strange. Because I don't know if
uniq function can process list such as @list. When I use uniq to process
it I can just see the last line of the file.

Amy

Amy Lee · Sep 29, 2008

Perl has a built in sort, and unique can be implemented with a few lines
of code. They're even in the official FAQ:

perlfaq4: How can I remove duplicate elements from a list or
array?

http://perldoc.perl.org/perlfaq4.html#How-can-I-remove-duplicate-elements-from-a-list-or-array?

Here's the codes:

open FILE, '<', "$file";
while (<FILE>)
{
@raw_list = split /\n/, $_;
@list = uniq (@raw_list);
print "@list\n";
}
It seems that the uniq does nothing! I don't know the reason.

Amy

Ben Morrow · Sep 29, 2008

Quoth Amy Lee said:
I use this module List::MoreUtils to have a process but still failed and
output just the last line, here's my codes.

use List::MoreUtils qw(any all none notall true false firstidx first_index
lastidx last_index insert_after insert_after_string
apply after after_incl before before_incl indexes
firstval first_value lastval last_value each_array
each_arrayref pairwise natatime mesh zip uniq
minmax);

Don't import more than you need.

use List::MoreUtils qw(uniq);

$file = $ARGV[0];

Your script should start with

use warnings;
use strict;

which will mean you need 'my' on all your variables

my $file = $ARGV[0];

open FILE, '<', "$file";

Use lexical filehandles.
Always check the return value of open.
Don't quote things when you don't need to.

open my $FILE, '<', $file
or die "can't read '$file': $!";

while (<FILE>)
{
@raw_list = split /\n/, $_;

while (<FILE>) reads the file one line at a time. You then split that
line on /\n/ (which won't do anything except remove the trailing
newline, since it's just a single line) and replace the contents of
@raw_line with the result. This means @raw_list never has more than one
element (the last line read).

Since you want to keep all the lines, either push them onto the array:

while (<$FILE>) {
chomp; # remove the newline
push @raw_list, $_;
}

or, better, use <> in list context, which returns all the lines:

}
@list = uniq @raw_list;
foreach $single (@list)
{
print "$single\n";

Ben

Amy Lee · Sep 29, 2008

Quoth Amy Lee said:
Quoth Amy Lee said:

I use this module List::MoreUtils to have a process but still failed and
output just the last line, here's my codes.

use List::MoreUtils qw(any all none notall true false firstidx first_index
lastidx last_index insert_after insert_after_string
apply after after_incl before before_incl indexes
firstval first_value lastval last_value each_array
each_arrayref pairwise natatime mesh zip uniq
minmax);

Click to expand...

Don't import more than you need.

use List::MoreUtils qw(uniq);

$file = $ARGV[0];

Click to expand...

Your script should start with

use warnings;
use strict;

which will mean you need 'my' on all your variables

my $file = $ARGV[0];

open FILE, '<', "$file";

Click to expand...

Use lexical filehandles.
Always check the return value of open.
Don't quote things when you don't need to.

open my $FILE, '<', $file
or die "can't read '$file': $!";

while (<FILE>)
{
@raw_list = split /\n/, $_;

Click to expand...

while (<FILE>) reads the file one line at a time. You then split that
line on /\n/ (which won't do anything except remove the trailing
newline, since it's just a single line) and replace the contents of
@raw_line with the result. This means @raw_list never has more than one
element (the last line read).

Since you want to keep all the lines, either push them onto the array:

while (<$FILE>) {
chomp; # remove the newline
push @raw_list, $_;
}

or, better, use <> in list context, which returns all the lines:

}
@list = uniq @raw_list;
foreach $single (@list)
{
print "$single\n";

Click to expand...

Ben

Thank you very much. I have solved this one by your method.

Best Regards,

Amy

RedGrittyBrick · Sep 29, 2008

Amy said:
Hello,

Dose perl has functions like the UNIX command sort and uniq can output
duplicate lines and unique lines?

There's my codes, what if I run this it will output many lines but I just
want to save the duplicate line just once and unique line.

#!/usr/bin/perl
use strict;
use warnings;

my %seen;
for(sort <DATA>) {
chomp;
if (/(\w+\s+\w+\s+)/) {
print "$1\n" unless $seen{$1}++;
}
}

__END__
Homo sapiens E
Homo sapiens D
Arabidopsis thaliana S
Homo sapiens G
Mus musculus P
Mus musculus Q
Mus musculus R
Homo sapiens F
Caenorhabditis elegans H
Caenorhabditis elegans I
Homo sapiens A
Homo sapiens B
Homo sapiens C
Caenorhabditis elegans J
Mus musculus L
Mus musculus O
Mus musculus M
Mus musculus N
Caenorhabditis elegans K

RedGrittyBrick · Sep 29, 2008

RedGrittyBrick said:
#!/usr/bin/perl
use strict;
use warnings;

my %seen;
for(sort <DATA>) {
chomp;
if (/(\w+\s+\w+\s+)/) {
print "$1\n" unless $seen{$1}++;
}
}

P.S. For large amounts of data I'd prefer

#!/usr/bin/perl
use strict;
use warnings;
my %seen;
my @uniq;
for(<DATA>) {
chomp;
if (/(\w+\s+\w+\s+)/) {
push @uniq, "$1\n" unless $seen{$1}++;
}
}
print sort @uniq;

__END__
Homo sapiens E
Homo sapiens D
Arabidopsis thaliana S
Homo sapiens G
Mus musculus P
Mus musculus Q
Mus musculus R
Homo sapiens F
Caenorhabditis elegans H
Caenorhabditis elegans I
Homo sapiens A
Homo sapiens B
Homo sapiens C
Caenorhabditis elegans J
Mus musculus L
Mus musculus O
Mus musculus M
Mus musculus N
Caenorhabditis elegans K

RedGrittyBrick · Sep 29, 2008

RedGrittyBrick said:
P.S. For large amounts of data I'd prefer

#!/usr/bin/perl
use strict;
use warnings;
my %seen;
my @uniq;
for(<DATA>) {
chomp;
if (/(\w+\s+\w+\s+)/) {
push @uniq, "$1\n" unless $seen{$1}++;
}
}
print sort @uniq;

#!/usr/bin/perl
use strict;
use warnings;
my %seen;
for(<DATA>) {
chomp;
if (/(\w+\s+\w+\s+)/) {
$seen{"$1\n"}++;
}
}
print sort keys %seen;

This is a deep hole I've dug myself into

Amy Lee · Sep 29, 2008

P.S. For large amounts of data I'd prefer

#!/usr/bin/perl
use strict;
use warnings;
my %seen;
my @uniq;
for(<DATA>) {
chomp;
if (/(\w+\s+\w+\s+)/) {
push @uniq, "$1\n" unless $seen{$1}++;
}
}
print sort @uniq;

__END__
Homo sapiens E
Homo sapiens D
Arabidopsis thaliana S
Homo sapiens G
Mus musculus P
Mus musculus Q
Mus musculus R
Homo sapiens F
Caenorhabditis elegans H
Caenorhabditis elegans I
Homo sapiens A
Homo sapiens B
Homo sapiens C
Caenorhabditis elegans J
Mus musculus L
Mus musculus O
Mus musculus M
Mus musculus N
Caenorhabditis elegans K

Thank you very much!

Regards,

Amy

Jürgen Exner · Sep 29, 2008

Amy Lee said:
Dose perl has functions like the UNIX command sort

What does 'perldoc -f sort' tell you?

and uniq can output
duplicate lines and unique lines?

Did you check the FAQ? Please see 'perldoc -q duplicate'.

jue

Bart Lateur · Sep 29, 2008

Amy said:
Here's the codes:

open FILE, '<', "$file";
while (<FILE>)
{
@raw_list = split /\n/, $_;
@list = uniq (@raw_list);
print "@list\n";
}
It seems that the uniq does nothing! I don't know the reason.

You need to slurp the whole file before working on the data. Now you're
checking fior every line, if there's not the same line in the list of
one. Which is impossible.

Using one of the tricks from the FAQ, one can do this:

open FILE, '<', "$file";
my %seen;
while (<FILE>)
{
print unless $seen{$_}++;
}

The hash %seen is used to check for lines in the past, too. That's why
it works across lines.

Tim Greer · Sep 29, 2008

Amy said:
There is a uniq function in the List::MoreUtils module otherwise the
standard way is to use the printed stings as keys in a hash to mark
which lines is allready printed.

//Makholm

Click to expand...

Hello,

I use this module List::MoreUtils to have a process but still failed
and output just the last line, here's my codes.

use List::MoreUtils qw(any all none notall true false firstidx
first_index
lastidx last_index insert_after
insert_after_string apply after after_incl
before before_incl indexes firstval
first_value lastval last_value each_array
each_arrayref pairwise natatime mesh zip
uniq minmax);

$file = $ARGV[0];
open FILE, '<', "$file";
while (<FILE>)
{
@raw_list = split /\n/, $_;
}
@list = uniq @raw_list;
foreach $single (@list)
{
print "$single\n";
}

Thank you very much.

Regards,

Amy

Why read it into an array, just to break it down again? Per line, use
hashes and check to see if it's been 'seen' yet.

Martien Verbruggen · Sep 29, 2008

RedGrittyBrick wrote:

P.S. For large amounts of data I'd prefer

#!/usr/bin/perl
use strict;
use warnings;
my %seen;
my @uniq;
for(<DATA>) {
chomp;
if (/(\w+\s+\w+\s+)/) {
push @uniq, "$1\n" unless $seen{$1}++;
}
}
print sort @uniq;

Wouldn't it be better to use while(<DATA>){} (or one of the equivalent
forms listed in perlop), as for() builds a list? Or is this no longer
the case?

Martien

PS. I couldn't find anything in the delta documents, since 5.6, about
foreach having changed this behaviour, but then, there's a lot of
documentation, and I could easily have missed something.

Help: Match Error	7	Sep 30, 2008
Php combine identical lines in text file	4	Oct 11, 2023
Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023
matching over multiple lines	4	Nov 21, 2006
Big problem I need to solve with some unix utils	1	Jun 19, 2022
Delete duplicate rows in textfile - except it contains a "{" or "}"	4	Oct 10, 2012
Help: Filemask problem	4	Oct 14, 2007
Insert Multiple Lines after a Specified Line -- Please Help!	2	May 4, 2007

Help: Duplicate and Unique Lines Problem

Amy Lee

Peter Makholm

Amy Lee

Amy Lee

Bart Lateur

Amy Lee

Amy Lee

Ben Morrow

Amy Lee

RedGrittyBrick

RedGrittyBrick

RedGrittyBrick

Amy Lee

Jürgen Exner

Bart Lateur

Tim Greer

Martien Verbruggen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads