Help: Duplicate and Unique Lines Problem

A

Amy Lee

Hello,

Dose perl has functions like the UNIX command sort and uniq can output
duplicate lines and unique lines?

There's my codes, what if I run this it will output many lines but I just
want to save the duplicate line just once and unique line.

while (<>)
{
if (/^\>.*/)
{
s/\>//g;
if (/\w+\s\w+\s(.*)\smiR.*\s\w+/g)
{
print "$1\n";
}
}
}

The output is like this:

.......
Homo sapiens
Homo sapiens
Homo sapiens
Homo sapiens
Homo sapiens
Homo sapiens
Homo sapiens
Caenorhabditis elegans
Caenorhabditis elegans
Caenorhabditis elegans
Caenorhabditis elegans
Mus musculus
Mus musculus
Mus musculus
Mus musculus
Mus musculus
Mus musculus
Mus musculus
Arabidopsis thaliana
.........

And mu purpose is the output should be like that:
........
Homo sapiens
Caenorhabditis elegans
Mus musculus
Arabidopsis thaliana
........

Thank you very much~

Best Regards,

Amy Lee
 
P

Peter Makholm

Amy Lee said:
Dose perl has functions like the UNIX command sort and uniq can output
duplicate lines and unique lines?

There is a uniq function in the List::MoreUtils module otherwise the
standard way is to use the printed stings as keys in a hash to mark
which lines is allready printed.

//Makholm
 
A

Amy Lee

If you're running on *NIX, just pipe your script to sort/uniq and you're done.

BugBear
Thank you. But I hope make it more convenient so I could put codes into
another perl script.

Regards,

Amy Lee
 
A

Amy Lee

There is a uniq function in the List::MoreUtils module otherwise the
standard way is to use the printed stings as keys in a hash to mark
which lines is allready printed.

//Makholm
Hello,

I use this module List::MoreUtils to have a process but still failed and
output just the last line, here's my codes.

use List::MoreUtils qw(any all none notall true false firstidx first_index
lastidx last_index insert_after insert_after_string
apply after after_incl before before_incl indexes
firstval first_value lastval last_value each_array
each_arrayref pairwise natatime mesh zip uniq minmax);

$file = $ARGV[0];
open FILE, '<', "$file";
while (<FILE>)
{
@raw_list = split /\n/, $_;
}
@list = uniq @raw_list;
foreach $single (@list)
{
print "$single\n";
}

Thank you very much.

Regards,

Amy
 
A

Amy Lee

A

Amy Lee

Perl has a built in sort, and unique can be implemented with a few lines
of code. They're even in the official FAQ:

perlfaq4: How can I remove duplicate elements from a list or
array?

http://perldoc.perl.org/perlfaq4.html#How-can-I-remove-duplicate-elements-from-a-list-or-array?
Here's the codes:

open FILE, '<', "$file";
while (<FILE>)
{
@raw_list = split /\n/, $_;
@list = uniq (@raw_list);
print "@list\n";
}
It seems that the uniq does nothing! I don't know the reason.

Amy
 
B

Ben Morrow

Quoth Amy Lee said:
I use this module List::MoreUtils to have a process but still failed and
output just the last line, here's my codes.

use List::MoreUtils qw(any all none notall true false firstidx first_index
lastidx last_index insert_after insert_after_string
apply after after_incl before before_incl indexes
firstval first_value lastval last_value each_array
each_arrayref pairwise natatime mesh zip uniq
minmax);

Don't import more than you need.

use List::MoreUtils qw(uniq);
$file = $ARGV[0];

Your script should start with

use warnings;
use strict;

which will mean you need 'my' on all your variables

my $file = $ARGV[0];
open FILE, '<', "$file";

Use lexical filehandles.
Always check the return value of open.
Don't quote things when you don't need to.

open my $FILE, '<', $file
or die "can't read '$file': $!";
while (<FILE>)
{
@raw_list = split /\n/, $_;

while (<FILE>) reads the file one line at a time. You then split that
line on /\n/ (which won't do anything except remove the trailing
newline, since it's just a single line) and replace the contents of
@raw_line with the result. This means @raw_list never has more than one
element (the last line read).

Since you want to keep all the lines, either push them onto the array:

while (<$FILE>) {
chomp; # remove the newline
push @raw_list, $_;
}

or, better, use <> in list context, which returns all the lines:

}
@list = uniq @raw_list;
foreach $single (@list)
{
print "$single\n";

Ben
 
A

Amy Lee

Quoth Amy Lee said:
I use this module List::MoreUtils to have a process but still failed and
output just the last line, here's my codes.

use List::MoreUtils qw(any all none notall true false firstidx first_index
lastidx last_index insert_after insert_after_string
apply after after_incl before before_incl indexes
firstval first_value lastval last_value each_array
each_arrayref pairwise natatime mesh zip uniq
minmax);

Don't import more than you need.

use List::MoreUtils qw(uniq);
$file = $ARGV[0];

Your script should start with

use warnings;
use strict;

which will mean you need 'my' on all your variables

my $file = $ARGV[0];
open FILE, '<', "$file";

Use lexical filehandles.
Always check the return value of open.
Don't quote things when you don't need to.

open my $FILE, '<', $file
or die "can't read '$file': $!";
while (<FILE>)
{
@raw_list = split /\n/, $_;

while (<FILE>) reads the file one line at a time. You then split that
line on /\n/ (which won't do anything except remove the trailing
newline, since it's just a single line) and replace the contents of
@raw_line with the result. This means @raw_list never has more than one
element (the last line read).

Since you want to keep all the lines, either push them onto the array:

while (<$FILE>) {
chomp; # remove the newline
push @raw_list, $_;
}

or, better, use <> in list context, which returns all the lines:

}
@list = uniq @raw_list;
foreach $single (@list)
{
print "$single\n";

Ben
Thank you very much. I have solved this one by your method.

Best Regards,

Amy
 
R

RedGrittyBrick

Amy said:
Hello,

Dose perl has functions like the UNIX command sort and uniq can output
duplicate lines and unique lines?

There's my codes, what if I run this it will output many lines but I just
want to save the duplicate line just once and unique line.

#!/usr/bin/perl
use strict;
use warnings;

my %seen;
for(sort <DATA>) {
chomp;
if (/(\w+\s+\w+\s+)/) {
print "$1\n" unless $seen{$1}++;
}
}


__END__
Homo sapiens E
Homo sapiens D
Arabidopsis thaliana S
Homo sapiens G
Mus musculus P
Mus musculus Q
Mus musculus R
Homo sapiens F
Caenorhabditis elegans H
Caenorhabditis elegans I
Homo sapiens A
Homo sapiens B
Homo sapiens C
Caenorhabditis elegans J
Mus musculus L
Mus musculus O
Mus musculus M
Mus musculus N
Caenorhabditis elegans K
 
R

RedGrittyBrick

RedGrittyBrick said:
#!/usr/bin/perl
use strict;
use warnings;

my %seen;
for(sort <DATA>) {
chomp;
if (/(\w+\s+\w+\s+)/) {
print "$1\n" unless $seen{$1}++;
}
}

P.S. For large amounts of data I'd prefer

#!/usr/bin/perl
use strict;
use warnings;
my %seen;
my @uniq;
for(<DATA>) {
chomp;
if (/(\w+\s+\w+\s+)/) {
push @uniq, "$1\n" unless $seen{$1}++;
}
}
print sort @uniq;


__END__
Homo sapiens E
Homo sapiens D
Arabidopsis thaliana S
Homo sapiens G
Mus musculus P
Mus musculus Q
Mus musculus R
Homo sapiens F
Caenorhabditis elegans H
Caenorhabditis elegans I
Homo sapiens A
Homo sapiens B
Homo sapiens C
Caenorhabditis elegans J
Mus musculus L
Mus musculus O
Mus musculus M
Mus musculus N
Caenorhabditis elegans K
 
R

RedGrittyBrick

RedGrittyBrick said:
P.S. For large amounts of data I'd prefer

#!/usr/bin/perl
use strict;
use warnings;
my %seen;
my @uniq;
for(<DATA>) {
chomp;
if (/(\w+\s+\w+\s+)/) {
push @uniq, "$1\n" unless $seen{$1}++;
}
}
print sort @uniq;


#!/usr/bin/perl
use strict;
use warnings;
my %seen;
for(<DATA>) {
chomp;
if (/(\w+\s+\w+\s+)/) {
$seen{"$1\n"}++;
}
}
print sort keys %seen;


This is a deep hole I've dug myself into :)
 
A

Amy Lee

P.S. For large amounts of data I'd prefer

#!/usr/bin/perl
use strict;
use warnings;
my %seen;
my @uniq;
for(<DATA>) {
chomp;
if (/(\w+\s+\w+\s+)/) {
push @uniq, "$1\n" unless $seen{$1}++;
}
}
print sort @uniq;


__END__
Homo sapiens E
Homo sapiens D
Arabidopsis thaliana S
Homo sapiens G
Mus musculus P
Mus musculus Q
Mus musculus R
Homo sapiens F
Caenorhabditis elegans H
Caenorhabditis elegans I
Homo sapiens A
Homo sapiens B
Homo sapiens C
Caenorhabditis elegans J
Mus musculus L
Mus musculus O
Mus musculus M
Mus musculus N
Caenorhabditis elegans K
Thank you very much!

Regards,

Amy
 
B

Bart Lateur

Amy said:
Here's the codes:

open FILE, '<', "$file";
while (<FILE>)
{
@raw_list = split /\n/, $_;
@list = uniq (@raw_list);
print "@list\n";
}
It seems that the uniq does nothing! I don't know the reason.

You need to slurp the whole file before working on the data. Now you're
checking fior every line, if there's not the same line in the list of
one. Which is impossible.

Using one of the tricks from the FAQ, one can do this:

open FILE, '<', "$file";
my %seen;
while (<FILE>)
{
print unless $seen{$_}++;
}

The hash %seen is used to check for lines in the past, too. That's why
it works across lines.
 
T

Tim Greer

Amy said:
There is a uniq function in the List::MoreUtils module otherwise the
standard way is to use the printed stings as keys in a hash to mark
which lines is allready printed.

//Makholm
Hello,

I use this module List::MoreUtils to have a process but still failed
and output just the last line, here's my codes.

use List::MoreUtils qw(any all none notall true false firstidx
first_index
lastidx last_index insert_after
insert_after_string apply after after_incl
before before_incl indexes firstval
first_value lastval last_value each_array
each_arrayref pairwise natatime mesh zip
uniq minmax);

$file = $ARGV[0];
open FILE, '<', "$file";
while (<FILE>)
{
@raw_list = split /\n/, $_;
}
@list = uniq @raw_list;
foreach $single (@list)
{
print "$single\n";
}

Thank you very much.

Regards,

Amy

Why read it into an array, just to break it down again? Per line, use
hashes and check to see if it's been 'seen' yet.
 
M

Martien Verbruggen

RedGrittyBrick wrote:
P.S. For large amounts of data I'd prefer

#!/usr/bin/perl
use strict;
use warnings;
my %seen;
my @uniq;
for(<DATA>) {
chomp;
if (/(\w+\s+\w+\s+)/) {
push @uniq, "$1\n" unless $seen{$1}++;
}
}
print sort @uniq;

Wouldn't it be better to use while(<DATA>){} (or one of the equivalent
forms listed in perlop), as for() builds a list? Or is this no longer
the case?

Martien

PS. I couldn't find anything in the delta documents, since 5.6, about
foreach having changed this behaviour, but then, there's a lot of
documentation, and I could easily have missed something.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top