Perl script to mimic uniq

M

Martin Foster

Hi.

I would like to be able to mimic the unix tool 'uniq' within a Perl script.

I have a file with entries that look like this

4 10 21 37 58 83 111 145 184 226
4 12 24 42 64 92 124 162 204 252
4 11 23 44 67 95 134 168 215 271
..
..
..

Many number sequences, I would like to analyze the file to tell me how often a
sequence occurs throughout the file.

I've began writing a script:

#!/usr/bin/perl
# Perl script to find most common CS
use strict;

my @line;
my $infile = "/home/martin/DATABASE/large.txt";
open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
my @array = <INFILE>;
my $no_lines = $#array;
print "There are ", $no_lines+1, " lines in the large array\n";

my (@table);
foreach my $array (@array) {
push(@table, [split(/\s/, $array) ]);
}

my $no_cells = $#{$table[$no_lines]};

for (my $k =0; $k<=$no_lines; $k++) {
print "[$k] occurs ";
my $match=0;
my $matched=0;
for (my $h =0; $h<=$no_lines; $h++) {
for (my $j =3; $j<=12; $j++ ) {
if ($table[$k][$j] == $table[$h][$j]){
$match++;
}
}
if ($match==10) {
$matched++;
}
}
print "$matched times\n";
} # end of large loop

Does anyone know a better, quicker method of doing this?

Many thanks in advance for any suggestions.
 
N

nobull

I would like to be able to mimic the unix tool 'uniq' within a Perl script.

There are Perl implementations of the Unix tools "out there". (Doing
web search to find them is left as an exercise for the reader).
I have a file with entries that look like this

4 10 21 37 58 83 111 145 184 226
4 12 24 42 64 92 124 162 204 252
4 11 23 44 67 95 134 168 215 271
.
.
.

Many number sequences, I would like to analyze the file to tell me how often a
sequence occurs throughout the file.

That is not what Unix uniq does. 'uniq' compares adjacent lines.

Always reduce your problems to their simplest form. The fact that the
lines of the file happen to be sequences of numbers in not part of
your problem's simplest form.

I shall assume that you really want to count the number of times each
distints line appears in a file.

The cannonical Perl one-liner to do this is:

perl -en '$c{$_}++; END { print "$c{$_} $_" for keys %c }'

Or as a script:

#!/usr/bin/perl
use strict;
use warnings;

my %count;

$count{$_}++ while <>;

print "$count{$_} $_" for keys %count;
__END__

I've began writing a script:

Good. We don't like helping people who don't show what they've tried.
As a requard I'll give you some general Perl programming tips!
#!/usr/bin/perl
# Perl script to find most common CS

That comment does not describe what the script does.
Wrong comments are worse than no comments.
use strict;

Get as much help as you can, use warnings too!
my @line;

You never use this variable.
my $infile = "/home/martin/DATABASE/large.txt";
open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
my @array = <INFILE>;
my $no_lines = $#array;

Variable names should reflect what's in the variable.

There's no point having a variable that's just a copy of $#array
since you can always just use $#array.
print "There are ", $no_lines+1, " lines in the large array\n";

It would be more ideomatic to use scalar(@array) rather than $#array+1
my (@table);
foreach my $array (@array) {
push(@table, [split(/\s/, $array) ]);
}

For really simple for/push loops like this consider using map:

my @table = map { [ split ] } @array;
my $no_cells = $#{$table[$no_lines]};

Variable names should reflect what's in the variable.

Anyhow you never use that variable.
for (my $k =0; $k<=$no_lines; $k++) {

Don't use C-style for in Perl unless you need to.

for my $k ( 0 .. $no_lines ) {
print "[$k] occurs ";

Hang on, $k is the line number (minus one) not the content of the
line.
I suspect there's more to your original problem than you are telling
us.
my $match=0;
my $matched=0;
for (my $h =0; $h<=$no_lines; $h++) {
for (my $j =3; $j<=12; $j++ ) {

Where did those 3 and 12 come from. I suspect there's more to your
original problem than you are telling us.
if ($table[$k][$j] == $table[$h][$j]){
$match++;
}
}
if ($match==10) {
$matched++;
}

Rather than counting matches and checking you have 10 it would be
better to count mismatches an check you have 0. That way if the 12
ever had to become 13 you wouldn't have to have to change 10 to 11
}
print "$matched times\n";
} # end of large loop

Does anyone know a better, quicker method of doing this?

Doing what? You've moved the goal-posts several times.
Many thanks in advance for any suggestions.

I suggest that you get clear in your mind what you are asking before
you ask it.

I also suggest you post to newsgroups that still exist (this one
doesn't, see FAQ). Your post will then be seen my many more people.
 
M

Martin Foster

There are Perl implementations of the Unix tools "out there". (Doing
web search to find them is left as an exercise for the reader).


That is not what Unix uniq does. 'uniq' compares adjacent lines.

I know, I can sort lines to be adjacent and then use uniq.
Always reduce your problems to their simplest form. The fact that the
lines of the file happen to be sequences of numbers in not part of
your problem's simplest form.

I shall assume that you really want to count the number of times each
distints line appears in a file.

The cannonical Perl one-liner to do this is:

perl -en '$c{$_}++; END { print "$c{$_} $_" for keys %c }'

Or as a script:

#!/usr/bin/perl
use strict;
use warnings;

my %count;

$count{$_}++ while <>;

print "$count{$_} $_" for keys %count;
__END__
This is amazing, I don't understand how it works but it's very
powerful.
Can I se this script to compare the n columns of a file, no the entire
file.
I've began writing a script:

Good. We don't like helping people who don't show what they've tried.
As a requard I'll give you some general Perl programming tips!
#!/usr/bin/perl
# Perl script to find most common CS

That comment does not describe what the script does.
Wrong comments are worse than no comments.
use strict;

Get as much help as you can, use warnings too!
my @line;

You never use this variable.
my $infile = "/home/martin/DATABASE/large.txt";
open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
my @array = <INFILE>;
my $no_lines = $#array;

Variable names should reflect what's in the variable.

There's no point having a variable that's just a copy of $#array
since you can always just use $#array.
print "There are ", $no_lines+1, " lines in the large array\n";

It would be more ideomatic to use scalar(@array) rather than $#array+1
my (@table);
foreach my $array (@array) {
push(@table, [split(/\s/, $array) ]);
}

For really simple for/push loops like this consider using map:

my @table = map { [ split ] } @array;

Ok. Thanks, I've not used map before, just beginning to learn.
my $no_cells = $#{$table[$no_lines]};

Variable names should reflect what's in the variable.

Anyhow you never use that variable.
for (my $k =0; $k<=$no_lines; $k++) {

Don't use C-style for in Perl unless you need to.

for my $k ( 0 .. $no_lines ) {
print "[$k] occurs ";

Hang on, $k is the line number (minus one) not the content of the
line.
I suspect there's more to your original problem than you are telling
us.
my $match=0;
my $matched=0;
for (my $h =0; $h<=$no_lines; $h++) {
for (my $j =3; $j<=12; $j++ ) {

Where did those 3 and 12 come from. I suspect there's more to your
original problem than you are telling us.

I've got a identifier for each line at the beginning, for example

1666237 4 10 23 16 and so. The identifier is an id to link to
something else and so on. I just want to compare the 10 columns with
the numbers.
if ($table[$k][$j] == $table[$h][$j]){
$match++;
}
}
if ($match==10) {
$matched++;
}

Rather than counting matches and checking you have 10 it would be
better to count mismatches an check you have 0. That way if the 12
ever had to become 13 you wouldn't have to have to change 10 to 11
} print "$matched times\n";
} # end of large loop

Does anyone know a better, quicker method of doing this?

Doing what? You've moved the goal-posts several times.
Many thanks in advance for any suggestions.

I suggest that you get clear in your mind what you are asking before
you ask it.

I also suggest you post to newsgroups that still exist (this one
doesn't, see FAQ). Your post will then be seen my many more people.
BTW where is the FAQ, which says this newsgroup no longer exists?
 
J

Jürgen Exner

Martin said:
I would like to be able to mimic the unix tool 'uniq' within a Perl
script.

Unfortunately the FAQ entry is worded the opposite way:
perldoc -q duplicate:
"How can I remove duplicate elements from a list or array?"

jue
 
N

nobull

This is amazing, I don't understand how it works but it's very
powerful.

If you look in the newsgroup that replaced this one when this one was
deleted, you'll find every couple of months someone posts a script
substancially like the one above and says "I found this - how does it
work?".

You could look at one of those threads.

I believe it is also an example that is used in most Perl tutorials.
Can I se this script to compare the n columns of a file, no the entire
file.

No you can't use this _script_. But you can use the technique.

Rather than keying %count on the whole line you can use some sort of
string manipulation to extract just part of the line to consider. The
most normal way to manipulate strings in Perl is the m// and s///
operators.
I've got a identifier for each line at the beginning, for example

1666237 4 10 23 16 and so. The identifier is an id to link to
something else and so on. I just want to compare the 10 columns with
the numbers.

Well if, for example, we say the first 3 whitespace delimted columns
are the identifier you could remove them thus:

BTW where is the FAQ, which says this newsgroup no longer exists?

The Perl FAQ is part of the standard Perl documentation that can be
found on any computer on which Perl has been installed and also on
various Perl-related web sites.
 
M

Martin Foster

Thanks for your help.

My script now looks like this:


#!/usr/bin/perl
# Perl script to find most common CS
use strict;
use warnings;

my $infile = "/home/martin/DATABASE/large.txt";
open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
my %count;

do {
$_ =~ s/^(\S+\s+){2}//;
$count{$_}++
} while <INFILE>;

print "$count{$_} $_" for keys %count;
__END__

So I'm feeding the file into the %count array by removing the first two
columns with the identifier information and then counting the keys.
How can I still keep the identifier part of the line linked to the array?
Since this is the part which I'm really interested in.
I can't keep the identifier in
the %count array, since this would screw up the "for keys" part.

I checked perldoc -q and found how to remove duplicates but I don't think
I can rewrite this to do what I want.

The "for keys" method is brillant but I'm losing the identifier.

So I'm back to my original script which looks like this.

#!/usr/bin/perl
# Perl script to find most common CS
use strict;
use warnings;


my $infile = "/home/martin/DATABASE/large.txt";
open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
my @array = <INFILE>;
print "There are ", $#array+1, " lines in the large array\n";

my (@table);
foreach my $array (@array) {
push(@table, [split(/\s/, $array) ]);
}

for (my $k =0; $k<=$#array; $k++) {
print "$table[$k][1] $table[$k][2] occurs ";
my $matched=0;
for (my $h =0; $h<=$no_lines; $h++) {
my $match=0;
for (my $j =2; $j<=11; $j++ ) {
if ($table[$k][$j] == $table[$h][$j]){
$match++;
}
}
if ($match==10) {
$matched++;
}
}
print "$matched times\n";
} # end of large loop


But this sad looking script is not very smart and very slow, I don't want to
run over each line. I would like the script to search the file,
identify a sequence as unique. If there are duplicate sequences
in that file then print out how many and do not revisit that line
if it has been counted as a duplicate.


my data file looks like this, a small section only.


810 141-2_1_2 4 10 21 37 58 83 111 145 184 226
811 141-2_1_6 4 12 24 42 64 92 124 162 204 252
812 141-2_1_7 4 11 23 44 67 95 134 168 215 271
879 141_1_2 4 10 21 37 58 83 111 145 184 226
880 141_1_6 4 12 24 42 64 92 124 162 204 252
881 141_1_7 4 11 23 44 67 95 134 168 215 271
882 152_1_15 4 12 26 44 72 104 138 178 228 282
883 152_1_23 4 10 21 40 65 96 134 180 230 286
884 152_1_24 4 10 21 40 65 96 134 180 230 286
885 152_1_3 4 12 22 40 66 102 128 168 218 268

Again many thanks for your help. I still don't get why you say
this newsgroup has been deleted. What is the url for the replacement
newsgroup?
 
N

nobull

(e-mail address removed) (Martin Foster) spits TOFU in my face:
Thanks for your help.

Please, if you want to thank me, learn to quote properly. TOFU ((new)
Text Over, Full-quote Under) is considered very rude.
My script now looks like this:


#!/usr/bin/perl
# Perl script to find most common CS
use strict;
use warnings;

my $infile = "/home/martin/DATABASE/large.txt";
open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
my %count;

do {
$_ =~ s/^(\S+\s+){2}//;
$count{$_}++
} while <INFILE>;

Please see perldoc perlsyn for how "do { BLOCK } while EXPR" is
different from "while (EXPR) { BLOCK }". In this case you want the
latter.

Saying "$_ =~" i.e. "don't use $_, use $_ instead" is considered
somwhat affected. Either use $_ (and don't mention it) or use
something else instead.

You are assuming the s/// succedes always. Whenever you are assume
something like this will succede always you should decorate it with
"or die". This acts as a comment saying "I'm assuming this succedes
always". It also causes the program to crash out rather than carry on
and do something weird if your assumption was wrong.
So I'm feeding the file into the %count array by removing the first two
columns with the identifier information and then counting the keys.
How can I still keep the identifier part of the line linked to the array?
Since this is the part which I'm really interested in.

Ah, well you never mentioned that before. It helps to know what you
want.
I can't keep the identifier in
the %count array, since this would screw up the "for keys" part.

You can't keep it in the keys of %count, but you can keep it in the
values.

while (<INFILE>) {
s/^(\S+\s+){2}// or die;
push @{$count{$_}}, $1;
};

I checked perldoc -q and found how to remove duplicates but I don't think
I can rewrite this to do what I want.

Don't worry I'm sure your programming skill will improve. You appear
smart but inexperienced. You do, however, seem to have an unfortunate
streak of defeatism.
The "for keys" method is brillant but I'm losing the identifier.

So I'm back to my original script which looks like this.

Why? I showed you many ways to improve it independant of changing the
algorithm.
#!/usr/bin/perl
# Perl script to find most common CS

I still don't get how this comment relates to what your program does
nor what you say you want it to do.
use strict;
use warnings;


my $infile = "/home/martin/DATABASE/large.txt";
open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
my @array = <INFILE>;
print "There are ", $#array+1, " lines in the large array\n";

my (@table);
foreach my $array (@array) {
push(@table, [split(/\s/, $array) ]);
}

for (my $k =0; $k<=$#array; $k++) {
print "$table[$k][1] $table[$k][2] occurs ";
my $matched=0;
for (my $h =0; $h<=$no_lines; $h++) {
my $match=0;
for (my $j =2; $j<=11; $j++ ) {
if ($table[$k][$j] == $table[$h][$j]){
$match++;
}
}
if ($match==10) {
$matched++;
}
}
print "$matched times\n";
} # end of large loop


But this sad looking script is not very smart and very slow, I don't want to
run over each line. I would like the script to search the file,
identify a sequence as unique. If there are duplicate sequences
in that file then print out how many and do not revisit that line
if it has been counted as a duplicate.

It's not clear what you are saying.

Are you saying you want the first ID (only) and the number of
occurances of each distinct sequence?

while (<INFILE>) {
s/^(\S+\s+){2}// or die;
push @{$count{$_}}, $1;
};

for ( values %count ) {
print "$_->[0]occurs ",scalar(@$_)," times\n";
}
I still don't get why you say this newsgroup has been deleted.

I say it because it is true, and because it will help people who
didn't know this to reach a larger audience.
What is the url for the replacement newsgroup?

What part of the answer to the Perl FAQ: "What are the Perl newsgroups
on Usenet?" are you having trouble understanding?
 
M

Martin Foster

(e-mail address removed) (Martin Foster) spits TOFU in my face:


Please, if you want to thank me, learn to quote properly. TOFU ((new)
Text Over, Full-quote Under) is considered very rude.
I see.
Please see perldoc perlsyn for how "do { BLOCK } while EXPR" is
different from "while (EXPR) { BLOCK }". In this case you want the
latter.

Saying "$_ =~" i.e. "don't use $_, use $_ instead" is considered
somwhat affected. Either use $_ (and don't mention it) or use
something else instead.

You are assuming the s/// succedes always. Whenever you are assume
something like this will succede always you should decorate it with
"or die". This acts as a comment saying "I'm assuming this succedes
always". It also causes the program to crash out rather than carry on
and do something weird if your assumption was wrong.
This is good tip. I'll use this for now on.
Ah, well you never mentioned that before. It helps to know what you
want.


You can't keep it in the keys of %count, but you can keep it in the
values.

while (<INFILE>) {
s/^(\S+\s+){2}// or die;
push @{$count{$_}}, $1;
};



Don't worry I'm sure your programming skill will improve. You appear
smart but inexperienced. You do, however, seem to have an unfortunate
streak of defeatism.


Why? I showed you many ways to improve it independant of changing the
algorithm.


I still don't get how this comment relates to what your program does
nor what you say you want it to do.
The data list is a sequence of numbers, which are called coordination
sequences, CS for short. My program tries to find the most common CS
in the data file.
use strict;
use warnings;


my $infile = "/home/martin/DATABASE/large.txt";
open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
my @array = <INFILE>;
print "There are ", $#array+1, " lines in the large array\n";

my (@table);
foreach my $array (@array) {
push(@table, [split(/\s/, $array) ]);
}

for (my $k =0; $k<=$#array; $k++) {
print "$table[$k][1] $table[$k][2] occurs ";
my $matched=0;
for (my $h =0; $h<=$no_lines; $h++) {
my $match=0;
for (my $j =2; $j<=11; $j++ ) {
if ($table[$k][$j] == $table[$h][$j]){
$match++;
}
}
if ($match==10) {
$matched++;
}
} print "$matched times\n";
} # end of large loop


But this sad looking script is not very smart and very slow, I don't want to
run over each line. I would like the script to search the file,
identify a sequence as unique. If there are duplicate sequences
in that file then print out how many and do not revisit that line
if it has been counted as a duplicate.

It's not clear what you are saying.
There is a list of number sequences. Each list is labelled uniquely
by
an identifier. I want to sort through the list, so I starting at the
1st row and then my code loops through the list checking the
sequences. If it finds a match, then that row does not need to be
revisited again later in the loop, since it has been identified as a
match to the 1st row. I guess I need to keep
an index of some sort while looping the list. Then when I start at
the 2nd row, I only loop over the sequences which are indexed as 'not
yet matched'.
I hope this makes more sense.

Are you saying you want the first ID (only) and the number of
occurances of each distinct sequence?
Yes. This is very helpful. '$_->[0]' looks like
a pointer. So your piece of code, maps the $1 column of the original
line
as a pointer to the values of the %count array. Then the "values" of
%count are the unique "keys" of that array and "scalar" is counting
the number of lines that are the same. Is that right?
I'm trying to understand what your code does, since I want to use it.
Perl is great, but it so difficult to read if you don't have a clue.
while (<INFILE>) {
s/^(\S+\s+){2}// or die;
push @{$count{$_}}, $1;
};

for ( values %count ) {
print "$_->[0]occurs ",scalar(@$_)," times\n";
}
I still don't get why you say this newsgroup has been deleted.

I say it because it is true, and because it will help people who
didn't know this to reach a larger audience.
What is the url for the replacement newsgroup?

What part of the answer to the Perl FAQ: "What are the Perl newsgroups
on Usenet?" are you having trouble understanding?
 
N

nobull

The data list is a sequence of numbers, which are called coordination
sequences, CS for short. My program tries to find the most common CS
in the data file.

I still don't see anything in your program that relates to finding the
most common CS. It looks to me like your program is printing out the
number of occurances of each CS.
There is a list of number sequences. Each list is labelled uniquely
by an identifier. I want to sort through the list, so I starting at the
1st row and then my code loops through the list checking the
sequences. If it finds a match, then that row does not need to be
revisited again later in the loop, since it has been identified as a
match to the 1st row. I guess I need to keep
an index of some sort while looping the list. Then when I start at
the 2nd row, I only loop over the sequences which are indexed as 'not
yet matched'.

I think you are mixing up your definition of the problem you are
trying to solve with the implementation of a partial solution.
I hope this makes more sense.

Not much.
Yes. This is very helpful.

Right. So that's what you want one output line for each distinct CS
in no particular order. You don't want to find the CS that appears
most often.

If you wanted the output sorted in order of frequently you would have
to put a sort in there somewhere.
while (<INFILE>) {
s/^(\S+\s+){2}// or die;
push @{$count{$_}}, $1;
};

for ( values %count ) {
print "$_->[0]occurs ",scalar(@$_)," times\n";
}
'$_->[0]' looks like a pointer.

This is no accident. The values of %count are references (pointers)
to arrays of IDs.
So your piece of code, maps the $1 column of the original
line as a pointer to the values of the %count array.

$1 in Perl is not like it is in awk.

In Perl $1 is whatever was captured by the first () capture in the
most recent regex in the current scope.

So in this case $1 is the first two columns (and the following
whitespace) of the original line. I believe, from what you've said
previously, that this is some sort of ID (identifier) and is not part
of the CS.

Actually you probably should thow away the whitespace between the ID
and the CS.

s/^(\S+\s+\S+)\s+// or die;

Also if you want to improve reability you could avoid $_ and $1 and
also rename %count to something more appropriate to its new role:

my ( $id, $cs ) = /^(\S+\s+\S+)\s+(.*)/ or die;
push @{$ids_by_cs{$cs}}, $id;
Then the "values" of
%count are the unique "keys" of that array and "scalar" is counting
the number of lines that are the same. Is that right?

There is nothing for "that array" to refer to in the previous
sentence.

The values of the hash %count (or %ids_by_cs) are (a list of) pointers
to arrays. Each array contains the series of IDs that correspond to a
single CS. The keys of the hash are the distinct CSs themselves.

As to the uniqueness of the IDs there is nothing in the program that
either ensures that nor cares that the IDs in the input data are
unique.
"scalar" is counting the number of lines that are the same.

scalar is counting the number of elements in the array of IDs that
correspond to a single CS. So, yes, in effect this counts the number
of lines that were the same.
Perl is great, but it so difficult to read if you don't have a clue.

Oh, you noticed that, did you? :)
 
A

Aaron Sherman

Hi.

I would like to be able to mimic the unix tool 'uniq' within a Perl script.

I think you were not asking for uniq per se, so much as "uniq -c"
specifically.

Here's a simple stab.

Note that, like "uniq -c", this requires the data to be sorted.
Sorting lines in the file is left as an excersise for the reader.

while(<>) {
if (defined($prev) && $_ ne $prev) {
printf "%7d %s", $n, $prev;
$n = 0;
}
} continue {
$prev = $_;
$n++;
}
printf "%7d %s", $n, $prev if defined $prev;

If you actually want to do both the sorting and the unique line
counting at the same time, you need to keep everything in memory
(possibly quite expensive, and this is why uniq doesn't do that). Try
this code in that case:

while(<>) {
$lines{$_}++;
}
foreach $line (sort keys %lines) {
printf "%7d %s", $lines{$line}, $line;
}


All of this is typed in from my head, so make sure to check my syntax,
etc before using.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top