Perl script to mimic uniq

Discussion in 'Perl' started by Martin Foster, Jan 30, 2004.

  1. Hi.

    I would like to be able to mimic the unix tool 'uniq' within a Perl script.

    I have a file with entries that look like this

    4 10 21 37 58 83 111 145 184 226
    4 12 24 42 64 92 124 162 204 252
    4 11 23 44 67 95 134 168 215 271
    ..
    ..
    ..

    Many number sequences, I would like to analyze the file to tell me how often a
    sequence occurs throughout the file.

    I've began writing a script:

    #!/usr/bin/perl
    # Perl script to find most common CS
    use strict;

    my @line;
    my $infile = "/home/martin/DATABASE/large.txt";
    open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
    my @array = <INFILE>;
    my $no_lines = $#array;
    print "There are ", $no_lines+1, " lines in the large array\n";

    my (@table);
    foreach my $array (@array) {
    push(@table, [split(/\s/, $array) ]);
    }

    my $no_cells = $#{$table[$no_lines]};

    for (my $k =0; $k<=$no_lines; $k++) {
    print "[$k] occurs ";
    my $match=0;
    my $matched=0;
    for (my $h =0; $h<=$no_lines; $h++) {
    for (my $j =3; $j<=12; $j++ ) {
    if ($table[$k][$j] == $table[$h][$j]){
    $match++;
    }
    }
    if ($match==10) {
    $matched++;
    }
    }
    print "$matched times\n";
    } # end of large loop

    Does anyone know a better, quicker method of doing this?

    Many thanks in advance for any suggestions.
    Martin Foster, Jan 30, 2004
    #1
    1. Advertising

  2. Martin Foster

    Guest

    (Martin Foster) wrote in message news:<>...
    > I would like to be able to mimic the unix tool 'uniq' within a Perl script.


    There are Perl implementations of the Unix tools "out there". (Doing
    web search to find them is left as an exercise for the reader).

    > I have a file with entries that look like this
    >
    > 4 10 21 37 58 83 111 145 184 226
    > 4 12 24 42 64 92 124 162 204 252
    > 4 11 23 44 67 95 134 168 215 271
    > .
    > .
    > .
    >
    > Many number sequences, I would like to analyze the file to tell me how often a
    > sequence occurs throughout the file.


    That is not what Unix uniq does. 'uniq' compares adjacent lines.

    Always reduce your problems to their simplest form. The fact that the
    lines of the file happen to be sequences of numbers in not part of
    your problem's simplest form.

    I shall assume that you really want to count the number of times each
    distints line appears in a file.

    The cannonical Perl one-liner to do this is:

    perl -en '$c{$_}++; END { print "$c{$_} $_" for keys %c }'

    Or as a script:

    #!/usr/bin/perl
    use strict;
    use warnings;

    my %count;

    $count{$_}++ while <>;

    print "$count{$_} $_" for keys %count;
    __END__


    > I've began writing a script:


    Good. We don't like helping people who don't show what they've tried.
    As a requard I'll give you some general Perl programming tips!

    > #!/usr/bin/perl
    > # Perl script to find most common CS


    That comment does not describe what the script does.
    Wrong comments are worse than no comments.

    > use strict;


    Get as much help as you can, use warnings too!
    >
    > my @line;


    You never use this variable.

    > my $infile = "/home/martin/DATABASE/large.txt";
    > open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
    > my @array = <INFILE>;
    > my $no_lines = $#array;


    Variable names should reflect what's in the variable.

    There's no point having a variable that's just a copy of $#array
    since you can always just use $#array.

    > print "There are ", $no_lines+1, " lines in the large array\n";


    It would be more ideomatic to use scalar(@array) rather than $#array+1

    > my (@table);
    > foreach my $array (@array) {
    > push(@table, [split(/\s/, $array) ]);
    > }


    For really simple for/push loops like this consider using map:

    my @table = map { [ split ] } @array;

    > my $no_cells = $#{$table[$no_lines]};


    Variable names should reflect what's in the variable.

    Anyhow you never use that variable.

    >
    > for (my $k =0; $k<=$no_lines; $k++) {


    Don't use C-style for in Perl unless you need to.

    for my $k ( 0 .. $no_lines ) {

    > print "[$k] occurs ";


    Hang on, $k is the line number (minus one) not the content of the
    line.
    I suspect there's more to your original problem than you are telling
    us.

    > my $match=0;
    > my $matched=0;
    > for (my $h =0; $h<=$no_lines; $h++) {
    > for (my $j =3; $j<=12; $j++ ) {


    Where did those 3 and 12 come from. I suspect there's more to your
    original problem than you are telling us.

    > if ($table[$k][$j] == $table[$h][$j]){
    > $match++;
    > }
    > }
    > if ($match==10) {
    > $matched++;
    > }


    Rather than counting matches and checking you have 10 it would be
    better to count mismatches an check you have 0. That way if the 12
    ever had to become 13 you wouldn't have to have to change 10 to 11

    > }
    > print "$matched times\n";
    > } # end of large loop
    >
    > Does anyone know a better, quicker method of doing this?


    Doing what? You've moved the goal-posts several times.

    > Many thanks in advance for any suggestions.


    I suggest that you get clear in your mind what you are asking before
    you ask it.

    I also suggest you post to newsgroups that still exist (this one
    doesn't, see FAQ). Your post will then be seen my many more people.
    , Jan 30, 2004
    #2
    1. Advertising

  3. wrote in message news:<>...
    > (Martin Foster) wrote in message news:<>...
    > > I would like to be able to mimic the unix tool 'uniq' within a Perl script.

    >
    > There are Perl implementations of the Unix tools "out there". (Doing
    > web search to find them is left as an exercise for the reader).
    >
    > > I have a file with entries that look like this
    > >
    > > 4 10 21 37 58 83 111 145 184 226
    > > 4 12 24 42 64 92 124 162 204 252
    > > 4 11 23 44 67 95 134 168 215 271
    > > .
    > > .
    > > .
    > >
    > > Many number sequences, I would like to analyze the file to tell me how often a
    > > sequence occurs throughout the file.

    >
    > That is not what Unix uniq does. 'uniq' compares adjacent lines.


    I know, I can sort lines to be adjacent and then use uniq.

    >
    > Always reduce your problems to their simplest form. The fact that the
    > lines of the file happen to be sequences of numbers in not part of
    > your problem's simplest form.
    >
    > I shall assume that you really want to count the number of times each
    > distints line appears in a file.
    >
    > The cannonical Perl one-liner to do this is:
    >
    > perl -en '$c{$_}++; END { print "$c{$_} $_" for keys %c }'
    >
    > Or as a script:
    >
    > #!/usr/bin/perl
    > use strict;
    > use warnings;
    >
    > my %count;
    >
    > $count{$_}++ while <>;
    >
    > print "$count{$_} $_" for keys %count;
    > __END__
    >

    This is amazing, I don't understand how it works but it's very
    powerful.
    Can I se this script to compare the n columns of a file, no the entire
    file.

    >
    > > I've began writing a script:

    >
    > Good. We don't like helping people who don't show what they've tried.
    > As a requard I'll give you some general Perl programming tips!
    >
    > > #!/usr/bin/perl
    > > # Perl script to find most common CS

    >
    > That comment does not describe what the script does.
    > Wrong comments are worse than no comments.
    >
    > > use strict;

    >
    > Get as much help as you can, use warnings too!
    > >
    > > my @line;

    >
    > You never use this variable.
    >
    > > my $infile = "/home/martin/DATABASE/large.txt";
    > > open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
    > > my @array = <INFILE>;
    > > my $no_lines = $#array;

    >
    > Variable names should reflect what's in the variable.
    >
    > There's no point having a variable that's just a copy of $#array
    > since you can always just use $#array.
    >
    > > print "There are ", $no_lines+1, " lines in the large array\n";

    >
    > It would be more ideomatic to use scalar(@array) rather than $#array+1
    >
    > > my (@table);
    > > foreach my $array (@array) {
    > > push(@table, [split(/\s/, $array) ]);
    > > }

    >
    > For really simple for/push loops like this consider using map:
    >
    > my @table = map { [ split ] } @array;


    Ok. Thanks, I've not used map before, just beginning to learn.

    >
    > > my $no_cells = $#{$table[$no_lines]};

    >
    > Variable names should reflect what's in the variable.
    >
    > Anyhow you never use that variable.
    >
    > >
    > > for (my $k =0; $k<=$no_lines; $k++) {

    >
    > Don't use C-style for in Perl unless you need to.
    >
    > for my $k ( 0 .. $no_lines ) {
    >
    > > print "[$k] occurs ";

    >
    > Hang on, $k is the line number (minus one) not the content of the
    > line.
    > I suspect there's more to your original problem than you are telling
    > us.
    >
    > > my $match=0;
    > > my $matched=0;
    > > for (my $h =0; $h<=$no_lines; $h++) {
    > > for (my $j =3; $j<=12; $j++ ) {

    >
    > Where did those 3 and 12 come from. I suspect there's more to your
    > original problem than you are telling us.


    I've got a identifier for each line at the beginning, for example

    1666237 4 10 23 16 and so. The identifier is an id to link to
    something else and so on. I just want to compare the 10 columns with
    the numbers.

    >
    > > if ($table[$k][$j] == $table[$h][$j]){
    > > $match++;
    > > }
    > > }
    > > if ($match==10) {
    > > $matched++;
    > > }

    >
    > Rather than counting matches and checking you have 10 it would be
    > better to count mismatches an check you have 0. That way if the 12
    > ever had to become 13 you wouldn't have to have to change 10 to 11
    >
    > > }

    > print "$matched times\n";
    > > } # end of large loop
    > >
    > > Does anyone know a better, quicker method of doing this?

    >
    > Doing what? You've moved the goal-posts several times.
    >
    > > Many thanks in advance for any suggestions.

    >
    > I suggest that you get clear in your mind what you are asking before
    > you ask it.
    >
    > I also suggest you post to newsgroups that still exist (this one
    > doesn't, see FAQ). Your post will then be seen my many more people.

    BTW where is the FAQ, which says this newsgroup no longer exists?
    Martin Foster, Jan 31, 2004
    #3
  4. Martin Foster wrote:
    > I would like to be able to mimic the unix tool 'uniq' within a Perl
    > script.


    Unfortunately the FAQ entry is worded the opposite way:
    perldoc -q duplicate:
    "How can I remove duplicate elements from a list or array?"

    jue
    Jürgen Exner, Jan 31, 2004
    #4
  5. Martin Foster

    Guest

    (Martin Foster) wrote:

    > wrote:
    >
    > > I shall assume that you really want to count the number of times each
    > > distinct line appears in a file.


    > > perl -en '$c{$_}++; END { print "$c{$_} $_" for keys %c }'


    > > Or as a script:


    > > $count{$_}++ while <>;


    > This is amazing, I don't understand how it works but it's very
    > powerful.


    If you look in the newsgroup that replaced this one when this one was
    deleted, you'll find every couple of months someone posts a script
    substancially like the one above and says "I found this - how does it
    work?".

    You could look at one of those threads.

    I believe it is also an example that is used in most Perl tutorials.

    > Can I se this script to compare the n columns of a file, no the entire
    > file.


    No you can't use this _script_. But you can use the technique.

    Rather than keying %count on the whole line you can use some sort of
    string manipulation to extract just part of the line to consider. The
    most normal way to manipulate strings in Perl is the m// and s///
    operators.

    > I've got a identifier for each line at the beginning, for example
    >
    > 1666237 4 10 23 16 and so. The identifier is an id to link to
    > something else and so on. I just want to compare the 10 columns with
    > the numbers.


    Well if, for example, we say the first 3 whitespace delimted columns
    are the identifier you could remove them thus:

    s/^(\S+\s+){3}// and $count{$_}++ while <>;

    > > I also suggest you post to newsgroups that still exist (this one
    > > doesn't, see FAQ). Your post will then be seen my many more people.


    > BTW where is the FAQ, which says this newsgroup no longer exists?


    The Perl FAQ is part of the standard Perl documentation that can be
    found on any computer on which Perl has been installed and also on
    various Perl-related web sites.
    , Jan 31, 2004
    #5
  6. Thanks for your help.

    My script now looks like this:


    #!/usr/bin/perl
    # Perl script to find most common CS
    use strict;
    use warnings;

    my $infile = "/home/martin/DATABASE/large.txt";
    open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
    my %count;

    do {
    $_ =~ s/^(\S+\s+){2}//;
    $count{$_}++
    } while <INFILE>;

    print "$count{$_} $_" for keys %count;
    __END__

    So I'm feeding the file into the %count array by removing the first two
    columns with the identifier information and then counting the keys.
    How can I still keep the identifier part of the line linked to the array?
    Since this is the part which I'm really interested in.
    I can't keep the identifier in
    the %count array, since this would screw up the "for keys" part.

    I checked perldoc -q and found how to remove duplicates but I don't think
    I can rewrite this to do what I want.

    The "for keys" method is brillant but I'm losing the identifier.

    So I'm back to my original script which looks like this.

    #!/usr/bin/perl
    # Perl script to find most common CS
    use strict;
    use warnings;


    my $infile = "/home/martin/DATABASE/large.txt";
    open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
    my @array = <INFILE>;
    print "There are ", $#array+1, " lines in the large array\n";

    my (@table);
    foreach my $array (@array) {
    push(@table, [split(/\s/, $array) ]);
    }

    for (my $k =0; $k<=$#array; $k++) {
    print "$table[$k][1] $table[$k][2] occurs ";
    my $matched=0;
    for (my $h =0; $h<=$no_lines; $h++) {
    my $match=0;
    for (my $j =2; $j<=11; $j++ ) {
    if ($table[$k][$j] == $table[$h][$j]){
    $match++;
    }
    }
    if ($match==10) {
    $matched++;
    }
    }
    print "$matched times\n";
    } # end of large loop


    But this sad looking script is not very smart and very slow, I don't want to
    run over each line. I would like the script to search the file,
    identify a sequence as unique. If there are duplicate sequences
    in that file then print out how many and do not revisit that line
    if it has been counted as a duplicate.


    my data file looks like this, a small section only.


    810 141-2_1_2 4 10 21 37 58 83 111 145 184 226
    811 141-2_1_6 4 12 24 42 64 92 124 162 204 252
    812 141-2_1_7 4 11 23 44 67 95 134 168 215 271
    879 141_1_2 4 10 21 37 58 83 111 145 184 226
    880 141_1_6 4 12 24 42 64 92 124 162 204 252
    881 141_1_7 4 11 23 44 67 95 134 168 215 271
    882 152_1_15 4 12 26 44 72 104 138 178 228 282
    883 152_1_23 4 10 21 40 65 96 134 180 230 286
    884 152_1_24 4 10 21 40 65 96 134 180 230 286
    885 152_1_3 4 12 22 40 66 102 128 168 218 268

    Again many thanks for your help. I still don't get why you say
    this newsgroup has been deleted. What is the url for the replacement
    newsgroup?


    wrote in message news:<>...
    > (Martin Foster) wrote:
    >
    > > wrote:
    > >
    > > > I shall assume that you really want to count the number of times each
    > > > distinct line appears in a file.

    >
    > > > perl -en '$c{$_}++; END { print "$c{$_} $_" for keys %c }'

    >
    > > > Or as a script:

    >
    > > > $count{$_}++ while <>;

    >
    > > This is amazing, I don't understand how it works but it's very
    > > powerful.

    >
    > If you look in the newsgroup that replaced this one when this one was
    > deleted, you'll find every couple of months someone posts a script
    > substancially like the one above and says "I found this - how does it
    > work?".
    >
    > You could look at one of those threads.
    >
    > I believe it is also an example that is used in most Perl tutorials.
    >
    > > Can I se this script to compare the n columns of a file, no the entire
    > > file.

    >
    > No you can't use this _script_. But you can use the technique.
    >
    > Rather than keying %count on the whole line you can use some sort of
    > string manipulation to extract just part of the line to consider. The
    > most normal way to manipulate strings in Perl is the m// and s///
    > operators.
    >
    > > I've got a identifier for each line at the beginning, for example
    > >
    > > 1666237 4 10 23 16 and so. The identifier is an id to link to
    > > something else and so on. I just want to compare the 10 columns with
    > > the numbers.

    >
    > Well if, for example, we say the first 3 whitespace delimted columns
    > are the identifier you could remove them thus:
    >
    > s/^(\S+\s+){3}// and $count{$_}++ while <>;
    >
    > > > I also suggest you post to newsgroups that still exist (this one
    > > > doesn't, see FAQ). Your post will then be seen my many more people.

    >
    > > BTW where is the FAQ, which says this newsgroup no longer exists?

    >
    > The Perl FAQ is part of the standard Perl documentation that can be
    > found on any computer on which Perl has been installed and also on
    > various Perl-related web sites.
    Martin Foster, Feb 2, 2004
    #6
  7. Martin Foster

    Guest

    (Martin Foster) spits TOFU in my face:

    > Thanks for your help.


    Please, if you want to thank me, learn to quote properly. TOFU ((new)
    Text Over, Full-quote Under) is considered very rude.

    > My script now looks like this:
    >
    >
    > #!/usr/bin/perl
    > # Perl script to find most common CS
    > use strict;
    > use warnings;
    >
    > my $infile = "/home/martin/DATABASE/large.txt";
    > open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
    > my %count;
    >
    > do {
    > $_ =~ s/^(\S+\s+){2}//;
    > $count{$_}++
    > } while <INFILE>;


    Please see perldoc perlsyn for how "do { BLOCK } while EXPR" is
    different from "while (EXPR) { BLOCK }". In this case you want the
    latter.

    Saying "$_ =~" i.e. "don't use $_, use $_ instead" is considered
    somwhat affected. Either use $_ (and don't mention it) or use
    something else instead.

    You are assuming the s/// succedes always. Whenever you are assume
    something like this will succede always you should decorate it with
    "or die". This acts as a comment saying "I'm assuming this succedes
    always". It also causes the program to crash out rather than carry on
    and do something weird if your assumption was wrong.

    > So I'm feeding the file into the %count array by removing the first two
    > columns with the identifier information and then counting the keys.
    > How can I still keep the identifier part of the line linked to the array?
    > Since this is the part which I'm really interested in.


    Ah, well you never mentioned that before. It helps to know what you
    want.

    > I can't keep the identifier in
    > the %count array, since this would screw up the "for keys" part.


    You can't keep it in the keys of %count, but you can keep it in the
    values.

    while (<INFILE>) {
    s/^(\S+\s+){2}// or die;
    push @{$count{$_}}, $1;
    };


    > I checked perldoc -q and found how to remove duplicates but I don't think
    > I can rewrite this to do what I want.


    Don't worry I'm sure your programming skill will improve. You appear
    smart but inexperienced. You do, however, seem to have an unfortunate
    streak of defeatism.

    > The "for keys" method is brillant but I'm losing the identifier.
    >
    > So I'm back to my original script which looks like this.


    Why? I showed you many ways to improve it independant of changing the
    algorithm.

    > #!/usr/bin/perl
    > # Perl script to find most common CS


    I still don't get how this comment relates to what your program does
    nor what you say you want it to do.

    > use strict;
    > use warnings;
    >
    >
    > my $infile = "/home/martin/DATABASE/large.txt";
    > open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
    > my @array = <INFILE>;
    > print "There are ", $#array+1, " lines in the large array\n";
    >
    > my (@table);
    > foreach my $array (@array) {
    > push(@table, [split(/\s/, $array) ]);
    > }
    >
    > for (my $k =0; $k<=$#array; $k++) {
    > print "$table[$k][1] $table[$k][2] occurs ";
    > my $matched=0;
    > for (my $h =0; $h<=$no_lines; $h++) {
    > my $match=0;
    > for (my $j =2; $j<=11; $j++ ) {
    > if ($table[$k][$j] == $table[$h][$j]){
    > $match++;
    > }
    > }
    > if ($match==10) {
    > $matched++;
    > }
    > }
    > print "$matched times\n";
    > } # end of large loop
    >
    >
    > But this sad looking script is not very smart and very slow, I don't want to
    > run over each line. I would like the script to search the file,
    > identify a sequence as unique. If there are duplicate sequences
    > in that file then print out how many and do not revisit that line
    > if it has been counted as a duplicate.


    It's not clear what you are saying.

    Are you saying you want the first ID (only) and the number of
    occurances of each distinct sequence?

    while (<INFILE>) {
    s/^(\S+\s+){2}// or die;
    push @{$count{$_}}, $1;
    };

    for ( values %count ) {
    print "$_->[0]occurs ",scalar(@$_)," times\n";
    }

    > I still don't get why you say this newsgroup has been deleted.


    I say it because it is true, and because it will help people who
    didn't know this to reach a larger audience.

    > What is the url for the replacement newsgroup?


    What part of the answer to the Perl FAQ: "What are the Perl newsgroups
    on Usenet?" are you having trouble understanding?
    , Feb 3, 2004
    #7
  8. wrote in message news:<>...
    > (Martin Foster) spits TOFU in my face:
    >
    > > Thanks for your help.

    >
    > Please, if you want to thank me, learn to quote properly. TOFU ((new)
    > Text Over, Full-quote Under) is considered very rude.
    >

    I see.

    > > My script now looks like this:
    > >
    > >
    > > #!/usr/bin/perl
    > > # Perl script to find most common CS
    > > use strict;
    > > use warnings;
    > >
    > > my $infile = "/home/martin/DATABASE/large.txt";
    > > open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
    > > my %count;
    > >
    > > do {
    > > $_ =~ s/^(\S+\s+){2}//;
    > > $count{$_}++
    > > } while <INFILE>;

    >
    > Please see perldoc perlsyn for how "do { BLOCK } while EXPR" is
    > different from "while (EXPR) { BLOCK }". In this case you want the
    > latter.
    >
    > Saying "$_ =~" i.e. "don't use $_, use $_ instead" is considered
    > somwhat affected. Either use $_ (and don't mention it) or use
    > something else instead.
    >
    > You are assuming the s/// succedes always. Whenever you are assume
    > something like this will succede always you should decorate it with
    > "or die". This acts as a comment saying "I'm assuming this succedes
    > always". It also causes the program to crash out rather than carry on
    > and do something weird if your assumption was wrong.
    >

    This is good tip. I'll use this for now on.
    > > So I'm feeding the file into the %count array by removing the first two
    > > columns with the identifier information and then counting the keys.
    > > How can I still keep the identifier part of the line linked to the array?
    > > Since this is the part which I'm really interested in.

    >
    > Ah, well you never mentioned that before. It helps to know what you
    > want.
    >
    > > I can't keep the identifier in
    > > the %count array, since this would screw up the "for keys" part.

    >
    > You can't keep it in the keys of %count, but you can keep it in the
    > values.
    >
    > while (<INFILE>) {
    > s/^(\S+\s+){2}// or die;
    > push @{$count{$_}}, $1;
    > };
    >
    >
    > > I checked perldoc -q and found how to remove duplicates but I don't think
    > > I can rewrite this to do what I want.

    >
    > Don't worry I'm sure your programming skill will improve. You appear
    > smart but inexperienced. You do, however, seem to have an unfortunate
    > streak of defeatism.
    >
    > > The "for keys" method is brillant but I'm losing the identifier.
    > >
    > > So I'm back to my original script which looks like this.

    >
    > Why? I showed you many ways to improve it independant of changing the
    > algorithm.
    >
    > > #!/usr/bin/perl
    > > # Perl script to find most common CS

    >
    > I still don't get how this comment relates to what your program does
    > nor what you say you want it to do.

    The data list is a sequence of numbers, which are called coordination
    sequences, CS for short. My program tries to find the most common CS
    in the data file.
    >
    > > use strict;
    > > use warnings;
    > >
    > >
    > > my $infile = "/home/martin/DATABASE/large.txt";
    > > open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
    > > my @array = <INFILE>;
    > > print "There are ", $#array+1, " lines in the large array\n";
    > >
    > > my (@table);
    > > foreach my $array (@array) {
    > > push(@table, [split(/\s/, $array) ]);
    > > }
    > >
    > > for (my $k =0; $k<=$#array; $k++) {
    > > print "$table[$k][1] $table[$k][2] occurs ";
    > > my $matched=0;
    > > for (my $h =0; $h<=$no_lines; $h++) {
    > > my $match=0;
    > > for (my $j =2; $j<=11; $j++ ) {
    > > if ($table[$k][$j] == $table[$h][$j]){
    > > $match++;
    > > }
    > > }
    > > if ($match==10) {
    > > $matched++;
    > > }
    > > }

    > print "$matched times\n";
    > > } # end of large loop
    > >
    > >
    > > But this sad looking script is not very smart and very slow, I don't want to
    > > run over each line. I would like the script to search the file,
    > > identify a sequence as unique. If there are duplicate sequences
    > > in that file then print out how many and do not revisit that line
    > > if it has been counted as a duplicate.

    >
    > It's not clear what you are saying.
    >

    There is a list of number sequences. Each list is labelled uniquely
    by
    an identifier. I want to sort through the list, so I starting at the
    1st row and then my code loops through the list checking the
    sequences. If it finds a match, then that row does not need to be
    revisited again later in the loop, since it has been identified as a
    match to the 1st row. I guess I need to keep
    an index of some sort while looping the list. Then when I start at
    the 2nd row, I only loop over the sequences which are indexed as 'not
    yet matched'.
    I hope this makes more sense.


    > Are you saying you want the first ID (only) and the number of
    > occurances of each distinct sequence?

    Yes. This is very helpful. '$_->[0]' looks like
    a pointer. So your piece of code, maps the $1 column of the original
    line
    as a pointer to the values of the %count array. Then the "values" of
    %count are the unique "keys" of that array and "scalar" is counting
    the number of lines that are the same. Is that right?
    I'm trying to understand what your code does, since I want to use it.
    Perl is great, but it so difficult to read if you don't have a clue.

    >
    > while (<INFILE>) {
    > s/^(\S+\s+){2}// or die;
    > push @{$count{$_}}, $1;
    > };
    >
    > for ( values %count ) {
    > print "$_->[0]occurs ",scalar(@$_)," times\n";
    > }
    >
    > > I still don't get why you say this newsgroup has been deleted.

    >
    > I say it because it is true, and because it will help people who
    > didn't know this to reach a larger audience.
    >
    > > What is the url for the replacement newsgroup?

    >
    > What part of the answer to the Perl FAQ: "What are the Perl newsgroups
    > on Usenet?" are you having trouble understanding?
    Martin Foster, Feb 5, 2004
    #8
  9. Martin Foster

    Guest

    (Martin Foster) wrote in message news:<>...
    > wrote in message news:<>...
    > > (Martin Foster) spits TOFU in my face:
    > >
    > > > # Perl script to find most common CS

    > >
    > > I still don't get how this comment relates to what your program does
    > > nor what you say you want it to do.

    >
    > The data list is a sequence of numbers, which are called coordination
    > sequences, CS for short. My program tries to find the most common CS
    > in the data file.


    I still don't see anything in your program that relates to finding the
    most common CS. It looks to me like your program is printing out the
    number of occurances of each CS.

    > > > I would like the script to search the file,
    > > > identify a sequence as unique. If there are duplicate sequences
    > > > in that file then print out how many and do not revisit that line
    > > > if it has been counted as a duplicate.

    > >
    > > It's not clear what you are saying.

    >
    > There is a list of number sequences. Each list is labelled uniquely
    > by an identifier. I want to sort through the list, so I starting at the
    > 1st row and then my code loops through the list checking the
    > sequences. If it finds a match, then that row does not need to be
    > revisited again later in the loop, since it has been identified as a
    > match to the 1st row. I guess I need to keep
    > an index of some sort while looping the list. Then when I start at
    > the 2nd row, I only loop over the sequences which are indexed as 'not
    > yet matched'.


    I think you are mixing up your definition of the problem you are
    trying to solve with the implementation of a partial solution.

    > I hope this makes more sense.


    Not much.

    > > Are you saying you want the first ID (only) and the number of
    > > occurances of each distinct sequence?

    >
    > Yes. This is very helpful.


    Right. So that's what you want one output line for each distinct CS
    in no particular order. You don't want to find the CS that appears
    most often.

    If you wanted the output sorted in order of frequently you would have
    to put a sort in there somewhere.

    > >
    > > while (<INFILE>) {
    > > s/^(\S+\s+){2}// or die;
    > > push @{$count{$_}}, $1;
    > > };
    > >
    > > for ( values %count ) {
    > > print "$_->[0]occurs ",scalar(@$_)," times\n";
    > > }


    > '$_->[0]' looks like a pointer.


    This is no accident. The values of %count are references (pointers)
    to arrays of IDs.

    > So your piece of code, maps the $1 column of the original
    > line as a pointer to the values of the %count array.


    $1 in Perl is not like it is in awk.

    In Perl $1 is whatever was captured by the first () capture in the
    most recent regex in the current scope.

    So in this case $1 is the first two columns (and the following
    whitespace) of the original line. I believe, from what you've said
    previously, that this is some sort of ID (identifier) and is not part
    of the CS.

    Actually you probably should thow away the whitespace between the ID
    and the CS.

    s/^(\S+\s+\S+)\s+// or die;

    Also if you want to improve reability you could avoid $_ and $1 and
    also rename %count to something more appropriate to its new role:

    my ( $id, $cs ) = /^(\S+\s+\S+)\s+(.*)/ or die;
    push @{$ids_by_cs{$cs}}, $id;

    > Then the "values" of
    > %count are the unique "keys" of that array and "scalar" is counting
    > the number of lines that are the same. Is that right?


    There is nothing for "that array" to refer to in the previous
    sentence.

    The values of the hash %count (or %ids_by_cs) are (a list of) pointers
    to arrays. Each array contains the series of IDs that correspond to a
    single CS. The keys of the hash are the distinct CSs themselves.

    As to the uniqueness of the IDs there is nothing in the program that
    either ensures that nor cares that the IDs in the input data are
    unique.

    > "scalar" is counting the number of lines that are the same.


    scalar is counting the number of elements in the array of IDs that
    correspond to a single CS. So, yes, in effect this counts the number
    of lines that were the same.

    > Perl is great, but it so difficult to read if you don't have a clue.


    Oh, you noticed that, did you? :)
    , Feb 5, 2004
    #9
  10. (Martin Foster) wrote in message news:<>...
    > Hi.
    >
    > I would like to be able to mimic the unix tool 'uniq' within a Perl script.


    I think you were not asking for uniq per se, so much as "uniq -c"
    specifically.

    Here's a simple stab.

    Note that, like "uniq -c", this requires the data to be sorted.
    Sorting lines in the file is left as an excersise for the reader.

    while(<>) {
    if (defined($prev) && $_ ne $prev) {
    printf "%7d %s", $n, $prev;
    $n = 0;
    }
    } continue {
    $prev = $_;
    $n++;
    }
    printf "%7d %s", $n, $prev if defined $prev;

    If you actually want to do both the sorting and the unique line
    counting at the same time, you need to keep everything in memory
    (possibly quite expensive, and this is why uniq doesn't do that). Try
    this code in that case:

    while(<>) {
    $lines{$_}++;
    }
    foreach $line (sort keys %lines) {
    printf "%7d %s", $lines{$line}, $line;
    }


    All of this is typed in from my head, so make sure to check my syntax,
    etc before using.
    Aaron Sherman, Feb 5, 2004
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John Hunter

    mimic -i from script

    John Hunter, Mar 2, 2004, in forum: Python
    Replies:
    2
    Views:
    322
    Michael Hudson
    Mar 5, 2004
  2. Replies:
    1
    Views:
    252
    Michele Dondi
    Jan 29, 2008
  3. Markus

    Array#uniq

    Markus, Sep 29, 2004, in forum: Ruby
    Replies:
    1
    Views:
    85
    Yukihiro Matsumoto
    Sep 30, 2004
  4. Belorion

    Array::uniq { block } ?

    Belorion, Jan 26, 2005, in forum: Ruby
    Replies:
    22
    Views:
    262
    David A. Black
    Jan 31, 2005
  5. Jos Backus

    How to mimic Perl's `s///' in Ruby?

    Jos Backus, Feb 9, 2005, in forum: Ruby
    Replies:
    17
    Views:
    216
    Jos Backus
    Feb 10, 2005
Loading...

Share This Page