One liner to remove duplicate records

Ninja Li · Apr 30, 2010

Hi,

I have a file with the following sample data delimited by "|" with
duplicate records:

20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|Ashley Cole|1.09|1.08|
20100430|20100429|Bill Thompson|0.76|0.78|
20100429|20100428|Time Apache|2.10|2.24|

The first three fields "date_1", "date_2" and "name" are unique
identifiers of a record.

Is there a simple way, like a one liner to remove the duplicates such
as with "John Smith"?

Thanks in advance.

Nick Li

sln · Apr 30, 2010

Hi,

I have a file with the following sample data delimited by "|" with
duplicate records:

20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|Ashley Cole|1.09|1.08|
20100430|20100429|Bill Thompson|0.76|0.78|
20100429|20100428|Time Apache|2.10|2.24|

The first three fields "date_1", "date_2" and "name" are unique
identifiers of a record.

Is there a simple way, like a one liner to remove the duplicates such
as with "John Smith"?

Thanks in advance.

Nick Li

I could think of a way, but it takes 2 lines, sorry.
-sln

John Bokma · Apr 30, 2010

Ninja Li said:
Hi,

I have a file with the following sample data delimited by "|" with
duplicate records:

20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|Ashley Cole|1.09|1.08|
20100430|20100429|Bill Thompson|0.76|0.78|
20100429|20100428|Time Apache|2.10|2.24|

The first three fields "date_1", "date_2" and "name" are unique
identifiers of a record.

Is there a simple way, like a one liner to remove the duplicates such
as with "John Smith"?

Yes.

But have you tried to write a multi-line Perl program first? Moving from
a working Perl program to a one-liner might be easier than starting
straight with the one-liner.

Also read up on what the various options of perl do.

sln · Apr 30, 2010

I could think of a way, but it takes 2 lines, sorry.

Wait, this might work.

c:\temp>perl -a -F"\|" -n -e "/^$/ and next or !exists $hash{$key = join '',@F[0
...2]} and ++$hash{$key} and print" file.txt
20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|Ashley Cole|1.09|1.08|
20100430|20100429|Bill Thompson|0.76|0.78|
20100429|20100428|Time Apache|2.10|2.24|

c:\temp>

-sln

Dr.Ruud · Apr 30, 2010

Ninja said:
I have a file with the following sample data delimited by "|" with
duplicate records:

20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|Ashley Cole|1.09|1.08|
20100430|20100429|Bill Thompson|0.76|0.78|
20100429|20100428|Time Apache|2.10|2.24|

The first three fields "date_1", "date_2" and "name" are unique
identifiers of a record.

Is there a simple way, like a one liner to remove the duplicates such
as with "John Smith"?

If the data is as strict as presented, you can use

sort -u <input

sort <input |uniq

or simply use the whole line as a hash key:

perl -wne'$_{$_}++ or print' <input

(the first underscore is not really necessary)

Jürgen Exner · Apr 30, 2010

Ninja Li said:
I have a file with the following sample data delimited by "|" with
duplicate records:

20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|Ashley Cole|1.09|1.08|
20100430|20100429|Bill Thompson|0.76|0.78|
20100429|20100428|Time Apache|2.10|2.24|

The first three fields "date_1", "date_2" and "name" are unique
identifiers of a record.

Is there a simple way, like a one liner to remove the duplicates such
as with "John Smith"?

Your data is sorted already, so a simple call to 'uniq' will do the job:
http://en.wikipedia.org/wiki/Uniq

jue

sln · May 4, 2010

Hi,

I have a file with the following sample data delimited by "|" with
duplicate records:

20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|Ashley Cole|1.09|1.08|
20100430|20100429|Bill Thompson|0.76|0.78|
20100429|20100428|Time Apache|2.10|2.24|

The first three fields "date_1", "date_2" and "name" are unique
identifiers of a record.

Is there a simple way, like a one liner to remove the duplicates such
as with "John Smith"?

Thanks in advance.

Nick Li

Another way:

perl -anF"\|" -e "tr/|// > 1 and ++$seen{qq<@F[0..2]>} > 1 and next or print" file.txt

-sln

comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

One liner to remove duplicate records

Ninja Li

sln

John Bokma

sln

Dr.Ruud

Jürgen Exner

sln

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads