One liner to remove duplicate records

Discussion in 'Perl Misc' started by Ninja Li, Apr 30, 2010.

  1. Ninja Li

    Ninja Li Guest

    Hi,

    I have a file with the following sample data delimited by "|" with
    duplicate records:

    20100430|20100429|John Smith|-0.07|-0.08|
    20100430|20100429|John Smith|-0.07|-0.08|
    20100430|20100429|Ashley Cole|1.09|1.08|
    20100430|20100429|Bill Thompson|0.76|0.78|
    20100429|20100428|Time Apache|2.10|2.24|

    The first three fields "date_1", "date_2" and "name" are unique
    identifiers of a record.

    Is there a simple way, like a one liner to remove the duplicates such
    as with "John Smith"?

    Thanks in advance.

    Nick Li
     
    Ninja Li, Apr 30, 2010
    #1
    1. Advertising

  2. Ninja Li

    Guest

    On Fri, 30 Apr 2010 08:55:12 -0700 (PDT), Ninja Li <> wrote:

    >Hi,
    >
    >I have a file with the following sample data delimited by "|" with
    >duplicate records:
    >
    >20100430|20100429|John Smith|-0.07|-0.08|
    >20100430|20100429|John Smith|-0.07|-0.08|
    >20100430|20100429|Ashley Cole|1.09|1.08|
    >20100430|20100429|Bill Thompson|0.76|0.78|
    >20100429|20100428|Time Apache|2.10|2.24|
    >
    >The first three fields "date_1", "date_2" and "name" are unique
    >identifiers of a record.
    >
    >Is there a simple way, like a one liner to remove the duplicates such
    >as with "John Smith"?
    >
    >Thanks in advance.
    >
    >Nick Li


    I could think of a way, but it takes 2 lines, sorry.
    -sln
     
    , Apr 30, 2010
    #2
    1. Advertising

  3. Ninja Li

    John Bokma Guest

    Ninja Li <> writes:

    > Hi,
    >
    > I have a file with the following sample data delimited by "|" with
    > duplicate records:
    >
    > 20100430|20100429|John Smith|-0.07|-0.08|
    > 20100430|20100429|John Smith|-0.07|-0.08|
    > 20100430|20100429|Ashley Cole|1.09|1.08|
    > 20100430|20100429|Bill Thompson|0.76|0.78|
    > 20100429|20100428|Time Apache|2.10|2.24|
    >
    > The first three fields "date_1", "date_2" and "name" are unique
    > identifiers of a record.
    >
    > Is there a simple way, like a one liner to remove the duplicates such
    > as with "John Smith"?


    Yes.

    But have you tried to write a multi-line Perl program first? Moving from
    a working Perl program to a one-liner might be easier than starting
    straight with the one-liner.

    Also read up on what the various options of perl do.

    --
    John Bokma j3b

    Hacking & Hiking in Mexico - http://johnbokma.com/
    http://castleamber.com/ - Perl & Python Development
     
    John Bokma, Apr 30, 2010
    #3
  4. Ninja Li

    Guest

    On Fri, 30 Apr 2010 09:06:52 -0700, wrote:

    >On Fri, 30 Apr 2010 08:55:12 -0700 (PDT), Ninja Li <> wrote:
    >
    >>Hi,
    >>
    >>I have a file with the following sample data delimited by "|" with
    >>duplicate records:
    >>
    >>20100430|20100429|John Smith|-0.07|-0.08|
    >>20100430|20100429|John Smith|-0.07|-0.08|
    >>20100430|20100429|Ashley Cole|1.09|1.08|
    >>20100430|20100429|Bill Thompson|0.76|0.78|
    >>20100429|20100428|Time Apache|2.10|2.24|
    >>
    >>The first three fields "date_1", "date_2" and "name" are unique
    >>identifiers of a record.
    >>
    >>Is there a simple way, like a one liner to remove the duplicates such
    >>as with "John Smith"?
    >>
    >>Thanks in advance.
    >>
    >>Nick Li

    >
    >I could think of a way, but it takes 2 lines, sorry.


    Wait, this might work.

    c:\temp>perl -a -F"\|" -n -e "/^$/ and next or !exists $hash{$key = join '',@F[0
    ...2]} and ++$hash{$key} and print" file.txt
    20100430|20100429|John Smith|-0.07|-0.08|
    20100430|20100429|Ashley Cole|1.09|1.08|
    20100430|20100429|Bill Thompson|0.76|0.78|
    20100429|20100428|Time Apache|2.10|2.24|

    c:\temp>

    -sln
     
    , Apr 30, 2010
    #4
  5. Ninja Li

    Dr.Ruud Guest

    Ninja Li wrote:

    > I have a file with the following sample data delimited by "|" with
    > duplicate records:
    >
    > 20100430|20100429|John Smith|-0.07|-0.08|
    > 20100430|20100429|John Smith|-0.07|-0.08|
    > 20100430|20100429|Ashley Cole|1.09|1.08|
    > 20100430|20100429|Bill Thompson|0.76|0.78|
    > 20100429|20100428|Time Apache|2.10|2.24|
    >
    > The first three fields "date_1", "date_2" and "name" are unique
    > identifiers of a record.
    >
    > Is there a simple way, like a one liner to remove the duplicates such
    > as with "John Smith"?


    If the data is as strict as presented, you can use

    sort -u <input

    sort <input |uniq

    or simply use the whole line as a hash key:

    perl -wne'$_{$_}++ or print' <input

    (the first underscore is not really necessary)

    --
    Ruud
     
    Dr.Ruud, Apr 30, 2010
    #5
  6. Ninja Li <> wrote:
    >I have a file with the following sample data delimited by "|" with
    >duplicate records:
    >
    >20100430|20100429|John Smith|-0.07|-0.08|
    >20100430|20100429|John Smith|-0.07|-0.08|
    >20100430|20100429|Ashley Cole|1.09|1.08|
    >20100430|20100429|Bill Thompson|0.76|0.78|
    >20100429|20100428|Time Apache|2.10|2.24|
    >
    >The first three fields "date_1", "date_2" and "name" are unique
    >identifiers of a record.
    >
    >Is there a simple way, like a one liner to remove the duplicates such
    >as with "John Smith"?


    Your data is sorted already, so a simple call to 'uniq' will do the job:
    http://en.wikipedia.org/wiki/Uniq

    jue
     
    Jürgen Exner, May 1, 2010
    #6
  7. Ninja Li

    Guest

    On Fri, 30 Apr 2010 08:55:12 -0700 (PDT), Ninja Li <> wrote:

    >Hi,
    >
    >I have a file with the following sample data delimited by "|" with
    >duplicate records:
    >
    >20100430|20100429|John Smith|-0.07|-0.08|
    >20100430|20100429|John Smith|-0.07|-0.08|
    >20100430|20100429|Ashley Cole|1.09|1.08|
    >20100430|20100429|Bill Thompson|0.76|0.78|
    >20100429|20100428|Time Apache|2.10|2.24|
    >
    >The first three fields "date_1", "date_2" and "name" are unique
    >identifiers of a record.
    >
    >Is there a simple way, like a one liner to remove the duplicates such
    >as with "John Smith"?
    >
    >Thanks in advance.
    >
    >Nick Li


    Another way:

    perl -anF"\|" -e "tr/|// > 1 and ++$seen{qq<@F[0..2]>} > 1 and next or print" file.txt

    -sln
     
    , May 4, 2010
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. sumit
    Replies:
    1
    Views:
    6,454
    Anatoly
    Nov 25, 2003
  2. Jameel
    Replies:
    1
    Views:
    566
  3. Paul Naude

    Urgent: Records duplicate on update!

    Paul Naude, Feb 21, 2006, in forum: ASP .Net Datagrid Control
    Replies:
    0
    Views:
    113
    Paul Naude
    Feb 21, 2006
  4. Damien Wyart

    One-liner removing duplicate lines

    Damien Wyart, Oct 5, 2005, in forum: Ruby
    Replies:
    35
    Views:
    954
  5. Larry
    Replies:
    1
    Views:
    103
    Martien Verbruggen
    Feb 3, 2005
Loading...

Share This Page