extracting duplicates from CSV file by specific fields

V

VP

Hi,
I have a csv file:

'aaa.111', 'T100', 'pn123', 'sn111'
'aaa.111', 'T200', 'pn123', 'sn222'
'bbb.333', 'T300', 'pn123', 'sn333'
'ccc.444', 'T400', 'pn123', 'sn444'
'ddd', 'T500', 'pn123', 'sn555'
'eee.666', 'T600', 'pn123', 'sn444'
'fff.777', 'T700', 'pn123', 'sn777'

How can I extract duplicates checking each row by filed1 and filed4?

I should get something like that:

'aaa.111', 'T100', 'pn123', 'sn111'
'bbb.333', 'T300', 'pn123', 'sn333'
'ccc.444', 'T400', 'pn123', 'sn444'
'ddd', 'T500', 'pn123', 'sn555'
'fff.777', 'T700', 'pn123', 'sn777'

and

'aaa.111', 'T200', 'pn123', 'sn222'
'eee.666', 'T600', 'pn123', 'sn444'

Any help will be extremely appreciated.
 
M

MRAB

VP said:
Hi,
I have a csv file:

'aaa.111', 'T100', 'pn123', 'sn111'
'aaa.111', 'T200', 'pn123', 'sn222'
'bbb.333', 'T300', 'pn123', 'sn333'
'ccc.444', 'T400', 'pn123', 'sn444'
'ddd', 'T500', 'pn123', 'sn555'
'eee.666', 'T600', 'pn123', 'sn444'
'fff.777', 'T700', 'pn123', 'sn777'

How can I extract duplicates checking each row by filed1 and filed4?

I should get something like that:

'aaa.111', 'T100', 'pn123', 'sn111'
'bbb.333', 'T300', 'pn123', 'sn333'
'ccc.444', 'T400', 'pn123', 'sn444'
'ddd', 'T500', 'pn123', 'sn555'
'fff.777', 'T700', 'pn123', 'sn777'

and

'aaa.111', 'T200', 'pn123', 'sn222'
'eee.666', 'T600', 'pn123', 'sn444'

Any help will be extremely appreciated.
Use the csv module, and when you're reading build a set of the values
you've already seen in field 1 and a set of the values you've already
seen in field 4 so you can check whether you've seen a row before.
 
V

VP

Thanks guys!
Tested, seems working.

CSV file:
---------
"a.a","sn-01"
"b.b","sn-02"
"c.c","sn-03"
"d.d","sn-04"
"e.e","sn-05"
"f.f","sn-06"
"g.g","sn-07"
"h.h","sn-08"
"i.i","sn-09"
"a.a","sn-10"
"k.k","sn-02"
"i.i","sn-09"


Source:
---------
#!/usr/bin/env python
import csv

unqs = []
dups = []

seen_in_field0 = set()
seen_in_field1 = set()

reader = csv.reader(open("myfile.csv", "rb"))

print "\nOriginals:\n"

for row in reader:
print row

if row[0] in seen_in_field0 or row[1] in seen_in_field1:
dups.append(row)
else:
seen_in_field0.add(row[0])
seen_in_field1.add(row[1])
unqs.append(row)

print "\nUniques:\n"

for row in unqs:
print row

print "\nDuplicates:\n"

for row in dups:
print row

print "\n"



Result:
---------

Originals:

['a.a', 'sn-01']
['b.b', 'sn-02']
['c.c', 'sn-03']
['d.d', 'sn-04']
['e.e', 'sn-05']
['f.f', 'sn-06']
['g.g', 'sn-07']
['h.h', 'sn-08']
['i.i', 'sn-09']
['a.a', 'sn-10']
['k.k', 'sn-02']
['i.i', 'sn-09']

Uniques:

['a.a', 'sn-01']
['b.b', 'sn-02']
['c.c', 'sn-03']
['d.d', 'sn-04']
['e.e', 'sn-05']
['f.f', 'sn-06']
['g.g', 'sn-07']
['h.h', 'sn-08']
['i.i', 'sn-09']

Duplicates:

['a.a', 'sn-10']
['k.k', 'sn-02']
['i.i', 'sn-09']
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,534
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top