Delete duplicate rows in textfile - except it contains a "{" or "}"

J

Joon Ki Choi

Hello Pythonistas,

i have a very large textfile with contents like:

@INBOOK{Ackermann1999-b,
author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
year = {1980},
timestamp = {1995-12-02}
}

And i want to delete the duplicate rows except these rows containing the brackets { or }.
The result should look like:

@INBOOK{Ackermann1999-b,
author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
year = {1980},
timestamp = {1995-12-02}
}

I come across with this Python-Skript:

lines_seen = set() # holds lines already seen
outfile = open("literatur_clean.txt", "w")
for line in open("literatur_dupl.txt", "r"):
if line not in lines_seen: # not a duplicate
outfile.write(line)
lines_seen.add(line)
outfile.close()

But it deletes also the lines with a closing bracket } and the lines with the same authordata.
Therefor i need the condition of the brackets.

Could someone point me out to adding this condition?

Thanks in advance,
Joon
 
M

Mark Lawrence

Hello Pythonistas,

i have a very large textfile with contents like:

@INBOOK{Ackermann1999-b,
author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
year = {1980},
timestamp = {1995-12-02}
}

And i want to delete the duplicate rows except these rows containing the brackets { or }.
The result should look like:

@INBOOK{Ackermann1999-b,
author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
year = {1980},
timestamp = {1995-12-02}
}

I come across with this Python-Skript:

lines_seen = set() # holds lines already seen
outfile = open("literatur_clean.txt", "w")

Slight aside, you could use this so there's no need to explicitly close
the file.

with open("literatur_dupl.txt", "r") as infile
for line in infile:
if line not in lines_seen: # not a duplicate
outfile.write(line)
lines_seen.add(line)

Something like:-

if "{" in line or "}" in line or line not in lines_seen:
 
P

Peter Otten

Joon said:
Hello Pythonistas,

i have a very large textfile with contents like:

@INBOOK{Ackermann1999-b,
author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
Ackermann,
K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
year = {1980},
timestamp = {1995-12-02}
}

And i want to delete the duplicate rows except these rows containing the
brackets { or }. The result should look like:

@INBOOK{Ackermann1999-b,
author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
Ackermann,
Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
year = {1980},
timestamp = {1995-12-02}
}

I come across with this Python-Skript:

lines_seen = set() # holds lines already seen
outfile = open("literatur_clean.txt", "w")
for line in open("literatur_dupl.txt", "r"):
if line not in lines_seen: # not a duplicate
outfile.write(line)
lines_seen.add(line)
outfile.close()

But it deletes also the lines with a closing bracket } and the lines with
the same authordata. Therefor i need the condition of the brackets.

Could someone point me out to adding this condition?

Thanks in advance,
Joon

Not what you asked for, but here is something that is quick-and-dirty, too,
but tries a bit harder:

import re

def unique(match):
names = match.group()[1:-1].split(",")
parts = set(" ".join(author.split()) for author in names)
return "{%s}" % ", ".join(parts)

if __name__ == "__main__":
with open("literatur_dupl.txt") as f:
data = f.read()
data = re.compile("{[^{}]*}", re.DOTALL).sub(unique, data)

with open("literatur_clean.txt", "w") as f:
f.write(data)

I'm assuming that "very large" means that the file contents still
comfortably fit into your computer's memory...
 
D

Dave Angel

Hello Pythonistas,

i have a very large textfile with contents like:

@INBOOK{Ackermann1999-b,
author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
year = {1980},
timestamp = {1995-12-02}
}

And i want to delete the duplicate rows except these rows containing the brackets { or }.
The result should look like:

@INBOOK{Ackermann1999-b,
author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
year = {1980},
timestamp = {1995-12-02}
}

Which is it? Do you want to match your output, or match your
description? Your description would result in:

@INBOOK{Ackermann1999-b,
author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
year = {1980},
timestamp = {1995-12-02}
}

(that's doing it by eyeball, so i may have missed some minor differences)
 
J

Joon Ki Choi

lines_seen = set() # holds lines already seen
outfile = open("literatur_clean.txt", "w")
for line in open("literatur_dupl.txt", "r"):
if ('{' in line or '}' in line) or line not in lines_seen:
outfile.write(line)
lines_seen.add(line)
outfile.close()
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top