Delete duplicate rows in textfile - except it contains a "{" or "}"

Discussion in 'Python' started by Joon Ki Choi, Oct 10, 2012.

  1. Joon Ki Choi

    Joon Ki Choi Guest

    Hello Pythonistas,

    i have a very large textfile with contents like:

    @INBOOK{Ackermann1999-b,
    author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
    K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
    and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
    Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
    K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
    and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
    Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
    K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
    and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
    Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
    year = {1980},
    timestamp = {1995-12-02}
    }

    And i want to delete the duplicate rows except these rows containing the brackets { or }.
    The result should look like:

    @INBOOK{Ackermann1999-b,
    author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
    Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
    year = {1980},
    timestamp = {1995-12-02}
    }

    I come across with this Python-Skript:

    lines_seen = set() # holds lines already seen
    outfile = open("literatur_clean.txt", "w")
    for line in open("literatur_dupl.txt", "r"):
    if line not in lines_seen: # not a duplicate
    outfile.write(line)
    lines_seen.add(line)
    outfile.close()

    But it deletes also the lines with a closing bracket } and the lines with the same authordata.
    Therefor i need the condition of the brackets.

    Could someone point me out to adding this condition?

    Thanks in advance,
    Joon
    Joon Ki Choi, Oct 10, 2012
    #1
    1. Advertising

  2. Re: Delete duplicate rows in textfile - except it contains a "{"or "}"

    On 10/10/2012 09:51, Joon Ki Choi wrote:
    >
    > Hello Pythonistas,
    >
    > i have a very large textfile with contents like:
    >
    > @INBOOK{Ackermann1999-b,
    > author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
    > K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
    > and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
    > Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
    > K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
    > and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
    > Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
    > K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
    > and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
    > Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
    > year = {1980},
    > timestamp = {1995-12-02}
    > }
    >
    > And i want to delete the duplicate rows except these rows containing the brackets { or }.
    > The result should look like:
    >
    > @INBOOK{Ackermann1999-b,
    > author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
    > Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
    > year = {1980},
    > timestamp = {1995-12-02}
    > }
    >
    > I come across with this Python-Skript:
    >
    > lines_seen = set() # holds lines already seen
    > outfile = open("literatur_clean.txt", "w")


    Slight aside, you could use this so there's no need to explicitly close
    the file.

    with open("literatur_dupl.txt", "r") as infile

    > for line in infile:
    > if line not in lines_seen: # not a duplicate
    > outfile.write(line)
    > lines_seen.add(line)


    Something like:-

    if "{" in line or "}" in line or line not in lines_seen:

    > outfile.close()
    >
    > But it deletes also the lines with a closing bracket } and the lines with the same authordata.
    > Therefor i need the condition of the brackets.
    >
    > Could someone point me out to adding this condition?
    >
    > Thanks in advance,
    > Joon
    >


    --
    Cheers.

    Mark Lawrence.
    Mark Lawrence, Oct 10, 2012
    #2
    1. Advertising

  3. Joon Ki Choi

    Peter Otten Guest

    Re: Delete duplicate rows in textfile - except it contains a "{" or"}"

    Joon Ki Choi wrote:

    >
    > Hello Pythonistas,
    >
    > i have a very large textfile with contents like:
    >
    > @INBOOK{Ackermann1999-b,
    > author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
    > Ackermann,
    > K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
    > and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
    > Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
    > K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
    > and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
    > Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
    > K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
    > and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
    > Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
    > year = {1980},
    > timestamp = {1995-12-02}
    > }
    >
    > And i want to delete the duplicate rows except these rows containing the
    > brackets { or }. The result should look like:
    >
    > @INBOOK{Ackermann1999-b,
    > author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
    > Ackermann,
    > Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
    > year = {1980},
    > timestamp = {1995-12-02}
    > }
    >
    > I come across with this Python-Skript:
    >
    > lines_seen = set() # holds lines already seen
    > outfile = open("literatur_clean.txt", "w")
    > for line in open("literatur_dupl.txt", "r"):
    > if line not in lines_seen: # not a duplicate
    > outfile.write(line)
    > lines_seen.add(line)
    > outfile.close()
    >
    > But it deletes also the lines with a closing bracket } and the lines with
    > the same authordata. Therefor i need the condition of the brackets.
    >
    > Could someone point me out to adding this condition?
    >
    > Thanks in advance,
    > Joon


    Not what you asked for, but here is something that is quick-and-dirty, too,
    but tries a bit harder:

    import re

    def unique(match):
    names = match.group()[1:-1].split(",")
    parts = set(" ".join(author.split()) for author in names)
    return "{%s}" % ", ".join(parts)

    if __name__ == "__main__":
    with open("literatur_dupl.txt") as f:
    data = f.read()
    data = re.compile("{[^{}]*}", re.DOTALL).sub(unique, data)

    with open("literatur_clean.txt", "w") as f:
    f.write(data)

    I'm assuming that "very large" means that the file contents still
    comfortably fit into your computer's memory...
    Peter Otten, Oct 10, 2012
    #3
  4. Joon Ki Choi

    Dave Angel Guest

    Re: Delete duplicate rows in textfile - except it contains a "{"or "}"

    On 10/10/2012 04:51 AM, Joon Ki Choi wrote:
    > Hello Pythonistas,
    >
    > i have a very large textfile with contents like:
    >
    > @INBOOK{Ackermann1999-b,
    > author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
    > K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
    > and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
    > Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
    > K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
    > and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
    > Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
    > K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
    > and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
    > Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
    > year = {1980},
    > timestamp = {1995-12-02}
    > }
    >
    > And i want to delete the duplicate rows except these rows containing the brackets { or }.
    > The result should look like:
    >
    > @INBOOK{Ackermann1999-b,
    > author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
    > Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
    > year = {1980},
    > timestamp = {1995-12-02}
    > }


    Which is it? Do you want to match your output, or match your
    description? Your description would result in:

    @INBOOK{Ackermann1999-b,
    author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
    K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F.
    and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and
    Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann,
    Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann},
    year = {1980},
    timestamp = {1995-12-02}
    }

    (that's doing it by eyeball, so i may have missed some minor differences)



    --

    DaveA
    Dave Angel, Oct 10, 2012
    #4
  5. Joon Ki Choi

    Joon Ki Choi Guest

    Re: Delete duplicate rows in textfile - except it contains a "{" or"}"

    lines_seen = set() # holds lines already seen
    outfile = open("literatur_clean.txt", "w")
    for line in open("literatur_dupl.txt", "r"):
    if ('{' in line or '}' in line) or line not in lines_seen:
    outfile.write(line)
    lines_seen.add(line)
    outfile.close()
    Joon Ki Choi, Oct 10, 2012
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John Salerno
    Replies:
    20
    Views:
    843
    John Salerno
    Aug 11, 2006
  2. Fabio Z Tessitore

    who is simpler? try/except/else or try/except

    Fabio Z Tessitore, Aug 12, 2007, in forum: Python
    Replies:
    5
    Views:
    367
  3. David House

    try -> except -> else -> except?

    David House, Jul 6, 2009, in forum: Python
    Replies:
    2
    Views:
    332
    Bruno Desthuilliers
    Jul 6, 2009
  4. Peng Yu
    Replies:
    1
    Views:
    512
    Steven D'Aprano
    Nov 18, 2009
  5. MRAB
    Replies:
    0
    Views:
    819
Loading...

Share This Page