using python to parse md5sum list

Discussion in 'Python' started by Ben Rf, Mar 6, 2005.

  1. Ben Rf

    Ben Rf Guest

    Hi

    I'm new to programming and i'd like to write a program that will parse
    a list produced by md5summer and give me a report in a text file on
    which md5 sums appear more than once and where they are located.

    the end end goal is to have a way of finding duplicate files that are
    scattered across a lan of 4 windows computers.

    I've dabbled with different languages over the years and i think
    python is a good language for this but i have had a lot of trouble
    sifting through manual and tutorials finding out with commands i need
    and their syntax.

    Can someone please help me?

    Thanks.

    Ben
    Ben Rf, Mar 6, 2005
    #1
    1. Advertising

  2. Ben Rf

    James Stroud Guest

    Among many other things:

    First, you might want to look at os.path.walk()
    Second, look at the string data type.

    Third, get the Python essential reference.

    Also, Programming Python (O'Riely) actually has a lot in it about stuff like
    this. Its a tedious read, but in the end will help a lot for administrative
    stuff like you are doing here.

    So, with the understanding that you will look at these references, I will
    foolishly save you a little time...

    If you are using md5sum, tou can grab the md5 and the filename like such:

    myfile = open(filename)
    md5sums = []
    for aline in myfile.readlines():
    md5sums.append(aline[:-1].split(" ",1))
    myfile.close()

    The md5 sum will be in the 0 element of each tuple in the md5sums list, and
    the path to the file will be in the 1 element.


    James

    On Saturday 05 March 2005 07:54 pm, Ben Rf wrote:
    > Hi
    >
    > I'm new to programming and i'd like to write a program that will parse
    > a list produced by md5summer and give me a report in a text file on
    > which md5 sums appear more than once and where they are located.
    >
    > the end end goal is to have a way of finding duplicate files that are
    > scattered across a lan of 4 windows computers.
    >
    > I've dabbled with different languages over the years and i think
    > python is a good language for this but i have had a lot of trouble
    > sifting through manual and tutorials finding out with commands i need
    > and their syntax.
    >
    > Can someone please help me?
    >
    > Thanks.
    >
    > Ben


    --
    James Stroud, Ph.D.
    UCLA-DOE Institute for Genomics and Proteomics
    Box 951570
    Los Angeles, CA 90095
    James Stroud, Mar 6, 2005
    #2
    1. Advertising

  3. Ben Rf wrote:

    > I'm new to programming and i'd like to write a program that will parse
    > a list produced by md5summer and give me a report in a text file on
    > which md5 sums appear more than once and where they are located.


    This should do the trick:

    """
    import fileinput

    md5s = {}
    for line in fileinput.input():
    md5, filename = line.rstrip().split()
    md5s.setdefault(md5, []).append(filename)

    for md5, filenames in md5s.iteritems():
    if len(filenames) > 1:
    print "\t".join(filenames)
    """

    Put this in md5dups.py and you can then use
    md5dups.py [FILE]... to find duplicates in any of the files you
    specify. They'll then be printed out as a tab-delimited list.

    Key things you might want to look up to understand this:

    * the dict datatype
    * dict.setdefault()
    * dict.iteritems()
    * the fileinput module
    --
    Michael Hoffman
    Michael Hoffman, Mar 6, 2005
    #3
  4. In <>, James Stroud
    wrote:

    > If you are using md5sum, tou can grab the md5 and the filename like such:
    >
    > myfile = open(filename)
    > md5sums = []
    > for aline in myfile.readlines():
    > md5sums.append(aline[:-1].split(" ",1))


    md5sums.append(aline[:-1].split(None, 1))

    That works too if md5sum opened the files in binary mode which is the
    default on Windows. The filename is prefixed with a '*' then, leaving
    just one space between checksum and filename.

    > myfile.close()


    Ciao,
    Marc 'BlackJack' Rintsch
    Marc 'BlackJack' Rintsch, Mar 6, 2005
    #4
  5. On 5 Mar 2005 19:54:34 -0800, rumours say that (Ben Rf)
    might have written:

    [snip]

    >the end end goal is to have a way of finding duplicate files that are
    >scattered across a lan of 4 windows computers.


    Just in case you want to go directly to that goal, check this:

    http://groups-beta.google.com/group/comp.lang.python/messages/048e292ec9adb82d

    It doesn't read a file at all, unless there is a need to do that. For example,
    if you have ten small files and one large one, the large one will not be read
    (since no other files would be found with the same size).

    In your case, you can use the find_duplicate_files function with arguments like:
    r"\\COMPUTER1\SHARE1", r"\\COMPUTER2\SHARE2" etc
    --
    TZOTZIOY, I speak England very best.
    "Be strict when sending and tolerant when receiving." (from RFC1958)
    I really should keep that in mind when talking with people, actually...
    Christos TZOTZIOY Georgiou, Mar 7, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    19
    Views:
    1,119
    Daniel Vallstrom
    Mar 15, 2005
  2. Andrew Chalk

    md5sum differs across builds

    Andrew Chalk, Aug 23, 2005, in forum: C++
    Replies:
    1
    Views:
    357
    Victor Bazarov
    Aug 23, 2005
  3. Udai Kiran

    md5sum c function

    Udai Kiran, Nov 12, 2007, in forum: C Programming
    Replies:
    4
    Views:
    780
    santosh
    Nov 12, 2007
  4. Robert Lynch
    Replies:
    0
    Views:
    133
    Robert Lynch
    Jul 17, 2003
  5. David Filmer
    Replies:
    2
    Views:
    122
    Martijn Lievaart
    Nov 21, 2007
Loading...

Share This Page