fdups: calling for beta testers

Discussion in 'Python' started by Patrick Useldinger, Feb 25, 2005.

  1. Hi all,

    I am looking for beta-testers for fdups.

    fdups is a program to detect duplicate files on locally mounted
    filesystems. Files are considered equal if their content is identical,
    regardless of their filename. Also, fdups ignores symbolic links and is
    able to detect and ignore hardlinks, where available.

    In contrast to similar programs, fdups does not rely on md5 sums or
    other hash functions to detect potentially identical files. Instead, it
    does a direct blockwise comparison and stops reading as soon as
    possible, thus reducing the file reads to a maximum.

    fdups has been developed on Linux but should run on all platforms that
    support Python.

    fdups' homepage is at http://www.homepages.lu/pu/fdups.html, where
    you'll also find a link to download the tar.

    I am primarily interested in getting feedback if it produces correct
    results. But as I haven't been programming in Python for a year or so,
    I'd also be interested in comments on code if you happen to look at it
    in detail.

    Your help is much appreciated.

    -pu
     
    Patrick Useldinger, Feb 25, 2005
    #1
    1. Advertising

  2. Patrick Useldinger

    John Machin Guest

    Patrick Useldinger wrote:
    >
    > fdups' homepage is at http://www.homepages.lu/pu/fdups.html, where
    > you'll also find a link to download the tar.
    >


    """fdups has no installation program. Just change into a temporary
    directory, and type "tar xfj fdups.tar.bz". You should also chown the
    files according to your needs, and then copy the executables to your
    PATH."""

    (1) It's actually .bz2, not .bz (2) Why annoy people with the
    not-widely-known bzip2 format just to save a few % of a 12KB file?? (3)
    Typing that on Windows command line doesn't produce a useful result (4)
    Haven't you heard of distutils?

    (5) if files[subgroup[j]]['flag'] and files[subgroup]['buffer'] ==
    files[subgroup[j]]['buffer']:

    That's not the most readable code I've ever seen.

    (6) You are keeping open handles for all files of a given size -- have
    you actually considered the possibility of an exception like this:
    IOError: [Errno 24] Too many open files: 'foo509'

    Once upon a time, max 20 open files was considered as generous as 640KB
    of memory. Looks like Bill thinks 512 (open files, that is) is about
    right these days.

    (7)

    ! def compare(self):
    ! """ compare all files of the same size - outer loop """
    ! sizes=self.compfiles.keys()
    ! sizes.sort()
    ! for size in sizes:
    ! self.comparefiles(size,self.compfiles[size])

    Why sort? What's wrong with just two lines:

    ! for size, file_list in self.compfiles.iteritems():
    ! self.comparefiles(size, file_list)

    (8) global
    MIN_FILESIZE,MAX_ONEBUFFER,MAX_ALLBUFFERS,BLOCKSIZE,INODES

    That doesn't sit very well with the 'everything must be in a class'
    religion seemingly espoused by the following:

    ! class fDups:
    ! """ encapsulates the whole logic """

    (9) Any good reason why the "executables" don't have ".py" extensions
    on their names?

    All in all, a very poor "out-of-the-box" experience. Bear in mind that
    very few Windows users would have even heard of bzip2, let alone have a
    bzip2.exe on their machine. They wouldn't even be able to *open* the
    box.
    And what is "chown" -- any relation of Perl's "chomp"?
     
    John Machin, Feb 26, 2005
    #2
    1. Advertising

  3. John Machin wrote:

    > (1) It's actually .bz2, not .bz (2) Why annoy people with the
    > not-widely-known bzip2 format just to save a few % of a 12KB file?? (3)
    > Typing that on Windows command line doesn't produce a useful result (4)
    > Haven't you heard of distutils?


    (1) Typo, thanks for pointing it out
    (2)(3) In the Linux world, it is really popular. I suppose you are a
    Windows user, and I haven't given that much thought. The point was not
    to save space, just to use the "standard" format. What would it be for
    Windows - zip?
    (4) Never used them, but are very valid point. I will look into it.

    > (6) You are keeping open handles for all files of a given size -- have
    > you actually considered the possibility of an exception like this:
    > IOError: [Errno 24] Too many open files: 'foo509'


    (6) Not much I can do about this. In the beginning, all files of equal
    size are potentially identical. I first need to read a chunk of each,
    and if I want to avoid opening & closing files all the time, I need them
    open together.
    What would you suggest?

    > Once upon a time, max 20 open files was considered as generous as 640KB
    > of memory. Looks like Bill thinks 512 (open files, that is) is about
    > right these days.


    Bill also thinks it is normal that half of service pack 2 lingers twice
    on a harddisk. Not sure whether he's my hero ;-)

    > (7)
    > Why sort? What's wrong with just two lines:
    >
    > ! for size, file_list in self.compfiles.iteritems():
    > ! self.comparefiles(size, file_list)


    (7) I wanted the output to be sorted by file size, instead of being
    random. It's psychological, but if you're chasing dups, you'd want to
    start with the largest ones first. If you have more that a screen full
    of info, it's the last lines which are the most interesting. And it will
    produce the same info in the same order if you run it twice on the same
    folders.

    > (8) global
    > MIN_FILESIZE,MAX_ONEBUFFER,MAX_ALLBUFFERS,BLOCKSIZE,INODES
    >
    > That doesn't sit very well with the 'everything must be in a class'
    > religion seemingly espoused by the following:


    (8) Agreed. I'll think about that.

    > (9) Any good reason why the "executables" don't have ".py" extensions
    > on their names?


    (9) Because I am lazy and Linux doesn't care. I suppose Windows does?

    > All in all, a very poor "out-of-the-box" experience. Bear in mind that
    > very few Windows users would have even heard of bzip2, let alone have a
    > bzip2.exe on their machine. They wouldn't even be able to *open* the
    > box.


    As I said, I did not give Windows users much thought. I will improve this.

    > And what is "chown" -- any relation of Perl's "chomp"?


    chown is a Unix command to change the owner or the group of a file. It
    has to do with controlling access to the file. It is not relevant on
    Windows. No relation to Perl's chomp.

    Thank you very much for your feedback. Did you actually run it on your
    Windows box?

    -pu
     
    Patrick Useldinger, Feb 26, 2005
    #3
  4. Patrick Useldinger

    Peter Hansen Guest

    Patrick Useldinger wrote:
    >> (9) Any good reason why the "executables" don't have ".py" extensions
    >> on their names?

    >
    > (9) Because I am lazy and Linux doesn't care. I suppose Windows does?


    Unfortunately, yes. Windows has nothing like the "x" permission
    bit, so you have to have an actual extension on the filename and
    Windows (XP anyway) will check it against the list of extensions
    in the PATHEXT environment variable to determine if it should be
    treated like an executable.

    Otherwise you must type "python" and the full filename.

    -Peter
     
    Peter Hansen, Feb 26, 2005
    #4
  5. Patrick Useldinger

    Serge Orlov Guest

    Peter Hansen wrote:
    > Patrick Useldinger wrote:
    >>> (9) Any good reason why the "executables" don't have ".py"
    >>> extensions on their names?

    >>
    >> (9) Because I am lazy and Linux doesn't care. I suppose Windows does?

    >
    > Unfortunately, yes. Windows has nothing like the "x" permission
    > bit, so you have to have an actual extension on the filename and
    > Windows (XP anyway) will check it against the list of extensions
    > in the PATHEXT environment variable to determine if it should be
    > treated like an executable.
    >
    > Otherwise you must type "python" and the full filename.


    Or use exemaker, which IMHO is the best way to handle this
    problem.

    Serge.
     
    Serge Orlov, Feb 26, 2005
    #5
  6. Patrick Useldinger

    John Machin Guest

    Patrick Useldinger wrote:
    > John Machin wrote:
    >
    > > (1) It's actually .bz2, not .bz (2) Why annoy people with the
    > > not-widely-known bzip2 format just to save a few % of a 12KB file??

    (3)
    > > Typing that on Windows command line doesn't produce a useful result

    (4)
    > > Haven't you heard of distutils?

    >
    > (1) Typo, thanks for pointing it out
    > (2)(3) In the Linux world, it is really popular. I suppose you are a
    > Windows user, and I haven't given that much thought. The point was

    not
    > to save space, just to use the "standard" format. What would it be

    for
    > Windows - zip?


    Yes. Moreover, "WinZip", the most popular archive-handler, doesn't grok
    bzip2.

    > > (6) You are keeping open handles for all files of a given size --

    have
    > > you actually considered the possibility of an exception like this:
    > > IOError: [Errno 24] Too many open files: 'foo509'

    >
    > (6) Not much I can do about this. In the beginning, all files of

    equal
    > size are potentially identical. I first need to read a chunk of each,


    > and if I want to avoid opening & closing files all the time, I need

    them
    > open together.
    > What would you suggest?


    Test, like I did, to see how many open handles you can get away with. I
    was not joking, 20 was the max on MS-DOS at one stage and I vaguely
    recall: (a) some low limits on various flavours of *x (b) the "ulimit"
    command can be used to vary the per-process limit but (c) there is a
    system-wide limit also.

    You should consider a fall-back method to be used in this case and in
    the case of too many files for your 1Mb (default) buffer pool. BTW 1Mb
    seems tiny; desktop PCs come with 512MB standard these days, and Bill
    does leave a bit more than 1MB available for applications.

    > > And what is "chown" -- any relation of Perl's "chomp"?

    >
    > chown is a Unix command to change the owner or the group of a file.

    It
    > has to do with controlling access to the file. It is not relevant on
    > Windows. No relation to Perl's chomp.


    The question was rhetorical. Your irony detector must be on the fritz.
    :)

    > Did you actually run it on your
    > Windows box?


    Yes, with trepidation, after carefully reading the source. It detected
    some highly plausible duplicates, which I haven't verified yet.

    Cheers,
    John
     
    John Machin, Feb 26, 2005
    #6
  7. John Machin wrote:

    > Yes. Moreover, "WinZip", the most popular archive-handler, doesn't grok
    > bzip2.


    I've added a zip file. It was made in Linux with the zip command-line
    tool, the man pages say it's compatible with the Windows zip tools. I
    have also added .py extentions to the 2 programs. I did however not use
    distutils, because I'm not sure it is really adapted to module-less scripts.

    > You should consider a fall-back method to be used in this case and in
    > the case of too many files for your 1Mb (default) buffer pool. BTW 1Mb
    > seems tiny; desktop PCs come with 512MB standard these days, and Bill
    > does leave a bit more than 1MB available for applications.


    I've added it to the TODO list.

    > The question was rhetorical. Your irony detector must be on the fritz.
    > :)


    I always find it hard to detect irony by mail with people I do not know. ..

    >>Did you actually run it on your
    >>Windows box?

    >
    >
    > Yes, with trepidation, after carefully reading the source. It detected
    > some highly plausible duplicates, which I haven't verified yet.


    I would have been reluctant too. But I've tested it intensively, and
    there's strictly no statement that actually alters the file system.

    Thanks for your feedback!

    -pu
     
    Patrick Useldinger, Feb 26, 2005
    #7
  8. Serge Orlov wrote:

    > Or use exemaker, which IMHO is the best way to handle this
    > problem.


    Looks good, but I do not use Windows.

    -pu
     
    Patrick Useldinger, Feb 26, 2005
    #8
  9. Patrick Useldinger

    John Machin Guest

    On Sat, 26 Feb 2005 23:53:10 +0100, Patrick Useldinger
    <> wrote:

    > I've tested it intensively


    "Famous Last Words" :)

    >Thanks for your feedback!


    Here's some more:

    (1) Manic s/w producing lots of files all the same size: the Borland
    C[++] compiler produces a debug symbol file (.tds) that's always
    384KB; I have 144 of these on my HD, rarely more than 1 in the same
    directory.

    Here's a snippet from a duplicate detection run:

    DUP|393216|2|\devel\delimited\build\lib.win32-1.5\delimited.tds|\devel\delimited\build\lib.win32-2.1\delimited.tds
    DUP|393216|2|\devel\delimited\build\lib.win32-2.3\delimited.tds|\devel\delimited\build\lib.win32-2.4\delimited.tds

    (2) There appears to be a flaw in your logic such that it will find
    duplicates only if they are in the *SAME* directory and only when
    there are no other directories with two or more files of the same
    size. The above duplicates were detected only when I made the
    following changes to your script:


    --- fdups Sat Feb 26 06:41:36 2005
    +++ fdups_jm.py Sun Feb 27 12:18:04 2005
    @@ -29,13 +29,14 @@
    self.count = self.totalsize = self.inodecount =
    self.slinkcount = 0
    self.gain = self.bytescompared = self.bytesread =
    self.inodecount = 0
    for toplevel in args:
    - os.path.walk(toplevel, self.buildList, None)
    + os.path.walk(toplevel, self.updateDict, None)
    if self.count > 0:
    self.compare()

    - def buildList(self,arg,dirpath,namelist):
    - """ build a dictionnary of files to be analysed, indexed by
    length """
    - files = {}
    + def updateDict(self,arg,dirpath,namelist):
    + """ update a dictionary of files to be analysed, indexed by
    length """
    + # files = {}
    + files = self.compfiles
    for filepath in namelist:
    fullpath = os.path.join(dirpath,filepath)
    if os.path.isfile(fullpath):
    @@ -51,20 +52,23 @@
    if size >= MIN_FILESIZE:
    self.count += 1
    self.totalsize += size
    + # is above totalling in the wrong place?
    if size not in files:
    files[size]=[fullpath]
    else:
    files[size].append(fullpath)
    - for size in files:
    - if len(files[size]) != 1:
    - self.compfiles[size]=files[size]
    + # for size in files:
    + # if len(files[size]) != 1:
    + # self.compfiles[size]=files[size]

    def compare(self):
    """ compare all files of the same size - outer loop """
    sizes=self.compfiles.keys()
    sizes.sort()
    for size in sizes:
    - self.comparefiles(size,self.compfiles[size])
    + list_of_filenames = self.compfiles[size]
    + if len(list_of_filenames) > 1:
    + self.comparefiles(size, list_of_filenames)

    def comparefiles(self,size,filelist):
    """ compare all files of the same size - inner loop """


    (3) Your fdups-check gadget doesn't work on Windows; the commands
    module works only on Unix but is supplied with Python on all
    platforms. The results might just confuse a newbie:

    (1, "'{' is not recognized as an internal or external
    command,\noperable program or batch file.")

    Why not use the Python filecmp module?

    Cheers,
    John
     
    John Machin, Feb 27, 2005
    #9
  10. John Machin wrote:

    >>I've tested it intensively

    > "Famous Last Words" :)


    ;-)

    > (1) Manic s/w producing lots of files all the same size: the Borland
    > C[++] compiler produces a debug symbol file (.tds) that's always
    > 384KB; I have 144 of these on my HD, rarely more than 1 in the same
    > directory.


    Not sure what you want me to do about it. I've decreased the minimum
    block size once more, to accomodate for more files of the same length
    without increasing the total amount of memory used.

    > (2) There appears to be a flaw in your logic such that it will find
    > duplicates only if they are in the *SAME* directory and only when
    > there are no other directories with two or more files of the same
    > size.


    Ooops...
    A really stupid mistake on my side. Corrected.

    > (3) Your fdups-check gadget doesn't work on Windows; the commands
    > module works only on Unix but is supplied with Python on all
    > platforms. The results might just confuse a newbie:
    > Why not use the Python filecmp module?


    Done. It's also faster AND it works better. Thanks for the suggestion.

    Please fetch the new version from http://www.homepages.lu/pu/fdups.html.

    -pu
     
    Patrick Useldinger, Feb 27, 2005
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Roger Jack
    Replies:
    0
    Views:
    361
    Roger Jack
    Dec 4, 2003
  2. Robert Weatherford
    Replies:
    0
    Views:
    369
    Robert Weatherford
    Apr 12, 2004
  3. Xavier Jefferson
    Replies:
    2
    Views:
    352
    Ken Cox [Microsoft MVP]
    Aug 21, 2004
  4. Patrick Useldinger

    [ann] fdups 0.15

    Patrick Useldinger, Mar 20, 2005, in forum: Python
    Replies:
    1
    Views:
    341
  5. timinganalyzer
    Replies:
    3
    Views:
    776
    Pierre-Fran├žois \(f5bqp_pfm\)
    Oct 29, 2008
Loading...

Share This Page