fdups: calling for beta testers

P

Patrick Useldinger

Hi all,

I am looking for beta-testers for fdups.

fdups is a program to detect duplicate files on locally mounted
filesystems. Files are considered equal if their content is identical,
regardless of their filename. Also, fdups ignores symbolic links and is
able to detect and ignore hardlinks, where available.

In contrast to similar programs, fdups does not rely on md5 sums or
other hash functions to detect potentially identical files. Instead, it
does a direct blockwise comparison and stops reading as soon as
possible, thus reducing the file reads to a maximum.

fdups has been developed on Linux but should run on all platforms that
support Python.

fdups' homepage is at http://www.homepages.lu/pu/fdups.html, where
you'll also find a link to download the tar.

I am primarily interested in getting feedback if it produces correct
results. But as I haven't been programming in Python for a year or so,
I'd also be interested in comments on code if you happen to look at it
in detail.

Your help is much appreciated.

-pu
 
J

John Machin

Patrick said:
fdups' homepage is at http://www.homepages.lu/pu/fdups.html, where
you'll also find a link to download the tar.

"""fdups has no installation program. Just change into a temporary
directory, and type "tar xfj fdups.tar.bz". You should also chown the
files according to your needs, and then copy the executables to your
PATH."""

(1) It's actually .bz2, not .bz (2) Why annoy people with the
not-widely-known bzip2 format just to save a few % of a 12KB file?? (3)
Typing that on Windows command line doesn't produce a useful result (4)
Haven't you heard of distutils?

(5) if files[subgroup[j]]['flag'] and files[subgroup]['buffer'] ==
files[subgroup[j]]['buffer']:

That's not the most readable code I've ever seen.

(6) You are keeping open handles for all files of a given size -- have
you actually considered the possibility of an exception like this:
IOError: [Errno 24] Too many open files: 'foo509'

Once upon a time, max 20 open files was considered as generous as 640KB
of memory. Looks like Bill thinks 512 (open files, that is) is about
right these days.

(7)

! def compare(self):
! """ compare all files of the same size - outer loop """
! sizes=self.compfiles.keys()
! sizes.sort()
! for size in sizes:
! self.comparefiles(size,self.compfiles[size])

Why sort? What's wrong with just two lines:

! for size, file_list in self.compfiles.iteritems():
! self.comparefiles(size, file_list)

(8) global
MIN_FILESIZE,MAX_ONEBUFFER,MAX_ALLBUFFERS,BLOCKSIZE,INODES

That doesn't sit very well with the 'everything must be in a class'
religion seemingly espoused by the following:

! class fDups:
! """ encapsulates the whole logic """

(9) Any good reason why the "executables" don't have ".py" extensions
on their names?

All in all, a very poor "out-of-the-box" experience. Bear in mind that
very few Windows users would have even heard of bzip2, let alone have a
bzip2.exe on their machine. They wouldn't even be able to *open* the
box.
And what is "chown" -- any relation of Perl's "chomp"?
 
P

Patrick Useldinger

John said:
(1) It's actually .bz2, not .bz (2) Why annoy people with the
not-widely-known bzip2 format just to save a few % of a 12KB file?? (3)
Typing that on Windows command line doesn't produce a useful result (4)
Haven't you heard of distutils?

(1) Typo, thanks for pointing it out
(2)(3) In the Linux world, it is really popular. I suppose you are a
Windows user, and I haven't given that much thought. The point was not
to save space, just to use the "standard" format. What would it be for
Windows - zip?
(4) Never used them, but are very valid point. I will look into it.
(6) You are keeping open handles for all files of a given size -- have
you actually considered the possibility of an exception like this:
IOError: [Errno 24] Too many open files: 'foo509'

(6) Not much I can do about this. In the beginning, all files of equal
size are potentially identical. I first need to read a chunk of each,
and if I want to avoid opening & closing files all the time, I need them
open together.
What would you suggest?
Once upon a time, max 20 open files was considered as generous as 640KB
of memory. Looks like Bill thinks 512 (open files, that is) is about
right these days.

Bill also thinks it is normal that half of service pack 2 lingers twice
on a harddisk. Not sure whether he's my hero ;-)
(7)
Why sort? What's wrong with just two lines:

! for size, file_list in self.compfiles.iteritems():
! self.comparefiles(size, file_list)

(7) I wanted the output to be sorted by file size, instead of being
random. It's psychological, but if you're chasing dups, you'd want to
start with the largest ones first. If you have more that a screen full
of info, it's the last lines which are the most interesting. And it will
produce the same info in the same order if you run it twice on the same
folders.
(8) global
MIN_FILESIZE,MAX_ONEBUFFER,MAX_ALLBUFFERS,BLOCKSIZE,INODES

That doesn't sit very well with the 'everything must be in a class'
religion seemingly espoused by the following:

(8) Agreed. I'll think about that.
(9) Any good reason why the "executables" don't have ".py" extensions
on their names?

(9) Because I am lazy and Linux doesn't care. I suppose Windows does?
All in all, a very poor "out-of-the-box" experience. Bear in mind that
very few Windows users would have even heard of bzip2, let alone have a
bzip2.exe on their machine. They wouldn't even be able to *open* the
box.

As I said, I did not give Windows users much thought. I will improve this.
And what is "chown" -- any relation of Perl's "chomp"?

chown is a Unix command to change the owner or the group of a file. It
has to do with controlling access to the file. It is not relevant on
Windows. No relation to Perl's chomp.

Thank you very much for your feedback. Did you actually run it on your
Windows box?

-pu
 
P

Peter Hansen

Patrick said:
(9) Because I am lazy and Linux doesn't care. I suppose Windows does?

Unfortunately, yes. Windows has nothing like the "x" permission
bit, so you have to have an actual extension on the filename and
Windows (XP anyway) will check it against the list of extensions
in the PATHEXT environment variable to determine if it should be
treated like an executable.

Otherwise you must type "python" and the full filename.

-Peter
 
S

Serge Orlov

Peter said:
Unfortunately, yes. Windows has nothing like the "x" permission
bit, so you have to have an actual extension on the filename and
Windows (XP anyway) will check it against the list of extensions
in the PATHEXT environment variable to determine if it should be
treated like an executable.

Otherwise you must type "python" and the full filename.

Or use exemaker, which IMHO is the best way to handle this
problem.

Serge.
 
J

John Machin

Patrick said:
(1) Typo, thanks for pointing it out
(2)(3) In the Linux world, it is really popular. I suppose you are a
Windows user, and I haven't given that much thought. The point was not
to save space, just to use the "standard" format. What would it be for
Windows - zip?

Yes. Moreover, "WinZip", the most popular archive-handler, doesn't grok
bzip2.
(6) You are keeping open handles for all files of a given size -- have
you actually considered the possibility of an exception like this:
IOError: [Errno 24] Too many open files: 'foo509'

(6) Not much I can do about this. In the beginning, all files of equal
size are potentially identical. I first need to read a chunk of each,
and if I want to avoid opening & closing files all the time, I need them
open together.
What would you suggest?

Test, like I did, to see how many open handles you can get away with. I
was not joking, 20 was the max on MS-DOS at one stage and I vaguely
recall: (a) some low limits on various flavours of *x (b) the "ulimit"
command can be used to vary the per-process limit but (c) there is a
system-wide limit also.

You should consider a fall-back method to be used in this case and in
the case of too many files for your 1Mb (default) buffer pool. BTW 1Mb
seems tiny; desktop PCs come with 512MB standard these days, and Bill
does leave a bit more than 1MB available for applications.
chown is a Unix command to change the owner or the group of a file. It
has to do with controlling access to the file. It is not relevant on
Windows. No relation to Perl's chomp.

The question was rhetorical. Your irony detector must be on the fritz.
:)
Did you actually run it on your
Windows box?

Yes, with trepidation, after carefully reading the source. It detected
some highly plausible duplicates, which I haven't verified yet.

Cheers,
John
 
P

Patrick Useldinger

John said:
Yes. Moreover, "WinZip", the most popular archive-handler, doesn't grok
bzip2.

I've added a zip file. It was made in Linux with the zip command-line
tool, the man pages say it's compatible with the Windows zip tools. I
have also added .py extentions to the 2 programs. I did however not use
distutils, because I'm not sure it is really adapted to module-less scripts.
You should consider a fall-back method to be used in this case and in
the case of too many files for your 1Mb (default) buffer pool. BTW 1Mb
seems tiny; desktop PCs come with 512MB standard these days, and Bill
does leave a bit more than 1MB available for applications.

I've added it to the TODO list.
The question was rhetorical. Your irony detector must be on the fritz.
:)

I always find it hard to detect irony by mail with people I do not know. ..
Yes, with trepidation, after carefully reading the source. It detected
some highly plausible duplicates, which I haven't verified yet.

I would have been reluctant too. But I've tested it intensively, and
there's strictly no statement that actually alters the file system.

Thanks for your feedback!

-pu
 
J

John Machin

I've tested it intensively

"Famous Last Words" :)
Thanks for your feedback!

Here's some more:

(1) Manic s/w producing lots of files all the same size: the Borland
C[++] compiler produces a debug symbol file (.tds) that's always
384KB; I have 144 of these on my HD, rarely more than 1 in the same
directory.

Here's a snippet from a duplicate detection run:

DUP|393216|2|\devel\delimited\build\lib.win32-1.5\delimited.tds|\devel\delimited\build\lib.win32-2.1\delimited.tds
DUP|393216|2|\devel\delimited\build\lib.win32-2.3\delimited.tds|\devel\delimited\build\lib.win32-2.4\delimited.tds

(2) There appears to be a flaw in your logic such that it will find
duplicates only if they are in the *SAME* directory and only when
there are no other directories with two or more files of the same
size. The above duplicates were detected only when I made the
following changes to your script:


--- fdups Sat Feb 26 06:41:36 2005
+++ fdups_jm.py Sun Feb 27 12:18:04 2005
@@ -29,13 +29,14 @@
self.count = self.totalsize = self.inodecount =
self.slinkcount = 0
self.gain = self.bytescompared = self.bytesread =
self.inodecount = 0
for toplevel in args:
- os.path.walk(toplevel, self.buildList, None)
+ os.path.walk(toplevel, self.updateDict, None)
if self.count > 0:
self.compare()

- def buildList(self,arg,dirpath,namelist):
- """ build a dictionnary of files to be analysed, indexed by
length """
- files = {}
+ def updateDict(self,arg,dirpath,namelist):
+ """ update a dictionary of files to be analysed, indexed by
length """
+ # files = {}
+ files = self.compfiles
for filepath in namelist:
fullpath = os.path.join(dirpath,filepath)
if os.path.isfile(fullpath):
@@ -51,20 +52,23 @@
if size >= MIN_FILESIZE:
self.count += 1
self.totalsize += size
+ # is above totalling in the wrong place?
if size not in files:
files[size]=[fullpath]
else:
files[size].append(fullpath)
- for size in files:
- if len(files[size]) != 1:
- self.compfiles[size]=files[size]
+ # for size in files:
+ # if len(files[size]) != 1:
+ # self.compfiles[size]=files[size]

def compare(self):
""" compare all files of the same size - outer loop """
sizes=self.compfiles.keys()
sizes.sort()
for size in sizes:
- self.comparefiles(size,self.compfiles[size])
+ list_of_filenames = self.compfiles[size]
+ if len(list_of_filenames) > 1:
+ self.comparefiles(size, list_of_filenames)

def comparefiles(self,size,filelist):
""" compare all files of the same size - inner loop """


(3) Your fdups-check gadget doesn't work on Windows; the commands
module works only on Unix but is supplied with Python on all
platforms. The results might just confuse a newbie:

(1, "'{' is not recognized as an internal or external
command,\noperable program or batch file.")

Why not use the Python filecmp module?

Cheers,
John
 
P

Patrick Useldinger

John said:
I've tested it intensively
"Famous Last Words" :)
;-)

(1) Manic s/w producing lots of files all the same size: the Borland
C[++] compiler produces a debug symbol file (.tds) that's always
384KB; I have 144 of these on my HD, rarely more than 1 in the same
directory.

Not sure what you want me to do about it. I've decreased the minimum
block size once more, to accomodate for more files of the same length
without increasing the total amount of memory used.
(2) There appears to be a flaw in your logic such that it will find
duplicates only if they are in the *SAME* directory and only when
there are no other directories with two or more files of the same
size.

Ooops...
A really stupid mistake on my side. Corrected.
(3) Your fdups-check gadget doesn't work on Windows; the commands
module works only on Unix but is supplied with Python on all
platforms. The results might just confuse a newbie:
Why not use the Python filecmp module?

Done. It's also faster AND it works better. Thanks for the suggestion.

Please fetch the new version from http://www.homepages.lu/pu/fdups.html.

-pu
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top