How to get the size of a file?

U

User

Anyone have ideas which os command could be used to get the size of a
file without actually opening it? My intention is to write a script
that identifies duplicate files with different names. I have no
trouble getting the names of all the files in the directory using the
os.listdir() command, but that doesn't return the file size. In order
to be identical, files must be the same size, so I want to use file
size as the first criteria, then, if they are the same size, actually
open them up and compare the contents.

I have written such a script in the past, but had to resort to
something like:

os.system('dir *.* >> trash.txt')

The next step was then to open up 'trash.txt', and piece together the
information I need compare file sizes. The problems with this
approach are that it is very platform dependent (worked on WIN 95, but
don't know what else it will work on) and 8.3 filename limitations
that apply within this environment. That is the reason I'm looking
for some other command to obtain file size before the files are ever
opened.
 
D

David M. Cooke

User said:
Anyone have ideas which os command could be used to get the size of a
file without actually opening it? My intention is to write a script
that identifies duplicate files with different names. I have no
trouble getting the names of all the files in the directory using the
os.listdir() command, but that doesn't return the file size. In order
to be identical, files must be the same size, so I want to use file
size as the first criteria, then, if they are the same size, actually
open them up and compare the contents.

I have written such a script in the past, but had to resort to
something like:

os.system('dir *.* >> trash.txt')

You're looking for os.stat. It returns an object whose attributes have
info about the file.
2821
(or whatever)

More info in the docs for the os module, of course.
 
B

Bengt Richter

Anyone have ideas which os command could be used to get the size of a
file without actually opening it? My intention is to write a script
that identifies duplicate files with different names. I have no
trouble getting the names of all the files in the directory using the
os.listdir() command, but that doesn't return the file size. In order
to be identical, files must be the same size, so I want to use file
size as the first criteria, then, if they are the same size, actually
open them up and compare the contents.

I have written such a script in the past, but had to resort to
something like:

os.system('dir *.* >> trash.txt')

The next step was then to open up 'trash.txt', and piece together the
information I need compare file sizes. The problems with this
approach are that it is very platform dependent (worked on WIN 95, but
don't know what else it will work on) and 8.3 filename limitations
that apply within this environment. That is the reason I'm looking
for some other command to obtain file size before the files are ever
opened.

This should list duplicate files in the specified directory:
You can hack to suit. Not very tested. Just what you see ;-)
------------------------------------------------
# get_dupes.py
import os, md5
def get_dupes(thedir):
finfo = {}
for f in os.listdir(thedir):
if os.path.isfile(f):
finfo.setdefault(os.path.getsize(f), []).append(f)

result = []
for size, flist in finfo.items():
if len(flist)>1:
dupes = {}
for name in flist:
dupes.setdefault(md5.new(open(name, 'rb').read()).hexdigest(),[]).append(name)
for digest, names in dupes.items():
if len(names)>1: result.append((size, digest, names))
return result

if __name__ == '__main__':
import sys
try:
dupes = get_dupes(sys.argv[1])
if dupes:
print
print '%8s %32s %s' % ('size','md5 digest','files with the given size, digest')
print '%8s %32s %s' % ('----','-'*32 ,'---------------------------------')
for duped in dupes:
print '%8s %32s %s' % duped
else:
print 'No duplicate files in %r' % sys.argv[1]
except:
raise SystemExit, 'Usage: python get_dupes.py directory'
-------------------------------------------

(I was surprised at the amount of duplicated stuff ;-)

[23:23] C:\pywk\clp>python get_dupes.py .

size md5 digest files with the given size, digest
---- -------------------------------- ---------------------------------
0 d41d8cd98f00b204e9800998ecf8427e ['z3', 'zero_len.py']
111 ea70a0f814917ef8861bebc085e5e7d0 ['MyConsts.py', 'MyConsts.py~']
163 f8e4add20e45bb253bd46963f25a7057 ['ramb.txt', 'rambxx.txt']
4096 d96633a4b58522ce5787ef80a18e9c7b ['yyy2', 'yyy3']
786 05956208d5185259b47362afcf1812fd ['startmore.py', 'startmore.py~']
851 3845f161fa93cbb9119c16fc43e7b62a ['quadratic.py', 'quadratic.py~']
1536 72f5c05b7ea8dd6059bf59f50b22df33 ['virtest.txt', '~DF30EC.tmp']
1028 fbedc511f9556a8a1dc2ecfa3d859621 ['PaulMoore.py', 'PaulMoore.py~']
1515 568f9732866a9de698732616ae4f9c3b ['loopbreak.py', 'loopbreak.py~']
1662 f54414637ed420fe61b78eeba59737b7 ['for_grodrigues.py', 'for_grodrigues.r1.py']
1702 23fa57926e7fcf2487943acb10db7e2a ['bitfield.py', 'bitfield.py~', 'packbits.py']
3765 e69bf6b018ba305cc3e190378f93e421 ['pythonHi.gif', 'showgif.gif']
5874 bae87bbed53c1e6908bb5c37db9c4292 ['testyenc.py', 'testyenc.py~']
3990 4a5096efaf136f901603a2e1be850eb3 ['pns.py', 'pns.r1.py']

Regards,
Bengt Richter
 
B

Bengt Richter

]
This should list duplicate files in the specified directory:
You can hack to suit. Not very tested. Just what you see ;-)
[... version which only worked for current working directory...]
Phooey. Hopefully better:

----------------------------------------------------------------------------
# get_dupes.py
import os, md5
def get_dupes(thedir):
finfo = {}
for f in os.listdir(thedir):
p = os.path.join(thedir, f)
if os.path.isfile(p):
finfo.setdefault(os.path.getsize(p), []).append(f)

result = []
for size, flist in finfo.items():
if len(flist)>1:
dupes = {}
for name in flist:
dupes.setdefault(md5.new(open(os.path.join(thedir, name), 'rb'
).read()).hexdigest(),[]).append(name)
for digest, names in dupes.items():
if len(names)>1: result.append((size, digest, names))
return result

if __name__ == '__main__':
import sys
try:
dupes = get_dupes(sys.argv[1])
if dupes:
print
print '%8s %32s %s' % ('size','md5 digest','files with the given size, digest')
print '%8s %32s %s' % ('----','-'*32 ,'---------------------------------')
for duped in dupes:
print '%8s %32s %s' % duped
else:
print 'No duplicate files in %r' % sys.argv[1]
except:
raise SystemExit, 'Usage: python get_dupes.py directory'
 
D

Dennis Lee Bieber

Anyone have ideas which os command could be used to get the size of a
file without actually opening it? My intention is to write a script
that identifies duplicate files with different names. I have no
trouble getting the names of all the files in the directory using the
os.listdir() command, but that doesn't return the file size. In order
to be identical, files must be the same size, so I want to use file
size as the first criteria, then, if they are the same size, actually
open them up and compare the contents.

Watch out for line wraps... This one didn't actually check the
contents, only logged candidates (time for me to use it again too)... As
you can see, it is ancient, and no doubt could be improved with some of
the newer Python modules...

#
# DupCheck.py -- Scans a directory and all subdirectories
# for duplicate file names, reporting
conflicts
# March 22 1998 dl bieber <[email protected]>
#

import os
import sys
import string
from stat import *

Files = {}

def Scan_Dir(cd):
global Files, logfile

cur_files = os.listdir(cd)
cur_files.sort()
for f in cur_files:
fib = os.stat("%s\\%s" % (cd, f))
if S_ISDIR(fib[ST_MODE]):
Scan_Dir("%s\\%s" % (cd, f))
elif S_ISREG(fib[ST_MODE]):
if Files.has_key(string.lower(f)):
(aSize, aDir) = Files[string.lower(f)]
if fib[ST_SIZE] == aSize:
logfile.write(
"***** Possible
Duplicate File: %s\n" % (f))
logfile.write(
" %s\t%s\n" %
(fib[ST_SIZE], cd))
logfile.write(
" %s\t%s\n\n" %
(Files[string.lower(f)]))
else:
Files[string.lower(f)] = (fib[ST_SIZE],
cd)
else:
logfile.write(
"***** SKIPPED Not File or Dir:
%s\n\n" % (f))


if __name__ == "__main__":
Cur_Dir = raw_input("Root Directory -> ")
Log_To = raw_input("Log File -> ")

if Log_To:
logfile = open(Log_To, "w")
else:
logfile = sys.stdout

Scan_Dir(Cur_Dir)

if Log_To:
logfile.close()


--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,772
Messages
2,569,593
Members
45,111
Latest member
KetoBurn
Top