E
EP
This inquiry may either turn out to be about the suitability of the
SHA-1 (160 bit digest) for file identification, the sha function in
Python ... or about some error in my script. Any insight appreciated
in advance.
I am trying to reduce duplicate files in storage at home - I have a
large number files (e.g. MP3s) which have been stored on disk multiple
times under different names or on different paths. The using
applications will search down from the top path and find the files - so
I do not need to worry about keeping track of paths.
All seemed to be working until I examined my log files and found files
with the same SHA digest had different sizes according to
os.stat(fpath).st_size . This is on Windows XP.
- Am I expecting too much of SHA-1?
- Is it that the os.stat data on Windows cannot be trusted?
- Or perhaps there is a silly error in my code I should have seen?
Thanks
- Eric
- - - - - - - - - - - - - - - - - -
Log file extract:
Dup: no Path: F:\music\mp3s\01125.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 01125.mp3 Size:
63006
Dup: YES Path: F:\music\mp3s\0791.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0791.mp3 Size:
50068
Dup: YES Path: F:\music\mp3s\12136.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 12136.mp3 Size:
51827
Dup: YES Path: F:\music\mp3s\11137.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 11137.mp3 Size:
56417
Dup: YES Path: F:\music\mp3s\0991.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0991.mp3 Size:
59043
Dup: YES Path: F:\music\mp3s\0591.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0591.mp3 Size:
59162
Dup: YES Path: F:\music\mp3s\10140.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 10140.mp3 Size:
59545
Dup: YES Path: F:\music\mp3s\0491.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0491.mp3 Size:
63101
Dup: YES Path: F:\music\mp3s\0392.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0392.mp3 Size:
63252
Dup: YES Path: F:\music\mp3s\0891.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0891.mp3 Size:
65808
Dup: YES Path: F:\music\mp3s\0691.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0691.mp3 Size:
67050
Dup: YES Path: F:\music\mp3s\0294.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0294.mp3 Size:
67710
Code:
# Dedup_inplace.py
# vers .02
# Python 2.4.1
# Create a dictionary consisting of hashath
# Look for 2nd same hash and delete path
testpath=r"F:\music\mp3s"
logpath=r"C:\testlog6.txt"
import os, sha
def hashit(pth):
"""Takes a file path and returns a SHA hash of its string"""
fs=open(pth,'r').read()
sh=sha.new(fs).hexdigest()
return sh
def logData(d={}, logfile="c://filename999.txt", separator="\n"):
"""Takes a dictionary of values and writes them to the provided
file path"""
logstring=separator.join([str(key)+": "+d[key] for key in
d.keys()])+"\n"
f=open(logfile,'a')
f.write(logstring)
f.close()
return
def walker(topPath):
fDict={}
logDict={}
limit=1000
freed_space=0
for root, dirs, files in os.walk(topPath):
for name in files:
fpath=os.path.join(root,name)
fsize=os.stat(fpath).st_size
fkey=hashit(fpath)
logDict["Name"]=name
logDict["Path"]=fpath
logDict["Hash"]=fkey
logDict["Size"]=str(fsize)
if fkey not in fDict.keys():
fDict[fkey]=fpath
logDict["Dup"]="no"
else:
#os.remove(fpath) --uncomment only when script
proven
logDict["Dup"]="YES"
freed_space+=fsize
logData(logDict, logpath, "\t")
items=len(fDict.keys())
print "Dict entry: ",items,
print "Cum freed space: ",freed_space
if items > limit:
break
if items > limit:
break
def emptyNests(topPath):
"""Walks downward from the given path and deletes any empty
directories"""
for root, dirs, files in os.walk(topPath):
for d in dirs:
dpath=os.path.join(root,d)
if len(os.listdir(dpath))==0:
print "deleting: ", dpath
os.rmdir(dpath)
walker(testpath)
emptyNests(testpath)
SHA-1 (160 bit digest) for file identification, the sha function in
Python ... or about some error in my script. Any insight appreciated
in advance.
I am trying to reduce duplicate files in storage at home - I have a
large number files (e.g. MP3s) which have been stored on disk multiple
times under different names or on different paths. The using
applications will search down from the top path and find the files - so
I do not need to worry about keeping track of paths.
All seemed to be working until I examined my log files and found files
with the same SHA digest had different sizes according to
os.stat(fpath).st_size . This is on Windows XP.
- Am I expecting too much of SHA-1?
- Is it that the os.stat data on Windows cannot be trusted?
- Or perhaps there is a silly error in my code I should have seen?
Thanks
- Eric
- - - - - - - - - - - - - - - - - -
Log file extract:
Dup: no Path: F:\music\mp3s\01125.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 01125.mp3 Size:
63006
Dup: YES Path: F:\music\mp3s\0791.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0791.mp3 Size:
50068
Dup: YES Path: F:\music\mp3s\12136.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 12136.mp3 Size:
51827
Dup: YES Path: F:\music\mp3s\11137.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 11137.mp3 Size:
56417
Dup: YES Path: F:\music\mp3s\0991.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0991.mp3 Size:
59043
Dup: YES Path: F:\music\mp3s\0591.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0591.mp3 Size:
59162
Dup: YES Path: F:\music\mp3s\10140.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 10140.mp3 Size:
59545
Dup: YES Path: F:\music\mp3s\0491.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0491.mp3 Size:
63101
Dup: YES Path: F:\music\mp3s\0392.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0392.mp3 Size:
63252
Dup: YES Path: F:\music\mp3s\0891.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0891.mp3 Size:
65808
Dup: YES Path: F:\music\mp3s\0691.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0691.mp3 Size:
67050
Dup: YES Path: F:\music\mp3s\0294.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0294.mp3 Size:
67710
Code:
# Dedup_inplace.py
# vers .02
# Python 2.4.1
# Create a dictionary consisting of hashath
# Look for 2nd same hash and delete path
testpath=r"F:\music\mp3s"
logpath=r"C:\testlog6.txt"
import os, sha
def hashit(pth):
"""Takes a file path and returns a SHA hash of its string"""
fs=open(pth,'r').read()
sh=sha.new(fs).hexdigest()
return sh
def logData(d={}, logfile="c://filename999.txt", separator="\n"):
"""Takes a dictionary of values and writes them to the provided
file path"""
logstring=separator.join([str(key)+": "+d[key] for key in
d.keys()])+"\n"
f=open(logfile,'a')
f.write(logstring)
f.close()
return
def walker(topPath):
fDict={}
logDict={}
limit=1000
freed_space=0
for root, dirs, files in os.walk(topPath):
for name in files:
fpath=os.path.join(root,name)
fsize=os.stat(fpath).st_size
fkey=hashit(fpath)
logDict["Name"]=name
logDict["Path"]=fpath
logDict["Hash"]=fkey
logDict["Size"]=str(fsize)
if fkey not in fDict.keys():
fDict[fkey]=fpath
logDict["Dup"]="no"
else:
#os.remove(fpath) --uncomment only when script
proven
logDict["Dup"]="YES"
freed_space+=fsize
logData(logDict, logpath, "\t")
items=len(fDict.keys())
print "Dict entry: ",items,
print "Cum freed space: ",freed_space
if items > limit:
break
if items > limit:
break
def emptyNests(topPath):
"""Walks downward from the given path and deletes any empty
directories"""
for root, dirs, files in os.walk(topPath):
for d in dirs:
dpath=os.path.join(root,d)
if len(os.listdir(dpath))==0:
print "deleting: ", dpath
os.rmdir(dpath)
walker(testpath)
emptyNests(testpath)