Python SHA-1 as a method for unique file identification ? [help!]

EP · Jun 21, 2006

This inquiry may either turn out to be about the suitability of the
SHA-1 (160 bit digest) for file identification, the sha function in
Python ... or about some error in my script. Any insight appreciated
in advance.

I am trying to reduce duplicate files in storage at home - I have a
large number files (e.g. MP3s) which have been stored on disk multiple
times under different names or on different paths. The using
applications will search down from the top path and find the files - so
I do not need to worry about keeping track of paths.

All seemed to be working until I examined my log files and found files
with the same SHA digest had different sizes according to
os.stat(fpath).st_size . This is on Windows XP.

- Am I expecting too much of SHA-1?
- Is it that the os.stat data on Windows cannot be trusted?
- Or perhaps there is a silly error in my code I should have seen?

Thanks

- Eric

- - - - - - - - - - - - - - - - - -

Log file extract:

Dup: no Path: F:\music\mp3s\01125.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 01125.mp3 Size:
63006
Dup: YES Path: F:\music\mp3s\0791.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0791.mp3 Size:
50068
Dup: YES Path: F:\music\mp3s\12136.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 12136.mp3 Size:
51827
Dup: YES Path: F:\music\mp3s\11137.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 11137.mp3 Size:
56417
Dup: YES Path: F:\music\mp3s\0991.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0991.mp3 Size:
59043
Dup: YES Path: F:\music\mp3s\0591.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0591.mp3 Size:
59162
Dup: YES Path: F:\music\mp3s\10140.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 10140.mp3 Size:
59545
Dup: YES Path: F:\music\mp3s\0491.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0491.mp3 Size:
63101
Dup: YES Path: F:\music\mp3s\0392.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0392.mp3 Size:
63252
Dup: YES Path: F:\music\mp3s\0891.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0891.mp3 Size:
65808
Dup: YES Path: F:\music\mp3s\0691.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0691.mp3 Size:
67050
Dup: YES Path: F:\music\mp3s\0294.mp3 Hash:
00b3acb529aae11df186ced8424cb189f062fa48 Name: 0294.mp3 Size:
67710

Code:

# Dedup_inplace.py
# vers .02
# Python 2.4.1

# Create a dictionary consisting of hash

ath
# Look for 2nd same hash and delete path

testpath=r"F:\music\mp3s"
logpath=r"C:\testlog6.txt"

import os, sha

def hashit(pth):
"""Takes a file path and returns a SHA hash of its string"""
fs=open(pth,'r').read()
sh=sha.new(fs).hexdigest()
return sh

def logData(d={}, logfile="c://filename999.txt", separator="\n"):
"""Takes a dictionary of values and writes them to the provided
file path"""
logstring=separator.join([str(key)+": "+d[key] for key in
d.keys()])+"\n"
f=open(logfile,'a')
f.write(logstring)
f.close()
return

def walker(topPath):
fDict={}
logDict={}
limit=1000
freed_space=0
for root, dirs, files in os.walk(topPath):
for name in files:
fpath=os.path.join(root,name)
fsize=os.stat(fpath).st_size
fkey=hashit(fpath)
logDict["Name"]=name
logDict["Path"]=fpath
logDict["Hash"]=fkey
logDict["Size"]=str(fsize)
if fkey not in fDict.keys():
fDict[fkey]=fpath
logDict["Dup"]="no"
else:
#os.remove(fpath) --uncomment only when script
proven
logDict["Dup"]="YES"
freed_space+=fsize
logData(logDict, logpath, "\t")
items=len(fDict.keys())
print "Dict entry: ",items,
print "Cum freed space: ",freed_space
if items > limit:
break
if items > limit:
break

def emptyNests(topPath):
"""Walks downward from the given path and deletes any empty
directories"""
for root, dirs, files in os.walk(topPath):
for d in dirs:
dpath=os.path.join(root,d)
if len(os.listdir(dpath))==0:
print "deleting: ", dpath
os.rmdir(dpath)

walker(testpath)
emptyNests(testpath)

Justin Ezequiel · Jun 21, 2006

EP said:
This inquiry may either turn out to be about the suitability of the
SHA-1 (160 bit digest) for file identification, the sha function in
Python ... or about some error in my script.

This is on Windows XP.

def hashit(pth):
fs=open(pth,'r').read()
sh=sha.new(fs).hexdigest()
return sh

cannot comment on the suitability of SHA for your use-case but
shouldn't you be opening the file in binary mode?

fs=open(pth,'rb').read()

Tim Peters · Jun 21, 2006

[EP said:
This inquiry may either turn out to be about the suitability of the
SHA-1 (160 bit digest) for file identification, the sha function in
Python ... or about some error in my script

It's your script. Always open binary files in binary mode. It's a
disaster on Windows if you don't (if you open a file in text mode on
Windows, the OS pretends that EOF occurs at the first instance of byte
chr(26) -- this is an ancient Windows behavior that made an odd kind
of sense in the mists of history, and has persisted in worship of
Backward Compatibility despite that the original reason for it went
away _long_ ago).

....

I am trying to reduce duplicate files in storage at home - I have a
large number files (e.g. MP3s) which have been stored on disk multiple
times under different names or on different paths.
....

All seemed to be working until I examined my log files and found files
with the same SHA digest had different sizes according to
os.stat(fpath).st_size . This is on Windows XP.

- Am I expecting too much of SHA-1?

No. SHA-1 should work well for this.

- Is it that the os.stat data on Windows cannot be trusted?

It can be trusted to the extent that anything on Windows can be trusted ;-)

....

def hashit(pth):
"""Takes a file path and returns a SHA hash of its string"""
fs=open(pth,'r').read()

Make that 'rb' instead of 'r', and your problem will go away.

Do note that there are faster ways to go about this. For example, if
a particular file has a unique size among the files you're looking at,
there's no reason to even open that file (let alone compute its hash).
Group the files by size, and within a size group you can find
duplicates by, e.g., first comparing their initial 16 bytes, then the
next 32, then the next 64 ... etc. If duplicates are uncommon, this
can be a huge savings. For example, here's output from an old
dup-finding Python program of mine run over a directory tree which
happens to contain no duplicates:

Files
-----
Total 10,718
Unique 10,718
Duplicate 0
# w/ unique size 10,053
# w/ unique prefix 665

Bytes
-----
Total 1,401,668,015
Unique 1,401,668,015
Duplicate 0
Read 76,688
Excess 76,688

That last two lines mean it actually read a total of only about 77,000
file bytes on its way to proving there were no duplicates among about
11,000 files spanning about 1.4 gigabytes.

No hashes are computed by that program. I'd use a hash if instead I
were going to save a persistent (across runs) set of known hashes, so
that answering "here's a new file -- same as one I already have?"
could be done by computing its hash. While the program above
generally reads "amazingly" few file bytes, it hammers the OS file
directory services, and it would go faster to compute a hash of a new
file than to run the whole analysis again.

Mike Orr · Jun 26, 2006

Tim said:
[EP said:

This inquiry may either turn out to be about the suitability of the
SHA-1 (160 bit digest) for file identification, the sha function in
Python ... or about some error in my script

Click to expand...

It's your script. Always open binary files in binary mode. It's a
disaster on Windows if you don't (if you open a file in text mode on
Windows, the OS pretends that EOF occurs at the first instance of byte
chr(26) -- this is an ancient Windows behavior that made an odd kind
of sense in the mists of history, and has persisted in worship of
Backward Compatibility despite that the original reason for it went
away _long_ ago).

On a semi-related note, I have a database on Linux that imports from a
Macintosh CSV file. The 'csv' module says to always open files in
binary mode, but this didn't work in my case: I had to open it as 'rU'
(text with universal newlines) or 'csv' misparsed it. I'd like the
program to be portable to Windows and Mac. Is there a way around this?
Will I really burn in hell for using 'rU'?

What was the odd bit of sense? I know you end console input by typing
ctrl-Z, but I thought it was just like Unix ctrl-D which ends the input
but doesn't actually insert that character.

John Machin · Jun 26, 2006

Tim said:
Tim said:

[EP said:

This inquiry may either turn out to be about the suitability of the
SHA-1 (160 bit digest) for file identification, the sha function in
Python ... or about some error in my script

Click to expand...

It's your script. Always open binary files in binary mode. It's a
disaster on Windows if you don't (if you open a file in text mode on
Windows, the OS pretends that EOF occurs at the first instance of byte
chr(26) -- this is an ancient Windows behavior that made an odd kind
of sense in the mists of history, and has persisted in worship of
Backward Compatibility despite that the original reason for it went
away _long_ ago).

Click to expand...

On a semi-related note, I have a database on Linux that imports from a
Macintosh CSV file. The 'csv' module says to always open files in
binary mode, but this didn't work in my case: I had to open it as 'rU'
(text with universal newlines) or 'csv' misparsed it. I'd like the
program to be portable to Windows and Mac. Is there a way around this?
Will I really burn in hell for using 'rU'?

Yes, you will burn in hell for using any old kludge that gets results
(by accident) instead of reading the manual to find a principled solution:

"""
lineterminator
The string used to terminate lines in the CSV file. It defaults to '\r\n'.
"""

In the case of a Mac CSV file, '\r' is probably required.

You will burn in hell for asking questions w/o supplying sufficient
information, like (a) repr(first few lines of your Mac CSV file) (b)
what was the result from the csv module ("didn't work" doesn't cut it).

What was the odd bit of sense? I know you end console input by typing
ctrl-Z, but I thought it was just like Unix ctrl-D which ends the input
but doesn't actually insert that character.

Pace timbot, the "ancient Windows behavior" was inherited via MS-DOS
from CP/M. Sectors on disk were 128 bytes. File sizes were recorded as
numbers of sectors, not numbers of bytes. The convention was that the
end of a text file was indicated by ^Z.

You are correct, modern software shouldn't and usually doesn't
gratuitously write ^Z to files, but there is is some software out there
that still does, hence the preservation of the convention on reading.

More importantly for CSV files, the data may contain *embedded* CRs and
LFs that the users had in their spreadsheet file. Reading that with "r"
or "rU" will certainly result in "didn't work".

HTH,
John

Andrew McNamara · Jun 26, 2006

On a semi-related note, I have a database on Linux that imports from a

Yes, you will burn in hell for using any old kludge that gets results
(by accident) instead of reading the manual to find a principled solution:

"""
lineterminator
The string used to terminate lines in the CSV file. It defaults to '\r\n'.
"""

In the case of a Mac CSV file, '\r' is probably required.

Unfortunately, the documentation is misleading in this case:
"lineterminator" is only used for output.

The documentation specifies that the file should be opened in binary mode,
because the CSV parser has it's own idea of "universal newlines". The
complicating factor is that newlines can appear quoted inside a field:
using universal newlines, these "quoted newlines" would be damaged
(because it's unaware of the quoting conventions).

If your data file contains no quoted newlines (they're rare, but if you
need them, you need them), then opening the file in "universal newline"
mode should be harmless (and in this case, is the right thing to do).

ANN main-4.4.0	0	Nov 25, 2010
glibc ARM cross compile error	2	Aug 18, 2008

Python SHA-1 as a method for unique file identification ? [help!]

EP

Justin Ezequiel

Tim Peters

Mike Orr

John Machin

Andrew McNamara

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads