safe way to calc md5 on very large files

R

rtilley

I'm calculating md5 checksums on very large files (2 GB). This is a safe
way to do so, right? Also... is the file closed when the block exits?
I'm using 'rb' as this is used on Windows and Linux computers.

md5 = Digest::MD5.new()
File.open(file, 'rb').each {|line| md5.update(line)}
 
A

ara.t.howard

Close.. try this..

require 'md5'
File.open(filename,'rb') { |f| MD5.hexdigest(f.read) }

And yes, the file is closed with the block form of open.

--Steve

i think the OP has the right approach - note that an 'f.read' will consume
2GB. but the OP's code

harp:~ > cat a.rb
require 'digest/md5'
md5 = Digest::MD5.new() and open(ARGV.shift, 'rb').each{|line| md5 << line}
p md5.hexdigest

will not.

regards.

-a
 
A

Andrew Johnson

i think the OP has the right approach - note that an 'f.read' will consume
2GB. but the OP's code

harp:~ > cat a.rb
require 'digest/md5'
md5 = Digest::MD5.new() and open(ARGV.shift, 'rb').each{|line| md5 << line}
p md5.hexdigest

will not.


In my reading of the OP, both the block-open and iteration are actually
desired:

md5 = Digest::MD5.new
File.open(file,'rb') do |ios|
ios.each {|line| md5 << line }
end

cheers,
andrew
 
B

Bill Kelly

From: "rtilley said:
I'm calculating md5 checksums on very large files (2 GB). This is a safe
way to do so, right? Also... is the file closed when the block exits?
I'm using 'rb' as this is used on Windows and Linux computers.

md5 = Digest::MD5.new()
File.open(file, 'rb').each {|line| md5.update(line)}

Hi - does the file really contain text lines? Or is it a file
full of binary data. If it's a binary file, there may be no
guarantee the whole thing isn't one very long "line". In that
case I'd recommend reading it in chunks.

Untested:

md5 = Digest::MD5.new()
File.open(file, 'rb') do |io|
while (buf = io.read(4096)) && buf.length > 0
md5.update(buf)
end
end


Regards,

Bill
 
R

Robert Klemme

Andrew Johnson said:
In my reading of the OP, both the block-open and iteration are
actually desired:

md5 = Digest::MD5.new
File.open(file,'rb') do |ios|
ios.each {|line| md5 << line }
end

IMHO it's a bad idea to use line oriented reading on a binary file because
"lines" can be arbitrary long (i.e. the whole file in worst case). Using
IO#read is much better.

Kind regards

robert
 
R

Robert Klemme

Bill Kelly said:
Hi - does the file really contain text lines? Or is it a file
full of binary data. If it's a binary file, there may be no
guarantee the whole thing isn't one very long "line". In that
case I'd recommend reading it in chunks.

Untested:

md5 = Digest::MD5.new()
File.open(file, 'rb') do |io|
while (buf = io.read(4096)) && buf.length > 0
md5.update(buf)
end
end

io.read will return nil at EOF so your test for positive length is basically
obsolete. Also, for reasons of error checking I'd place the digest creation
inside the block because then the digest is never created if the file cannot
be opened:

md5 = File.open(file, 'rb') do |io|
dig = Digest::MD5.new
while (buf = io.read(4096))
dig.update(buf)
end
dig
end

If you want to increase efficiency, you can do this, which will prevent new
strings to be created as buffers all the time:

md5 = File.open(file, 'rb') do |io|
dig = Digest::MD5.new
buf = ""
while io.read(4096, buf)
dig.update(buf)
end
dig
end

Here's another nice variant:

md5 = File.open(file, 'rb') do |io|
dig = Digest::MD5.new
buf = ""
dig.update(buf) while io.read(4096, buf)
dig
end

Kind regards

robert
 
R

rtilley

Robert said:
io.read will return nil at EOF so your test for positive length is
basically obsolete. Also, for reasons of error checking I'd place the
digest creation inside the block because then the digest is never
created if the file cannot be opened:

md5 = File.open(file, 'rb') do |io|
dig = Digest::MD5.new
while (buf = io.read(4096))
dig.update(buf)
end
dig
end

Thank you Robert, Billy and others! Your suggestions have helped me to
solve the problem.
 
T

Tanaka Akira

Robert Klemme said:
md5 = File.open(file, 'rb') do |io|
dig = Digest::MD5.new
buf = ""
while io.read(4096, buf)
dig.update(buf)
end
dig
end

Why we have no such method in the digest library?

I think it is useful enough to have in the library.
 
E

Erik Veenstra

Why we have no such method in the digest library?

I extended the MD5 class with a class method to build an MD5
object directly from the contents of a given file.

Use it like this:

md5 = MD5.file("foo.bar")

gegroet,
Erik V. - http://www.erikveen.dds.nl/

----------------------------------------------------------------

require "md5"

class MD5
def self.file(file)
File.open(file, "rb") do |f|
res = self.new
while (data = f.read(4096))
res << data
end
res
end
end
end

----------------------------------------------------------------
 
R

rtilley

Erik said:
I extended the MD5 class with a class method to build an MD5
object directly from the contents of a given file.

Should this be done to sha1, sha2, etc?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,188
Latest member
Crypto TaxSoftware

Latest Threads

Top