IO#sysread on windows

Discussion in 'Ruby' started by pihentagy, Jun 14, 2006.

  1. pihentagy

    pihentagy Guest

    Hi!

    I tried to write a file dupe finder. For this to work, I created an
    improved File::Stat, like this:

    class File::StatWithSha < File::Stat
    attr_reader :filename, :read
    def initialize fn
    @filename=File.expand_path fn
    @read = 0
    super fn
    end
    def sha1sum
    return @sha1sum if @sha1sum ||= nil
    warn "Calculating sha1sum for #@filename"
    chunk = nil
    fs = 0
    d = Digest::SHA1.new
    File.open(filename) {|f|
    begin
    while chunk = f.sysread(1048576)
    fs += chunk.length
    d.update(chunk)
    end
    rescue EOFError
    warn "\nResult is #{d} #{fs} <=> #{self.size}"
    return @sha1sum = d
    rescue e
    warn "Holy shit! #{e}"
    end
    }
    warn "Oh my god!"
    exit
    end
    def inspect; @filename;end
    end


    When under windows, it fails with both ruby1.8.2 and ruby1.8.4

    irb(main):006:0> fws.sha1sum
    Calculating sha1sum for F:/private/prg/ruby/g2.rb
    Chunk is 2113

    Result is c75de1a39ce389e7e198c97345ffad52b074e5e9 2113 <=> 2210
    => c75de1a39ce389e7e198c97345ffad52b074e5e9

    Under linux it works fine.
    Anyway, how should I calculate the sha1sum of a BIG file, just using
    ruby?
    pihentagy, Jun 14, 2006
    #1
    1. Advertising

  2. pihentagy

    Tim Hunter Guest

    pihentagy wrote:
    > Hi!
    >
    > I tried to write a file dupe finder. For this to work, I created an
    > improved File::Stat, like this:
    >
    > class File::StatWithSha < File::Stat
    > attr_reader :filename, :read
    > def initialize fn
    > @filename=File.expand_path fn
    > @read = 0
    > super fn
    > end
    > def sha1sum
    > return @sha1sum if @sha1sum ||= nil
    > warn "Calculating sha1sum for #@filename"
    > chunk = nil
    > fs = 0
    > d = Digest::SHA1.new
    > File.open(filename) {|f|
    > begin
    > while chunk = f.sysread(1048576)
    > fs += chunk.length
    > d.update(chunk)
    > end
    > rescue EOFError
    > warn "\nResult is #{d} #{fs} <=> #{self.size}"
    > return @sha1sum = d
    > rescue e
    > warn "Holy shit! #{e}"
    > end
    > }
    > warn "Oh my god!"
    > exit
    > end
    > def inspect; @filename;end
    > end
    >
    >
    > When under windows, it fails with both ruby1.8.2 and ruby1.8.4
    >
    > irb(main):006:0> fws.sha1sum
    > Calculating sha1sum for F:/private/prg/ruby/g2.rb
    > Chunk is 2113
    >
    > Result is c75de1a39ce389e7e198c97345ffad52b074e5e9 2113 <=> 2210
    > => c75de1a39ce389e7e198c97345ffad52b074e5e9
    >
    > Under linux it works fine.



    Probably you should open the files with "rb" instead of letting it
    default to "r".

    > Anyway, how should I calculate the sha1sum of a BIG file, just using
    > ruby?
    >


    For finding dups, I wonder if it's useful to compare checksums unless
    you've already computed them in advance. I notice that Ruby's own
    FileUtils.install checks filea == fileb by simply comparing the files
    until it finds a difference or gets to EOF.
    Tim Hunter, Jun 14, 2006
    #2
    1. Advertising

  3. Tim Hunter wrote:
    > For finding dups, I wonder if it's useful to compare checksums unless
    > you've already computed them in advance. I notice that Ruby's own
    > FileUtils.install checks filea == fileb by simply comparing the files
    > until it finds a difference or gets to EOF.


    It depends. If you want to find duplicates in a set of files then using
    the digest as hash key can make finding duplicates much faster. OTOH if
    you can detect candidates by looking at other attributes (size,
    mtime...) then the additional overhead for the checksum calculation
    might slow things down. It depends - as always. :)

    Btw, I don't see a reason to use sysread in this scenario. read will do.

    Kind regards

    robert
    Robert Klemme, Jun 14, 2006
    #3
  4. pihentagy

    pihentagy Guest

    Tim Hunter wrote:
    > Probably you should open the files with "rb" instead of letting it
    > default to "r".

    Holy s**t! Since I tried and failed on textfiles, I don't know why does
    it count anyway.
    Ah, that damned \r\n - \n transformation I guess.

    > For finding dups, I wonder if it's useful to compare checksums unless
    > you've already computed them in advance. I notice that Ruby's own
    > FileUtils.install checks filea == fileb by simply comparing the files
    > until it finds a difference or gets to EOF.

    Well, first I'd like to partition files based on filesize. And after
    that, I compare them.
    If you have more than 2 files having the same size, it's better to
    calculate sha1sum for all the files involved once. And, if you'd like
    to live on the safe side, you can compare by content the files having
    the same sha1sum.
    And, you can improve caching sha1sums (say in a file in every
    directory).
    pihentagy, Jun 14, 2006
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hal Fulton

    sysread and buffered I/O

    Hal Fulton, Jul 21, 2004, in forum: Ruby
    Replies:
    39
    Views:
    630
    Tanaka Akira
    Jul 24, 2004
  2. William E. Rubin
    Replies:
    3
    Views:
    109
    William E. Rubin
    Dec 8, 2005
  3. Muazzam Mushtaq
    Replies:
    0
    Views:
    354
    Muazzam Mushtaq
    Mar 28, 2006
  4. Mento Ruby
    Replies:
    0
    Views:
    125
    Mento Ruby
    Dec 11, 2006
  5. Aatch

    ruby File#sysread

    Aatch, Jun 15, 2007, in forum: Ruby
    Replies:
    5
    Views:
    93
    Nobuyoshi Nakada
    Jun 17, 2007
Loading...

Share This Page