64-bit hashing function

Discussion in 'Java' started by Roedy Green, Apr 20, 2014.

  1. Roedy Green

    Roedy Green Guest

    What is your favourite 64-bit hashing function/digest? This is not for
    cryptography. Ideally there is Java source available and it is fast.

    I will be using it to collapse filenames down. There is no malice or
    privacy to worry about.

    I would like similar filenames to generate different results.
    Roedy Green, Apr 20, 2014
    1. Advertisements

  2. Roedy Green

    Arne Vajhøj Guest

    String hashCode gives 32 bit values.

    MessageDigest.getInstance("MD5") gives 128 bit.

    64 bit is a funky size. Usually overkill for a hash table and
    too small for cryptographic usage.

    But if you want it then Google finds stuff like:


    Arne Vajhøj, Apr 21, 2014
    1. Advertisements

  3. Roedy Green

    Roedy Green Guest

    I have found a promising candidate called CityHash.

    CityHash 64-bit and others. Fast on 64 bit processors, available for


    for Java and


    for C. C versions use special hardware instructions available only on
    some processors.
    Roedy Green, Apr 21, 2014
  4. If you could live with very rare collisions, a 32 bit hash will do as
    well. String.getHashCode will be fine in this case. It will usually not
    return the same values for similar strings.

    If you can't deal with collisions, even a longer hash might not hit the
    target. Hash codes always /may/ create collisions, although they are
    very seldom for reasonable hashing algorithms.

    If you absolutely want to have 64 bits and you have no well known good
    64 bit hash, use two /different/ 32 bit hash codes instead. E.g.
    String.getHashCode and CRC32.

    Marcel Müller, Apr 21, 2014
  5. Roedy Green

    Roedy Green Guest

    I thought I might collapse a filename to a 64 bit hash, then look it
    up in a traditional HashMap. I just wanted to keep the total size of
    the HashMap down.

    In essence, the problem is I want to look up by filename (suitably
    truncated) to get a Long timestamp, when the file was last
    successfully processed without errors. If the current date of the file
    matches that, I can bypass processing the file. If current date of the
    file, is later than that, I have to reprocess it, and update the

    This is basically what I talked the HTML Validator people into
    implementing to drastically speed up batch validation. Now I am
    preparing to do it to my own HTMLMacros code. (They did not use a
    HashMap, but a 0-length matching indicator file with the
    last-successfully processed date, and let the OS do the lookup.)

    Let us say I write the code presuming collisions are impossible. What
    if one does happen? Then two files will share the same timestamp. It
    will contain the later of the two's timestamp. So if either file needs
    reprocessing, the scheme will think both do, hardly a catastrophe
    since I process everything all the time now.

    There must be a formula to estimate the number of collisions given the
    bit-size of the hash and the number of items you hash presuming it is
    perfectly random If not, I suppose I could write a one-shot to find
    out experimentally if 64-bits are needed.

    I studied probability at UBC, so I am quite capable of deriving the
    formula if it is not easily available.
    Roedy Green, Apr 21, 2014
  6. Roedy Green

    Roedy Green Guest

    If the math is right,

    For a 16-bit hashing, 20,000 filenames, you could expect 50,796

    For a 32-bit hashing, 20,000 filenames, you could expect 4 collisions.
    For a 64-bit hashing, 20,000 filenames, you could expect 2 collisions.

    This is quite a surprise. 64-bit buys you almost nothing. That
    greatly simplifies my project.

    I will post the program I wrote at
    within the hour.
    Roedy Green, Apr 21, 2014
  7. ? - The hash map takes the same storage if you hash file names directly.
    There is no map needed for that purpose at all.

    Simply take a single time stamp at program start that is not in the
    future and remember it for the next run. On the next run you need to
    process any files modified not before this time stamp. There is no
    benefit when you know exactly which time stamp of a particular file got

    If you want to be sure that no file is processed twice then you need to
    process only the modification time stamp interval

    But be aware of accuracy issues with file time stamps. Depending on the
    file system you might not get what you expected. I.e. if you processed
    up to 2013-04-21T10:20:01.152Z the file system may write the next file
    at 2013-04-21T10:20:00Z (e.g FAT16) or 2013-04-21T10:20:01Z (e.g.
    ext2/3). Although the file is written /after/ your program the time
    stamp is in the past from your point of view, most likely because
    rounding issues. But also without rounding you might not see a change
    immediately during the directory scan. In fact you end up with a race
    All this preferably applies to remote file systems which might behave
    entirely different than you expect on a particular platform.

    A common work around is to introduce a defined latency. I.e. take a time
    stamp let's say 30 seconds in the past at program start. This is used as
    last run the next time.

    Like any other make utility this process gets messed up when the time
    stamps of the files are manipulated in an unexpected way. I.e. moving an
    old file into another location or replace it by an old version of the
    same file. Especially on Windows copying a file does not update the last
    write time stamp. You should use the maximum of the last write time
    stamp and the creation time stamp to come around this. Unfortunately the
    latter is not very well accessible with Java before Version 7.

    I still can't see any reasonable improvement of using hash codes instead
    of file names. If you need to process a file you need the file name anyway.

    Yes. Look for the Birthday Paradox at Wikipedia.

    The birthday is your hash code and only 23 people are required to get a
    50/50 chance of at least one collision. For a 32 bit int you need about
    77,000 objects. All of this assumes an optimal hash function which
    neither the birthday nor String.getHashCode is. So the real values are
    somewhat lower.
    But you mentioned that a few collisions do not care you at all.
    There is neither a benefit from 64 bit in your case nor from using hash
    codes instead of file names.

    Marcel Müller, Apr 21, 2014
  8. Roedy Green

    Roedy Green Guest

    If I used a HashMap of filename -> timestamp,
    I would have 20,000 x ( 15*2 + 8 ) = 767,436 bytes.

    If I use a HashMap of digest -> timestamp,
    I would have 20,000 x ( 4 + 2 ) = 120,000 bytes

    The difference is even more pronounced when you consider the overhead
    of wrapping each digest/timestamp in an object.

    I only need look at one filename at time. I always have the current
    filename. I don't need the HashMap to give me a filename.

    It sounds conceptually complicated, but would really only and a few
    lines of code.
    Roedy Green, Apr 21, 2014
  9. Roedy Green

    Roedy Green Guest

    The "next run" is not quite as simple as might think


    1. I often do runs on just one file or a small set of files.
    2. The run can abort
    3. I can abort the run
    4. there are some files I leave completely alone, and only rarely
    include them.
    5. if a file had errors, even if it were processed, I have to do it
    again to bring it to my attention, even if I have done nothing to
    correct the problems. Sometimes errors might be caused by something
    outside the file, say an include.

    One idea I am playing with is just to delete the cache file every time
    I compile, and not worry about trying to tell the cache software when
    the software last changed.
    Roedy Green, Apr 21, 2014
  10. Roedy Green

    Roedy Green Guest

    6. one more complication. If you process a file, and the result is the
    same as before, the timestamp must be left as is to avoid reloading
    the file to the website.
    Roedy Green, Apr 21, 2014
  11. You are not really talking about 1MB memory nowadays, aren't you?

    Btw. Where did you get 15*2 from? File names take about 32 bytes + twice
    the string length, depending on the platform (32/64 bit, JVM etc.).
    I don't see why this second map needs no boxing in Java. Also your
    timestamp shrink to 2 bytes.

    So you basically need a (persistent) decision what file names in which
    revisions are already processed. (Any version control software has
    similar requirements.)

    I would not recommend to use your digest here. How would you ever be
    able to clean up the hash map? You don't know the hash code of the
    vanished file names so you can't know what to remove. And since your
    operation is not a LUW with respect to all files in a folder (as
    mentioned in your other post) you cannot simply remove all untouched
    keys during a run.

    Furthermore the result of filename -> timestamp is by far more human
    readable than hash keys. This will simplify testing and serviceability.
    If you care about the MB storage in the file system then zip the content
    of your persistent information file (will likely shrink some factor 2+x)
    or use a smarter file format that does not repeat the same strings over
    and over. E.g.
    File1 timestamp
    File2 timestamp
    This will save approx. another factor 2 with many files in deep folders.

    Marcel Müller, Apr 21, 2014
  12. Roedy Green

    Roedy Green Guest

    here is my first cut at the code. I have not run it yet

    It shows what I had in mind.

    * [Lazy.java]
    * Summary: Lets us avoid the work of expanding macros if they were
    done successfully earlier.
    * Copyright: (c) 2012-2014 Roedy Green, Canadian Mind Products,
    * Licence: This software may be copied and used freely for any
    purpose but military.
    * http://mindprod.com/contact/nonmil.html
    * Requires: JDK 1.7+
    * Created with: JetBrains IntelliJ IDEA IDE
    * Version History:
    * 1.0 2014-04-21 initial version.
    package com.mindprod.htmlmacros.support;

    import com.mindprod.common17.ST;

    import java.io.BufferedInputStream;
    import java.io.BufferedOutputStream;
    import java.io.DataInputStream;
    import java.io.DataOutputStream;
    import java.io.EOFException;
    import java.io.File;
    import java.io.FileInputStream;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.nio.charset.Charset;
    import java.util.HashMap;
    import java.util.Map.Entry;
    import java.util.concurrent.TimeUnit;
    import java.util.zip.Adler32;

    import static com.mindprod.htmlmacros.macro.Global.configuration;
    import static java.lang.System.err;

    * Lets us avoid the work of expanding macros if they were done
    successfully earlier.
    * @author Roedy Green, Canadian Mind Products
    * @version 1.0 2014-04-21 initial version
    * @since 2014-04-21
    public class Lazy
    // -------------------------- PUBLIC STATIC METHODS

    static final Adler32 digester = new Adler32();

    * encoding for UTF-8
    private static final Charset UTF8Charset = Charset.forName(
    "UTF-8" );

    * where the lazy cache is kept are kept.
    private static final String CACHE_FILE_NAME =

    * how many entries we are expecting in the lookup file.
    private static final int size = 20000;

    * root directory of the local websites
    private static final File root = new File(
    configuration.getLocalWebrootWithSlashes() );

    * allow 5 seconds of slop in matching dates
    private static final long slop = TimeUnit.SECONDS.toMillis( 5 );



    // arrange for save to be run at shutdown.
    Runtime.getRuntime().addShutdownHook( new Thread()
    public void run()
    } );

    * look up filename hash-32 to timestamp
    private static HashMap<Integer, Long> lookup;

    * load the previous contents of the lookup cache from
    * It is a binary file of pairs hash-32, timestamp-64
    private static void load()
    // load up the HashMap we use to track when files were last
    successfully processed.

    final File cacheFile = new File( root, CACHE_FILE_NAME );
    // allow some padding to avoid collisions
    lookup = new HashMap<>( size + size / 4 );
    // if no cachefile carry on with empty lookup.
    // if cachefile exists, load pair from it.
    if ( cacheFile.exists() & cacheFile.canRead() )
    DataInputStream dis = null;
    // O P E N
    final FileInputStream fis = new FileInputStream(
    cacheFile );
    final BufferedInputStream bis = new
    BufferedInputStream( fis, 512 * 1024 );
    dis = new DataInputStream( bis );

    while ( true )
    // R E A D pairs hash-32, timestamp-64
    int hash = dis.readInt();
    long timestamp = dis.readInt();
    safePut( hash, timestamp );
    } // end loop
    } // end inner try

    catch ( EOFException e )
    // nothing to do
    // C L O S E
    if ( dis != null )
    } // end outer try
    catch ( IOException e )
    err.println( ">>> Warning. Unable to read cache.bin
    file" );

    } // end if
    err.println( ">>> clearing the cache.bin file" );
    } // end load
    } // end save

    * save the contents of the lookup cache into
    * It is a binary file of pairs hash-32, timestamp-64
    private static void save()

    final File cacheFile = new File( root, CACHE_FILE_NAME );

    if ( cacheFile.canWrite() )
    // O P E N
    final FileOutputStream fos = new FileOutputStream(
    cacheFile, false /* append */ );
    final BufferedOutputStream bos = new
    BufferedOutputStream( fos, 65536 /* 64K bytes */ );
    final DataOutputStream dos = new DataOutputStream( bos
    for ( Entry<Integer, Long> entry : lookup.entrySet() )

    // W R I T E
    int hash = entry.getKey();
    long timestamp = entry.getValue();
    dos.writeInt( hash ); // we write int and long,
    not Integer and Long.
    dos.writeLong( timestamp );

    // C L O S E

    } // end if
    catch ( IOException e )
    err.println( ">>> Warning. Unable to write cache.bin
    file" + e.getMessage() );
    } // end if
    err.println( ">>> Warning. Unable to write cache.bin file"
    } // end save

    * has this file already been processed and is unchanged since
    that time?
    * @param file file we are processing.
    * @return true if the file has already been successfully
    public static boolean isAlreadyDone( File file )
    // prune filename down to jgloss/jdk
    int hash = calcHash( file );

    Long timestampL = lookup.get( hash );
    // if no entry, it was not registered as done.
    if ( timestampL == null )
    return false;

    // if all is well ,the last modified date should not have
    changed since we recorded the file as
    // successfully processed.
    if ( file.lastModified() > timestampL + slop )

    // the file has been modified since we last processed it.
    // we will have to reprocess it.
    // This cache entry is useless. We might as well get rid
    of it to save some space.
    lookup.remove( hash );
    return false;
    // it has not been touched since we last successfully
    processed it.
    // the cache entry is fine as is.
    return true;

    * Mark the status of this file.
    * @param file file we are processing.
    * @param status true= file successfully processed, false=file was
    not successfully processed.
    public static void markStatus( File file, boolean status )
    int hash = calcHash( file );

    if ( status )

    // GOOD
    // If there already is an entry we must modify it.
    // If there is no entry, we must create one
    // The easiest way to accomplish this is just to create a
    new entry for that key.
    // Collisions should be very rare, but when they happen
    keep the conservative older date of the pair.
    // if one of the pair needs to be reprocessed, both will
    long timestamp = file.lastModified();
    safePut( hash, timestamp );
    // BAD
    // processing this file failed.
    // get rid of any entry. There might not be one. remove
    does not mind.
    lookup.remove( hash );


    * but this hash : timestamp into the lookup.
    * However if there is an existing entry, do not replace it
    * if it is older than this one.
    * @param hash has-32 of the file name
    * @param timestamp timestamp of file when it was last
    successfully processed.
    private static void safePut( final int hash, final long timestamp

    Long prevTimeStampL = lookup.put( hash, timestamp );

    // collisions should be very rare, but when they happen keep
    the conservative older date of the pair.
    // if one of the pair needs to be reprocessed, both will be.
    if ( prevTimeStampL != null && prevTimeStampL < timestamp )
    // put it back. This should happen very rarely.
    lookup.put( hash, prevTimeStampL );

    * calculate a hash-32 of the name of the file, not its contents
    * @param file file to be processed
    * @return 32-bit Adlerian hash.
    private static int calcHash( final File file )
    // prune filename E:/mindprod/jgloss/jdk.html down to
    String chopped = ST.chopTrailingString(
    Tools.webrootRelativeNameWithSlashes( file ), ".html" );
    final byte[] theTextToDigestAsBytes = chopped.getBytes(
    UTF8Charset );
    digester.update( theTextToDigestAsBytes );
    return ( int ) digester.getValue();
    Roedy Green, Apr 21, 2014
  13. Roedy Green

    markspace Guest

    Yeow, this is kinda terrible. My main thoughts are:

    1. Way too much static initialization. Just referencing the class in
    any fashion will cause the entire cache to be read into memory.

    2. Several of your static constants are ripe for a Look Up Service
    and/or Injection. Cf Martin Fowler's discussion of these design patterns.


    3. You should leave yourself some wiggle room to move from flat files to
    a simple database in the future. Since it's your code you can just gut
    the thing and re-write it but it doesn't look easy.
    Numbering package names == hard to remember. Use a better name for
    what's common in this package other than "17".
    You said it, not me, ;-)
    Shutdown hooks are pretty gross, imo. Is there a better way?
    Way too many static methods here. What happened to instantiating a
    class like normal folks?
    Same comment.
    There are a lot of side effects in this method, including removing stale
    entries. Consider refactoring this into smaller methods.
    I'm really curious as to what, exactly, you are caching here, and why.
    What's the point of keeping filenames in a hash map?

    I also think your conflating a basic operation, like caching, with other
    operations that are specific your your app domain, like "prune filename
    E:/mindprod/jgloss/jdk.html down to jgloss/jdk". Filename manipulation
    belongs in a higher level of the code, the cache should just do caching.

    Cf. other cache designs, like the one built into Android:

    markspace, Apr 21, 2014
  14. Roedy Green


    I don't think this is correct. It's pretty clear intuitively that the probability of there being any collisions is something of the order of Nf^2/Nb
    where the Nb is the number of bins (e.g., 2^64) and Nf is the number of files.
    (Assuming Nf << Nb).

    Consider: The first first file has 0 chance of being a collision. There's 1/Nb chance that the second file is a collision, and something very close to2/Nb that the the third file is a collision and... ~Nf/Nb chance that the Nf file is a collision. So the sum of these is

    Sum (1,Nf)/Nb ~ Nf^2 / (2 Nb).

    Here Sum(1,Nf) is defined as the sum of all integers from 1 to Nf.

    So that would indicate that there the expectation value for collisions in a 64 bit hash is ~2^(-17) <<< 1. See http://blog.shay.co/hash-collision-probability/ to do this more rigorously.

    Tom McGlynn
    , Apr 21, 2014
  15. Roedy Green


    Actually it's 20,000^2/2^65 which is more like 2^30/2^65 = 2^-35 ~ 3x10^-11 give or take a power of two or so. Switched ints and shorts for a second writing the above.

    , Apr 21, 2014
  16. Roedy Green

    Roedy Green Guest

    I was just counting the raw data bytes, not the boxing overhead.
    Roedy Green, Apr 22, 2014
  17. Roedy Green

    Roedy Green Guest

    I think by collision handling is a bit naive. I need to think about it
    a bit more. If I can get the number of collisions down to once every
    1000 years, I will live with it since any software change flushes the
    Roedy Green, Apr 22, 2014
  18. Roedy Green

    Roedy Green Guest

    it refer to common code that needs JDK 1.7 or higher to run.

    I used several of them 11 13 15, but now I have dropped the older
    Roedy Green, Apr 22, 2014
  19. Roedy Green

    Roedy Green Guest

    If you have methods like load and save, you can make the client call
    them, but it just one more thing to go wrong. What are the
    Roedy Green, Apr 22, 2014
  20. Roedy Green

    Roedy Green Guest

    Because I am not prepared at this early stage to think about multiple
    instances. Before I released it, I would have to do that.

    A released version would need a callback for compacting the filename.
    It would need the name of the cache file as a parameter.

    I tend to think in concrete terms, then when all is working, gradually
    add the generalisations.
    Roedy Green, Apr 22, 2014
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.