64-bit hashing function

Discussion in 'Java' started by Roedy Green, Apr 20, 2014.

  1. Roedy Green

    Roedy Green Guest

    What is your favourite 64-bit hashing function/digest? This is not for
    cryptography. Ideally there is Java source available and it is fast.

    I will be using it to collapse filenames down. There is no malice or
    privacy to worry about.

    I would like similar filenames to generate different results.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    "Don't worry about people stealing an idea; if it's original, you'll
    have to shove it down their throats."
    ~ Howard Aiken (born: 1900-03-08 died: 1973-03-14 at age: 73)
     
    Roedy Green, Apr 20, 2014
    #1
    1. Advertisements

  2. Roedy Green

    Arne Vajhøj Guest

    On 4/20/2014 6:15 PM, Roedy Green wrote:
    > What is your favourite 64-bit hashing function/digest? This is not for
    > cryptography. Ideally there is Java source available and it is fast.
    >
    > I will be using it to collapse filenames down. There is no malice or
    > privacy to worry about.
    >
    > I would like similar filenames to generate different results.


    String hashCode gives 32 bit values.

    MessageDigest.getInstance("MD5") gives 128 bit.

    64 bit is a funky size. Usually overkill for a hash table and
    too small for cryptographic usage.

    But if you want it then Google finds stuff like:

    https://github.com/vkandy/jenkins-hash-java

    Arne
     
    Arne Vajhøj, Apr 21, 2014
    #2
    1. Advertisements

  3. Roedy Green

    Roedy Green Guest

    On Sun, 20 Apr 2014 15:15:07 -0700, Roedy Green
    <> wrote, quoted or indirectly quoted
    someone who said :

    >What is your favourite 64-bit hashing function/digest? This is not for
    >cryptography. Ideally there is Java source available and it is fast.
    >
    >I will be using it to collapse filenames down. There is no malice or
    >privacy to worry about.



    I have found a promising candidate called CityHash.

    CityHash 64-bit and others. Fast on 64 bit processors, available for

    https://code.google.com/p/guava-libraries/source/browse/guava/src/com/
    google/common/hash/Hashing.java?name=refs/remotes/gcode-clone/cityhash

    for Java and

    https://code.google.com/p/cityhash/

    for C. C versions use special hardware instructions available only on
    some processors.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    "Don't worry about people stealing an idea; if it's original, you'll
    have to shove it down their throats."
    ~ Howard Aiken (born: 1900-03-08 died: 1973-03-14 at age: 73)
     
    Roedy Green, Apr 21, 2014
    #3
  4. On 21.04.14 00.15, Roedy Green wrote:
    > What is your favourite 64-bit hashing function/digest? This is not for
    > cryptography. Ideally there is Java source available and it is fast.
    >
    > I will be using it to collapse filenames down. There is no malice or
    > privacy to worry about.
    >
    > I would like similar filenames to generate different results.


    If you could live with very rare collisions, a 32 bit hash will do as
    well. String.getHashCode will be fine in this case. It will usually not
    return the same values for similar strings.

    If you can't deal with collisions, even a longer hash might not hit the
    target. Hash codes always /may/ create collisions, although they are
    very seldom for reasonable hashing algorithms.

    If you absolutely want to have 64 bits and you have no well known good
    64 bit hash, use two /different/ 32 bit hash codes instead. E.g.
    String.getHashCode and CRC32.


    Marcel
     
    Marcel Müller, Apr 21, 2014
    #4
  5. Roedy Green

    Roedy Green Guest

    On Mon, 21 Apr 2014 09:19:52 +0200, Marcel Müller
    <> wrote, quoted or indirectly quoted
    someone who said :

    >If you could live with very rare collisions, a 32 bit hash will do as
    >well. String.getHashCode will be fine in this case. It will usually not
    >return the same values for similar strings.


    I thought I might collapse a filename to a 64 bit hash, then look it
    up in a traditional HashMap. I just wanted to keep the total size of
    the HashMap down.

    In essence, the problem is I want to look up by filename (suitably
    truncated) to get a Long timestamp, when the file was last
    successfully processed without errors. If the current date of the file
    matches that, I can bypass processing the file. If current date of the
    file, is later than that, I have to reprocess it, and update the
    timestamp.

    This is basically what I talked the HTML Validator people into
    implementing to drastically speed up batch validation. Now I am
    preparing to do it to my own HTMLMacros code. (They did not use a
    HashMap, but a 0-length matching indicator file with the
    last-successfully processed date, and let the OS do the lookup.)

    Let us say I write the code presuming collisions are impossible. What
    if one does happen? Then two files will share the same timestamp. It
    will contain the later of the two's timestamp. So if either file needs
    reprocessing, the scheme will think both do, hardly a catastrophe
    since I process everything all the time now.

    There must be a formula to estimate the number of collisions given the
    bit-size of the hash and the number of items you hash presuming it is
    perfectly random If not, I suppose I could write a one-shot to find
    out experimentally if 64-bits are needed.

    I studied probability at UBC, so I am quite capable of deriving the
    formula if it is not easily available.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    "Don't worry about people stealing an idea; if it's original, you'll
    have to shove it down their throats."
    ~ Howard Aiken (born: 1900-03-08 died: 1973-03-14 at age: 73)
     
    Roedy Green, Apr 21, 2014
    #5
  6. Roedy Green

    Roedy Green Guest

    On Mon, 21 Apr 2014 00:51:05 -0700, Roedy Green
    <> wrote, quoted or indirectly quoted
    someone who said :

    >I studied probability at UBC, so I am quite capable of deriving the
    >formula if it is not easily available.


    If the math is right,

    For a 16-bit hashing, 20,000 filenames, you could expect 50,796
    collisions.

    For a 32-bit hashing, 20,000 filenames, you could expect 4 collisions.
    For a 64-bit hashing, 20,000 filenames, you could expect 2 collisions.

    This is quite a surprise. 64-bit buys you almost nothing. That
    greatly simplifies my project.

    I will post the program I wrote at
    http://mindprod.com/jgloss/digest.html
    within the hour.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    "Don't worry about people stealing an idea; if it's original, you'll
    have to shove it down their throats."
    ~ Howard Aiken (born: 1900-03-08 died: 1973-03-14 at age: 73)
     
    Roedy Green, Apr 21, 2014
    #6
  7. On 21.04.14 09.51, Roedy Green wrote:
    > On Mon, 21 Apr 2014 09:19:52 +0200, Marcel Müller
    > <> wrote, quoted or indirectly quoted
    > someone who said :
    >
    >> If you could live with very rare collisions, a 32 bit hash will do as
    >> well. String.getHashCode will be fine in this case. It will usually not
    >> return the same values for similar strings.

    >
    > I thought I might collapse a filename to a 64 bit hash, then look it
    > up in a traditional HashMap. I just wanted to keep the total size of
    > the HashMap down.


    ? - The hash map takes the same storage if you hash file names directly.

    > In essence, the problem is I want to look up by filename (suitably
    > truncated) to get a Long timestamp, when the file was last
    > successfully processed without errors. If the current date of the file
    > matches that, I can bypass processing the file. If current date of the
    > file, is later than that, I have to reprocess it, and update the
    > timestamp.


    There is no map needed for that purpose at all.

    Simply take a single time stamp at program start that is not in the
    future and remember it for the next run. On the next run you need to
    process any files modified not before this time stamp. There is no
    benefit when you know exactly which time stamp of a particular file got
    processed.

    If you want to be sure that no file is processed twice then you need to
    process only the modification time stamp interval
    [last_timestamp,current_timestamp).

    But be aware of accuracy issues with file time stamps. Depending on the
    file system you might not get what you expected. I.e. if you processed
    up to 2013-04-21T10:20:01.152Z the file system may write the next file
    at 2013-04-21T10:20:00Z (e.g FAT16) or 2013-04-21T10:20:01Z (e.g.
    ext2/3). Although the file is written /after/ your program the time
    stamp is in the past from your point of view, most likely because
    rounding issues. But also without rounding you might not see a change
    immediately during the directory scan. In fact you end up with a race
    condition.
    All this preferably applies to remote file systems which might behave
    entirely different than you expect on a particular platform.

    A common work around is to introduce a defined latency. I.e. take a time
    stamp let's say 30 seconds in the past at program start. This is used as
    last run the next time.

    Like any other make utility this process gets messed up when the time
    stamps of the files are manipulated in an unexpected way. I.e. moving an
    old file into another location or replace it by an old version of the
    same file. Especially on Windows copying a file does not update the last
    write time stamp. You should use the maximum of the last write time
    stamp and the creation time stamp to come around this. Unfortunately the
    latter is not very well accessible with Java before Version 7.


    > Let us say I write the code presuming collisions are impossible. What
    > if one does happen? Then two files will share the same timestamp. It
    > will contain the later of the two's timestamp. So if either file needs
    > reprocessing, the scheme will think both do, hardly a catastrophe
    > since I process everything all the time now.


    I still can't see any reasonable improvement of using hash codes instead
    of file names. If you need to process a file you need the file name anyway.


    > There must be a formula to estimate the number of collisions given the
    > bit-size of the hash and the number of items you hash presuming it is
    > perfectly random.


    Yes. Look for the Birthday Paradox at Wikipedia.

    The birthday is your hash code and only 23 people are required to get a
    50/50 chance of at least one collision. For a 32 bit int you need about
    77,000 objects. All of this assumes an optimal hash function which
    neither the birthday nor String.getHashCode is. So the real values are
    somewhat lower.
    But you mentioned that a few collisions do not care you at all.

    > If not, I suppose I could write a one-shot to find
    > out experimentally if 64-bits are needed.


    There is neither a benefit from 64 bit in your case nor from using hash
    codes instead of file names.


    Marcel
     
    Marcel Müller, Apr 21, 2014
    #7
  8. Roedy Green

    Roedy Green Guest

    On Mon, 21 Apr 2014 12:51:32 +0200, Marcel Müller
    <> wrote, quoted or indirectly quoted
    someone who said :

    >I still can't see any reasonable improvement of using hash codes instead
    >of file names. If you need to process a file you need the file name anyway.


    If I used a HashMap of filename -> timestamp,
    I would have 20,000 x ( 15*2 + 8 ) = 767,436 bytes.

    If I use a HashMap of digest -> timestamp,
    I would have 20,000 x ( 4 + 2 ) = 120,000 bytes

    The difference is even more pronounced when you consider the overhead
    of wrapping each digest/timestamp in an object.

    I only need look at one filename at time. I always have the current
    filename. I don't need the HashMap to give me a filename.

    It sounds conceptually complicated, but would really only and a few
    lines of code.

    --
    Roedy Green Canadian Mind Products http://mindprod.com
    "Don't worry about people stealing an idea; if it's original, you'll
    have to shove it down their throats."
    ~ Howard Aiken (born: 1900-03-08 died: 1973-03-14 at age: 73)
     
    Roedy Green, Apr 21, 2014
    #8
  9. Roedy Green

    Roedy Green Guest

    On Mon, 21 Apr 2014 12:51:32 +0200, Marcel Müller
    <> wrote, quoted or indirectly quoted
    someone who said :

    >Simply take a single time stamp at program start that is not in the
    >future and remember it for the next run


    The "next run" is not quite as simple as might think

    Complications:

    1. I often do runs on just one file or a small set of files.
    2. The run can abort
    3. I can abort the run
    4. there are some files I leave completely alone, and only rarely
    include them.
    5. if a file had errors, even if it were processed, I have to do it
    again to bring it to my attention, even if I have done nothing to
    correct the problems. Sometimes errors might be caused by something
    outside the file, say an include.

    One idea I am playing with is just to delete the cache file every time
    I compile, and not worry about trying to tell the cache software when
    the software last changed.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    "Don't worry about people stealing an idea; if it's original, you'll
    have to shove it down their throats."
    ~ Howard Aiken (born: 1900-03-08 died: 1973-03-14 at age: 73)
     
    Roedy Green, Apr 21, 2014
    #9
  10. Roedy Green

    Roedy Green Guest

    On Mon, 21 Apr 2014 06:38:52 -0700, Roedy Green
    <> wrote, quoted or indirectly quoted
    someone who said :

    >5. if a file had errors, even if it were processed, I have to do it


    6. one more complication. If you process a file, and the result is the
    same as before, the timestamp must be left as is to avoid reloading
    the file to the website.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    "Don't worry about people stealing an idea; if it's original, you'll
    have to shove it down their throats."
    ~ Howard Aiken (born: 1900-03-08 died: 1973-03-14 at age: 73)
     
    Roedy Green, Apr 21, 2014
    #10
  11. On 21.04.14 15.31, Roedy Green wrote:
    > On Mon, 21 Apr 2014 12:51:32 +0200, Marcel Müller
    > <> wrote, quoted or indirectly quoted
    > someone who said :
    >
    >> I still can't see any reasonable improvement of using hash codes instead
    >> of file names. If you need to process a file you need the file name anyway.

    >
    > If I used a HashMap of filename -> timestamp,
    > I would have 20,000 x ( 15*2 + 8 ) = 767,436 bytes.


    You are not really talking about 1MB memory nowadays, aren't you?

    Btw. Where did you get 15*2 from? File names take about 32 bytes + twice
    the string length, depending on the platform (32/64 bit, JVM etc.).

    > If I use a HashMap of digest -> timestamp,
    > I would have 20,000 x ( 4 + 2 ) = 120,000 bytes
    >
    > The difference is even more pronounced when you consider the overhead
    > of wrapping each digest/timestamp in an object.


    I don't see why this second map needs no boxing in Java. Also your
    timestamp shrink to 2 bytes.

    > I only need look at one filename at time. I always have the current
    > filename. I don't need the HashMap to give me a filename.


    OK.

    So you basically need a (persistent) decision what file names in which
    revisions are already processed. (Any version control software has
    similar requirements.)

    I would not recommend to use your digest here. How would you ever be
    able to clean up the hash map? You don't know the hash code of the
    vanished file names so you can't know what to remove. And since your
    operation is not a LUW with respect to all files in a folder (as
    mentioned in your other post) you cannot simply remove all untouched
    keys during a run.

    Furthermore the result of filename -> timestamp is by far more human
    readable than hash keys. This will simplify testing and serviceability.
    If you care about the MB storage in the file system then zip the content
    of your persistent information file (will likely shrink some factor 2+x)
    or use a smarter file format that does not repeat the same strings over
    and over. E.g.
    Folder1
    File1 timestamp
    File2 timestamp
    Folder2
    ...
    This will save approx. another factor 2 with many files in deep folders.


    Marcel
     
    Marcel Müller, Apr 21, 2014
    #11
  12. Roedy Green

    Roedy Green Guest

    On Mon, 21 Apr 2014 12:51:32 +0200, Marcel Müller
    <> wrote, quoted or indirectly quoted
    someone who said :

    >There is no map needed for that purpose at all


    here is my first cut at the code. I have not run it yet

    It shows what I had in mind.

    /*
    * [Lazy.java]
    *
    * Summary: Lets us avoid the work of expanding macros if they were
    done successfully earlier.
    *
    * Copyright: (c) 2012-2014 Roedy Green, Canadian Mind Products,
    http://mindprod.com
    *
    * Licence: This software may be copied and used freely for any
    purpose but military.
    * http://mindprod.com/contact/nonmil.html
    *
    * Requires: JDK 1.7+
    *
    * Created with: JetBrains IntelliJ IDEA IDE
    http://www.jetbrains.com/idea/
    *
    * Version History:
    * 1.0 2014-04-21 initial version.
    */
    package com.mindprod.htmlmacros.support;

    import com.mindprod.common17.ST;

    import java.io.BufferedInputStream;
    import java.io.BufferedOutputStream;
    import java.io.DataInputStream;
    import java.io.DataOutputStream;
    import java.io.EOFException;
    import java.io.File;
    import java.io.FileInputStream;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.nio.charset.Charset;
    import java.util.HashMap;
    import java.util.Map.Entry;
    import java.util.concurrent.TimeUnit;
    import java.util.zip.Adler32;

    import static com.mindprod.htmlmacros.macro.Global.configuration;
    import static java.lang.System.err;

    /**
    * Lets us avoid the work of expanding macros if they were done
    successfully earlier.
    *
    * @author Roedy Green, Canadian Mind Products
    * @version 1.0 2014-04-21 initial version
    * @since 2014-04-21
    */
    public class Lazy
    {
    // -------------------------- PUBLIC STATIC METHODS
    --------------------------

    static final Adler32 digester = new Adler32();

    /**
    * encoding for UTF-8
    */
    private static final Charset UTF8Charset = Charset.forName(
    "UTF-8" );

    /**
    * where the lazy cache is kept are kept.
    */
    private static final String CACHE_FILE_NAME =
    "embellishment/cache.bin";

    /**
    * how many entries we are expecting in the lookup file.
    */
    private static final int size = 20000;

    /**
    * root directory of the local websites
    */
    private static final File root = new File(
    configuration.getLocalWebrootWithSlashes() );

    /**
    * allow 5 seconds of slop in matching dates
    */
    private static final long slop = TimeUnit.SECONDS.toMillis( 5 );

    static
    {
    load();
    }

    static
    {

    // arrange for save to be run at shutdown.
    Runtime.getRuntime().addShutdownHook( new Thread()
    {
    public void run()
    {
    save();
    }
    } );
    }

    /**
    * look up filename hash-32 to timestamp
    */
    private static HashMap<Integer, Long> lookup;

    /**
    * load the previous contents of the lookup cache from
    embellishments/cache.bin
    * It is a binary file of pairs hash-32, timestamp-64
    */
    private static void load()
    {
    // load up the HashMap we use to track when files were last
    successfully processed.

    final File cacheFile = new File( root, CACHE_FILE_NAME );
    // allow some padding to avoid collisions
    lookup = new HashMap<>( size + size / 4 );
    // if no cachefile carry on with empty lookup.
    // if cachefile exists, load pair from it.
    if ( cacheFile.exists() & cacheFile.canRead() )
    {
    DataInputStream dis = null;
    try
    {
    try
    {
    // O P E N
    final FileInputStream fis = new FileInputStream(
    cacheFile );
    final BufferedInputStream bis = new
    BufferedInputStream( fis, 512 * 1024 );
    dis = new DataInputStream( bis );

    while ( true )
    {
    // R E A D pairs hash-32, timestamp-64
    int hash = dis.readInt();
    long timestamp = dis.readInt();
    safePut( hash, timestamp );
    } // end loop
    } // end inner try

    catch ( EOFException e )
    {
    // nothing to do
    }
    finally
    {
    // C L O S E
    if ( dis != null )
    {
    dis.close();
    }
    }
    } // end outer try
    catch ( IOException e )
    {
    err.println( ">>> Warning. Unable to read cache.bin
    file" );
    }

    } // end if
    else
    {
    err.println( ">>> clearing the cache.bin file" );
    } // end load
    } // end save

    /**
    * save the contents of the lookup cache into
    embellishments/cache.bin
    * It is a binary file of pairs hash-32, timestamp-64
    */
    private static void save()
    {

    final File cacheFile = new File( root, CACHE_FILE_NAME );

    if ( cacheFile.canWrite() )
    {
    try
    {
    // O P E N
    final FileOutputStream fos = new FileOutputStream(
    cacheFile, false /* append */ );
    final BufferedOutputStream bos = new
    BufferedOutputStream( fos, 65536 /* 64K bytes */ );
    final DataOutputStream dos = new DataOutputStream( bos
    );
    for ( Entry<Integer, Long> entry : lookup.entrySet() )
    {

    // W R I T E
    int hash = entry.getKey();
    long timestamp = entry.getValue();
    dos.writeInt( hash ); // we write int and long,
    not Integer and Long.
    dos.writeLong( timestamp );
    }

    // C L O S E
    dos.close();

    } // end if
    catch ( IOException e )
    {
    err.println( ">>> Warning. Unable to write cache.bin
    file" + e.getMessage() );
    }
    } // end if
    else
    {
    err.println( ">>> Warning. Unable to write cache.bin file"
    );
    }
    } // end save

    /**
    * has this file already been processed and is unchanged since
    that time?
    *
    * @param file file we are processing.
    *
    * @return true if the file has already been successfully
    processed.
    */
    public static boolean isAlreadyDone( File file )
    {
    // prune filename down to jgloss/jdk
    int hash = calcHash( file );

    Long timestampL = lookup.get( hash );
    // if no entry, it was not registered as done.
    if ( timestampL == null )
    {
    return false;
    }

    // if all is well ,the last modified date should not have
    changed since we recorded the file as
    // successfully processed.
    if ( file.lastModified() > timestampL + slop )
    {

    // the file has been modified since we last processed it.
    // we will have to reprocess it.
    // This cache entry is useless. We might as well get rid
    of it to save some space.
    lookup.remove( hash );
    return false;
    }
    else
    {
    // it has not been touched since we last successfully
    processed it.
    // the cache entry is fine as is.
    return true;
    }
    }

    /**
    * Mark the status of this file.
    *
    * @param file file we are processing.
    * @param status true= file successfully processed, false=file was
    not successfully processed.
    */
    public static void markStatus( File file, boolean status )
    {
    int hash = calcHash( file );

    if ( status )

    {
    // GOOD
    // If there already is an entry we must modify it.
    // If there is no entry, we must create one
    // The easiest way to accomplish this is just to create a
    new entry for that key.
    // Collisions should be very rare, but when they happen
    keep the conservative older date of the pair.
    // if one of the pair needs to be reprocessed, both will
    be.
    long timestamp = file.lastModified();
    safePut( hash, timestamp );
    }
    else
    {
    // BAD
    // processing this file failed.
    // get rid of any entry. There might not be one. remove
    does not mind.
    lookup.remove( hash );
    }

    }

    /**
    * but this hash : timestamp into the lookup.
    * However if there is an existing entry, do not replace it
    * if it is older than this one.
    *
    * @param hash has-32 of the file name
    * @param timestamp timestamp of file when it was last
    successfully processed.
    */
    private static void safePut( final int hash, final long timestamp
    )
    {

    Long prevTimeStampL = lookup.put( hash, timestamp );

    // collisions should be very rare, but when they happen keep
    the conservative older date of the pair.
    // if one of the pair needs to be reprocessed, both will be.
    if ( prevTimeStampL != null && prevTimeStampL < timestamp )
    {
    // put it back. This should happen very rarely.
    lookup.put( hash, prevTimeStampL );
    }
    }

    /**
    * calculate a hash-32 of the name of the file, not its contents
    *
    * @param file file to be processed
    *
    * @return 32-bit Adlerian hash.
    */
    private static int calcHash( final File file )
    {
    // prune filename E:/mindprod/jgloss/jdk.html down to
    jgloss/jdk
    String chopped = ST.chopTrailingString(
    Tools.webrootRelativeNameWithSlashes( file ), ".html" );
    final byte[] theTextToDigestAsBytes = chopped.getBytes(
    UTF8Charset );
    digester.reset();
    digester.update( theTextToDigestAsBytes );
    return ( int ) digester.getValue();
    }
    }
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    "Don't worry about people stealing an idea; if it's original, you'll
    have to shove it down their throats."
    ~ Howard Aiken (born: 1900-03-08 died: 1973-03-14 at age: 73)
     
    Roedy Green, Apr 21, 2014
    #12
  13. Roedy Green

    markspace Guest

    On 4/21/2014 10:09 AM, Roedy Green wrote:
    >
    > here is my first cut at the code. I have not run it yet


    Yeow, this is kinda terrible. My main thoughts are:

    1. Way too much static initialization. Just referencing the class in
    any fashion will cause the entire cache to be read into memory.

    2. Several of your static constants are ripe for a Look Up Service
    and/or Injection. Cf Martin Fowler's discussion of these design patterns.

    <http://martinfowler.com/articles/injection.html>

    3. You should leave yourself some wiggle room to move from flat files to
    a simple database in the future. Since it's your code you can just gut
    the thing and re-write it but it doesn't look easy.

    > import com.mindprod.common17.ST;


    Numbering package names == hard to remember. Use a better name for
    what's common in this package other than "17".

    > public classLazy


    You said it, not me, ;-)

    > {
    > static
    > {
    > load();
    > }


    Evil!!!

    >
    > static
    > {
    >
    > // arrange for save to be run at shutdown.
    > Runtime.getRuntime().addShutdownHook( new Thread()


    Shutdown hooks are pretty gross, imo. Is there a better way?

    > /**
    > * look up filename hash-32 to timestamp
    > */
    > private static HashMap<Integer, Long> lookup;


    ???

    > private static void load()
    > {


    Way too many static methods here. What happened to instantiating a
    class like normal folks?

    > private static void save()


    Same comment.

    > public static boolean isAlreadyDone( File file )
    > {


    There are a lot of side effects in this method, including removing stale
    entries. Consider refactoring this into smaller methods.

    >
    > // if all is well ,the last modified date should not have
    > changed since we recorded the file as
    > // successfully processed.
    > if ( file.lastModified() > timestampL + slop )
    > {
    >
    > // the file has been modified since we last processed it.
    > // we will have to reprocess it.
    > // This cache entry is useless. We might as well get rid
    > of it to save some space.


    I'm really curious as to what, exactly, you are caching here, and why.
    What's the point of keeping filenames in a hash map?

    I also think your conflating a basic operation, like caching, with other
    operations that are specific your your app domain, like "prune filename
    E:/mindprod/jgloss/jdk.html down to jgloss/jdk". Filename manipulation
    belongs in a higher level of the code, the cache should just do caching.

    Cf. other cache designs, like the one built into Android:

    <http://developer.android.com/reference/android/util/LruCache.html>
     
    markspace, Apr 21, 2014
    #13
  14. Roedy Green

    Guest

    On Monday, April 21, 2014 5:05:18 AM UTC-4, Roedy Green wrote:
    > On Mon, 21 Apr 2014 00:51:05 -0700, Roedy Green
    >
    > <> wrote, quoted or indirectly quoted
    > someone who said :
    >
    > >I studied probability at UBC, so I am quite capable of deriving the
    > >formula if it is not easily available.

    >
    > If the math is right,
    > For a 16-bit hashing, 20,000 filenames, you could expect 50,796
    >
    > collisions.
    >
    > For a 32-bit hashing, 20,000 filenames, you could expect 4 collisions.
    > For a 64-bit hashing, 20,000 filenames, you could expect 2 collisions.
    >
    >

    I don't think this is correct. It's pretty clear intuitively that the probability of there being any collisions is something of the order of Nf^2/Nb
    where the Nb is the number of bins (e.g., 2^64) and Nf is the number of files.
    (Assuming Nf << Nb).

    Consider: The first first file has 0 chance of being a collision. There's 1/Nb chance that the second file is a collision, and something very close to2/Nb that the the third file is a collision and... ~Nf/Nb chance that the Nf file is a collision. So the sum of these is

    Sum (1,Nf)/Nb ~ Nf^2 / (2 Nb).

    Here Sum(1,Nf) is defined as the sum of all integers from 1 to Nf.

    So that would indicate that there the expectation value for collisions in a 64 bit hash is ~2^(-17) <<< 1. See http://blog.shay.co/hash-collision-probability/ to do this more rigorously.


    Regards,
    Tom McGlynn



    >
    > This is quite a surprise. 64-bit buys you almost nothing. That
    >
    > greatly simplifies my project.
    >
    >
    >
    > I will post the program I wrote at
    >
    > http://mindprod.com/jgloss/digest.html
    >
    > within the hour.
    >
    > --
    >
    > Roedy Green Canadian Mind Products http://mindprod.com
    >
     
    , Apr 21, 2014
    #14
  15. Roedy Green

    Guest

    On Monday, April 21, 2014 2:49:20 PM UTC-4, wrote:
    ....
    > ... a 64 bit hash is ~2^(-17) <<< 1.

    Actually it's 20,000^2/2^65 which is more like 2^30/2^65 = 2^-35 ~ 3x10^-11 give or take a power of two or so. Switched ints and shorts for a second writing the above.

    Tom
     
    , Apr 21, 2014
    #15
  16. Roedy Green

    Roedy Green Guest

    On Mon, 21 Apr 2014 16:40:14 +0200, Marcel Müller
    <> wrote, quoted or indirectly quoted
    someone who said :

    >I don't see why this second map needs no boxing in Java. Also your
    >timestamp shrink to 2 bytes.


    I was just counting the raw data bytes, not the boxing overhead.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    "Don't worry about people stealing an idea; if it's original, you'll
    have to shove it down their throats."
    ~ Howard Aiken (born: 1900-03-08 died: 1973-03-14 at age: 73)
     
    Roedy Green, Apr 22, 2014
    #16
  17. Roedy Green

    Roedy Green Guest

    On Mon, 21 Apr 2014 10:09:04 -0700, Roedy Green
    <> wrote, quoted or indirectly quoted
    someone who said :

    >here is my first cut at the code. I have not run it yet


    I think by collision handling is a bit naive. I need to think about it
    a bit more. If I can get the number of collisions down to once every
    1000 years, I will live with it since any software change flushes the
    cache.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    "Don't worry about people stealing an idea; if it's original, you'll
    have to shove it down their throats."
    ~ Howard Aiken (born: 1900-03-08 died: 1973-03-14 at age: 73)
     
    Roedy Green, Apr 22, 2014
    #17
  18. Roedy Green

    Roedy Green Guest

    On Mon, 21 Apr 2014 11:46:06 -0700, markspace
    <> wrote, quoted or indirectly quoted someone
    who said :

    >Numbering package names == hard to remember. Use a better name for
    >what's common in this package other than "17".


    it refer to common code that needs JDK 1.7 or higher to run.

    I used several of them 11 13 15, but now I have dropped the older
    ones.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    "Don't worry about people stealing an idea; if it's original, you'll
    have to shove it down their throats."
    ~ Howard Aiken (born: 1900-03-08 died: 1973-03-14 at age: 73)
     
    Roedy Green, Apr 22, 2014
    #18
  19. Roedy Green

    Roedy Green Guest

    On Mon, 21 Apr 2014 11:46:06 -0700, markspace
    <> wrote, quoted or indirectly quoted someone
    who said :

    >1. Way too much static initialization. Just referencing the class in
    >any fashion will cause the entire cache to be read into memory.


    If you have methods like load and save, you can make the client call
    them, but it just one more thing to go wrong. What are the
    alternatives?
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    "Don't worry about people stealing an idea; if it's original, you'll
    have to shove it down their throats."
    ~ Howard Aiken (born: 1900-03-08 died: 1973-03-14 at age: 73)
     
    Roedy Green, Apr 22, 2014
    #19
  20. Roedy Green

    Roedy Green Guest

    On Mon, 21 Apr 2014 11:46:06 -0700, markspace
    <> wrote, quoted or indirectly quoted someone
    who said :

    >Way too many static methods here. What happened to instantiating a
    >class like normal folks?


    Because I am not prepared at this early stage to think about multiple
    instances. Before I released it, I would have to do that.

    A released version would need a callback for compacting the filename.
    It would need the name of the cache file as a parameter.

    I tend to think in concrete terms, then when all is working, gradually
    add the generalisations.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    "Don't worry about people stealing an idea; if it's original, you'll
    have to shove it down their throats."
    ~ Howard Aiken (born: 1900-03-08 died: 1973-03-14 at age: 73)
     
    Roedy Green, Apr 22, 2014
    #20
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Owen Jacobson
    Replies:
    3
    Views:
    5,640
    CBFalconer
    May 26, 2005
  2. Pat

    Hashing function

    Pat, May 18, 2004, in forum: C++
    Replies:
    2
    Views:
    636
    Ivan Vecerina
    May 18, 2004
  3. Matt Bull

    Hashing function (possibly OT)

    Matt Bull, Jul 7, 2003, in forum: C Programming
    Replies:
    2
    Views:
    565
    Matt Bull
    Jul 7, 2003
  4. Replies:
    3
    Views:
    2,382
    Timothy Bendfelt
    Jan 19, 2007
  5. Mohan
    Replies:
    3
    Views:
    571
    Robbie Hatley
    Jul 14, 2006
  6. Replies:
    9
    Views:
    1,499
    Juha Nieminen
    Aug 22, 2007
  7. Jeff.M
    Replies:
    6
    Views:
    466
    Lasse Reichstein Nielsen
    May 4, 2009
Loading...