64-bit hashing function

R

Roedy Green

What is your favourite 64-bit hashing function/digest? This is not for
cryptography. Ideally there is Java source available and it is fast.

I will be using it to collapse filenames down. There is no malice or
privacy to worry about.

I would like similar filenames to generate different results.
 
A

Arne Vajhøj

What is your favourite 64-bit hashing function/digest? This is not for
cryptography. Ideally there is Java source available and it is fast.

I will be using it to collapse filenames down. There is no malice or
privacy to worry about.

I would like similar filenames to generate different results.

String hashCode gives 32 bit values.

MessageDigest.getInstance("MD5") gives 128 bit.

64 bit is a funky size. Usually overkill for a hash table and
too small for cryptographic usage.

But if you want it then Google finds stuff like:

https://github.com/vkandy/jenkins-hash-java

Arne
 
R

Roedy Green

What is your favourite 64-bit hashing function/digest? This is not for
cryptography. Ideally there is Java source available and it is fast.

I will be using it to collapse filenames down. There is no malice or
privacy to worry about.


I have found a promising candidate called CityHash.

CityHash 64-bit and others. Fast on 64 bit processors, available for

https://code.google.com/p/guava-libraries/source/browse/guava/src/com/
google/common/hash/Hashing.java?name=refs/remotes/gcode-clone/cityhash

for Java and

https://code.google.com/p/cityhash/

for C. C versions use special hardware instructions available only on
some processors.
 
M

Marcel Müller

What is your favourite 64-bit hashing function/digest? This is not for
cryptography. Ideally there is Java source available and it is fast.

I will be using it to collapse filenames down. There is no malice or
privacy to worry about.

I would like similar filenames to generate different results.

If you could live with very rare collisions, a 32 bit hash will do as
well. String.getHashCode will be fine in this case. It will usually not
return the same values for similar strings.

If you can't deal with collisions, even a longer hash might not hit the
target. Hash codes always /may/ create collisions, although they are
very seldom for reasonable hashing algorithms.

If you absolutely want to have 64 bits and you have no well known good
64 bit hash, use two /different/ 32 bit hash codes instead. E.g.
String.getHashCode and CRC32.


Marcel
 
R

Roedy Green

If you could live with very rare collisions, a 32 bit hash will do as
well. String.getHashCode will be fine in this case. It will usually not
return the same values for similar strings.

I thought I might collapse a filename to a 64 bit hash, then look it
up in a traditional HashMap. I just wanted to keep the total size of
the HashMap down.

In essence, the problem is I want to look up by filename (suitably
truncated) to get a Long timestamp, when the file was last
successfully processed without errors. If the current date of the file
matches that, I can bypass processing the file. If current date of the
file, is later than that, I have to reprocess it, and update the
timestamp.

This is basically what I talked the HTML Validator people into
implementing to drastically speed up batch validation. Now I am
preparing to do it to my own HTMLMacros code. (They did not use a
HashMap, but a 0-length matching indicator file with the
last-successfully processed date, and let the OS do the lookup.)

Let us say I write the code presuming collisions are impossible. What
if one does happen? Then two files will share the same timestamp. It
will contain the later of the two's timestamp. So if either file needs
reprocessing, the scheme will think both do, hardly a catastrophe
since I process everything all the time now.

There must be a formula to estimate the number of collisions given the
bit-size of the hash and the number of items you hash presuming it is
perfectly random If not, I suppose I could write a one-shot to find
out experimentally if 64-bits are needed.

I studied probability at UBC, so I am quite capable of deriving the
formula if it is not easily available.
 
R

Roedy Green

I studied probability at UBC, so I am quite capable of deriving the
formula if it is not easily available.

If the math is right,

For a 16-bit hashing, 20,000 filenames, you could expect 50,796
collisions.

For a 32-bit hashing, 20,000 filenames, you could expect 4 collisions.
For a 64-bit hashing, 20,000 filenames, you could expect 2 collisions.

This is quite a surprise. 64-bit buys you almost nothing. That
greatly simplifies my project.

I will post the program I wrote at
http://mindprod.com/jgloss/digest.html
within the hour.
 
M

Marcel Müller

I thought I might collapse a filename to a 64 bit hash, then look it
up in a traditional HashMap. I just wanted to keep the total size of
the HashMap down.

? - The hash map takes the same storage if you hash file names directly.
In essence, the problem is I want to look up by filename (suitably
truncated) to get a Long timestamp, when the file was last
successfully processed without errors. If the current date of the file
matches that, I can bypass processing the file. If current date of the
file, is later than that, I have to reprocess it, and update the
timestamp.

There is no map needed for that purpose at all.

Simply take a single time stamp at program start that is not in the
future and remember it for the next run. On the next run you need to
process any files modified not before this time stamp. There is no
benefit when you know exactly which time stamp of a particular file got
processed.

If you want to be sure that no file is processed twice then you need to
process only the modification time stamp interval
[last_timestamp,current_timestamp).

But be aware of accuracy issues with file time stamps. Depending on the
file system you might not get what you expected. I.e. if you processed
up to 2013-04-21T10:20:01.152Z the file system may write the next file
at 2013-04-21T10:20:00Z (e.g FAT16) or 2013-04-21T10:20:01Z (e.g.
ext2/3). Although the file is written /after/ your program the time
stamp is in the past from your point of view, most likely because
rounding issues. But also without rounding you might not see a change
immediately during the directory scan. In fact you end up with a race
condition.
All this preferably applies to remote file systems which might behave
entirely different than you expect on a particular platform.

A common work around is to introduce a defined latency. I.e. take a time
stamp let's say 30 seconds in the past at program start. This is used as
last run the next time.

Like any other make utility this process gets messed up when the time
stamps of the files are manipulated in an unexpected way. I.e. moving an
old file into another location or replace it by an old version of the
same file. Especially on Windows copying a file does not update the last
write time stamp. You should use the maximum of the last write time
stamp and the creation time stamp to come around this. Unfortunately the
latter is not very well accessible with Java before Version 7.

Let us say I write the code presuming collisions are impossible. What
if one does happen? Then two files will share the same timestamp. It
will contain the later of the two's timestamp. So if either file needs
reprocessing, the scheme will think both do, hardly a catastrophe
since I process everything all the time now.

I still can't see any reasonable improvement of using hash codes instead
of file names. If you need to process a file you need the file name anyway.

There must be a formula to estimate the number of collisions given the
bit-size of the hash and the number of items you hash presuming it is
perfectly random.

Yes. Look for the Birthday Paradox at Wikipedia.

The birthday is your hash code and only 23 people are required to get a
50/50 chance of at least one collision. For a 32 bit int you need about
77,000 objects. All of this assumes an optimal hash function which
neither the birthday nor String.getHashCode is. So the real values are
somewhat lower.
But you mentioned that a few collisions do not care you at all.
If not, I suppose I could write a one-shot to find
out experimentally if 64-bits are needed.

There is neither a benefit from 64 bit in your case nor from using hash
codes instead of file names.


Marcel
 
R

Roedy Green

I still can't see any reasonable improvement of using hash codes instead
of file names. If you need to process a file you need the file name anyway.

If I used a HashMap of filename -> timestamp,
I would have 20,000 x ( 15*2 + 8 ) = 767,436 bytes.

If I use a HashMap of digest -> timestamp,
I would have 20,000 x ( 4 + 2 ) = 120,000 bytes

The difference is even more pronounced when you consider the overhead
of wrapping each digest/timestamp in an object.

I only need look at one filename at time. I always have the current
filename. I don't need the HashMap to give me a filename.

It sounds conceptually complicated, but would really only and a few
lines of code.
 
R

Roedy Green

Simply take a single time stamp at program start that is not in the
future and remember it for the next run

The "next run" is not quite as simple as might think

Complications:

1. I often do runs on just one file or a small set of files.
2. The run can abort
3. I can abort the run
4. there are some files I leave completely alone, and only rarely
include them.
5. if a file had errors, even if it were processed, I have to do it
again to bring it to my attention, even if I have done nothing to
correct the problems. Sometimes errors might be caused by something
outside the file, say an include.

One idea I am playing with is just to delete the cache file every time
I compile, and not worry about trying to tell the cache software when
the software last changed.
 
R

Roedy Green

5. if a file had errors, even if it were processed, I have to do it

6. one more complication. If you process a file, and the result is the
same as before, the timestamp must be left as is to avoid reloading
the file to the website.
 
M

Marcel Müller

If I used a HashMap of filename -> timestamp,
I would have 20,000 x ( 15*2 + 8 ) = 767,436 bytes.

You are not really talking about 1MB memory nowadays, aren't you?

Btw. Where did you get 15*2 from? File names take about 32 bytes + twice
the string length, depending on the platform (32/64 bit, JVM etc.).
If I use a HashMap of digest -> timestamp,
I would have 20,000 x ( 4 + 2 ) = 120,000 bytes

The difference is even more pronounced when you consider the overhead
of wrapping each digest/timestamp in an object.

I don't see why this second map needs no boxing in Java. Also your
timestamp shrink to 2 bytes.
I only need look at one filename at time. I always have the current
filename. I don't need the HashMap to give me a filename.

OK.

So you basically need a (persistent) decision what file names in which
revisions are already processed. (Any version control software has
similar requirements.)

I would not recommend to use your digest here. How would you ever be
able to clean up the hash map? You don't know the hash code of the
vanished file names so you can't know what to remove. And since your
operation is not a LUW with respect to all files in a folder (as
mentioned in your other post) you cannot simply remove all untouched
keys during a run.

Furthermore the result of filename -> timestamp is by far more human
readable than hash keys. This will simplify testing and serviceability.
If you care about the MB storage in the file system then zip the content
of your persistent information file (will likely shrink some factor 2+x)
or use a smarter file format that does not repeat the same strings over
and over. E.g.
Folder1
File1 timestamp
File2 timestamp
Folder2
...
This will save approx. another factor 2 with many files in deep folders.


Marcel
 
R

Roedy Green

There is no map needed for that purpose at all

here is my first cut at the code. I have not run it yet

It shows what I had in mind.

/*
* [Lazy.java]
*
* Summary: Lets us avoid the work of expanding macros if they were
done successfully earlier.
*
* Copyright: (c) 2012-2014 Roedy Green, Canadian Mind Products,
http://mindprod.com
*
* Licence: This software may be copied and used freely for any
purpose but military.
* http://mindprod.com/contact/nonmil.html
*
* Requires: JDK 1.7+
*
* Created with: JetBrains IntelliJ IDEA IDE
http://www.jetbrains.com/idea/
*
* Version History:
* 1.0 2014-04-21 initial version.
*/
package com.mindprod.htmlmacros.support;

import com.mindprod.common17.ST;

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.EOFException;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.HashMap;
import java.util.Map.Entry;
import java.util.concurrent.TimeUnit;
import java.util.zip.Adler32;

import static com.mindprod.htmlmacros.macro.Global.configuration;
import static java.lang.System.err;

/**
* Lets us avoid the work of expanding macros if they were done
successfully earlier.
*
* @author Roedy Green, Canadian Mind Products
* @version 1.0 2014-04-21 initial version
* @since 2014-04-21
*/
public class Lazy
{
// -------------------------- PUBLIC STATIC METHODS
--------------------------

static final Adler32 digester = new Adler32();

/**
* encoding for UTF-8
*/
private static final Charset UTF8Charset = Charset.forName(
"UTF-8" );

/**
* where the lazy cache is kept are kept.
*/
private static final String CACHE_FILE_NAME =
"embellishment/cache.bin";

/**
* how many entries we are expecting in the lookup file.
*/
private static final int size = 20000;

/**
* root directory of the local websites
*/
private static final File root = new File(
configuration.getLocalWebrootWithSlashes() );

/**
* allow 5 seconds of slop in matching dates
*/
private static final long slop = TimeUnit.SECONDS.toMillis( 5 );

static
{
load();
}

static
{

// arrange for save to be run at shutdown.
Runtime.getRuntime().addShutdownHook( new Thread()
{
public void run()
{
save();
}
} );
}

/**
* look up filename hash-32 to timestamp
*/
private static HashMap<Integer, Long> lookup;

/**
* load the previous contents of the lookup cache from
embellishments/cache.bin
* It is a binary file of pairs hash-32, timestamp-64
*/
private static void load()
{
// load up the HashMap we use to track when files were last
successfully processed.

final File cacheFile = new File( root, CACHE_FILE_NAME );
// allow some padding to avoid collisions
lookup = new HashMap<>( size + size / 4 );
// if no cachefile carry on with empty lookup.
// if cachefile exists, load pair from it.
if ( cacheFile.exists() & cacheFile.canRead() )
{
DataInputStream dis = null;
try
{
try
{
// O P E N
final FileInputStream fis = new FileInputStream(
cacheFile );
final BufferedInputStream bis = new
BufferedInputStream( fis, 512 * 1024 );
dis = new DataInputStream( bis );

while ( true )
{
// R E A D pairs hash-32, timestamp-64
int hash = dis.readInt();
long timestamp = dis.readInt();
safePut( hash, timestamp );
} // end loop
} // end inner try

catch ( EOFException e )
{
// nothing to do
}
finally
{
// C L O S E
if ( dis != null )
{
dis.close();
}
}
} // end outer try
catch ( IOException e )
{
err.println( ">>> Warning. Unable to read cache.bin
file" );
}

} // end if
else
{
err.println( ">>> clearing the cache.bin file" );
} // end load
} // end save

/**
* save the contents of the lookup cache into
embellishments/cache.bin
* It is a binary file of pairs hash-32, timestamp-64
*/
private static void save()
{

final File cacheFile = new File( root, CACHE_FILE_NAME );

if ( cacheFile.canWrite() )
{
try
{
// O P E N
final FileOutputStream fos = new FileOutputStream(
cacheFile, false /* append */ );
final BufferedOutputStream bos = new
BufferedOutputStream( fos, 65536 /* 64K bytes */ );
final DataOutputStream dos = new DataOutputStream( bos
);
for ( Entry<Integer, Long> entry : lookup.entrySet() )
{

// W R I T E
int hash = entry.getKey();
long timestamp = entry.getValue();
dos.writeInt( hash ); // we write int and long,
not Integer and Long.
dos.writeLong( timestamp );
}

// C L O S E
dos.close();

} // end if
catch ( IOException e )
{
err.println( ">>> Warning. Unable to write cache.bin
file" + e.getMessage() );
}
} // end if
else
{
err.println( ">>> Warning. Unable to write cache.bin file"
);
}
} // end save

/**
* has this file already been processed and is unchanged since
that time?
*
* @param file file we are processing.
*
* @return true if the file has already been successfully
processed.
*/
public static boolean isAlreadyDone( File file )
{
// prune filename down to jgloss/jdk
int hash = calcHash( file );

Long timestampL = lookup.get( hash );
// if no entry, it was not registered as done.
if ( timestampL == null )
{
return false;
}

// if all is well ,the last modified date should not have
changed since we recorded the file as
// successfully processed.
if ( file.lastModified() > timestampL + slop )
{

// the file has been modified since we last processed it.
// we will have to reprocess it.
// This cache entry is useless. We might as well get rid
of it to save some space.
lookup.remove( hash );
return false;
}
else
{
// it has not been touched since we last successfully
processed it.
// the cache entry is fine as is.
return true;
}
}

/**
* Mark the status of this file.
*
* @param file file we are processing.
* @param status true= file successfully processed, false=file was
not successfully processed.
*/
public static void markStatus( File file, boolean status )
{
int hash = calcHash( file );

if ( status )

{
// GOOD
// If there already is an entry we must modify it.
// If there is no entry, we must create one
// The easiest way to accomplish this is just to create a
new entry for that key.
// Collisions should be very rare, but when they happen
keep the conservative older date of the pair.
// if one of the pair needs to be reprocessed, both will
be.
long timestamp = file.lastModified();
safePut( hash, timestamp );
}
else
{
// BAD
// processing this file failed.
// get rid of any entry. There might not be one. remove
does not mind.
lookup.remove( hash );
}

}

/**
* but this hash : timestamp into the lookup.
* However if there is an existing entry, do not replace it
* if it is older than this one.
*
* @param hash has-32 of the file name
* @param timestamp timestamp of file when it was last
successfully processed.
*/
private static void safePut( final int hash, final long timestamp
)
{

Long prevTimeStampL = lookup.put( hash, timestamp );

// collisions should be very rare, but when they happen keep
the conservative older date of the pair.
// if one of the pair needs to be reprocessed, both will be.
if ( prevTimeStampL != null && prevTimeStampL < timestamp )
{
// put it back. This should happen very rarely.
lookup.put( hash, prevTimeStampL );
}
}

/**
* calculate a hash-32 of the name of the file, not its contents
*
* @param file file to be processed
*
* @return 32-bit Adlerian hash.
*/
private static int calcHash( final File file )
{
// prune filename E:/mindprod/jgloss/jdk.html down to
jgloss/jdk
String chopped = ST.chopTrailingString(
Tools.webrootRelativeNameWithSlashes( file ), ".html" );
final byte[] theTextToDigestAsBytes = chopped.getBytes(
UTF8Charset );
digester.reset();
digester.update( theTextToDigestAsBytes );
return ( int ) digester.getValue();
}
}
 
M

markspace

here is my first cut at the code. I have not run it yet

Yeow, this is kinda terrible. My main thoughts are:

1. Way too much static initialization. Just referencing the class in
any fashion will cause the entire cache to be read into memory.

2. Several of your static constants are ripe for a Look Up Service
and/or Injection. Cf Martin Fowler's discussion of these design patterns.

<http://martinfowler.com/articles/injection.html>

3. You should leave yourself some wiggle room to move from flat files to
a simple database in the future. Since it's your code you can just gut
the thing and re-write it but it doesn't look easy.
import com.mindprod.common17.ST;

Numbering package names == hard to remember. Use a better name for
what's common in this package other than "17".
public classLazy

You said it, not me, ;-)
{
static
{
load();
}
Evil!!!


static
{

// arrange for save to be run at shutdown.
Runtime.getRuntime().addShutdownHook( new Thread()

Shutdown hooks are pretty gross, imo. Is there a better way?
/**
* look up filename hash-32 to timestamp
*/
private static HashMap<Integer, Long> lookup;
???

private static void load()
{

Way too many static methods here. What happened to instantiating a
class like normal folks?
private static void save()

Same comment.
public static boolean isAlreadyDone( File file )
{

There are a lot of side effects in this method, including removing stale
entries. Consider refactoring this into smaller methods.
// if all is well ,the last modified date should not have
changed since we recorded the file as
// successfully processed.
if ( file.lastModified() > timestampL + slop )
{

// the file has been modified since we last processed it.
// we will have to reprocess it.
// This cache entry is useless. We might as well get rid
of it to save some space.

I'm really curious as to what, exactly, you are caching here, and why.
What's the point of keeping filenames in a hash map?

I also think your conflating a basic operation, like caching, with other
operations that are specific your your app domain, like "prune filename
E:/mindprod/jgloss/jdk.html down to jgloss/jdk". Filename manipulation
belongs in a higher level of the code, the cache should just do caching.

Cf. other cache designs, like the one built into Android:

<http://developer.android.com/reference/android/util/LruCache.html>
 
T

taqmcg

On Mon, 21 Apr 2014 00:51:05 -0700, Roedy Green



If the math is right,
For a 16-bit hashing, 20,000 filenames, you could expect 50,796

collisions.

For a 32-bit hashing, 20,000 filenames, you could expect 4 collisions.
For a 64-bit hashing, 20,000 filenames, you could expect 2 collisions.
I don't think this is correct. It's pretty clear intuitively that the probability of there being any collisions is something of the order of Nf^2/Nb
where the Nb is the number of bins (e.g., 2^64) and Nf is the number of files.
(Assuming Nf << Nb).

Consider: The first first file has 0 chance of being a collision. There's 1/Nb chance that the second file is a collision, and something very close to2/Nb that the the third file is a collision and... ~Nf/Nb chance that the Nf file is a collision. So the sum of these is

Sum (1,Nf)/Nb ~ Nf^2 / (2 Nb).

Here Sum(1,Nf) is defined as the sum of all integers from 1 to Nf.

So that would indicate that there the expectation value for collisions in a 64 bit hash is ~2^(-17) <<< 1. See http://blog.shay.co/hash-collision-probability/ to do this more rigorously.


Regards,
Tom McGlynn
 
T

taqmcg

.
... a 64 bit hash is ~2^(-17) <<< 1.
Actually it's 20,000^2/2^65 which is more like 2^30/2^65 = 2^-35 ~ 3x10^-11 give or take a power of two or so. Switched ints and shorts for a second writing the above.

Tom
 
R

Roedy Green

I don't see why this second map needs no boxing in Java. Also your
timestamp shrink to 2 bytes.

I was just counting the raw data bytes, not the boxing overhead.
 
R

Roedy Green

here is my first cut at the code. I have not run it yet

I think by collision handling is a bit naive. I need to think about it
a bit more. If I can get the number of collisions down to once every
1000 years, I will live with it since any software change flushes the
cache.
 
R

Roedy Green

Numbering package names == hard to remember. Use a better name for
what's common in this package other than "17".

it refer to common code that needs JDK 1.7 or higher to run.

I used several of them 11 13 15, but now I have dropped the older
ones.
 
R

Roedy Green

1. Way too much static initialization. Just referencing the class in
any fashion will cause the entire cache to be read into memory.

If you have methods like load and save, you can make the client call
them, but it just one more thing to go wrong. What are the
alternatives?
 
R

Roedy Green

Way too many static methods here. What happened to instantiating a
class like normal folks?

Because I am not prepared at this early stage to think about multiple
instances. Before I released it, I would have to do that.

A released version would need a callback for compacting the filename.
It would need the name of the cache file as a parameter.

I tend to think in concrete terms, then when all is working, gradually
add the generalisations.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,734
Messages
2,569,441
Members
44,832
Latest member
GlennSmall

Latest Threads

Top