Why my java code is THAT slow compared too C++?

K

Kevin

C++: 94 seconds.
my java: 227 seconds.
Time including all (CPU, IO, etc).

Task: read in data line by line (required), and find the word frequency
count from a large file (1.2G).
A c++ code (I do not have its source), it takes total 94 seconds, using
constant 600k memory as reported from Win TaskManager.

My code (I comment out some lines and run the code several times to get
these statistics):

1) Scan the data file line by line and convert each line to a customed
"raw string" class, total use 60 seconds. I use InputStream and read
byte[] of 32k each time (the data file is plain ascii file). And do my
own "NextLine()" function from the byte[]. In the follow code, this is
done by comment out lines from Line A to Line C.

2) on top of 1), adding splitting each line into words, use another 40
seconds. (total 100 seconds). In the follow code, this is done by
comment out lines from Line B to Line C.

3) on top of 2), adding a hashtable (hashmap similar result) frequency
count, use another 117 seconds (so total is 227 seconds).

Reading 1.2G plain text file line by line using 60 seconds seems not
that bad.

Splitting each line (byte[]) into words, which are many user-defined
"raw string" class, using 40 seconds, not that bad.

However, hash counting those words needs another 127 seconds,
hum..........

So:
Number 1 issue: counting the words using hashtable takes soooo long
time. What is the problem?
Number 2 issue: the split() function to split a byte[] seems not good
enough.

I list the codes below. Suggestions greatly appreciate. :)

Have a nice weekend!

///================================================================
Several major classes are:

My own raw string, which is basically a byte[] array.

public class KDRawString
{
private
byte[] Data = null;
public int hashCode()
{
int c = 1;
int v = 31;
for (int i=0; i<Data.length; i++)
{
c = c + Data*v;
v = v*31;
};
return c;
}
public boolean equals(Object o)
{
if (o instanceof KDRawString)
{
KDRawString K = (KDRawString) o;
if (K.Data.length != this.Data.length)
return false;
for (int i=0; i<this.Data.length; i++)
{
if (this.Data != K.Data)
return false;
};
return true;
}
else
return false;
}
public KDRawString(final byte[] data)
{
Data = Utility.Duplicate(data);
// Utility.Duplicate is just like System.arraycopy.
}
public KDRawString(byte[] data, final boolean ReUse)
{
if (ReUse) // sometimes, we do not need to create new array, if we
can use the input one.
Data = data;
else
Data = Utility.Duplicate(data);
// Utility.Duplicate is just like System.arraycopy.
}
//////////
public KDRawString[] split(final char c)
{
// basically just call the static split, and convert each byte[]
into KDRawString.
byte[][] b = split(Data, c);
KDRawString[] R = new KDRawString[b.length];
for (int i=0; i<b.length; i++)
{
R = new KDRawString(b, true);
// we can reuse the byte[] here since they are newly created and
not used any where else. This saved some time in my test run.
};
return R;
}
//
public static byte[][] split(final byte[] data, final char c)
{
// split it by char c.
// first, decide how many segment we need.
int count = 0;
for (int i=0; i<data.length; i++)
if (data == c)
count ++;
//
byte[][] result = new byte[count+1][];
// then split it by char c.
int resultIndex = 0;
int prePos = 0;
for (int i=0; i<data.length; i++)
{
if (data == c)
{
byte[] b = Utility.Duplicate(data, prePos, i-prePos);
prePos = i+1;
result[resultIndex] = b;
resultIndex ++;
};
};
byte[] b = Utility.Duplicate(data, prePos, data.length-prePos);
result[resultIndex] = b;
//
return result;
}
////////
}
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Another class is KDInt, which is just a int, used as counter (so I do
not need to new many Integer in the counting process).
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
for the main function, it is:

Hashtable ht = new Hashtable(10000); // total number of words is
about 4000.
// Using HashMap can save 1-2 seconds, as I have tested. So not a
major problem.
KDASCIITextFileReader kr = new KDASCIITextFileReader(fileName);
// This is a class to read in plain text file line by line as
byte[].
byte[] b = kr.NextLine_AsBytes();
int lineCount = 0;
while (b != null)
{
lineCount++;
//
KDRawString ks = new KDRawString(b);
//< ------------------------------------------------- Line A
KDRawString[] K = ks.split('\t');
//< ------------------------------------------------- Line B
for (int i=1; i<K.length; i++)
{
KDInt c = (KDInt) ht.get(K);
if (c == null)
{
c = new KDInt(0);
ht.put(K, c);
};
c.IncreaseValue();
};
//< ------------------------------------------------- Line C
b = kr.NextLine_AsBytes();
};

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// KDASCIITextFileReader is a class to read in text file line by line
into byte[].
public class KDASCIITextFileReader
{
private
boolean hasError = false;
private
InputStream is = null;
private
byte[] DataBuffer = null;
private
int DataBufferReadCount = -1; // bytes in the DataBuffer,
private
int DataBufferPointer = -1; // file seek pointer.
private
int initialBufferSize = 32768; // 32K.
public byte[] NextLine_AsBytes()
{
byte[] b = NextLine_PossibleBlank_AsBytes();
while ((b != null) && (b.length ==0) )
b = NextLine_PossibleBlank_AsBytes();
//
return b;
}

public byte[] NextLine_PossibleBlank_AsBytes()
{
// break at both \r and \n
// UNIX is \n, Win is \r\n, Apple is \r
//
if (is == null)
return null;
//
byte[] R = null; // bytes to return.
while (true)
{
int seekPos = DataBufferPointer;
while (seekPos < DataBufferReadCount)
{
if ( (DataBuffer[seekPos] == '\r') || (DataBuffer[seekPos] ==
'\n') )
{
byte[] b = Utility.Append(R, DataBuffer, DataBufferPointer,
seekPos-DataBufferPointer);
DataBufferPointer = seekPos+1;
return b;
};
seekPos++;
};
// so need to refill. add current to R before that.
R = Utility.Append(R, DataBuffer, DataBufferPointer,
DataBufferReadCount-DataBufferPointer);
fillBuffer();
if (DataBufferReadCount <0)
{
// end of file.
close();
return R;
};
}
}
public KDASCIITextFileReader(final String FileName)
{
initFile(FileName);
}
////////////////////
private void initFile(final String FileName)
{
hasError = false;
try
{
File f = new File(FileName);
FileInputStream fis = new FileInputStream(f);
//
String fnUpper = FileName.toLowerCase();
is = fis;
DataBuffer = new byte[initialBufferSize];
//
fillBuffer();
}
catch (Exception e)
{
e.printStackTrace();
setError(e.toString());
};
}
////////////////////
private void fillBuffer()
{
try
{
DataBufferReadCount = is.read(DataBuffer);
DataBufferPointer = 0;
}
catch (Exception e)
{
e.printStackTrace();
setError(e.toString());
}
}
public void close()
{
try
{
if (is != null)
is.close();
is = null;
DataBufferReadCount = -1;
DataBufferPointer = -1;
}
catch (Exception e)
{
e.printStackTrace();
setError(e.toString());
};
}
}
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// End of post.
 
R

Richard Wheeldon

Kevin said:
C++: 94 seconds.
my java: 227 seconds.
Time including all (CPU, IO, etc).

A few points:

1. Java is slow. Get used to it. You can save so much in
development time compared to C that the performance
difference pales by comparison.

2. Get a profiler. http://mindprod.com/jgloss/profiler.html
This will answer your specific issues here.

3. By convention Java class names start with Capital letters - LikeThis
methods start with small ones - likeThis().

4. Class and methods comments are best done as Javadoc comments.

Richard
 
J

Jeffrey Schwab

Richard said:
A few points:

1. Java is slow. Get used to it.

That depends on the application. For something as small as the OP's
example (which you snipped), Java's startup time is significant. For a
long-running, I/O bound, multi-tiered application, I think the
performance difference would dwindle.
You can save so much in
development time compared to C that the performance
difference pales by comparison.

Who said anything about C? The OP is using C++. These are closely
related but very different languages.

One of my pet peeves is the way people still compare languages to C to
show how much better they are for high-level development. C is not the
standard for developer productivity, and has not been for quite some
time. On the other hand, try integrating Java into a 1-MB PC BIOS
written mostly in assembly language.
2. Get a profiler. http://mindprod.com/jgloss/profiler.html
This will answer your specific issues here.

3. By convention Java class names start with Capital letters - LikeThis
methods start with small ones - likeThis().

4. Class and methods comments are best done as Javadoc comments.

Agree with all three of these points. :)
 
T

tom fredriksen

Hi, here are some comments I have:

You can not time an operation by just commenting out a step and then
rerun the program. A lot of the time taken is from other/external
effects which are not relevant to the step. What you have to do is read
the processors current millisecond counter at the start of the step you
want to test, System.getTimeMillis() should do it. Then you read the
counter again at the end of the step. Calculate the time difference and
print it out, then you have the real time of the step.

Also be ware of using operating system tools for a quick reading on the
memory usage etc. of a program, the interpretation of the numbers are
usually wrong. In linux for example, the memory number given in the ps
command represents the total of the program including all shared
libraries. That is not the number you are looking for, but rather the
size of the *your* programs allocated memory, not necessarily all the
memory used by all libraries aswell.

As suggested by others, use a profiling tool before you start doing any
optimisations. Time and again, it has been shown that people spend time
optimising the wrong part of the program.

Further on, if you think the hashmap is the major problem, use a
different map library than what is provided in the jdk. I can suggest
several: fastutil, trove, javolution (its up to 5 times faster than the
jdk version)

here is a link to a list of different 3rd party library implementations
(open source) which you might want to use.

http://www.manageability.org/blog/stuff/open-source-java-data-structures-and-algorithms/view

PS.
I would also recommend trying to make your code a little more readable
for others. It does not contain comments (javadoc style), the java style
is not used and I found the code a bit difficult in it self to read.
Perhaps because there is no empty lines in between methods and blocks, I
generally use two empty lines for that.

yours sincerely

tom
C++: 94 seconds.
my java: 227 seconds.
Time including all (CPU, IO, etc).

Task: read in data line by line (required), and find the word frequency
count from a large file (1.2G).
A c++ code (I do not have its source), it takes total 94 seconds, using
constant 600k memory as reported from Win TaskManager.

My code (I comment out some lines and run the code several times to get
these statistics):

1) Scan the data file line by line and convert each line to a customed
"raw string" class, total use 60 seconds. I use InputStream and read
byte[] of 32k each time (the data file is plain ascii file). And do my
own "NextLine()" function from the byte[]. In the follow code, this is
done by comment out lines from Line A to Line C.

2) on top of 1), adding splitting each line into words, use another 40
seconds. (total 100 seconds). In the follow code, this is done by
comment out lines from Line B to Line C.

3) on top of 2), adding a hashtable (hashmap similar result) frequency
count, use another 117 seconds (so total is 227 seconds).

Reading 1.2G plain text file line by line using 60 seconds seems not
that bad.

Splitting each line (byte[]) into words, which are many user-defined
"raw string" class, using 40 seconds, not that bad.

However, hash counting those words needs another 127 seconds,
hum..........

So:
Number 1 issue: counting the words using hashtable takes soooo long
time. What is the problem?
Number 2 issue: the split() function to split a byte[] seems not good
enough.

I list the codes below. Suggestions greatly appreciate. :)

Have a nice weekend!

///================================================================
Several major classes are:

My own raw string, which is basically a byte[] array.

public class KDRawString
{
private
byte[] Data = null;
public int hashCode()
{
int c = 1;
int v = 31;
for (int i=0; i<Data.length; i++)
{
c = c + Data*v;
v = v*31;
};
return c;
}
public boolean equals(Object o)
{
if (o instanceof KDRawString)
{
KDRawString K = (KDRawString) o;
if (K.Data.length != this.Data.length)
return false;
for (int i=0; i<this.Data.length; i++)
{
if (this.Data != K.Data)
return false;
};
return true;
}
else
return false;
}
public KDRawString(final byte[] data)
{
Data = Utility.Duplicate(data);
// Utility.Duplicate is just like System.arraycopy.
}
public KDRawString(byte[] data, final boolean ReUse)
{
if (ReUse) // sometimes, we do not need to create new array, if we
can use the input one.
Data = data;
else
Data = Utility.Duplicate(data);
// Utility.Duplicate is just like System.arraycopy.
}
//////////
public KDRawString[] split(final char c)
{
// basically just call the static split, and convert each byte[]
into KDRawString.
byte[][] b = split(Data, c);
KDRawString[] R = new KDRawString[b.length];
for (int i=0; i<b.length; i++)
{
R = new KDRawString(b, true);
// we can reuse the byte[] here since they are newly created and
not used any where else. This saved some time in my test run.
};
return R;
}
//
public static byte[][] split(final byte[] data, final char c)
{
// split it by char c.
// first, decide how many segment we need.
int count = 0;
for (int i=0; i<data.length; i++)
if (data == c)
count ++;
//
byte[][] result = new byte[count+1][];
// then split it by char c.
int resultIndex = 0;
int prePos = 0;
for (int i=0; i<data.length; i++)
{
if (data == c)
{
byte[] b = Utility.Duplicate(data, prePos, i-prePos);
prePos = i+1;
result[resultIndex] = b;
resultIndex ++;
};
};
byte[] b = Utility.Duplicate(data, prePos, data.length-prePos);
result[resultIndex] = b;
//
return result;
}
////////
}
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Another class is KDInt, which is just a int, used as counter (so I do
not need to new many Integer in the counting process).
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
for the main function, it is:

Hashtable ht = new Hashtable(10000); // total number of words is
about 4000.
// Using HashMap can save 1-2 seconds, as I have tested. So not a
major problem.
KDASCIITextFileReader kr = new KDASCIITextFileReader(fileName);
// This is a class to read in plain text file line by line as
byte[].
byte[] b = kr.NextLine_AsBytes();
int lineCount = 0;
while (b != null)
{
lineCount++;
//
KDRawString ks = new KDRawString(b);
//< ------------------------------------------------- Line A
KDRawString[] K = ks.split('\t');
//< ------------------------------------------------- Line B
for (int i=1; i<K.length; i++)
{
KDInt c = (KDInt) ht.get(K);
if (c == null)
{
c = new KDInt(0);
ht.put(K, c);
};
c.IncreaseValue();
};
//< ------------------------------------------------- Line C
b = kr.NextLine_AsBytes();
};

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// KDASCIITextFileReader is a class to read in text file line by line
into byte[].
public class KDASCIITextFileReader
{
private
boolean hasError = false;
private
InputStream is = null;
private
byte[] DataBuffer = null;
private
int DataBufferReadCount = -1; // bytes in the DataBuffer,
private
int DataBufferPointer = -1; // file seek pointer.
private
int initialBufferSize = 32768; // 32K.
public byte[] NextLine_AsBytes()
{
byte[] b = NextLine_PossibleBlank_AsBytes();
while ((b != null) && (b.length ==0) )
b = NextLine_PossibleBlank_AsBytes();
//
return b;
}

public byte[] NextLine_PossibleBlank_AsBytes()
{
// break at both \r and \n
// UNIX is \n, Win is \r\n, Apple is \r
//
if (is == null)
return null;
//
byte[] R = null; // bytes to return.
while (true)
{
int seekPos = DataBufferPointer;
while (seekPos < DataBufferReadCount)
{
if ( (DataBuffer[seekPos] == '\r') || (DataBuffer[seekPos] ==
'\n') )
{
byte[] b = Utility.Append(R, DataBuffer, DataBufferPointer,
seekPos-DataBufferPointer);
DataBufferPointer = seekPos+1;
return b;
};
seekPos++;
};
// so need to refill. add current to R before that.
R = Utility.Append(R, DataBuffer, DataBufferPointer,
DataBufferReadCount-DataBufferPointer);
fillBuffer();
if (DataBufferReadCount <0)
{
// end of file.
close();
return R;
};
}
}
public KDASCIITextFileReader(final String FileName)
{
initFile(FileName);
}
////////////////////
private void initFile(final String FileName)
{
hasError = false;
try
{
File f = new File(FileName);
FileInputStream fis = new FileInputStream(f);
//
String fnUpper = FileName.toLowerCase();
is = fis;
DataBuffer = new byte[initialBufferSize];
//
fillBuffer();
}
catch (Exception e)
{
e.printStackTrace();
setError(e.toString());
};
}
////////////////////
private void fillBuffer()
{
try
{
DataBufferReadCount = is.read(DataBuffer);
DataBufferPointer = 0;
}
catch (Exception e)
{
e.printStackTrace();
setError(e.toString());
}
}
public void close()
{
try
{
if (is != null)
is.close();
is = null;
DataBufferReadCount = -1;
DataBufferPointer = -1;
}
catch (Exception e)
{
e.printStackTrace();
setError(e.toString());
};
}
}
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// End of post.
 
C

Chris Uppal

tom said:
You can not time an operation by just commenting out a step and then
rerun the program. A lot of the time taken is from other/external
effects which are not relevant to the step.
[...] For something as small as the OP's
example (which you snipped), Java's startup time is significant. For a
long-running, I/O bound, multi-tiered application, I think the
performance difference would dwindle.

Your points are valid in general, but you both seem to have missed the
/numbers/ from Kevin's post:
C++: 94 seconds.
my java: 227 seconds.

When we are talking about numbers like that -- specifically the /very large/
numbers (for both versions of the program) -- then the JVM startup time is
essentially irrelevant, and the "noise" in the figures from OS effects will be
negligible.

And given that the C++ version took 94 seconds, we must assume that it had to
process a great deal of data (like Kevin's other examples). In which case it
would be almost completely disk-bound -- I/O bandwidth would dominate the
execution time. Under the circumstances, there is no excuse for a Java program
not to be similarly limited by I/O time. Fortunately we don't need to scrabble
around for an explanation for the embarrassing observation that it is /not/ so
limited, since Kevin has already (partially) explained it ;-)

-- chris
 
K

Kevin

####################### problem solved #####################
####################### problem solved #####################
####################### problem solved #####################
####################### problem solved #####################
Thanks all for the reply. I posted another topic and explained it. I
brief it again here.

It is not my java code's prolem.
I found the detailed guide for the c++ program.
It turned out that the data file is in a format that favor that c++
code: it encodes the hash value of each word into the file already
(which I previously thought was just some part of the data).
So the c code does not need to do the hash step, only the count step.
After I took that special data format into my code, my java code runs
as fast as the c++ code.
Thank you all for the reply and advice!~ I really learned a lot from
this group. Thank you.

Below, I briefly state how I do it in case someone will find useful.
For fast file (plain ascii in my case) IO:
1) read in the file using InputStream, each time ready in 32K data into
a byte[] buffer.
2) write your own "readLine()" method, which scan in the byte[] buffer,
and return a new byte[] as a line.
3) write your own "split(char c)" method, which break one byte[] into
many byte[].

####################### problem solved #####################
####################### problem solved #####################
####################### problem solved #####################
 
T

tom fredriksen

Chris said:
When we are talking about numbers like that -- specifically the /very large/
numbers (for both versions of the program) -- then the JVM startup time is
essentially irrelevant, and the "noise" in the figures from OS effects will be
negligible.

And given that the C++ version took 94 seconds, we must assume that it had to
process a great deal of data (like Kevin's other examples). In which case it
would be almost completely disk-bound -- I/O bandwidth would dominate the
execution time.

This is exactly my point, plus, that if the test is performed several
times and the numbers are simply added together the then IO time is
multiplied by as many times as the numbers are added, which suddenly
makes the result exponentially larger and therefore horrendously wrong.

Measuring the time to read the entire file is one thing, measuring that
plus other steps as an accumulation of each other is a completely
different matter and would produce the wrong result.
What I said earlier still stands if one wants correct results of
performance measurements.

/tom
 
R

Roedy Green

1. Java is slow. Get used to it. You can save so much in
development time compared to C that the performance
difference pales by comparison.

People who code in Java as if it were C write slow code. People who
insist on using and talking about 10 year old JVMs experience slow
code. People who write code with their eyes glued shut as to what is
quick and what is not will write slow code, at least until they take
time to optimise it. Try the latest JVMs and AOTs. Learn to do I/O
in big chunks or a file at a time efficiently. You will be surprised
how snappy Java can be.

See http://mindprod.comjgloss/jdk.html
http://mindprod.com/jgloss/aot.html
 
R

Roedy Green

And given that the C++ version took 94 seconds,
most of these times you are comparing an single person writing both
programs. Usually they are expert in one language and inept in the
other. You must compare implementations by competent people, not a
hack port of one language to another.
 
T

Thomas Hawtin

Kevin said:
It is not my java code's prolem.

I disagree.
I found the detailed guide for the c++ program.
It turned out that the data file is in a format that favor that c++
code: it encodes the hash value of each word into the file already
(which I previously thought was just some part of the data).

My initial reaction to reading that was: "Oh my god, you don't want to
do that."

This really should be a I/O bound problem. Adding extra data is not the
way to may it faster.
So the c code does not need to do the hash step, only the count step.

That is not necessarily why the C code is faster.
After I took that special data format into my code, my java code runs
as fast as the c++ code.

You are changing more than you think you are.

I didn't look through your code first time around because it had been
made unnecessarily difficult to read.

Looking at it again my first thought was that you were repeatedly
calculating the hash code every time. When you changed the program to
read the hash code from the file, you presumably stopped doing that.
Even so, the method should not be executed frequently.

Looking at the hash algorithm, it doesn't look conventional. Effectively
what you have is:

Sum s * ((2^5 - 1) ^ i+1)

The binary representation of powers of one less than a power of two are
not as "random" as you might think, particularly in the last few binary
digits. For a hash map/table this may result in poor distribution of
entries and hence very poor performance.

The exact impact will depend upon the implementation of the hash map.
Using HashMap on Sun's 1.4.1 may well turn out to be much slower than in
1.4.0 or 1.4.2.

The importance of reading the hash code from the file was that it is
calculated with a better algorithm. That appears to have transformed the
I/O bound problem into a CPU bound solutions.

Tom Hawtin
 
L

Luc The Perverse

Thomas Hawtin said:
That is not necessarily why the C code is faster.


My initial gut reaction, and his solution both seem to hint at the idea that
he was using an inefficient algorithm.

I have coded a word frequency calculator in C++ before, and used N^2
efficiency sorting algorithms and they were VERY slow.

So I disagree - he changed one thing and now they run in about the same
amount of time as opposed to the java running more than double. To me that
seems pretty conclusive.
 
T

Thomas Hawtin

Luc said:
My initial gut reaction, and his solution both seem to hint at the idea that
he was using an inefficient algorithm.

It is vastly more important that the hash code produce a good answer
than that it be ultra quick.

With two multiplies, the hash code computation wasn't as cheap as it
might have been. Sun's JVM seems poor at replacing constant multiplies
with shifts and adds/subtracts, although that is less important with
faster multipliers. If you replace v = v*31; with v = (v<<5) - v; and
perhaps look at two together (v = (v<<10) - 2*(v<<5) + v;) you might
understand that the low order bits are not going to be fairly distributed.
I have coded a word frequency calculator in C++ before, and used N^2
efficiency sorting algorithms and they were VERY slow.

Doing n operation on a hash map take O(n) time, apparently. Screw your
hash code up, and it becomes O(n^2) and your caches burn for an extra
constant time slow down.
So I disagree - he changed one thing and now they run in about the same
amount of time as opposed to the java running more than double. To me that
seems pretty conclusive.

How the hash code was calculation was performed changed (almost
certainly to a slower technique). Checking the implementation of
HashMap, it will only calculate the hash map once when inserting (or
retrieving) an entry. There is no way that the time taken to calculate
the hash code is going to make any impact on performance.

The original poster didn't know about the embedded hash code at the time
of writing the original code. It seems unlikely that the file format
happened to get the same hash value. So that was a separate change.
Given the likelihood of collisions with the first algorithm, that is a
very good candidate for causing the increased performance.

Tom Hawtin
 
L

Luc The Perverse

Thomas Hawtin said:
It is vastly more important that the hash code produce a good answer than
that it be ultra quick.

With two multiplies, the hash code computation wasn't as cheap as it might
have been. Sun's JVM seems poor at replacing constant multiplies with
shifts and adds/subtracts, although that is less important with faster
multipliers. If you replace v = v*31; with v = (v<<5) - v; and perhaps
look at two together (v = (v<<10) - 2*(v<<5) + v;) you might understand
that the low order bits are not going to be fairly distributed.


Doing n operation on a hash map take O(n) time, apparently. Screw your
hash code up, and it becomes O(n^2) and your caches burn for an extra
constant time slow down.


How the hash code was calculation was performed changed (almost certainly
to a slower technique). Checking the implementation of HashMap, it will
only calculate the hash map once when inserting (or retrieving) an entry.
There is no way that the time taken to calculate the hash code is going to
make any impact on performance.

The original poster didn't know about the embedded hash code at the time
of writing the original code. It seems unlikely that the file format
happened to get the same hash value. So that was a separate change. Given
the likelihood of collisions with the first algorithm, that is a very good
candidate for causing the increased performance.

I see your point - I must have misunderstood the first time round.

And I agree - I don't think there is any reason to hash at all. Just be
careful so you don't make redundant copies of strings and you should be
fine.

I used a red black tree, but any balanced tree should be ok. I searched for
entries, if I found them I incremented their count - if I didn't find them
then I incremented nothing.
 
C

Chris Uppal

Thomas said:
The original poster didn't know about the embedded hash code at the time
of writing the original code. It seems unlikely that the file format
happened to get the same hash value. So that was a separate change.
Given the likelihood of collisions with the first algorithm, that is a
very good candidate for causing the increased performance.

There's also the marginal effect that the OP was probably interpreting the hash
code strings as words and adding them to the table too. Not only would that
double the number of words in the table, the words themselves would have a much
reduced distribution (all decimal digits, or some such) which might further
exacerbate any problem caused by weak hashing.

-- chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,046
Latest member
Gavizuho

Latest Threads

Top