Slow insertion to hashes?

J

Jindroush

Hi,

I have to load a file. I 'index' the records in two hashes. At the
beginning I do:
my $fl = -s $in;
my $recs = int( $fl / 85 );
keys( %$hr1 ) = $recs;
keys( %$hr2 ) = $recs if defined $hr2;

then for each record:
$$hr1{ lc $fname } = \%H;

if( defined $hr2 )
{
$$hr2{ $H{ hdr } . $H{ ent } } = \%H;
}

In %H is my data stored. This works good up to cca 70000 records (file1
has about 91000 records, each record is about 85 chars, split in 6
strings). Then it is _DEAD SLOW_. I load three such files in 3 (6)
different hashes. The two other files are also _DEAD SLOW_. I have quite
fast computer and s**tloads of free ram. Running: v5.6.1 built for
MSWin32-x86-multi-thread (activestate) on WXP (no way to change the perl
version/os).

Why is it so slow? How can I make it faster?

Regards,
Jindroush
 
A

A. Sinan Unur

Jindroush said:
I have to load a file. I 'index' the records in two hashes. At the
beginning I do:
my $fl = -s $in;
my $recs = int( $fl / 85 );
keys( %$hr1 ) = $recs;
keys( %$hr2 ) = $recs if defined $hr2;

then for each record:
$$hr1{ lc $fname } = \%H;

if( defined $hr2 )
{
$$hr2{ $H{ hdr } . $H{ ent } } = \%H;
}

Please post a short but complete script that still exhibits the
problem. As it is, there are two many unknowns in what you have
posted.

The posting guidelines for this group will show you how to put
together such a script.
In %H is my data stored. This works good up to cca 70000 records
(file1 has about 91000 records, each record is about 85 chars, split
in 6 strings). Then it is _DEAD SLOW_. I load three such files in 3
(6) different hashes. The two other files are also _DEAD SLOW_. I have
quite fast computer and s**tloads of free ram. Running: v5.6.1 built
for MSWin32-x86-multi-thread (activestate) on WXP (no way to change
the perl version/os).

Why is it so slow?

You might be iterating over files many times. You might be slurping
large amounts of data. There might be a memory leak. Or, the keys
you are using might be generating too many collusions
(see <URL:http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003/index.html>)

Sinan
 
X

xhoster

Jindroush said:
Hi,

I have to load a file. I 'index' the records in two hashes. At the
beginning I do:
my $fl = -s $in;
my $recs = int( $fl / 85 );
keys( %$hr1 ) = $recs;
keys( %$hr2 ) = $recs if defined $hr2;

I've never found this type of preallocation to be worthwhile.

You don't seem to be using strict. This quite likely is the problem.
then for each record:
$$hr1{ lc $fname } = \%H;

Where did %H come from? If I could see that, perhaps I could help you.
if( defined $hr2 )
{
$$hr2{ $H{ hdr } . $H{ ent } } = \%H;
}

In %H is my data stored. This works good up to cca 70000 records (file1
has about 91000 records, each record is about 85 chars, split in 6
strings). Then it is _DEAD SLOW_.

It sounds like you are swapping.
I load three such files in 3 (6)
different hashes. The two other files are also _DEAD SLOW_.

It sounds like you are swapping.
I have quite
fast computer and s**tloads of free ram.

I have no idea what you consider to be a s**tload. If you want real
answers, post real information.
Running: v5.6.1 built for
MSWin32-x86-multi-thread (activestate) on WXP (no way to change the perl
version/os).

Why is it so slow?

Probably because you are swapping? Why are you swapping? I don't know,
you haven't given us enough information to know. Or you could be having
hash key collisions. periodically Print your hashes in a scalar context
to see what the bucket usage is.
How can I make it faster?

You could post code that is actually runnable, so we can try it out and
see if we can see what the problem is. In other words, you could follow
the posting guidelines.

Xho
 
J

Jindroush

Jindroush said:
Hi,

I have to load a file. I 'index' the records in two hashes. At the
beginning I do:
This works good up to cca 70000 records (file1
has about 91000 records, each record is about 85 chars, split in 6
strings). Then it is _DEAD SLOW_. I load three such files in 3 (6)
different hashes. The two other files are also _DEAD SLOW_. I have quite
fast computer and s**tloads of free ram. Running: v5.6.1 built for
MSWin32-x86-multi-thread (activestate) on WXP (no way to change the perl
version/os).

Why is it so slow? How can I make it faster?

According to the messages, I'm posting more 'proper' problem report.

The code is below. Input file is about 91k records, 8.8MBs. I have 1GB
of ram, the script takes under 50megs while running. So no swapping
(100%). I tried to comment various parts of the script and it seems that
the whole problem is the size of each %H. The shorter it is, the faster
the script works. How to solve that (easily?).


use strict;

my %DB;

my $in = "file1";

my $fl = -s $in;
my $recs = int( $fl / 85 );
keys( %DB ) = $recs;

open IN, $in or die;

print "Loading '$in'...\n";
print( "preallocated $recs records...\n" );
my $cnt = 0;
my $st = time;
while( my $line = <IN> )
{
chomp $line;

my( $v1, $v2, $v3, $v4, $v5, $fname )= split( / /, $line);

my %H;
$H{ v1 } = $v1;
$H{ v2 } = $v2;
$H{ v3 } = $v3;
$H{ v4 } = $v4;
$H{ v5 } = $v5;
$H{ fname } = $fname; #must be here

$DB{ lc $fname } = \%H;

printf( "%4d $cnt\n", time - $st ) unless ( $cnt % 1000 );
$cnt++;
}
close IN;
print "\n";
 
R

robic0

Jindroush said:
According to the messages, I'm posting more 'proper' problem report.

The code is below. Input file is about 91k records, 8.8MBs. I have 1GB
of ram, the script takes under 50megs while running. So no swapping
(100%). I tried to comment various parts of the script and it seems that
the whole problem is the size of each %H. The shorter it is, the faster
the script works. How to solve that (easily?).


use strict;

my %DB;

my $in = "file1";

my $fl = -s $in;
my $recs = int( $fl / 85 );
keys( %DB ) = $recs;
rounds up although I don't know how you can
check it
open IN, $in or die;

print "Loading '$in'...\n";
print( "preallocated $recs records...\n" );
my $cnt = 0;
my $st = time;
while( my $line = <IN> )
{
chomp $line;

my( $v1, $v2, $v3, $v4, $v5, $fname )= split( / /, $line);

my %H;
why incur the overhead with a named array?
$H{ v1 } = $v1;
$H{ v2 } = $v2;
$H{ v3 } = $v3;
$H{ v4 } = $v4;
$H{ v5 } = $v5;
$H{ fname } = $fname; #must be here

$DB{ lc $fname } = \%H;

have you tried ?
if (!exists $DB{ lc($fname) })
{
$DB{ lc($fname) } = {
'v1' => $v1,
'v2' => $v2,
'v3' => $v3,
'v4' => $v4,
'v5' => $v5,
'fname' => $fname};
} else {
print "$fname already exists\n";
}
 
X

xhoster

Jindroush said:
According to the messages, I'm posting more 'proper' problem report.

The code is below. Input file is about 91k records, 8.8MBs. I have 1GB
of ram, the script takes under 50megs while running. So no swapping
(100%). I tried to comment various parts of the script and it seems that
the whole problem is the size of each %H. The shorter it is, the faster
the script works. How to solve that (easily?).

use strict;

my %DB;

my $in = "file1";

##Because I couldn't run your program without file1, I added this
##code. I used 1_000_000 because hey, let's push the limits.
{
open my $fh, ">$in" or die $!;
foreach (1..1_000_000) {
print $fh join (" ", map substr(rand(),0,15), 1..6), "\n";
};
close $fh or die $!;
};
my $fl = -s $in;
my $recs = int( $fl / 85 );
keys( %DB ) = $recs;

open IN, $in or die;

print "Loading '$in'...\n";
print( "preallocated $recs records...\n" );
my $cnt = 0;
my $st = time;
while( my $line = <IN> )
{
chomp $line;

my( $v1, $v2, $v3, $v4, $v5, $fname )= split( / /, $line);

my %H;
$H{ v1 } = $v1;
$H{ v2 } = $v2;
$H{ v3 } = $v3;
$H{ v4 } = $v4;
$H{ v5 } = $v5;
$H{ fname } = $fname; #must be here

you could replace this with:

my %H;
@H{qw(v1 v2 v3 v4 v5 fname)} = split (/ /, $line);

I think it looks better, but I doubt it will make things faster.
$DB{ lc $fname } = \%H;

printf( "%4d $cnt\n", time - $st ) unless ( $cnt % 1000 );
$cnt++;
}
close IN;
print "\n";

I don't see anything obviously wrong with your code. The problem is
probably with your data (hash key collisions) or with your system (reading
the file iteself is very slow, once you get beyong the part that is cached
from before. Or some kind of VM problem. Or something else)

Generating the file the way I did above, I saw absolutely no slowing at all
all the way up to a size of 1_000_000. What do you see if you use my
random data rather than your actual data?

Xho
 
J

Jindroush

##Because I couldn't run your program without file1, I added this
##code. I used 1_000_000 because hey, let's push the limits.
{
open my $fh, ">$in" or die $!;
foreach (1..1_000_000) {
print $fh join (" ", map substr(rand(),0,15), 1..6), "\n";
};
close $fh or die $!;
};
I don't see anything obviously wrong with your code. The problem is
probably with your data (hash key collisions) or with your system (reading
the file iteself is very slow, once you get beyong the part that is cached
from before. Or some kind of VM problem. Or something else)

Generating the file the way I did above, I saw absolutely no slowing at all
all the way up to a size of 1_000_000. What do you see if you use my
random data rather than your actual data?

It is super fast. Just few seconds (I tried just 100_000 :eek:) )

The file format is
00000000 00000000 00000000 00000000 00000000 filename

first five strings are hexadecimal numbers printed by %08X. Filename is
unique, there can't be duplicate.

I've tried to replace the key of the hash (from 'lc $fname' to '$cnt'),
with no significant speed change - it speaks agains the hash collisions IMO.

So, I did few tests:
a) hash key 'lc $fname', only one 8byte string assignment to %H hash -
36 seconds
b) two assignments - 57 seconds
c) three assigments - 52 seconds
d) 4 assignments - 50 seconds
e) 5 assignments - 90 seconds
f) 6 assignments (including cca 40 bytes long fn) - 120 seconds
g) same as f, but with '$cnt' hashkey - 106 seconds
h) same as f, but 5 values replaced with substr( rand(),0,15) - 47 seconds
i) same as f, but 5 values replaced with random 32bit values sprintfed
as %08X (so 8 byte strings). - 120 seconds
j) same as f, but 5 values converted to numbers by 'hex'. - 51 seconds.

I initially thought that it is just the size of the inserted hash - but
it's obviously not true. The 15bytes long strings from h) are definitely
longer than 8bytes strings from i), but it takes less time to build the
structure?

Regards,

Jindroush
 
R

robic0

Jindroush said:
It is super fast. Just few seconds (I tried just 100_000 :eek:) )

The file format is
00000000 00000000 00000000 00000000 00000000 filename

first five strings are hexadecimal numbers printed by %08X. Filename is
unique, there can't be duplicate.

I've tried to replace the key of the hash (from 'lc $fname' to '$cnt'),
with no significant speed change - it speaks agains the hash collisions IMO.

So, I did few tests:
a) hash key 'lc $fname', only one 8byte string assignment to %H hash -
36 seconds
b) two assignments - 57 seconds
c) three assigments - 52 seconds
d) 4 assignments - 50 seconds
e) 5 assignments - 90 seconds
f) 6 assignments (including cca 40 bytes long fn) - 120 seconds
g) same as f, but with '$cnt' hashkey - 106 seconds
h) same as f, but 5 values replaced with substr( rand(),0,15) - 47 seconds
i) same as f, but 5 values replaced with random 32bit values sprintfed
as %08X (so 8 byte strings). - 120 seconds
j) same as f, but 5 values converted to numbers by 'hex'. - 51 seconds.

I initially thought that it is just the size of the inserted hash - but
it's obviously not true. The 15bytes long strings from h) are definitely
longer than 8bytes strings from i), but it takes less time to build the
structure?

Regards,

Jindroush

Have you tried anonymous hash array instead of %H.
Also, according to the perldocs, they try to randomize the hash key
insertions
for security reasons. When your dealing with that many keys, well ...
 
T

Tassilo v. Parseval

Also sprach (e-mail address removed):
Jindroush wrote:

Have you tried anonymous hash array instead of %H.
Also, according to the perldocs, they try to randomize the hash key
insertions
for security reasons. When your dealing with that many keys, well ...

Hash key insertion is certainly not randomized per se. If it was, it'd
be impossible to do hash-key look-ups. What more recent perls do is to
use a random seed that is determined at the time the interpreter is
started. That seed is used as the initial value of the hash value:

#define PERL_HASH(hash,str,len) \
STMT_START { \
register const char *s_PeRlHaSh_tmp = str; \
register const unsigned char *s_PeRlHaSh = (const unsigned char *)s_PeRlHaSh_tmp; \
register I32 i_PeRlHaSh = len; \
register U32 hash_PeRlHaSh = PERL_HASH_SEED; \
while (i_PeRlHaSh--) { \
hash_PeRlHaSh += *s_PeRlHaSh++; \
hash_PeRlHaSh += (hash_PeRlHaSh << 10); \
hash_PeRlHaSh ^= (hash_PeRlHaSh >> 6); \
} \
hash_PeRlHaSh += (hash_PeRlHaSh << 3); \
hash_PeRlHaSh ^= (hash_PeRlHaSh >> 11); \
(hash) = (hash_PeRlHaSh + (hash_PeRlHaSh << 15)); \
} STMT_END

Older versions of perl simply used 0 where it now says PERL_HASH_SEED.
This absolutely will not harm the time it takes to insert and find keys.

Apart from that, the OP is using a perl from the 5.6 branch where hash
randomization wasn't yet in place.

Tassilo
 
J

Jindroush

Have you tried anonymous hash array instead of %H.

Yeah, I've tried, insignificantly faster.

I did measure it:

A)
CPU: 3.2GHz Intel with HT
OS: WinXP
Perl: 5.6.1 Active State
Time: 120 seconds

B)
CPU: 1.6GHz Intel Centrino (notebook)
OS: WinXP
Perl: 5.6.1 Active State
Time: 50 seconds

C)
CPU: AMD Athlon 2400+
OS: WinXP home
Perl: 5.8.7 ActiveState
Time: 2 seconds

D)
CPU: - (virtualized on my 3.2GHz)
OS: RH9
Perl: 5.8.1
Time: 3 seconds

I don't understand the results at all. Why is 2x slower notebook 2x
faster? Why the extreme difference between 5.6.x and 5.8.x?

Jindroush
 
T

Tim Shoppa

Why the extreme difference between 5.6.x and 5.8.x?

The hashing algorithm used in Perl has been tweaked over the years as
"bad example" hash collisions occur in real keys.

How does 5.6.x do when presented with a fundamentally different set of
90000+ keys?

Tim.
 
J

Jindroush

Tim said:
The hashing algorithm used in Perl has been tweaked over the years as
"bad example" hash collisions occur in real keys.

How does 5.6.x do when presented with a fundamentally different set of
90000+ keys?

If I insert only empty hashes, it's 1 second.
If I insert the hash with filename record, it takes 60 seconds.

I've tried to replace filename (unique) with counter (unique), it's
almost no difference.

Sorry, but I don't believe it's just the key collision problem.

Jindroush
 
R

robic0

If I insert only empty hashes, it's 1 second.
If I insert the hash with filename record, it takes 60 seconds.

I've tried to replace filename (unique) with counter (unique), it's
almost no difference.

Sorry, but I don't believe it's just the key collision problem.

Jindroush

Then its in the %H creation. You can't evaluate the whole of it
when you don't break it down and try something different.
Don't do complex programming constructs. Isolate just the
creation and storing of an array of anonymous hash references
and push it into a scalar array first. Time that instead.
Don't mix this creation with an IO stream. When you have timed
that, start a new timer on just copying those references to
the values of the %DB hash. Do this simply without a complex
key generation.

Something like this:

# start timer1 ..
for ($i=0; $i<900000; $i++)
{
my $h = {'a'=> 'asdf','b'=> 'asdf','c'=> 'asdf',
'd'=> 'asdf','e'=> 'asdf','f'=> 'asdf'};
push (@tmp, $h);
## later use this instead
## ------------------------------------------------
## my $h = {'a'=> 'asdf','a'=> 'asdf','a'=> 'asdf',
## 'a'=> 'asdf','a'=> 'asdf','a'=> 'asdf'};
## push (@tmp, $h);
}
# end timer1 ..

# start timer2 ..
for ($i=0; $i<900000; $i++)
{
$DB{$i} = $tmp[$i];
}
# end timer 2 ..

What kind of results you get with these?
 
X

xhoster

Jindroush said:
It is super fast. Just few seconds (I tried just 100_000 :eek:) )

The file format is
00000000 00000000 00000000 00000000 00000000 filename

first five strings are hexadecimal numbers printed by %08X. Filename is
unique, there can't be duplicate.

I've tried to replace the key of the hash (from 'lc $fname' to '$cnt'),
with no significant speed change - it speaks agains the hash collisions
IMO.

So, I did few tests:

If I understand this right, all the below test are back to using our
data file again, right? And all are >10 times slower than the above
using my on-the-fly generated file?

If that is the case, can you either put our data file on the web somewhere
so that we can reproduce your below results, or write a short segment (like
mine above) which creates a file from random data but formatted like yours
such that it reproduce your results below?

a) hash key 'lc $fname', only one 8byte string assignment to %H hash -
36 seconds
b) two assignments - 57 seconds
c) three assigments - 52 seconds
d) 4 assignments - 50 seconds
e) 5 assignments - 90 seconds
f) 6 assignments (including cca 40 bytes long fn) - 120 seconds
g) same as f, but with '$cnt' hashkey - 106 seconds
h) same as f, but 5 values replaced with substr( rand(),0,15) - 47
seconds i) same as f, but 5 values replaced with random 32bit values
sprintfed as %08X (so 8 byte strings). - 120 seconds
j) same as f, but 5 values converted to numbers by 'hex'. - 51 seconds.

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,077
Latest member
SangMoor21

Latest Threads

Top