Fastest way to build a hash

  • Thread starter Stefan Fischerländer
  • Start date
S

Stefan Fischerländer

I've got the following situation:

My data is stored in a file; 5 bytes form one record. The bytes 0 to 3
are a long integer, which is regarded as the key, wheras byte 4 is the
value. I have to read this data structure into a hash.

What I'm doing at the moment is:
open(IN,"<file");
while(read(IN, $buf, 5))
{
$hash{unpack("L", substr($buf,0,4))} = substr($buf,4,1);
}
close(IN);

This takes about 5 seconds to read a file with about 100.000 records,
which is to long for my needs. (Celeron 800, IDE 7200rpm)

I figured out that the bottleneck isn't the I/O, because this version
takes as long as the script above:
open(IN,"<file")
$bread = read(IN, $buf, 1000000) / 5;
close(IN);
for($i=0;$i<$bread;$i++)
{
$hash{unpack("L", substr($buf,$i*5,4))} = substr($buf,$i*5+4,1);
}

Does anyone have any ideas how to build this hash faster? It would
also be possible to change the format of the data file. I already did
experiments with store and retrieve from the Storable module and with
a permanent hash with DB_File. In both cases the data file grew, while
the script was even slower.

Stefan
 
E

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

(e-mail address removed) (Stefan Fischerländer) wrote in
I've got the following situation:

My data is stored in a file; 5 bytes form one record. The bytes 0 to 3
are a long integer, which is regarded as the key, wheras byte 4 is the
value. I have to read this data structure into a hash.

What I'm doing at the moment is:
open(IN,"<file");
while(read(IN, $buf, 5))
{
$hash{unpack("L", substr($buf,0,4))} = substr($buf,4,1);
}
close(IN);

This takes about 5 seconds to read a file with about 100.000 records,
which is to long for my needs. (Celeron 800, IDE 7200rpm)

Untested, but my first thought is:

$/ = undef;
$raw = <IN>; # slurp whole file
%hash = $raw =~ /(....)(.)/gs; # split into 4, 1 byte pairs

(Note: untested)

- --
Eric
$_ = reverse sort qw p ekca lre Js reh ts
p, $/.r, map $_.$", qw e p h tona e; print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBPwy1mWPeouIeTNHoEQKsSgCg2l2gpuew2T2wuY79MCaIyDdMLsgAn1Ke
/fZKEVdWPNi7oJOpwLhmQNII
=zYBt
-----END PGP SIGNATURE-----
 
E

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Eric J. Roode said:
(e-mail address removed) (Stefan Fischerländer) wrote in


Untested, but my first thought is:

$/ = undef;
$raw = <IN>; # slurp whole file
%hash = $raw =~ /(....)(.)/gs; # split into 4, 1 byte pairs

On second thought, splitting the whole file up into pairs and then
assigning it to a hash is a waste, I think. Perhaps:

$/ = undef;
$raw = <IN>;
$hash{$1} = $2 while $raw =~ /(....)(.)/gs;

(again, untested).

- --
Eric
$_ = reverse sort qw p ekca lre Js reh ts
p, $/.r, map $_.$", qw e p h tona e; print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBPwzB6GPeouIeTNHoEQIU4ACeMHYALFq/lwuPglJ+0hfPmx7wcKIAn3Sf
vV6dH8uXJfGEGtYnoy+/PXKN
=TsTT
-----END PGP SIGNATURE-----
 
M

Martien Verbruggen

I've got the following situation:

My data is stored in a file; 5 bytes form one record. The bytes 0 to 3
are a long integer, which is regarded as the key, wheras byte 4 is the
value. I have to read this data structure into a hash.

What I'm doing at the moment is:
open(IN,"<file");
while(read(IN, $buf, 5))
{
$hash{unpack("L", substr($buf,0,4))} = substr($buf,4,1);
}
close(IN);

[OP also states that IO is not the bottleneck]

Maybe this is faster (all code untested, mainly because I didn't feel
like creating a test data set):

my ($key, $val) = unpack "La1", $buf;
$hash{$key} = $val;

or maybe

%hash = %hash, unpack "La1", $buf;

Alternatively, you could read in more records at once:

my $n_buf = 10; # Adjust to needs
my $pack_template = "La1" x $n_buf;

while(IN, $buf, 5 * $n_buf)
{
%vals = unpack $pack_template, $buf;
%hash{keys %vals} = values %vals;
}

Or in the loop body, something like:

%hash = %hash, unpack $pack_template, $buf;

If you do this, you might need to make sure that your boundary
conditions (the last group of things read from the file) are ok.

You could of course read the whole file in memory in one go. get the
file size, divide by 5, and use the result as the value for $n_buf up
there. No need to loop, or incrementally change %hash then:

open(IN, "file") or die $!;
my $n_bytes = -s IN;
my $pack_template = "La1" x ($n_bytes/5);
read(IN, $buf, $n_bytes); # do error checking
%hash = unpack $pack_template, $buf;

Or maybe something like: (again, this is untested)

%hash = do {
local ($/, *IN);
open(IN, "file") or die $!;
my $n_bytes = -s IN;
my $pack_template = "La1" x ($n_bytes/5);
unpack $pack_template, <IN>;
};

Martien
 
S

Stefan Fischerländer

Eric J. Roode said:
On second thought, splitting the whole file up into pairs and then
assigning it to a hash is a waste, I think. Perhaps:

$/ = undef;
$raw = <IN>;
$hash{$1} = $2 while $raw =~ /(....)(.)/gs;

Both programs work, but unfortunately they need exactly as long as my
initial program needs.

Thank you for your ideas anyway.
Stefan
 
V

Vlad Tepes

Stefan Fischerländer said:
I've got the following situation:

My data is stored in a file; 5 bytes form one record. The bytes 0 to 3
are a long integer, which is regarded as the key, wheras byte 4 is the
value. I have to read this data structure into a hash.

What I'm doing at the moment is:
open(IN,"<file");
while(read(IN, $buf, 5))
{
$hash{unpack("L", substr($buf,0,4))} = substr($buf,4,1);
}
close(IN);

This takes about 5 seconds to read a file with about 100.000 records,
which is to long for my needs. (Celeron 800, IDE 7200rpm)

( snip )
Does anyone have any ideas how to build this hash faster? It would
also be possible to change the format of the data file. I already did
experiments with store and retrieve from the Storable module and with
a permanent hash with DB_File. In both cases the data file grew, while
the script was even slower.

You should store the data in a way that requires minimal processing to
build the hash. I tried using keys and values separated with newlines,
and built a similar hash in a little over 1 second with:

open my $fh, "hash.dat" or die;
undef $/;
my %hash = split /\n/, <$fh>;
 
S

Stefan Fischerländer

Martien Verbruggen said:
my ($key, $val) = unpack "La1", $buf;
$hash{$key} = $val;

As fast/slow as my initial code.

%hash = %hash, unpack "La1", $buf;

Much slower!

open(IN, "file") or die $!;
my $n_bytes = -s IN;
my $pack_template = "La1" x ($n_bytes/5);
read(IN, $buf, $n_bytes); # do error checking
%hash = unpack $pack_template, $buf;

This one is faster. With an additional
keys(%hash) = (-s IN)/5;
runtime decreases from 3.2 to 1.9 seconds.

%hash = do {
local ($/, *IN);
open(IN, "file") or die $!;
my $n_bytes = -s IN;
my $pack_template = "La1" x ($n_bytes/5);
unpack $pack_template, <IN>;
};

A little bit slower than the code above: 2.4 seconds.


Thank you for your various ideas!
Stefan
 
S

Stefan Fischerländer

This is what made the program faster, not the pre-allocation of memory
for the hash.

I did something wrong again: If you want to pre-allocate memory, you
should use the same hash in the keys statement as you are using later
in your loop ...

keys(%hash) = (-s IN) / 5;
while(read IN, $buf, 4)
{
$hash{$buf} = getc IN;
}

This needs just 2.0 seconds - thank you very much.
 
S

Stefan Fischerländer

Eric J. Roode said:
Okay, this is the fastest I've come up with yet:
open IN, $input_file or die "Can't read $input_file: $!\n";
$/ = \5; # set input record length to 5 bytes
my %hash;
$hash{$_} = chop while <>;

Took some time to figure out that the last line must read:
$hash{$_} = chop while <IN>;

This solutions is also rather quick, takes 2.0 seconds.


I now have three different solutions (this is the Perl way *g*) that
are nearly equally fast and about 60% faster than my initial code.

Thank you all very much!

Stefan
 
E

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

(e-mail address removed) (Stefan Fischerländer) wrote in
Took some time to figure out that the last line must read:
$hash{$_} = chop while <IN>;

Oops, yeah, sorry about that.

- --
Eric
$_ = reverse sort qw p ekca lre Js reh ts
p, $/.r, map $_.$", qw e p h tona e; print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBPw3d0WPeouIeTNHoEQIA6ACbB9ewoSHwXns6RFyQd/m6RA8syDUAoJMd
2sQ3cjurQs+EXDekC6NJFowz
=RzlF
-----END PGP SIGNATURE-----
 
S

Stefan Fischerländer

Vlad Tepes said:
You should store the data in a way that requires minimal processing to
build the hash. I tried using keys and values separated with newlines,
and built a similar hash in a little over 1 second with:
open my $fh, "hash.dat" or die;
undef $/;
my %hash = split /\n/, <$fh>;

On my (slow) system it takes about 2.1 seconds and is nearly as fast
as the other solutions. Now I can choose from four different, very
fast programs.

Thank you all for your ideas and code snippets!

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,586
Members
45,088
Latest member
JeremyMedl

Latest Threads

Top