Fastest way to build a hash

Stefan Fischerländer · Jul 10, 2003

I've got the following situation:

My data is stored in a file; 5 bytes form one record. The bytes 0 to 3
are a long integer, which is regarded as the key, wheras byte 4 is the
value. I have to read this data structure into a hash.

What I'm doing at the moment is:
open(IN,"<file");
while(read(IN, $buf, 5))
{
$hash{unpack("L", substr($buf,0,4))} = substr($buf,4,1);
}
close(IN);

This takes about 5 seconds to read a file with about 100.000 records,
which is to long for my needs. (Celeron 800, IDE 7200rpm)

I figured out that the bottleneck isn't the I/O, because this version
takes as long as the script above:
open(IN,"<file")
$bread = read(IN, $buf, 1000000) / 5;
close(IN);
for($i=0;$i<$bread;$i++)
{
$hash{unpack("L", substr($buf,$i*5,4))} = substr($buf,$i*5+4,1);
}

Does anyone have any ideas how to build this hash faster? It would
also be possible to change the format of the data file. I already did
experiments with store and retrieve from the Storable module and with
a permanent hash with DB_File. In both cases the data file grew, while
the script was even slower.

Stefan

Eric J. Roode · Jul 10, 2003

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

(e-mail address removed) (Stefan Fischerländer) wrote in

I've got the following situation:

My data is stored in a file; 5 bytes form one record. The bytes 0 to 3
are a long integer, which is regarded as the key, wheras byte 4 is the
value. I have to read this data structure into a hash.

What I'm doing at the moment is:
open(IN,"<file");
while(read(IN, $buf, 5))
{
$hash{unpack("L", substr($buf,0,4))} = substr($buf,4,1);
}
close(IN);

This takes about 5 seconds to read a file with about 100.000 records,
which is to long for my needs. (Celeron 800, IDE 7200rpm)

Untested, but my first thought is:

$/ = undef;
$raw = <IN>; # slurp whole file
%hash = $raw =~ /(....)(.)/gs; # split into 4, 1 byte pairs

(Note: untested)

- --
Eric
$_ = reverse sort qw p ekca lre Js reh ts
p, $/.r, map $_.$", qw e p h tona e; print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBPwy1mWPeouIeTNHoEQKsSgCg2l2gpuew2T2wuY79MCaIyDdMLsgAn1Ke
/fZKEVdWPNi7oJOpwLhmQNII
=zYBt
-----END PGP SIGNATURE-----

Eric J. Roode · Jul 10, 2003

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Eric J. Roode said:
(e-mail address removed) (Stefan Fischerländer) wrote in

Untested, but my first thought is:

$/ = undef;
$raw = <IN>; # slurp whole file
%hash = $raw =~ /(....)(.)/gs; # split into 4, 1 byte pairs

On second thought, splitting the whole file up into pairs and then
assigning it to a hash is a waste, I think. Perhaps:

$/ = undef;
$raw = <IN>;
$hash{$1} = $2 while $raw =~ /(....)(.)/gs;

(again, untested).

- --
Eric
$_ = reverse sort qw p ekca lre Js reh ts
p, $/.r, map $_.$", qw e p h tona e; print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBPwzB6GPeouIeTNHoEQIU4ACeMHYALFq/lwuPglJ+0hfPmx7wcKIAn3Sf
vV6dH8uXJfGEGtYnoy+/PXKN
=TsTT
-----END PGP SIGNATURE-----

Martien Verbruggen · Jul 10, 2003

I've got the following situation:

My data is stored in a file; 5 bytes form one record. The bytes 0 to 3
are a long integer, which is regarded as the key, wheras byte 4 is the
value. I have to read this data structure into a hash.

What I'm doing at the moment is:
open(IN,"<file");
while(read(IN, $buf, 5))
{
$hash{unpack("L", substr($buf,0,4))} = substr($buf,4,1);
}
close(IN);

[OP also states that IO is not the bottleneck]

Maybe this is faster (all code untested, mainly because I didn't feel
like creating a test data set):

my ($key, $val) = unpack "La1", $buf;
$hash{$key} = $val;

or maybe

%hash = %hash, unpack "La1", $buf;

Alternatively, you could read in more records at once:

my $n_buf = 10; # Adjust to needs
my $pack_template = "La1" x $n_buf;

while(IN, $buf, 5 * $n_buf)
{
%vals = unpack $pack_template, $buf;
%hash{keys %vals} = values %vals;
}

Or in the loop body, something like:

%hash = %hash, unpack $pack_template, $buf;

If you do this, you might need to make sure that your boundary
conditions (the last group of things read from the file) are ok.

You could of course read the whole file in memory in one go. get the
file size, divide by 5, and use the result as the value for $n_buf up
there. No need to loop, or incrementally change %hash then:

open(IN, "file") or die $!;
my $n_bytes = -s IN;
my $pack_template = "La1" x ($n_bytes/5);
read(IN, $buf, $n_bytes); # do error checking
%hash = unpack $pack_template, $buf;

Or maybe something like: (again, this is untested)

%hash = do {
local ($/, *IN);
open(IN, "file") or die $!;
my $n_bytes = -s IN;
my $pack_template = "La1" x ($n_bytes/5);
unpack $pack_template, <IN>;
};

Martien

Stefan Fischerländer · Jul 10, 2003

Eric J. Roode said:
On second thought, splitting the whole file up into pairs and then
assigning it to a hash is a waste, I think. Perhaps:

$/ = undef;
$raw = <IN>;
$hash{$1} = $2 while $raw =~ /(....)(.)/gs;

Both programs work, but unfortunately they need exactly as long as my
initial program needs.

Thank you for your ideas anyway.
Stefan

Vlad Tepes · Jul 10, 2003

Stefan Fischerländer said:
I've got the following situation:

My data is stored in a file; 5 bytes form one record. The bytes 0 to 3
are a long integer, which is regarded as the key, wheras byte 4 is the
value. I have to read this data structure into a hash.

What I'm doing at the moment is:
open(IN,"<file");
while(read(IN, $buf, 5))
{
$hash{unpack("L", substr($buf,0,4))} = substr($buf,4,1);
}
close(IN);

This takes about 5 seconds to read a file with about 100.000 records,
which is to long for my needs. (Celeron 800, IDE 7200rpm)

( snip )

Does anyone have any ideas how to build this hash faster? It would
also be possible to change the format of the data file. I already did
experiments with store and retrieve from the Storable module and with
a permanent hash with DB_File. In both cases the data file grew, while
the script was even slower.

You should store the data in a way that requires minimal processing to
build the hash. I tried using keys and values separated with newlines,
and built a similar hash in a little over 1 second with:

open my $fh, "hash.dat" or die;
undef $/;
my %hash = split /\n/, <$fh>;

Stefan Fischerländer · Jul 10, 2003

Martien Verbruggen said:
my ($key, $val) = unpack "La1", $buf;
$hash{$key} = $val;

As fast/slow as my initial code.

%hash = %hash, unpack "La1", $buf;

Much slower!

open(IN, "file") or die $!;
my $n_bytes = -s IN;
my $pack_template = "La1" x ($n_bytes/5);
read(IN, $buf, $n_bytes); # do error checking
%hash = unpack $pack_template, $buf;

This one is faster. With an additional
keys(%hash) = (-s IN)/5;
runtime decreases from 3.2 to 1.9 seconds.

%hash = do {
local ($/, *IN);
open(IN, "file") or die $!;
my $n_bytes = -s IN;
my $pack_template = "La1" x ($n_bytes/5);
unpack $pack_template, <IN>;
};

A little bit slower than the code above: 2.4 seconds.

Thank you for your various ideas!
Stefan

Stefan Fischerländer · Jul 10, 2003

This is what made the program faster, not the pre-allocation of memory
for the hash.

I did something wrong again: If you want to pre-allocate memory, you
should use the same hash in the keys statement as you are using later
in your loop ...

keys(%hash) = (-s IN) / 5;
while(read IN, $buf, 4)
{
$hash{$buf} = getc IN;
}

This needs just 2.0 seconds - thank you very much.

Stefan Fischerländer · Jul 10, 2003

Eric J. Roode said:
Okay, this is the fastest I've come up with yet:
open IN, $input_file or die "Can't read $input_file: $!\n";
$/ = \5; # set input record length to 5 bytes
my %hash;
$hash{$_} = chop while <>;

Took some time to figure out that the last line must read:
$hash{$_} = chop while <IN>;

This solutions is also rather quick, takes 2.0 seconds.

I now have three different solutions (this is the Perl way *g*) that
are nearly equally fast and about 60% faster than my initial code.

Thank you all very much!

Stefan

Eric J. Roode · Jul 10, 2003

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

(e-mail address removed) (Stefan Fischerländer) wrote in

Took some time to figure out that the last line must read:
$hash{$_} = chop while <IN>;

Oops, yeah, sorry about that.

- --
Eric
$_ = reverse sort qw p ekca lre Js reh ts
p, $/.r, map $_.$", qw e p h tona e; print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBPw3d0WPeouIeTNHoEQIA6ACbB9ewoSHwXns6RFyQd/m6RA8syDUAoJMd
2sQ3cjurQs+EXDekC6NJFowz
=RzlF
-----END PGP SIGNATURE-----

Stefan Fischerländer · Jul 11, 2003

Vlad Tepes said:
You should store the data in a way that requires minimal processing to
build the hash. I tried using keys and values separated with newlines,
and built a similar hash in a little over 1 second with:
open my $fh, "hash.dat" or die;
undef $/;
my %hash = split /\n/, <$fh>;

On my (slow) system it takes about 2.1 seconds and is nearly as fast
as the other solutions. Now I can choose from four different, very
fast programs.

Thank you all for your ideas and code snippets!

Stefan

Trying to build a SARIMAX model to forecast the S&P500 trend	0	Nov 5, 2023
Is there a way to pass this state from component to the fetch?	1	Apr 24, 2023
Fastest way to send file to browser	6	Mar 11, 2008
laziest / fastest way to match last characters of a string	5	Sep 11, 2008
fastest way to monitor a dir	4	Nov 7, 2008
passing a reference to a hash to another page	2	May 4, 2012
Read a hash	5	Apr 28, 2011
PHP RSS Feed Aggregator changing to todays date everytime feed is aggregated	1	Jan 11, 2022

Fastest way to build a hash

Stefan Fischerländer

Eric J. Roode

Eric J. Roode

Martien Verbruggen

Stefan Fischerländer

Vlad Tepes

Stefan Fischerländer

Stefan Fischerländer

Stefan Fischerländer

Eric J. Roode

Stefan Fischerländer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads