Which is faster - hash or array lookup

R

Rasmus Villemoes

Hi

In a program I'm writing, I will need to access the 53130 different
5-element subsets of 0..24 many many times. So I thought I would
generate each of these subsets once and for all, using a canonical
enumeration scheme, and save the result in a global hash or array
(Concretely, I have generated a file containing lines of the form
"0:0,1,2,3,4", "1:0,1,2,3,5" etc., which I then read and parse at the
start of the program, that is, I generate an anonymous array
[0,1,2,3,4], and then I store a reference to this in $combs{0} or
$combs[0].)

Now the question is, should I store these in an array or a hash,
if I want to access a random entry as fast as possible?
 
L

Leon Timmermans

Now the question is, should I store these in an array or a hash, if I
want to access a random entry as fast as possible?

Arrays are the simpler datastructure and thus faster. The question is a
bit odd though: you should just use what makes most sense (a hash indexed
by hashes usually doesn't make sense).

Regards,

Leon Timmermans
 
R

Rasmus Villemoes

bugbear said:
Are you sure there isn't a better algorithm for your (unstated)
purpose?

Well, that depends. I am suite sure that perl is not The Right Tool
for my overall purpose, but what I am trying to do probably does
involve accessing each of these combinations by number (or some other
key). I have decided to use perl for this pet project to help me learn
some more perl.

The project concerns a little game which is played on a 5x5
board. Each player has 5 pieces, and additionally there is a common
piece used by both players. Only one piece can occupy a square at any
time, so the positions of either player's pieces at any time is
exactly a 5-subset of 0..24. If I use a canonical numbering of these
5-subsets, I can describe the state of the game using 5 bytes (2*2
bytes to describe numbers in the range 0..53129, and 1 byte to
indicate the position of the common piece along with information on
whose turn it is). For any given state, a move from that state can be
described by 3 bytes (encoding the new positions of the player's
pieces and the common piece). But in order to compute which moves are
possible from a given state, I need the positions of the 11 pieces on
the board. It may not be smart to store anonymous arrays containing
the 5-subsets. Another possibility is to use a bitstring with exactly
5 1's. That would probably save a lot of memory, and may be faster to
use (I will often need to ask "in state $x, is there a piece on square
number $n". I guess vec() can answer that question if I use
bitstrings, whereas traversing an array looking for $n is slow).

Of course, I could skip the huge constant table, but then I would
either have to describe each state using more data, or compute the
occupied/empty squares each time.
 
X

xhoster

Rasmus Villemoes said:
Hi

In a program I'm writing, I will need to access the 53130 different
5-element subsets of 0..24 many many times. So I thought I would
generate each of these subsets once and for all, using a canonical
enumeration scheme, and save the result in a global hash or array
(Concretely, I have generated a file containing lines of the form
"0:0,1,2,3,4", "1:0,1,2,3,5" etc., which I then read and parse at the
start of the program, that is, I generate an anonymous array
[0,1,2,3,4], and then I store a reference to this in $combs{0} or
$combs[0].)

Now the question is, should I store these in an array or a hash,
if I want to access a random entry as fast as possible?

Assuming you want to access them randomly by index, rather than by
canonical representation, then the array will be faster (and cleaner). But
if what you do between accesses is anything but trivial, that will probably
dominate your run time.


$ time perl -le 'srand(0); my %x; @x{0..50_000}=(0..50_000);
foreach (1..1e7)
{ print $x{int rand scalar keys %x} }' > foo2
14.364u 0.144s 0:15.22 95.2% 0+0k 0+0io 0pf+0w

$ time perl -le 'srand(0); my @x=0..50_000; foreach (1..1e7)
{ print $x[rand @x] }' > foo1
10.960u 0.204s 0:11.43 97.6% 0+0k 0+0io 0pf+0w

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
S

sln

On 26 Jan 2009 21:06:33 GMT, (e-mail address removed) wrote:

{snip}
Assuming you want to access them randomly by index, rather than by
canonical representation, then the array will be faster (and cleaner). But
if what you do between accesses is anything but trivial, that will probably
dominate your run time.

{snip}

Alot of fab Techno Twins verbage. Unfortunately, it doesen't clearly state
any meaning, if one was offered in the first place.

sln
 
X

xhoster

Rasmus Villemoes said:
Well, that depends. I am suite sure that perl is not The Right Tool
for my overall purpose, but what I am trying to do probably does
involve accessing each of these combinations by number (or some other
key). I have decided to use perl for this pet project to help me learn
some more perl.

Well in that case, what I'd do (and have done) is implement them in several
different ways, run them, and compare the timings. It will teach you a lot
of Perl, especially those ineffable things hard to put into documentation
or USENET posts.
The project concerns a little game which is played on a 5x5
board. Each player has 5 pieces, and additionally there is a common
piece used by both players. Only one piece can occupy a square at any
time, so the positions of either player's pieces at any time is
exactly a 5-subset of 0..24. If I use a canonical numbering of these
5-subsets, I can describe the state of the game using 5 bytes (2*2
bytes to describe numbers in the range 0..53129, and 1 byte to
indicate the position of the common piece along with information on
whose turn it is).

Your scheme allows impossible boards, because some combinations of the
two players combinations (or one of them and the common piece) will be
incompatible. Couldn't you canonicalize at entire-board level, rather than
at each player individually?

For any given state, a move from that state can be
described by 3 bytes (encoding the new positions of the player's
pieces and the common piece).

Why would you want to scrimp on space here? Are you planning on storing
all possible moves from all possible boards in memory simultaneously? I'd
probably just plan on recomputing possible moves on the fly, rather than
storing, and so wouldn't care about storage space for this.
But in order to compute which moves are
possible from a given state, I need the positions of the 11 pieces on
the board. It may not be smart to store anonymous arrays containing
the 5-subsets. Another possibility is to use a bitstring with exactly
5 1's. That would probably save a lot of memory, and may be faster to
use (I will often need to ask "in state $x, is there a piece on square
number $n". I guess vec() can answer that question if I use
bitstrings, whereas traversing an array looking for $n is slow).

You can use a sparse array, checking defined $x[$n] rather than
grep $_==$n, @x, just like you can a sparse vector. Although I'd usually
use a hash for that purpose instead of an array. But if I'm understanding
this right, this choice is at a different level than the one you originally
asked.
Of course, I could skip the huge constant table, but then I would
either have to describe each state using more data, or compute the
occupied/empty squares each time.

The CPU still has to compute it each time, it is just a matter of how you
go about getting it to do so. It seems like you will have to go in both
directions rapidly, canonical->board->different board->canonical. Unless
you have something especially clever in mind, I'd think it would be less
work to go from human-readable-string-representation of board to
array/hash-representation of the board, then from some abstruse canonical
number to an array/hash representation. (And back)

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
E

Eric Pozharski

On 2009-01-26 said:
Now the question is, should I store these in an array or a hash,
if I want to access a random entry as fast as possible?

(damn it, it doesn't fit in terminal)

perl -we '
use Benchmark qw|cmpthese timethese|;
my(@ar, $hs);
$ar10[$_] = $_ for 0 .. 10;
$hs10{$_} = $_ for 0 .. 10;
$ar100[$_] = $_ for 0 .. 100;
$hs100{$_} = $_ for 0 .. 100;
$ar1000[$_] = $_ for 0 .. 1000;
$hs1000{$_} = $_ for 0 .. 1000;
$ar10000[$_] = $_ for 0 .. 10000;
$hs10000{$_} = $_ for 0 .. 10000;
my $t = timethese -10, {
hash10 => sub { my $x = $hs10{int rand 11} },
array10 => sub { my $x = $ar10[int rand 11] },
hash100 => sub { my $x = $hs100{int rand 101} },
array100 => sub { my $x = $ar100[int rand 101] },
hash1000 => sub { my $x = $hs1000{int rand 1001} },
array1000 => sub { my $x = $ar1000[int rand 1001] },
hash10000 => sub { my $x = $hs10000{int rand 10001} },
array10000 => sub { my $x = $ar10000[int rand 10001] },
};
cmpthese $t;
print qq|$_ | for map scalar %$_, \%hs10, \%hs100, \%hs1000, \%hs10000;
print "\n";
'
Benchmark: running array10, array100, array1000, array10000, hash10,
hash100, hash1000, hash10000 for at least 10 CPU seconds...
array10: 14 wallclock secs (10.69 usr + 0.19 sys = 10.88 CPU) @
359942.56/s (n=3916175)
array100: 14 wallclock secs ( 9.95 usr + 0.05 sys = 10.00 CPU) @
381987.30/s (n=3819873)
array1000: 14 wallclock secs (10.20 usr + 0.05 sys = 10.25 CPU) @
366950.15/s (n=3761239)
array10000: 14 wallclock secs (10.37 usr + 0.03 sys = 10.40 CPU) @
344956.92/s (n=3587552)
hash10: 13 wallclock secs ( 9.98 usr + 0.04 sys = 10.02 CPU) @
284447.80/s (n=2850167)
hash100: 14 wallclock secs (10.16 usr + 0.04 sys = 10.20 CPU) @
281098.04/s (n=2867200)
hash1000: 13 wallclock secs (10.25 usr + 0.04 sys = 10.29 CPU) @
268208.65/s (n=2759867)
hash10000: 14 wallclock secs (10.08 usr + 0.05 sys = 10.13 CPU) @
208555.18/s (n=2112664)
Rate hash10000 hash1000 hash100 hash10 array10000 array10 array1000 array100
hash10000 208555/s -- -22% -26% -27% -40% -42% -43% -45%
hash1000 268209/s 29% -- -5% -6% -22% -25% -27% -30%
hash100 281098/s 35% 5% -- -1% -19% -22% -23% -26%
hash10 284448/s 36% 6% 1% -- -18% -21% -22% -26%
array10000 344957/s 65% 29% 23% 21% -- -4% -6% -10%
array10 359943/s 73% 34% 28% 27% 4% -- -2% -6%
array1000 366950/s 76% 37% 31% 29% 6% 2% -- -4%
array100 381987/s 83% 42% 36% 34% 11% 6% 4% --
10/16 74/128 630/1024 7391/16384

Please note, that systime varies a lot for me (YMMV). That's 5.10.0.
While for 5.8.8 (even on P200 64M (that's router)) systime is C<0.00>
*always*. Go figure.

However, even I<array10000> is 20 .. 40 percent faster that I<hash10>.
 
R

Rasmus Villemoes

Your scheme allows impossible boards, because some combinations of the
two players combinations (or one of them and the common piece) will be
incompatible. Couldn't you canonicalize at entire-board level, rather than
at each player individually?

Of course, there are only C(25,5) * C(20,5) * 15 * 2 different
possible states of the game (C(n,k) = binomial n,k), but this number
is greater then 2**32, so in order to distinguish all states one needs
at least five bytes. My scheme uses five bytes, and has the advantage
that I do not have to recompute the other player's positions after
each move, so I think that is rather efficient.

I have thought about doing a little sanity-check once-in-a-while;
or'ing bitstrings together representing the positions of the pieces
must always result in a bitstring of Hamming weight 11.
Why would you want to scrimp on space here? Are you planning on storing
all possible moves from all possible boards in memory simultaneously?

Not all, but a lot. Most of the possible configurations of the board
will never be reached in a real game, so I plan on making the program
play against itself starting from the starting position, and marking
winning/losing positions as it goes. In the next game, if a player
reaches a position where he can make a move to a losing position for
the other player, the current position can be marked as winning for
him (along with the "proof" in the form of the winning move). This
way, I hope only to build a small fraction of the graph.
I'd probably just plan on recomputing possible moves on the fly,
rather than storing, and so wouldn't care about storage space for
this.

I do not plan on making this program run only once, so I will store
whatever I have found out on disk so that I can resume. Also, this has
the advantage that I can delete moves which are known to lead to a
winning position for the other player; if a (known) position then at some
point has no possible moves, it is effectively the same as if the
player has lost at that time.
But in order to compute which moves are possible from a given
state, I need the positions of the 11 pieces on the board.

You can use a sparse array, checking defined $x[$n] rather than
grep $_==$n, @x, just like you can a sparse vector. Although I'd usually
use a hash for that purpose instead of an array.

Thanks. I'll keep that in mind.
The CPU still has to compute it each time, it is just a matter of how you
go about getting it to do so. It seems like you will have to go in both
directions rapidly, canonical->board->different board->canonical.

I have made some thoughts about how to make that last step
efficiently, but I haven't come to any decision yet. I will probably
end up implementing the whole thing in some way, and then find out
that it should be done in a completely different fashion.
Unless you have something especially clever in mind, I'd think it
would be less work to go from human-readable-string-representation
of board to array/hash-representation of the board, then from some
abstruse canonical number to an array/hash representation. (And
back)

I don't think the canonical numbers are so bad. If at any point I want
the program to produce something human readable, the conversion from
canonical numbers to a pretty-printed board is not (much) slower than
printing to a terminal, if at all.


Thanks a lot for your input.
 
X

Xho Jingleheimerschmidt

Rasmus said:
Of course, there are only C(25,5) * C(20,5) * 15 * 2 different
possible states of the game (C(n,k) = binomial n,k), but this number
is greater then 2**32, so in order to distinguish all states one
needs at least five bytes.

You probably don't need to multiply by 2, as game
(12345, 34567, 12, 1's move) is the same as game
(34567, 12345, 12, 2's move)

Also, if the rules of your games are invariant to flipping or rotating
the board, you can reduce the number of boards to 6/25ths by forcing,
for example, the common piece to be in one of 6 squares, for example the
upper right triangle of the upper left quadrant. That would put the
number below 2**32. Shaving one byte off the 5-byte representation
probably isn't all that important, but shaving the number of
representations to be memorized might be.
Not all, but a lot. Most of the possible configurations of the board
will never be reached in a real game, so I plan on making the program
play against itself starting from the starting position, and marking
winning/losing positions as it goes. In the next game, if a player
reaches a position where he can make a move to a losing position for
the other player, the current position can be marked as winning for
him (along with the "proof" in the form of the winning move). This
way, I hope only to build a small fraction of the graph.

I don't think this will work the way you want it to. If the computer
plays itself, rather than a strategic human, and the computer has no
strategy other than brute force memorization of previous plays, then the
computer will probably end up sampling all game-space which is
obtainable via a sequence of legal moves, not just the much smaller
game-space which is likely to be reached through playing against a
strategic human. Given that, I'd probably just start with the have-won
boards and work backwards, making a exhaustive record of all boards and
what a winning move, if any, there is from that board.
I do not plan on making this program run only once, so I will store
whatever I have found out on disk so that I can resume.

Ah, I see. So you have basically a 5-byte hash key (current board) with
a 3-byte value (winning move), from which you compute the next 5-byte
key (next board). I was thinking of just storing the 5 byte next board
as the value, and computing the move from the two boards. Your way is
definitely better than that. If you come up with a perfect hashing
algorithm where there cannot be collisions, you don't even need to store
the board bytes as you know what you searched for, so it takes only 3
"move" bytes per possible board. Of course at this point we are no
longer talking about Perl's hash tables, but some custom implementation.

Also, this
has the advantage that I can delete moves which are known to lead to
a winning position for the other player; if a (known) position then
at some point has no possible moves, it is effectively the same as if
the player has lost at that time.

This can lead to unintuitive, but interesting, play. Once the computer
realizes that it has "inevitably" lost (to a perfect opponent), what
does it do? Just make completely random moves? Try to come up with a
"bluffing" recovery strategy that is most likely to result in a real
(imperfect) opponent failing to realize the existence of the winning
end-game?

The possibilities are endless and fascinating. Much more interesting
than the relative speed of hash lookups versus array lookups. ;)

Xho
 
R

Rasmus Villemoes

Xho Jingleheimerschmidt said:
You probably don't need to multiply by 2, as game
(12345, 34567, 12, 1's move) is the same as game
(34567, 12345, 12, 2's move)

No, not necessarily...
Also, if the rules of your games are invariant to flipping or
rotating the board,

because the game is not invariant under rotations or flipping. In
fact, one goal of the game is to make the common piece end on the
opponent's back row (the five squares closest to him). So if the
squares occupied by (34567,12345) (in either order) would allow the
common piece to be moved to 1's back row, 2 would be winning, whereas
if it was 1's turn, he would have a chance to avoid losing.
Shaving one byte off the 5-byte representation probably isn't all
that important, but shaving the number of representations to be
memorized might be.

There is one symmetry that might be exploited; the game is mirror
symmetric around the axis joining (1,3)--(5,3). So one might at any
time assume that the common piece is in one of the 9 squares spanned
by (2,1),(2,3),(4,3),(4,1) (no need to memorize positions where the
common piece is at either back row, because those are game over
positions).
I don't think this will work the way you want it to. If the computer
plays itself, rather than a strategic human, and the computer has no
strategy other than brute force memorization of previous plays, then the
computer will probably end up sampling all game-space which is
obtainable via a sequence of legal moves, not just the much smaller
game-space which is likely to be reached through playing against a
strategic human.

From my experience of playing this game in real life, I have
discovered some heuristics that I plan to implement. It is almost
always a good idea to make a move so that the opponent has as few
places to move the common piece, so that would be one way to avoid
completely random play. There are a few others. If I restart the game
often enough (say, the first 10 million games are declared a draw if
no winner is found within 10 moves), many of the positions one would
need to go through to reach the unrealistic ones will have been marked
as winning, and so will be avoided in any subsequent games.
Given that, I'd probably just start with the have-won boards and
work backwards, making a exhaustive record of all boards and what a
winning move, if any, there is from that board.

I have thought about doing it that way, but the number of have-won
boards is rather huge (by the above, at least any board where the
common piece is at one of the 10 back row squares, which makes 2e7;
there are other ways to win, also). The problem of determining if
there is a winning strategy (nothing in the rules prevents the players
from entering a cycle which can only be broken if one of the players
resign) is "just" a question of coloring a huge graph by some
recursive algorithm.
This can lead to unintuitive, but interesting, play. Once the
computer realizes that it has "inevitably" lost (to a perfect
opponent), what
does it do? Just make completely random moves? Try to come up with a
"bluffing" recovery strategy that is most likely to result in a real
(imperfect) opponent failing to realize the existence of the winning
end-game?

We are quite OT. If you like I'd gladly tell you the exact rules of
the game, but then I think we should continue by private email or find
a more suitable group.
 
S

sln

Hi

In a program I'm writing, I will need to access the 53130 different
5-element subsets of 0..24 many many times. So I thought I would
generate each of these subsets once and for all, using a canonical
enumeration scheme, and save the result in a global hash or array
(Concretely, I have generated a file containing lines of the form
"0:0,1,2,3,4", "1:0,1,2,3,5" etc., which I then read and parse at the
start of the program, that is, I generate an anonymous array
[0,1,2,3,4], and then I store a reference to this in $combs{0} or
$combs[0].)

Now the question is, should I store these in an array or a hash,
if I want to access a random entry as fast as possible?

Which is faster? The hash bin method or traversing arrays?

I can tell you positively the array is faster when you have 1 element.

Maybe a hybrid, have you thought about that?

sln
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top