Efficiency of using long integers to hold bitmaps

J

Jeff Melvaine

I note that I can write expressions like "1 << 100" and the result is stored
as a long integer, which means it is stored as an integer of arbitrary
length. I may need to use a large number of these, and am interested to
know whether the storage efficiency of long integers is in danger of
breaking my code if I use too many. Would I do better to write a class that
defines bitwise operations on arrays of integers, each integer being assumed
to contain at most 32 bits? I cannot find information in the Python manuals
for 2.4.1 that would allow me to resolve this question; perhaps the
intention is that programmers should not rely on implementation details.

Thanks in advance,

Jeff
 
R

Raymond Hettinger

[Jeff Melvaine]
I note that I can write expressions like "1 << 100" and the result is stored
as a long integer, which means it is stored as an integer of arbitrary
length. I may need to use a large number of these, and am interested to
know whether the storage efficiency of long integers is in danger of
breaking my code if I use too many. Would I do better to write a class that
defines bitwise operations on arrays of integers, each integer being assumed
to contain at most 32 bits?


Both array() objects and long integers are equally space efficient.
In contrast, a list of integers takes up a lot of space (the list is
stored as an array of pointers to individual integer objects).

Raymond
 
B

Bengt Richter

I note that I can write expressions like "1 << 100" and the result is stored
as a long integer, which means it is stored as an integer of arbitrary
length. I may need to use a large number of these, and am interested to
know whether the storage efficiency of long integers is in danger of
breaking my code if I use too many. Would I do better to write a class that
defines bitwise operations on arrays of integers, each integer being assumed
to contain at most 32 bits? I cannot find information in the Python manuals
for 2.4.1 that would allow me to resolve this question; perhaps the
intention is that programmers should not rely on implementation details.

Thanks in advance,
Sounds like a possible^H^H^H^H^H^H^H^Hprobable premature optimization worry ;-)

What is a "large number of these" going to amount to? How many, tops?
And how many bits in each? How many operations between them? (Since integers
are immutable, operations mean allocation of space for new ones for results
and disposing of unused garbage ones (probably back to a special fast pool for
integers and longs)). Are you interested in a speed/memory tradeoff?

If your bit vectors are extremely large and sparse (have only a few bits "on"),
you might consider sets (import sets and help(sets)) of the bit numbers as representations.

BTW, I wonder if anyone has written an ordered bit set class in C yet. I was tempted ;-)

How much memory do you have? Buying more can be a pretty cheap way of solving space worries
if you are getting paid for your time.

You should be able to subclass int or long as a way of writing your program in terms of
your own bit vector class. Then you can change your mind and change the underlying representation
without changing the code that uses the api. Compared to plain longs it will cost you some speed
to keep converting results to your own type though.

Bottom line, whether your code will "break" due to storage problems or be fast enough
will depend on numbers you didn't provide ;-)

Regards,
Bengt Richter
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Bengt said:
Sounds like a possible^H^H^H^H^H^H^H^Hprobable premature optimization worry ;-)

Right. OTOH, I notice a frequent game of Katze und Maus (cat and mouse?)
in questions around Python implementation details. The OP doesn't
provide details of his application, but instead asks a general question
"how is foo implemented in Python, I'm worried it might be bar?". Then,
instead of saying "yes, it is bar" or "no, it isn't bar", the response
is "we can't answer your real question because you didn't state it
well".

In this case, I really liked Raymond's answer (arrays are as efficient
as long ints, which are both more efficient than lists of integers). It
is close enough to the truth for the OP (*), and was stated with much
less text than this message (or the one I'm responding to).

I really should go to bed now,

Martin

(*) even closer to the truth is the observation that arrays are slightly
more space-efficient, because they can store 32 bits in 4 bytes, whereas
long ints only store 30 bits in 4 bytes. Time efficiency is more
difficult to compare, because it depends on the individual
implementation.
 
T

Terry Reedy

"Martin v. Löwis" said:
Right. OTOH, I notice a frequent game of Katze und Maus (cat and mouse?)

Yes, apparently with the same idiomatic meaning, as you decribe the game
here perfectly.

TJR
 
J

jepler

You'll find that using Python Longs unsuitable if you *change* the
bitmaps---All numeric types are immutable, so you'll copy the bitmap
each time you perform an operation like "set bit".

numarray has a 'bit' typecode, though I'm not sure how such an array is
actually stored---from a quick look, it appears it's stored one bit per
byte:'\x01\x01\x01\x01\x01\x01\x01\x01\x01'

A Python object layer around array.array('B') would not necessarily be
fast, but it would at least avoid the copying problem, and be memory
efficient.
import array

class BitArray2D:
def __init__(self, rows, cols):
self.rows = rows
self.cols = cols
nbytes = (rows * cols + 7) / 8
self._data = array.array('B', [0]) * nbytes

def __getitem__(self, (r,c)):
# TODO: check r, c in bounds
idx = r + c * self.rows
byte = idx / 8
bit = 1 << (idx%8)
return bool(self._data[byte] & bit)

def __setitem__(self, (r, c), v):
# TODO: check r, c in bounds
idx = r + c * self.rows
byte = idx / 8
bit = 1 << (idx%8)
if v:
self._data[byte] |= bit
else:
self._data[byte] &= ~bit

b = BitArray2D(10, 10)
print b._data
for x in range(10):
b[x,x] = b[9-x,x] = 1
print b._data
print
for x in range(10):
for y in range(10):
print " *"[b[x,y]],
print

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFC0d14Jd01MZaTXX0RAmT3AKCZ56MDZDyYs+Mxu2Cka6ebCabfwACeIaz4
b1U9NHGbvoqO24z1u1LlB2w=
=qilm
-----END PGP SIGNATURE-----
 
B

Bengt Richter

Yes, apparently with the same idiomatic meaning, as you decribe the game
here perfectly.
I think you are right about some game happening, but I'm not sure it's cat and mouse.
To me, an incomplete question feels like an invitation to play "20 questions" regarding
what the real problem is. So I get a little annoyed, and often just bypass the post.
If I answer, the residual annoyance sometimes leads me to withhold my best guess, and
complain instead. Hence probably the cat and mouse impression. Not very big of me, but
OTOH a think a bit of a nudge towards better questions is not a bad thing. OTO3H, maybe
I should just silently pass up 20-questions invitations and not pollute this pleasant space
with perfumes of annoyance (since however diffuse, they are apparently pungent enough
for some to notice ;-)

Regards,
Bengt Richter
 
S

Steven D'Aprano

I think you are right about some game happening, but I'm not sure it's cat and mouse.
To me, an incomplete question feels like an invitation to play "20 questions" regarding
what the real problem is. So I get a little annoyed, and often just bypass the post.
If I answer, the residual annoyance sometimes leads me to withhold my best guess, and
complain instead. Hence probably the cat and mouse impression. Not very big of me, but
OTOH a think a bit of a nudge towards better questions is not a bad thing. OTO3H, maybe
I should just silently pass up 20-questions invitations and not pollute this pleasant space
with perfumes of annoyance (since however diffuse, they are apparently pungent enough
for some to notice ;-)

If it helps, I find the entertainment value of the gentle nudging is the
only thing that makes the smell of stupid questions bearable.

It isn't true that there is no such thing as a stupid question. There are
intelligent questions that are asked in a rude and stupid way.

Expecting people to read your mind and understand what your question is
about is rude. Expecting people to decipher poorly written, badly spelt,
incomprehensible ramblings is rude. (Allowance should be made for those
whose native language is not English, and those with good excuses for bad
spelling, eg broken keyboard, actual dyslexia.)

Some allowance for the occasional brain-fart or typo should be made, but
communication requires two parties. You wouldn't expect a mail server to
accept any random malformed packets and try to make sense of it, and you
shouldn't expect others to put in more work understanding your post than
you put in writing it.

As they say, be strict when sending and tolerant when receiving. I'm all
for that. But when people insist on sending broken packets, I see nothing
rude about bouncing those packets back with a error message or a creative
misunderstanding.
 
J

Jeff Melvaine

Raymond,

Thanks for your answers, which even covered the question that I didn't ask
but should have.

<code> "A Python list is not an array()\n" * 100 </code> :)

Jeff

Raymond Hettinger said:
[Jeff Melvaine]
I note that I can write expressions like "1 << 100" and the result is
stored
as a long integer, which means it is stored as an integer of arbitrary
length. I may need to use a large number of these, and am interested to
know whether the storage efficiency of long integers is in danger of
breaking my code if I use too many. Would I do better to write a class
that
defines bitwise operations on arrays of integers, each integer being
assumed
to contain at most 32 bits?


Both array() objects and long integers are equally space efficient.
In contrast, a list of integers takes up a lot of space (the list is
stored as an array of pointers to individual integer objects).

Raymond
 
J

Jeff Melvaine

Bengt,

Thanks for your informative reply, further comments interleaved.

Bengt Richter said:
Sounds like a possible^H^H^H^H^H^H^H^Hprobable premature optimization
worry ;-)
I'm writing a Sudoku solver of generic order. The object is not to make it
half a millisecond faster than the guy next door's solver, but I'd like it
to be able to do a bit more than just the daily newspaper puzzle, e.g.
search for uniquely solvable puzzles with minimal numbers of clues.
NP-completeness will put a lid on things sooner or later, but I'd like to
get as far as possible before that happens.

Why in Python? All my recent professional experience is in writing Ada,
which is not my idea of a rapid prototyping language (or a rapid deliverable
item handover language either, for that matter :).

Why do I want it to be efficient during debugging, rather than after fine
tuning? I take your point, but in a sense the ability to handle large
problems is part of the proof of concept.
What is a "large number of these" going to amount to? How many, tops?
And how many bits in each? How many operations between them? (Since
integers
are immutable, operations mean allocation of space for new ones for
results
and disposing of unused garbage ones (probably back to a special fast pool
for
integers and longs)). Are you interested in a speed/memory tradeoff?
The algorithms that interest me most do not involve cyclic data structures,
so I am trusting in built-in reference counts to avoid memory leaks. At the
moment I'm expecting to use bitmaps of constant size (81 for order 3, or 256
for order 4) for the most numerous data items, so fragmentation should not
be excessive.

Space was my first thought, but I also expect that the parallelism of
bitwise operations will be reasonably time-efficient. I would hope to be
able to do operations more quickly on a bitmap of n bits than on a list or
array of n integer variables with values constrained to 0 or 1. However the
prospect of writing "a & b" and getting multiword functionality that could
prove the concepts was rather appealing too; almost executable pseudocode.

The effectiveness of the algorithms will determine how much time and space I
use, but for NP-complete problems the ceiling is always too low, and one is
constantly learning new ways to duck.
If your bit vectors are extremely large and sparse (have only a few bits
"on"),
you might consider sets (import sets and help(sets)) of the bit numbers as
representations.

BTW, I wonder if anyone has written an ordered bit set class in C yet. I
was tempted ;-)
For symbolic Boolean algebra on systems of 729 or 4096 variables, sparse is
the way to go, but I would want ordered sets too. I've already implemented
a Boolean algebra system using Python lists (oops, thank you again Raymond)
of 32-bit bitmaps, and the calculations it does are not dazzlingly fast. My
question about using long integers had a different approach in mind.
How much memory do you have? Buying more can be a pretty cheap way of
solving space worries
if you are getting paid for your time.
512Mb. The only time I've hit the limit so far was when I got distracted
enough to leave out the escape condition in a small recursive function.
Death was surprisingly rapid.
You should be able to subclass int or long as a way of writing your
program in terms of
your own bit vector class. Then you can change your mind and change the
underlying representation
without changing the code that uses the api. Compared to plain longs it
will cost you some speed
to keep converting results to your own type though.
I like to encapsulate, but I hadn't thought of that one. Yes, it's an
approach to getting the best of both worlds; instant performance or
flexibility. The thought of executable pseudocode that I translate into
something better if necessary is not too bad for now.
Bottom line, whether your code will "break" due to storage problems or be
fast enough
will depend on numbers you didn't provide ;-)
Re netiquette (with thanks to other posters who made this thread so
stimulating), I don't demand that respondents to my posted questions should
do all my thinking for me at a fine level of detail. But thank you Bengt,
for taking the trouble to point me in helpful directions. It is certainly
worth the OP's trouble to facilitate this for respondents.

Jeff
 
B

Bengt Richter

Bengt,

Thanks for your informative reply, further comments interleaved.

Can't reply fully now, but just had the thought that maybe some ideas
from 8-queens solvers might be useful or interesting. There is an old thread at

http://groups-beta.google.com/group/comp.lang.python/browse_frm/thread/f88f301b7578705a

that explores various ways of solving it, and uses various representations of the board,
including integer bit maps at the end, which turned out fastest IIRC. I'm sure it can still
be improved upon, and I'm not sure it will be worth your while to dig into it, unless you
think the problem fun, but there it is.

Regards,
Bengt Richter
 
J

Jack Diederich

Bengt,

Thanks for your informative reply, further comments interleaved.


I'm writing a Sudoku solver of generic order. The object is not to make it
half a millisecond faster than the guy next door's solver, but I'd like it
to be able to do a bit more than just the daily newspaper puzzle, e.g.
search for uniquely solvable puzzles with minimal numbers of clues.
NP-completeness will put a lid on things sooner or later, but I'd like to
get as far as possible before that happens.

I would recommend making a hand-written C extension that does the heavy
lifting - but only in the tiny corner you need it to. I've done this for
the ICFP[1] competition the last few years. It is a time-limited competition
so the priorities are getting a pure-python program up and running to
understand the problem and then making the slow parts go really fast so
you can try as many full games as possible to try out as many new strategies
as possible.

Typically this means just the "game board" representation is done in C.
You'll want the heuristics to be done in python in order to try out variations
easily. For Sudoku the board implementation will likely have C functions
for copy(), valid() (raise ValueError), and something to return a list
of obviously legal values for a coordinate. Passing coord tuples in and
list & dictionaries out has worked well for me (easy to use in the python
part of the program).

I keep C modules out of production code unless absolutely necessary, but
I have no qualms about using it in toy/hobby problems, especially because
the C code stays at a manageable few hundred lines for toy problems.
If you aren't much of a C guy check out pyrex[2]. In my darker days I did
C++ for a living so I much prefer writing modules by hand; python makes it
easy to do and it is faster, less buggy, and easier to debug.

-jackdied

[1] http://performancedrivers.com/icfp2002/
http://performancedrivers.com/icfp2004/
(other years I botched it badly enough I didn't make a webpage)
http://performancedrivers.com/icfp2002/icfp_module.c
http://performancedrivers.com/icfp2002/icfpBoard.c

[2] http://nz.cosc.canterbury.ac.nz/~greg/python/Pyrex/
 
D

Dennis Lee Bieber

I'm writing a Sudoku solver of generic order. The object is not to make it
half a millisecond faster than the guy next door's solver, but I'd like it

And here I thought I was insane for creating one just for the
newspaper puzzle... I need to clean up the code I hacked this
afternoon... And maybe some day consider putting a GUI on it...

--
 
D

Dennis Lee Bieber

And here I thought I was insane for creating one just for the
newspaper puzzle... I need to clean up the code I hacked this
afternoon... And maybe some day consider putting a GUI on it...

And just for clarity, I am NOT implying anyone else is insane --
'twas merely the surprise at seeing the general concept here on the very
day I'd done my hack-job...

--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,528
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top