how space-efficient is bit_vector? bitset?

M

Markus Dehmann

I am planning to have tens of thousands of bit vectors (or bitsets).
Each of them has about 20,000 bits, but it's sparse: In a given
vector, only 20-30 bits will be set. What data structure would be best
for space (and probably also time) efficiency? Is std::bitset better
than std::bit_vector (vector<bool>)? Can I trust that the used space
for each of the bitvectors will be in the order of 20-30 bits (i.e.
the expected number of bits set), or rather 20,000 bits (the constant
number of bits that each vector has)? Or are there other, better
classes to use? Sparse Matrix classes?

Thanks!
Markus
 
G

Gianni Mariani

Markus said:
I am planning to have tens of thousands of bit vectors (or bitsets).
Each of them has about 20,000 bits, but it's sparse: In a given
vector, only 20-30 bits will be set. What data structure would be best
for space (and probably also time) efficiency? Is std::bitset better
than std::bit_vector (vector<bool>)? Can I trust that the used space
for each of the bitvectors will be in the order of 20-30 bits (i.e.
the expected number of bits set), or rather 20,000 bits (the constant
number of bits that each vector has)? Or are there other, better
classes to use? Sparse Matrix classes?

You could use map. I have attached the Austria C++ "rangemap" that can
be used as a sparse bitmap.

What kind of operations do you need to have on your bitmap ?

//
// The Austria library is copyright (c) Gianni Mariani 2004.
//
// Grant Of License. Grants to LICENSEE the non-exclusive right to use the Austria
// library subject to the terms of the LGPL.
//
// A copy of the license is available in this directory or one may be found at this URL:
// http://www.gnu.org/copyleft/lesser.txt
//
/**
* at_rangemap.h
*
*/

#ifndef x_at_rangemap_h_x
#define x_at_rangemap_h_x 1

#include "at_exports.h"
#include "at_os.h"
#include "at_assert.h"

#include <map>

// Austria namespace
namespace at
{


// ======== TypeRange =================================================
/**
* TypeRange describes the range of a particular type
*
*/

template <typename w_RangeType>
class TypeRange
{
public:

// range type
typedef w_RangeType t_RangeType;


// ======== Adjacent ==============================================
/**
* Adjacent returns true if the two parameters are "one apart"
*
* @param i_lesser is the lesser of the two values
* @param i_greater is the greater of the two
* @return true is no other elements exist between i_lesser and i_greater
*/

static bool Adjacent(
const t_RangeType & i_lesser,
const t_RangeType & i_greater
) {

t_RangeType l_greater_less( i_greater );

-- l_greater_less; // go to the earlier element

// deal with wrapping
if ( i_greater < l_greater_less )
{
return false;
}

return !( i_lesser < l_greater_less );
}
};


// ======== RangeMap ==================================================
/**
* RangeMap is a template that defines ranges.
*
*/

template <typename w_RangeType, typename w_RangeTraits=TypeRange<w_RangeType> >
class RangeMap
{
public:

// range type
typedef w_RangeType t_RangeType;
typedef w_RangeTraits t_RangeTraits;

// index on the end of the range
typedef std::map< t_RangeType, t_RangeType > t_Map;
typedef typename t_Map::iterator t_Iterator;


// ======== AddRange ==============================================
/**
* Add a segment to the range.
*
* @param i_begin The beginning of the range (inclusive)
* @param i_end The end of the range (inclusive)
* @return nothing
*/

void AddRange( const t_RangeType & i_begin, const t_RangeType & i_end )
{
const bool l_less_than( i_end < i_begin );

const t_RangeType & l_begin = ! l_less_than ? i_begin : i_end;
const t_RangeType & l_end = l_less_than ? i_begin : i_end;

// deal with an empty map here
if ( m_map.empty() )
{
// shorthand adding the first element into the map
m_map[ l_end ] = l_begin;
return;
}

// see if there is a segment to merge - find the element that preceeds
// l_begin

t_Iterator l_begin_bound = m_map.lower_bound( l_begin );

if ( l_begin_bound == m_map.end() )
{
// l_begin is after the last element

-- l_begin_bound;

if ( t_RangeTraits::Adjacent( l_begin_bound->first, l_begin ) )
{

// yes, they are mergable
t_RangeType l_temp = l_begin_bound->second;
m_map.erase( l_begin_bound );
m_map[ l_end ] = l_temp;

return;
}

// not mergable - add the segment at the end

m_map[ l_end ] = l_begin;
return;
}

// if the end of the segment being inserted is not beyond this one
if ( ( l_end < l_begin_bound->second ) && ! t_RangeTraits::Adjacent( l_end, l_begin_bound->second ) )
{
// NOT mergable with subsequent segments

if ( l_begin_bound == m_map.begin() )
{
// There is no previous segment

m_map[ l_end ] = l_begin;
return;
}

// The segment being inserted can't be merged at the end

// see if it can be merged with the previous one

t_Iterator l_previous = l_begin_bound;
-- l_previous;

AT_Assert( l_previous->first < l_begin );

if ( ! t_RangeTraits::Adjacent( l_previous->first, l_begin ) )
{
// not overlapping with previous and not mergable

m_map[ l_end ] = l_begin;
return;
}
else
{
// we are mergable with the previous element

// yes, they are mergable
t_RangeType l_temp = l_previous->second;
m_map.erase( l_previous );
m_map[ l_end ] = l_temp;
return;
}

}

if ( l_begin_bound == m_map.begin() )
{
if ( l_end < l_begin_bound->first )
{
if ( l_end < l_begin_bound->second )
{
if ( t_RangeTraits::Adjacent( l_end, l_begin_bound->second ) )
{
l_begin_bound->second = l_begin;
return;
}
else
{
m_map[ l_end ] = l_begin;
return;
}
}
else
{
if ( l_begin < l_begin_bound->second )
{
l_begin_bound->second = l_begin;
}
return;
}
}
else
{

t_RangeType l_new_begin = l_begin;

if ( l_begin_bound->second < l_begin )
{
l_new_begin = l_begin_bound->second;
}

// Check to see what segment is close to the end
t_Iterator l_end_bound = m_map.lower_bound( l_end );

if ( l_end_bound == m_map.end() )
{
// erase all the segments from l_previous to the end and
// replace with one

m_map.erase( l_begin_bound, l_end_bound );

m_map[ l_end ] = l_new_begin;
return;
}

if ( l_end < l_end_bound->second && ! t_RangeTraits::Adjacent( l_end, l_end_bound->second ) )
{
m_map.erase( l_begin_bound, l_end_bound );
m_map[ l_end ] = l_new_begin;
return;
}

// merge with the current end

m_map.erase( l_begin_bound, l_end_bound ); // erase segments in between
l_end_bound->second = l_new_begin;
return;
}
}

if ( l_begin_bound == m_map.begin() )
{
// no previous ranges

// see if we can merge with the current range


}

// find the previous iterator
t_Iterator l_previous = l_begin_bound;
-- l_previous;

t_RangeType l_new_begin = l_begin;

if ( t_RangeTraits::Adjacent( l_previous->first, l_begin ) )
{
l_new_begin = l_previous->second;
}
else
{
++ l_previous;

if ( l_previous->second < l_new_begin )
{
l_new_begin = l_previous->second;
}
}

t_RangeType l_new_end = l_end;


// Check to see what segment is close to the end
t_Iterator l_end_bound = m_map.lower_bound( l_end );

if ( l_end_bound == m_map.end() )
{
// erase all the segments from l_previous to the end and
// replace with one

m_map.erase( l_previous, l_end_bound );

m_map[ l_end ] = l_new_begin;
return;
}

if ( l_end < l_end_bound->second && ! t_RangeTraits::Adjacent( l_end, l_end_bound->second ) )
{
m_map.erase( l_previous, l_end_bound );
m_map[ l_end ] = l_new_begin;
return;
}

// merge with the current end

m_map.erase( l_previous, l_end_bound ); // erase segments in between
l_end_bound->second = l_new_begin;

return;
}


// ======== SubtractRange =========================================
/**
* SubtractRange removes the range. (opposite of Add)
*
*
* @param i_begin Beginning of range to subtract
* @param i_end End of range to subtract
* @return nothing
*/

void SubtractRange( const t_RangeType & i_begin, const t_RangeType & i_end )
{
const bool l_less_than( i_end < i_begin );

const t_RangeType & l_begin = ! l_less_than ? i_begin : i_end;
const t_RangeType & l_end = l_less_than ? i_begin : i_end;

// deal with an empty map here
if ( m_map.empty() )
{
// Nothing to remove
return;
}

// See if we find a segment
// l_begin

t_Iterator l_begin_bound = m_map.lower_bound( l_begin );

if ( l_begin_bound == m_map.end() )
{
// this does not cover any segments
return;
}

if ( l_begin_bound->second < l_begin )
{
// this segment is broken up

t_RangeType l_newend = l_begin;

-- l_newend;

m_map[ l_newend ] = l_begin_bound->second;

l_begin_bound->second = l_begin;
}

t_Iterator l_end_bound = m_map.lower_bound( l_end );

if ( l_end_bound == m_map.end() )
{
// erase all the segments from the beginning to end
m_map.erase( l_begin_bound, l_end_bound );
return;
}

if ( !( l_end < l_end_bound->first ) )
{
// the segment end must be equal the segment given

++ l_end_bound;

m_map.erase( l_begin_bound, l_end_bound );
return;
}

// need to break up the final segment

m_map.erase( l_begin_bound, l_end_bound );

if ( !( l_end < l_end_bound->second ) )
{
t_RangeType l_newbegin = l_end;

++ l_newbegin;

l_end_bound->second = l_newbegin;
}

return;

}

// ======== IsSet =================================================
/**
* Checks to see if the position is set
*
* @param i_pos
* @return True if the position is set
*/

bool IsSet( const t_RangeType & i_pos )
{
t_Iterator l_bound = m_map.lower_bound( i_pos );

if ( l_bound == m_map.end() )
{
// this does not cover any segments
return false;
}

return !( i_pos < l_bound->second );
}

t_Map m_map;

};



}; // namespace

#endif // x_at_rangemap_h_x
 
M

Markus Dehmann

You could use map. I have attached the Austria C++ "rangemap" that can
be used as a sparse bitmap.

Thanks! What are the differences compared to std::map? Could I also
just use map said:
What kind of operations do you need to have on your bitmap ?

I will have to compute unions of the vectors/sets, and I need a fast
way to lookup what bits are set in a given vector. I have one vector
of doubles (also 20,000 elements), and I frequently have to multiply
the bit vectors with that vector of doubles and then sum over all the
elements.

Thanks!
Markus

[at_rangemap.h]//
 
G

Guest

I am planning to have tens of thousands of bit vectors (or bitsets).
Each of them has about 20,000 bits, but it's sparse: In a given
vector, only 20-30 bits will be set. What data structure would be best
for space (and probably also time) efficiency? Is std::bitset better
than std::bit_vector (vector<bool>)?

but I do not think that said:
Can I trust that the used space
for each of the bitvectors will be in the order of 20-30 bits (i.e.
the expected number of bits set), or rather 20,000 bits (the constant
number of bits that each vector has)?

The bitset will be at least as big as the maximum number of bits (20,000
bits ~ 2.5 KB).
Or are there other, better classes to use? Sparse Matrix classes?

You should just store the index of the bits that are set. If we do some
quick calculations we see that with a bitset you will need (for 10,000
sets of 20,000 bits each) 10,000 * 20,000 ~ 25 MB. Using vector<int>
instead to store the indexes of the set bits would require (given 20 set
bits in each set and 4 byte per int) 10,000 * 20 * 4 ~ 800 KB. Using
shorts and you might be able to halve that. Depending on how you plan to
use the sets you might want to use sets, deques, or some other
container, a vector together with the *_heap() algorithms might also be
an idea.

Computing the unions when storing only the indexes can be done quite
easily by hand or if your store them in sorted order using set_union().
To get which bits that are set just iterate over the container. To get
the sum from the vector of doubles something like this can be used:

double sum(const std::vector<int>& bits, const std::vector<double>& val)
{
double sum = 0;
for (size_t i = 0; i < bits.size();++i)
sum += val[bits];
return sum;
}

This can easily be adopted to other containers using iterators.
 
G

Gianni Mariani

Markus said:
Thanks! What are the differences compared to std::map? Could I also
just use map<bool>?


I will have to compute unions of the vectors/sets, and I need a fast
way to lookup what bits are set in a given vector. I have one vector
of doubles (also 20,000 elements), and I frequently have to multiply
the bit vectors with that vector of doubles and then sum over all the
elements.

rangemap is a map of "ranges". It's like a bitmap in that you're
storing just the ranges of set bits. If you want to "multiply", then
the quickest thing is to go through the ranges in the rangemap and add
all the elements that are indexed by the ranges.
 
M

Markus Dehmann

The bitset will be at least as big as the maximum number of bits (20,000
bits ~ 2.5 KB).

That's quite shocking, together with your statement that vector<bool>
(I suppose you mean the bit_vector version) would be even worse. I
thought that some space optimization would be done?
You should just store the index of the bits that are set. If we do some
quick calculations we see that with a bitset you will need (for 10,000
sets of 20,000 bits each) 10,000 * 20,000 ~ 25 MB. Using vector<int>
instead to store the indexes of the set bits would require (given 20 set
bits in each set and 4 byte per int) 10,000 * 20 * 4 ~ 800 KB. Using
shorts and you might be able to halve that. Depending on how you plan to
use the sets you might want to use sets, deques, or some other
container, a vector together with the *_heap() algorithms might also be
an idea.

Storing just the bits that are set makes sense. Actually I was
thinking that bit_vector would pretty much do that for me as a space
optimization.

Another good container to use might be the google-sparsehash (http://
code.google.com/p/google-sparsehash). It is space-efficient and a hash
would make it easier to compute the unions of two of them.
 
G

Guest

That's quite shocking, together with your statement that vector<bool>
(I suppose you mean the bit_vector version) would be even worse. I
thought that some space optimization would be done?

My problem with vector<bool> is not so much with how much space it uses
(which should be roughly equal to bitset) but rather the fact that it
does not behave in the same way as a vector said:
Storing just the bits that are set makes sense. Actually I was
thinking that bit_vector would pretty much do that for me as a space
optimization.

I have not studied the standard in great detail so such an optimisation
might be allowed, but the most likely implementation is to allocate as
many bits (rounded up to an even byte) as necessary. The problem with
such an optimisation would be if all bits are set, in which case you
would require more memory than is you just set bits in memory. I am not
sure if the complexity requirements allows for a scheme that switches
between the most efficient storage format as needed, even if it does the
extra complexity is not something you want unless it is really necessary.
 
J

James Kanze

On 2007-10-26 07:32, Markus Dehmann wrote:
A bitset is probably better than a vector<bool>, but I do not
think that you should use any of them.

I tend to agree, but not for reasons of space.
The bitset will be at least as big as the maximum number of
bits (20,000 bits ~ 2.5 KB).
You should just store the index of the bits that are set. If we do some
quick calculations we see that with a bitset you will need (for 10,000
sets of 20,000 bits each) 10,000 * 20,000 ~ 25 MB. Using vector<int>
instead to store the indexes of the set bits would require (given 20 set
bits in each set and 4 byte per int) 10,000 * 20 * 4 ~ 800 KB. Using
shorts and you might be able to halve that.

Using short probably wouldn't change anything. Set it a node
based container; each element in the set is in its own node, a
node which also contains a couple of pointers (at least three, I
think, for std::set). On most modern machines, pointers are
either 4 or 8 bytes, and require a similar alignment. So you
have to add 4 or 8 times 3 (or more) to each element (2 or 4),
then round up if necessary to ensure alignment. And maybe
add a bit more for the overhead of the allocator. On a 32 bit
machine, you end up with at least 16 bytes, and on a 64 bit
machine, at least 32 bytes, per element.

Depending on how the data is used, it might be best to maintain
it in an vector of short or int, using lower_bound, etc., to
find the insertion point in order to maintain the vector sorted.
If there aren't too many elements in each set, even an insertion
at the beginning will not be too expensive; probably less
expensive than the allocation of a new node when inserting into
a set. And in this case, the extra overhead will be a couple of
pointers for the entire set, rather than for each element.

If you know the maximum number of elements, and it's fairly
small, you could also use boost::array. Or if it's usually
under some very small value, but with a few outlyers, you could
use something like the small string optimization frequently used
for strings.

Whatever you start with (I'd probably start with set, or maybe
vector), you should wrap it in an application specific class, so
that you can easily change it at will.
Depending on how you plan to
use the sets you might want to use sets, deques, or some other
container, a vector together with the *_heap() algorithms might also be
an idea.
Computing the unions when storing only the indexes can be done quite
easily by hand or if your store them in sorted order using set_union().
To get which bits that are set just iterate over the container. To get
the sum from the vector of doubles something like this can be used:

There are a set of functions for treating sorted sequences (set,
but also sorted vector, etc.) as a set. Which means that a lot
of his work is already done for him.
 
J

James Kanze

That's quite shocking, together with your statement that vector<bool>
(I suppose you mean the bit_vector version) would be even worse. I
thought that some space optimization would be done?

Why? The most frequent use is probably fairly small sets, for
which space optimization is counter productive.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,528
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top