The problem with size_t

J

jacob navia

I am implementing the bitstring type in the container library, and
obviously I store the number of bits in a size_t...

Problem is, in 64 bit versions, size_t grows to 8 bytes, what
is an absolute overkill for a number that in most cases will
fit in 16 bits, or, at most 32.

And this happens in all containers. I do not see most applications
use containers with more than 4G elements... In a 64 bit system
size_t is just too much waste.

Now I see the problems Malcom was pointing at when he ranted at size_t.

What would be the alternatives?

uin32_t?

That one looks better. Any problem with that?
 
N

Nick Keighley

I am implementing the bitstring type in the container library, and
obviously I store the number of bits in a size_t...
obviously...

Problem is, in 64 bit versions, size_t grows to 8 bytes, what
is an absolute overkill for a number that in most cases will
fit in 16 bits, or, at most 32.

And this happens in all containers. I do not see most applications
use containers with more than 4G elements...

if you don't want containers with more than 4G elements why are you
using a
64-bit system?
In a 64 bit system size_t is just too much waste.

Now I see the problems Malcom was pointing at when he ranted at size_t.

I always thought his rant was a bit odd
What would be the alternatives?

uin32_t?

That one looks better. Any problem with that?

it won't hold more than 4G elements!
 
N

Noob

jacob said:
[...] size_t grows to 8 bytes, what is an absolute overkill [...]

<OT grammar nit>
In this context, one would write "which" not "what".
(English native speakers: is my statement correct?)
I've seen several francophone posters make this mistake.
Is this grammatical rule incorrectly taught in school, perhaps?
</OT>
 
N

Nick Keighley

jacob navia wrote:
[...] size_t grows to 8 bytes, what is an absolute overkill [...]

<OT grammar nit>
In this context, one would write "which" not "what".
(English native speakers: is my statement correct?)
yes

I've seen several francophone posters make this mistake.
Is this grammatical rule incorrectly taught in school, perhaps?
</OT>
 
D

Dik T. Winter

> jacob navia wrote:
>
> > [...] size_t grows to 8 bytes, what is an absolute overkill [...]
>
> <OT grammar nit>
> In this context, one would write "which" not "what".
> (English native speakers: is my statement correct?)
> I've seen several francophone posters make this mistake.
> Is this grammatical rule incorrectly taught in school, perhaps?
> </OT>

In French (as in Dutch) which translates in this position to a word that
is also a translation of what. So when you use the translated word in
your native language you have to think about the difference in meaning
when translating into English.

Similar to "teach" and "learn" that can be (and frequently will be)
translated to the same word in Dutch.
 
B

Ben Bacarisse

jacob navia said:
I am implementing the bitstring type in the container library, and
obviously I store the number of bits in a size_t...

Problem is, in 64 bit versions, size_t grows to 8 bytes, what
is an absolute overkill for a number that in most cases will
fit in 16 bits, or, at most 32.

I can't see why bitstrings
 
J

Joachim Schmitz

Dik said:
jacob said:
[...] size_t grows to 8 bytes, what is an absolute overkill [...]

<OT grammar nit>
In this context, one would write "which" not "what".
(English native speakers: is my statement correct?)
I've seen several francophone posters make this mistake.
Is this grammatical rule incorrectly taught in school, perhaps?
</OT>

In French (as in Dutch) which translates in this position to a word
that is also a translation of what. So when you use the translated
word in your native language you have to think about the difference
in meaning when translating into English.

Same in German.
Similar to "teach" and "learn" that can be (and frequently will be)
translated to the same word in Dutch.

Happens in German too, but is utterly wrong. "teach" is "lehren", "lern" is
"lernen".
Quite a few Germans don't care and use "lernen" in either case.

Bye, Jojo
 
B

Ben Bacarisse

[I cancelled a previous reply sent unfinished (hardly begun!) but I
image it will get though. Please ignore it.]

jacob navia said:
I am implementing the bitstring type in the container library, and
obviously I store the number of bits in a size_t...

Problem is, in 64 bit versions, size_t grows to 8 bytes, what
is an absolute overkill for a number that in most cases will
fit in 16 bits, or, at most 32.

And this happens in all containers. I do not see most applications
use containers with more than 4G elements... In a 64 bit system
size_t is just too much waste.

If the implementation has decided that size_t should be 64 bits, then
I'd want containers that can be that large. I don't see that as a
problem but I don't use ever use low-memory systems with 64-bit
size_t. It soulds like you are worrying about a corner case.
Now I see the problems Malcom was pointing at when he ranted at
size_t.

I though he objected to not using int and wanted int, on all systems,
to be 64 bits? That would have a much higher memory impact the not
using size_t.

<snip>
 
D

Dik T. Winter

> Dik T. Winter wrote: ....
>
> Happens in German too, but is utterly wrong. "teach" is "lehren", "lern" is
> "lernen".

But in Dutch it is correct. "leren" has two meanings, transitive and
intransitive, which makes a sentence like:
De man leert pianospelen
ambiguous (the man teaches/learns playing piano).
 
J

Joachim Schmitz

Dik said:
But in Dutch it is correct. "leren" has two meanings, transitive and
intransitive, which makes a sentence like:
De man leert pianospelen
ambiguous (the man teaches/learns playing piano).

OK, then I now know where that mistake stems from (living so 60km from the
Dutch border :cool:

Bye, Jojo
 
J

jacob navia

Ben Bacarisse a écrit :
If the implementation has decided that size_t should be 64 bits, then
I'd want containers that can be that large. I don't see that as a
problem but I don't use ever use low-memory systems with 64-bit
size_t. It soulds like you are worrying about a corner case.

Maybe, but I am not sure about it.
If you want to use bitstrings (as opposed to
a container of unsigned char with 1 and 0 in each position)
it means you do care about memory usage.

Besides this problem is pervasive in 64 bit systems.
All programmers start thinking like that, "we have a lot of
memory", etc) and the end result is that the programs bloat
so much that the benefits of having more memory are completely
nullified by bloated software.

Pointers are now 64 bit, when in fact we could surely use
32 bit pointers for most applications. The code size increases
because of longer 64 bit instructions, etc.
 
K

Keith Thompson

jacob navia said:
I am implementing the bitstring type in the container library, and
obviously I store the number of bits in a size_t...

Problem is, in 64 bit versions, size_t grows to 8 bytes, what
is an absolute overkill for a number that in most cases will
fit in 16 bits, or, at most 32.

And this happens in all containers. I do not see most applications
use containers with more than 4G elements... In a 64 bit system
size_t is just too much waste.

My advice: Don't impose arbitrary limits in general-purpose code
unless there's a *very* good reason to do so.

How do you know that nobody will ever want a bitstring with more than
2**32 elements? What if I'm working with a large dataset (say, 10
billion elements) and I want a bitstring with one bit per element?
No, I can't do that because the person who designed this container
library 10 years ago was certain I'd never want to.

We're talking about 4 bytes of overhead per bitstring. That could
be a problem if you happen to have a huge number of small bitstrings,
but IMHO that's not too high a price to pay for generality.
Now I see the problems Malcom was pointing at when he ranted at size_t.

What would be the alternatives?

uin32_t?

(Typo: uint32_t)
That one looks better. Any problem with that?

Yes: It's not size_t, and it would make your bitstring container
gratuitously inconsistent with your other containers.

Just use size_t; that's what it's for.

My only real concern is that, since you can potentially have
objects up to SIZE_MAX bytes, and a bitstring presumably stores
CHAR_BIT bits per byte, so size_t potentially isn't big enough.
That's not likely to be a problem any time soon for 64-bit systems,
but it could for 32-bit where where, assuming CHAR_BIT==8, you
couldn't have a bitstring of more than 2**29 bytes (1/8 of the
addressing range). But I'd still advocate using size_t here rather
than some potentially larger type.
 
S

Seebs

How do you know that nobody will ever want a bitstring with more than
2**32 elements? What if I'm working with a large dataset (say, 10
billion elements) and I want a bitstring with one bit per element?
No, I can't do that because the person who designed this container
library 10 years ago was certain I'd never want to.

Let's go for the obvious case:

I have plenty of machines with over 1GB of memory.

I would like to implement a system which has an index of IP addresses,
and as a first pass, before I do a huge binary search, I just want to do
a linear check as to whether an address is known to me.

The bitstring for IPv4 space has precisely 2**32 elements.

If you can't think of a use for a string of 2**32 bits, or possibly 2**32
sets of three or four bits...
My only real concern is that, since you can potentially have
objects up to SIZE_MAX bytes, and a bitstring presumably stores
CHAR_BIT bits per byte, so size_t potentially isn't big enough.
That's not likely to be a problem any time soon for 64-bit systems,
but it could for 32-bit where where, assuming CHAR_BIT==8, you
couldn't have a bitstring of more than 2**29 bytes (1/8 of the
addressing range). But I'd still advocate using size_t here rather
than some potentially larger type.

Yes.

And there's a real issue here, because, while we currently don't have
machines that could handle a 2**128 bit bitstring, we have an obvious
application for one... (More realistically, though, I actually know
someone doing some initial data structures work on questions to do with
mapping IPv6, and he is talking about data sets that could, in the
reasonably forseeable future, exceed 2^32 items.)

-s
 
K

Keith Thompson

jacob navia said:
Ben Bacarisse a écrit :


Maybe, but I am not sure about it.
If you want to use bitstrings (as opposed to
a container of unsigned char with 1 and 0 in each position)
it means you do care about memory usage.

Perhaps I care about memory usage in the sense that an unpacked
bitstring of the length I need won't fit in my computer's memory.

If you really think it's important to be able to use a 32-bit, or
even 16-bit, length for small bitstrings, I suppose you could vary
the representation depending on the length. For example, start with
a 16-bit length field. For values from 0 to 0xFFFD, the bitstring
data immediately follows the length. 0xFFFE indicates that the
actual length is in the following 4 bytes, followed by the data.
0xFFFF indicates that the acutal length is in the following 8 bytes,
followed by the data. You could even start with a single byte.

I don't necessarily think this is a good idea (every reference
would have to check for special cases), but it would save space for
small bitstrings while allowing huge bitstrings to be supported.
And as long as the details are hidden from the user, the complexity
is isolated to the library code that implements it. (The assumption
of 8-bit bytes is also isolated to the library code.)

I still recommend just storing the length as a size_t, but this is
an alternative.
 
F

Flash Gordon

jacob said:
Ben Bacarisse a écrit :


Maybe, but I am not sure about it.
If you want to use bitstrings (as opposed to
a container of unsigned char with 1 and 0 in each position)
it means you do care about memory usage.

The only time I have used things like that I was dealing with large
enough structures that the overhead of a 64 bit size_t would be dwarfed
by the size of the structure.
Besides this problem is pervasive in 64 bit systems.
All programmers start thinking like that, "we have a lot of
memory", etc) and the end result is that the programs bloat
so much that the benefits of having more memory are completely
nullified by bloated software.

Pointers are now 64 bit, when in fact we could surely use
32 bit pointers for most applications. The code size increases
because of longer 64 bit instructions, etc.

You don't have to make size_t 64 bit if you don't want to. You could
make it 32 bits and limit the size of objects appropriately. I don't
know enough about the 64 bit processors to know if you can use 32 bit
pointers. You could, of course, provide compiler options to control this
if you wanted, just like the old compiler options for small/medium/large
memory models etc.
 
K

Keith Thompson

Scott Fluhrer said:
True. It strikes me as overkill in this case (if we have a system with
multiple Gigabytes of memory, is it that serious of a problem if size_t
wastes an additional 4 bytes), but it is certainly possible.

A 64-bit system doesn't necessarily have gigabytes of memory.

Historically, issues like this become important when the sizes (of
physical memory, of virtual memory, of the largest supported object)
approach some kind of boundary. We ran into it some time ago when
sizes around 64 KiB (addressible with 16 bits) were common. We're
seeing it again now with sizes around 4 GiB.
 
R

robertwessel2

I am implementing the bitstring type in the container library, and
obviously I store the number of bits in a size_t...

Problem is, in 64 bit versions, size_t grows to 8 bytes, what
is an absolute overkill for a number that in most cases will
fit in 16 bits, or, at most 32.

And this happens in all containers. I do not see most applications
use containers with more than 4G elements... In a 64 bit system
size_t is just too much waste.

Now I see the problems Malcom was pointing at when he ranted at size_t.

What would be the alternatives?

uin32_t?

That one looks better. Any problem with that?


It would also be worth considering that the application may be diverse
enough to justify two types - a large and small bitstring. The small
bitstring bit be limited to a short's worth of bits, and internally be
implemented with the struct hack for the bit storage. The large
bitstring might use some sort of tree or other layer of indirection to
make expansion reasonable as the bitstring gets really large (realloc
()’ing a growing 20KB bitstring is reasonable, but I’d have issues
growing a 1GB bit string that way).

Another option is that you change representations as the size grows
past a certain point. If you do that, could you not just alter the
vtable pointer to go "native" for the changed implementation? This
would require that enough space be allocated for the largest version
of the fixed part of the container (or allow a "small" base part to be
specified, which would not allow the container to grow past a certain
point).
 
S

Stephen Sprunk

jacob said:
I am implementing the bitstring type in the container library, and
obviously I store the number of bits in a size_t...

Problem is, in 64 bit versions, size_t grows to 8 bytes, what
is an absolute overkill for a number that in most cases will
fit in 16 bits, or, at most 32.

And this happens in all containers. I do not see most applications
use containers with more than 4G elements... In a 64 bit system
size_t is just too much waste.

If the user has a 64-bit system, it is extremely unlikely that saving
four bytes of memory will matter.

Also, on many implementations, 64-bit integers are just as fast as (or
faster than) 32-bit integers, and this is likely to become more common,
not less.

Finally, if I had a dollar for every time some numbskull programmer said
to himself "nobody will _ever_ send me more than N bytes of data" and
then was proven wrong a few years later, I'd be a rich man.

It's best to use size_t for sizes of things, even if that seems like
overkill at times.

(BTW, isn't the human genome around fourteen billion bits? That seems
like the kind of thing I'd store in a bitstring...)

S
 
J

jacob navia

(e-mail address removed) a écrit :
It would also be worth considering that the application may be diverse
enough to justify two types - a large and small bitstring. The small
bitstring bit be limited to a short's worth of bits, and internally be
implemented with the struct hack for the bit storage. The large
bitstring might use some sort of tree or other layer of indirection to
make expansion reasonable as the bitstring gets really large (realloc
()’ing a growing 20KB bitstring is reasonable, but I’d have issues
growing a 1GB bit string that way).

Yes, this is a problem with the heap manager, for instance. I have
a schema of having a vector of pointers to blocks each block containing
N elements. That gives 4G (2^32) elements using a couple
of 16 bit indexes.

Most containers will be happy with several thousand elements, what
would fit in two unsigned char indexes, etc.

There are several growth strategies, and it is difficult to generalize.
We have a linear growth (20K as you proposed above) or maybe exponential
growth: we double the amount allocated at each step.

Any strategy based on exponential growth can be dangerous in memory
usage. To use with care...

If we want to keep a record of the number of resizing being done,
that takes place too, etc.

A possible solution is a dynamic growth strategy that starts small
assuming most containers will have a few hundred elements at most,
and then dynamically change the vtable for bigger containers as
soon as some threshold is passed.

This can be complicated if the user has replaced some of the
functions in the vtable: care should be taken to copy the functions
that the user replaced...
Another option is that you change representations as the size grows
past a certain point. If you do that, could you not just alter the
vtable pointer to go "native" for the changed implementation? This
would require that enough space be allocated for the largest version
of the fixed part of the container (or allow a "small" base part to be
specified, which would not allow the container to grow past a certain
point).

I did not fully understand the above paragraph. Maybe you can expand?
what would it mean to "go native" ?

Thanks for your ideas.
 
K

Keith Thompson

Stephen Sprunk said:
(BTW, isn't the human genome around fourteen billion bits? That seems
like the kind of thing I'd store in a bitstring...)

Wikipedia says it's just over 3 billion DNA base pairs, at 2 bits each
for over 6 billion bits.

You'd be able to store that in less than 1 gigabyte, but if you want
to store any additional information things are going to get tight.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top