Properly aligned dynamic buffer

S

Spoon

Hello everyone,

As far as I understand, if I request a uint8_t buffer,
it could be allocated anywhere.

uint8_t *buf = new uint8_t[1328]

By anywhere, I mean e.g. it could start at an odd address.

Therefore it might be incorrect to access 32 bits at a time:

*(uint32_t *)(buf+4*i) is probably illegal.

If I want a uint8_t buffer that is aligned on, say, 32 bits,
can I do the following:

uint8_t *buf = ( uint8_t * )( new uint32_t[1328/4] );

and have the guarantee that I can always access
*(uint32_t *)(buf+4*i) without any problem?

What if I want to align to 256 bits?
Do I have to create a bogus 256-bit structure?
(Since there is no native 256-bit integral type.)

Regards.
 
D

dizone

Spoon said:
Hello everyone,
Hello

As far as I understand, if I request a uint8_t buffer,
it could be allocated anywhere.

uint8_t *buf = new uint8_t[1328]

By anywhere, I mean e.g. it could start at an odd address.

(*very* unlikely, see bellow)
Therefore it might be incorrect to access 32 bits at a time:

*(uint32_t *)(buf+4*i) is probably illegal.

If I want a uint8_t buffer that is aligned on, say, 32 bits,
can I do the following:

uint8_t *buf = ( uint8_t * )( new uint32_t[1328/4] );

and have the guarantee that I can always access
*(uint32_t *)(buf+4*i) without any problem?

Well depends on what degree of portability you want to reach. First of
all, above with 4*i you automatically assume that
std::numeric_limits<uint8_t>::digits == 8 (ie that a byte has 8 bits
which may not be the case on all platforms, sure also nobody says there
is a std::numeric_limits<uint8_t> but usually if your platform provides
uint8_t it also comes with a numeric_limits specialization for it).

To reply to your question, yes a uint32_t allocated array should be
able to access it in sizeof(uint32_t) byte offsets (I said
sizeof(uint32_t) not 4 cause as I said it may not always be the same).
What if I want to align to 256 bits?
Do I have to create a bogus 256-bit structure?
(Since there is no native 256-bit integral type.)

While not all platforms allow one to access "words" (a term defined for
a specific platform) at any memory address and must be aligned
accordingly, they all (AFAIK) have addresses which are aligned for ANY
type of access. This feature is generally used by C/C++ memory
allocators (ex. malloc() needs to return such aligned for all memory
usually while C++'s new is slightly more free to perform some alignment
optimizations).

So in order to get "dynamic memory" for any alignment so far you may:
1. use std::malloc(), please note that the alignment guarantee is ONLY
at the address returned and at multiples of sizeof() of the object
accessed from that base address (like an array) and not some random
location within the returned memory buffer; but then again why not use
"new" for the type if you are going to store a single type (or array
of) at that buffer
2. perform some trick to spot properly aligned memory for the type
needed; this is used in boost::alignment_of<> which has an interesting
implementation

My only need when I couldnt go with 1 and needed something like in 2
was when I had to do my own memory allocator over a given contigous
memory area (got with POSIX shared memory calls). In order to be more
portable I used the boost::alignment_of<> stuff to determine properly
aligned addresses for whatever data structures the allocator users
required.

Hope this helps.
 
D

dizone

Well depends on what degree of portability you want to reach. First of
all, above with 4*i you automatically assume that
std::numeric_limits<uint8_t>::digits == 8 (ie that a byte has 8 bits
which may not be the case on all platforms, sure also nobody says there
is a std::numeric_limits<uint8_t> but usually if your platform provides
uint8_t it also comes with a numeric_limits specialization for it).

Whoops, I ment "you assume that std::numeric_limits<char>::digits == 8
(char because it's the only type of which the standard says that it's
sizeof() is always 1, ie a byte).
 
J

Jim Langston

Spoon said:
Hello everyone,

As far as I understand, if I request a uint8_t buffer,
it could be allocated anywhere.

uint8_t *buf = new uint8_t[1328]

By anywhere, I mean e.g. it could start at an odd address.

I don't believe so. I had read that malloc (which new generally uses)
returns an address that can be used for any type (char, int, etc..) so it
should always be aligned properly. dzone said the same thing.
Therefore it might be incorrect to access 32 bits at a time:

*(uint32_t *)(buf+4*i) is probably illegal.

If I want a uint8_t buffer that is aligned on, say, 32 bits,
can I do the following:

uint8_t *buf = ( uint8_t * )( new uint32_t[1328/4] );

and have the guarantee that I can always access
*(uint32_t *)(buf+4*i) without any problem?

What if I want to align to 256 bits?
Do I have to create a bogus 256-bit structure?
(Since there is no native 256-bit integral type.)

Regards.
 
S

Spoon

dizone said:
Spoon said:
As far as I understand, if I request a uint8_t buffer,
it could be allocated anywhere.

uint8_t *buf = new uint8_t[1328]

By anywhere, I mean e.g. it could start at an odd address.

(*very* unlikely, see below)
Therefore it might be incorrect to access 32 bits at a time:

*(uint32_t *)(buf+4*i) is probably illegal.

If I want a uint8_t buffer that is aligned on, say, 32 bits,
can I do the following:

uint8_t *buf = ( uint8_t * )( new uint32_t[1328/4] );

and have the guarantee that I can always access
*(uint32_t *)(buf+4*i) without any problem?

Well depends on what degree of portability you want to reach. First of
all, above with 4*i you automatically assume that
std::numeric_limits<uint8_t>::digits == 8 (ie that a byte has 8 bits
which may not be the case on all platforms,

??

http://www.opengroup.org/onlinepubs/000095399/basedefs/stdint.h.html

I only assume that my platform defines uint8_t, which denotes an
unsigned integer type with a width of exactly 8 bits.

AFAIU, (uint32_t *)(buf+4*i) and ((uint32_t *)buf)+i are strictly
equivalent. I don't assume anything, do I?
 
D

dizone

Spoon said:
??

http://www.opengroup.org/onlinepubs/000095399/basedefs/stdint.h.html

I only assume that my platform defines uint8_t, which denotes an
unsigned integer type with a width of exactly 8 bits.

Yes that was a stupid thing for me to say, I corrected myself in
another message, however, I ment to say you assume that
std::numeric_limits said:
AFAIU, (uint32_t *)(buf+4*i) and ((uint32_t *)buf)+i are strictly
equivalent. I don't assume anything, do I?

I'm not so sure. How can you be sure that on every platform
sizeof(uint8_t) * 4 == sizeof(uint32_t) ?

As I was trying to say in my earlier message some platforms have a
different number of bits for a byte and thus it's natural to assume
that on those platforms alignment for a word won't happen in a multiple
of 8 bits but in a multiple of their byte (which lets assume it has 12
bits). In that case that platform in order to support fixed bitlen
types (like uint8_t) it will probably do some padding as such on that
platform sizeof(uint8_t) will be 1 (ie one 12bit byte can support a
8bit integer value) but sizeof(uint32_t) might be 3 (ie 3 bytes,
because 3 * 12bit = 36bit and can support a 32bit integer value).
 
S

Spoon

dizone said:
While not all platforms allow one to access "words" (a term defined for
a specific platform) at any memory address and must be aligned
accordingly, they all (AFAIK) have addresses which are aligned for ANY
type of access. This feature is generally used by C/C++ memory
allocators (ex. malloc() needs to return such aligned for all memory
usually while C++'s new is slightly more free to perform some alignment
optimizations).

Let me say a bit more about what I want to do.

Consider two N-bit buffers A and B. (Typically N is ~10000)

I want to compute C = A XOR B.
That is, for each bit i, C = A XOR B
As you can see, this problem is embarrassingly parallel.

The original naive solution was:

Q=N/8
uint8_t A[Q]; uint8_t B[Q]; uint8_t C[Q];
for (int i=0; i < Q; ++i) C = A ^ B;

The next step was to work with words (32 bits on my platform).

Q=N/32
uint32_t A[Q]; uint32_t B[Q]; uint32_t C[Q];
for (int i=0; i < Q; ++i) C = A ^ B;

Profiling indicated that I spend most of my time in this function, so I
figured I'd turn to platform-specific optimizations. My platform
provides 128-bit multimedia registers. But unaligned access incurs a
penalty. Thus, I want to guarantee that all 3 buffers are 128-bit
aligned, in order to write something like:

Q=N/128
uint32_t A[Q]; uint32_t B[Q]; uint32_t C[Q];
for (int i=0; i < Q; ++i) C = A ^ B;

(~16 times faster than my original implementation.)

How do I convince new to give me a 128-bit aligned buffer?

#include <cstdio>
struct foo { long long x,y; };
int main()
{
for (int i=0; i < 3; ++i) printf("%p\n", ( void * )( new foo ));
}

0x804a008
0x804a020
0x804a038

Are you saying I need to request "more" and "fix" the pointer?
So in order to get "dynamic memory" for any alignment so far you may:
1. use std::malloc(), please note that the alignment guarantee is ONLY
at the address returned and at multiples of sizeof() of the object
accessed from that base address (like an array) and not some random
location within the returned memory buffer; but then again why not use
"new" for the type if you are going to store a single type (or array
of) at that buffer
2. perform some trick to spot properly aligned memory for the type
needed; this is used in boost::alignment_of<> which has an interesting
implementation

My only need when I couldn't go with 1 and needed something like in 2
was when I had to do my own memory allocator over a given contigous
memory area (got with POSIX shared memory calls). In order to be more
portable I used the boost::alignment_of<> stuff to determine properly
aligned addresses for whatever data structures the allocator users
required.

Regards.
 
S

Spoon

dizone said:
Yes that was a stupid thing for me to say, I corrected myself in
another message, however, I ment to say you assume that


I'm not so sure. How can you be sure that on every platform
sizeof(uint8_t) * 4 == sizeof(uint32_t) ?

As I was trying to say in my earlier message some platforms have a
different number of bits for a byte and thus it's natural to assume
that on those platforms alignment for a word won't happen in a multiple
of 8 bits but in a multiple of their byte (which lets assume it has 12
bits). In that case that platform in order to support fixed bitlen
types (like uint8_t) it will probably do some padding as such on that
platform sizeof(uint8_t) will be 1 (ie one 12bit byte can support a
8bit integer value) but sizeof(uint32_t) might be 3 (ie 3 bytes,
because 3 * 12bit = 36bit and can support a 32bit integer value).

Perhaps I misread the standard, but it seems to me that uint8_t is only
defined on platforms where there exists a native unsigned integer type
with a width of *exactly* 8 bits.

Thus, on your hypothetical platform with 12-bit chars, uint8_t would not
be defined, as far as I understand.
 
D

dizone

Spoon said:
dizone said:
While not all platforms allow one to access "words" (a term defined for
a specific platform) at any memory address and must be aligned
accordingly, they all (AFAIK) have addresses which are aligned for ANY
type of access. This feature is generally used by C/C++ memory
allocators (ex. malloc() needs to return such aligned for all memory
usually while C++'s new is slightly more free to perform some alignment
optimizations).

Let me say a bit more about what I want to do.

Consider two N-bit buffers A and B. (Typically N is ~10000)

I want to compute C = A XOR B.
That is, for each bit i, C = A XOR B
As you can see, this problem is embarrassingly parallel.

The original naive solution was:

Q=N/8
uint8_t A[Q]; uint8_t B[Q]; uint8_t C[Q];
for (int i=0; i < Q; ++i) C = A ^ B;


slow and unportable (doesnt catch all the bits as I explained with the
12bit byte platforms).
The next step was to work with words (32 bits on my platform).

Q=N/32
uint32_t A[Q]; uint32_t B[Q]; uint32_t C[Q];
for (int i=0; i < Q; ++i) C = A ^ B;


Much faster but still unportable :)
Profiling indicated that I spend most of my time in this function, so I
figured I'd turn to platform-specific optimizations. My platform
provides 128-bit multimedia registers. But unaligned access incurs a
penalty. Thus, I want to guarantee that all 3 buffers are 128-bit
aligned, in order to write something like:

Q=N/128
uint32_t A[Q]; uint32_t B[Q]; uint32_t C[Q];
for (int i=0; i < Q; ++i) C = A ^ B;

(~16 times faster than my original implementation.)

How do I convince new to give me a 128-bit aligned buffer?


By asking it to allocate something of 128bit I guess although Im not so
sure about that (because of the problems with 8bit not being a byte).

However, in your situation, what I would do were to just use the
"natural" platform integer type and that's what "int" is. This is
actually why C/C++ never specify fixed bit sizes for integers because
they want to allow the coder to just use "int" which would turn into
the natural platform integer type (for 32bit platforms it has 32bit for
64bit it has 32 or 64, it depends).

So in your situation I would just use new int[some size] and cycle over
it and XOR it. This guarantees that it should be fast ("int" is the
natural platform integer, the one with which works fastest normally)
and that its portable (for 12bit byte platforms "int" would probably
turn into something of 12bit multiple size thus no issues with the
padding bits when XORing the buffer).

I would also test with long and benchmark against int, if it provides
better speed (long still is portable).

However, if you want to use special CPU instructions (that work with
128bit integers) you may get memory alligned for anything (including
for 128bit access) with std::malloc(). This use std::malloc() to get
the memory and the special CPU instructions to cycle and XOR it.
#include <cstdio>
struct foo { long long x,y; };
int main()
{
for (int i=0; i < 3; ++i) printf("%p\n", ( void * )( new foo ));
}

0x804a008
0x804a020
0x804a038

Are you saying I need to request "more" and "fix" the pointer?

That is another solution too. You can request more and fix the pointer
using boost::alignment_of<> but in your particular case I don't think
it's needed, just test with int/long normal "new" and XOR cycle and if
that's not fast enough use std::malloc() to get memory aligned for
anything and the special CPU 128bit working instructions over it.
 
D

dizone

Spoon said:
Perhaps I misread the standard, but it seems to me that uint8_t is only
defined on platforms where there exists a native unsigned integer type
with a width of *exactly* 8 bits.

Thus, on your hypothetical platform with 12-bit chars, uint8_t would not
be defined, as far as I understand.

I see, sorry about that, I only limited myself to ISO C++ and there is
no uint8_t in it so I assumed some things about it :)

Indeed if that's the text then my example is of no use (however I
really think I read about REAL platforms with 12bit bytes length so not
only in theory).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top