writing uint16_t in a buffer

Alessio Sangalli · Dec 2, 2008

Hi, I am building up a buffer in memory to be sent over the network as a
data structure.

In the following example, I will omit all the hton* calls for simplicity.

Imagine I have a char buffer[16] and I have to fill it up with a number
of small, 16bit values.

What I did is to define a macro:
#define CPY16(d,s) *(uint16_t*)&d=s

and then use it as follows:
uint16_t a=0xfaaf;
uint16_t b=0xc33c;

CPY16(buffer[12], a);
CPY16(buffer[2], b);

I did this because profiling revealed that the version above is almost
10 times faster than memcpy().

Where's the catch? Am I implementation/compiler/architecture dependent?

bye, thank you
Alessio

Bartc · Dec 2, 2008

Alessio Sangalli said:
Hi, I am building up a buffer in memory to be sent over the network as a
data structure.

In the following example, I will omit all the hton* calls for simplicity.

Imagine I have a char buffer[16] and I have to fill it up with a number
of small, 16bit values.

What I did is to define a macro:
#define CPY16(d,s) *(uint16_t*)&d=s

and then use it as follows:
uint16_t a=0xfaaf;
uint16_t b=0xc33c;

CPY16(buffer[12], a);
CPY16(buffer[2], b);

I did this because profiling revealed that the version above is almost
10 times faster than memcpy().

Are the uint16_t's always written to an even index? In that case why not
just use an array of 8 uint16_t's?

Or, if the values written will not overlap each, perhaps a struct.

Where's the catch? Am I implementation/compiler/architecture dependent?

There might be issues with alignment (and possibly byte-order).

Willem · Dec 2, 2008

Alessio Sangalli wrote:
) Hi, I am building up a buffer in memory to be sent over the network as a
) data structure.
)
) In the following example, I will omit all the hton* calls for simplicity.
)
) Imagine I have a char buffer[16] and I have to fill it up with a number
) of small, 16bit values.
)
) What I did is to define a macro:
) #define CPY16(d,s) *(uint16_t*)&d=s
)
) and then use it as follows:
) uint16_t a=0xfaaf;
) uint16_t b=0xc33c;
)
) CPY16(buffer[12], a);
) CPY16(buffer[2], b);
)
) I did this because profiling revealed that the version above is almost
) 10 times faster than memcpy().

Are you using memcpy() once on the entire buffer, or are you
calling it for 2 bytes ?

How much faster is it than:

buffer[12] = 0xfa;
buffer[13] = 0xaf;
buffer[2] = 0xc3;
buffer[3] = 0x3c;

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

litchie · Dec 2, 2008

Alessio Sangalli wrote:

) Hi, I am building up a buffer in memory to be sent over the network as a
) data structure.
)
) In the following example, I will omit all the hton* calls for simplicity.
)
) Imagine I have a char buffer[16] and I have to fill it up with a number
) of small, 16bit values.
)
) What I did is to define a macro:
) #define CPY16(d,s) *(uint16_t*)&d=s
)
) and then use it as follows:
) uint16_t a=0xfaaf;
) uint16_t b=0xc33c;
)
) CPY16(buffer[12], a);
) CPY16(buffer[2], b);
)
) I did this because profiling revealed that the version above is almost
) 10 times faster than memcpy().

Are you using memcpy() once on the entire buffer, or are you
calling it for 2 bytes ?

How much faster is it than:

buffer[12] = 0xfa;
buffer[13] = 0xaf;
buffer[2] = 0xc3;
buffer[3] = 0x3c;

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Alessio's solution takes 2 instructions to complete with a little risk
on unalign exceptions, yours takes 4 instructions but can run safely
on any platform.

Using memcpy() takes even more.. (function call + loop control + ...)

However, this kind of raw conversion is not recommended unless your
program will only run on same endian type machines.

Thad Smith · Dec 3, 2008

Alessio said:
Hi, I am building up a buffer in memory to be sent over the network as a
data structure.

In the following example, I will omit all the hton* calls for simplicity.

Imagine I have a char buffer[16] and I have to fill it up with a number
of small, 16bit values.

What I did is to define a macro:
#define CPY16(d,s) *(uint16_t*)&d=s

and then use it as follows:
uint16_t a=0xfaaf;
uint16_t b=0xc33c;

CPY16(buffer[12], a);
CPY16(buffer[2], b);

Do you want to encode the data as most significant byte first (big
endian) or least significant byte first (little endian)? This should
always be specified if you are transmitting or storing the data external
to the program.

I don't have a compiler handy to check, but you can use something like

#define INTTOBE16(d,s) ((&(d))[0]=((s)>>8)&0xff, (&(d))[1]=(s)&0xff)
#define INTTOLE16(d,s) ((&(d))[0]=(s)&0xff, (&(d))[1]=((s)>>8)&0xff)

Since both d and s are used twice in the macros, the arguments, when the
macro is used, should be free of side effects, such as incrementing a
value or calling a function.

Flash Gordon · Dec 3, 2008

litchie wrote, On 03/12/08 04:34:

Alessio Sangalli wrote:

) Hi, I am building up a buffer in memory to be sent over the network as a
) data structure.
)
) In the following example, I will omit all the hton* calls for simplicity.
)
) Imagine I have a char buffer[16] and I have to fill it up with a number
) of small, 16bit values.
)
) What I did is to define a macro:
) #define CPY16(d,s) *(uint16_t*)&d=s
)
) and then use it as follows:
) uint16_t a=0xfaaf;
) uint16_t b=0xc33c;
)
) CPY16(buffer[12], a);
) CPY16(buffer[2], b);
)
) I did this because profiling revealed that the version above is almost
) 10 times faster than memcpy().

Are you using memcpy() once on the entire buffer, or are you
calling it for 2 bytes ?

How much faster is it than:

buffer[12] = 0xfa;
buffer[13] = 0xaf;
buffer[2] = 0xc3;
buffer[3] = 0x3c;

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Click to expand...

Please don't quote peoples signatures, the bit typically after the "--
", although in this case it started with "SaSW, Willem".

Alessio's solution takes 2 instructions to complete with a little risk
on unalign exceptions,

There are platforms on which you won't get an exception but it can be
significantly slowed if the data is incorrectly aligned. I believe there
are also real current systems where it will trap.

yours takes 4 instructions but can run safely
on any platform.

How many instructions it takes depends. In part it depends on whether
you use the optimiser for the compiler.

Using memcpy() takes even more.. (function call + loop control + ...)

In that case switch on the optimiser. Decent compilers will inline the
memcpy and then optimise the inlined code. If you are not using the
optimiser then before even looking at how the code is written you should
try it!

However, this kind of raw conversion is not recommended unless your
program will only run on same endian type machines.

The OP mentioned that for simplicity the calls to the funtion dealing
with endianness had been omitted.

litchie · Dec 4, 2008

Flash said:
Flash said:

litchie wrote, On 03/12/08 04:34: [...]

Alessio's solution takes 2 instructions to complete with a little risk
on unalign exceptions,

Click to expand...

Click to expand...

There are platforms on which you won't get an exception but it can be
significantly slowed if the data is incorrectly aligned. I believe there
are also real current systems where it will trap.

Click to expand...

[...]

Worse, on many ARM processor types (widely used in portable devices these
days) I'm pretty sure unaligned word accesses simply get the "wrong"
bytes. So you won't even break into a debugger.

ARM arch spec says that unaligned word access must raise exception.

Chris Dollin · Dec 4, 2008

litchie said:
Flash said:

litchie wrote, On 03/12/08 04:34: [...]
Alessio's solution takes 2 instructions to complete with a little risk
on unalign exceptions,

Click to expand...

There are platforms on which you won't get an exception but it can be
significantly slowed if the data is incorrectly aligned. I believe there
are also real current systems where it will trap.

Click to expand...

[...]

Worse, on many ARM processor types (widely used in portable devices these
days) I'm pretty sure unaligned word accesses simply get the "wrong"
bytes. So you won't even break into a debugger.

Click to expand...

ARM arch spec says that unaligned word access must raise exception.

The ARM in my RISC PC at home doesn't conform to that spec, then.
(It's an old StrongARM, so not really relevant to what you'd encounter
in recent embedded ARMS -- but if you did try the misaligned copy trick,
you /would/ just get the "wrong" answer, if I'm remembering correctly.)

Function to determine the number of chars in a FILE buffer	11	Aug 31, 2011
Erradicating a Buffer Overflow	23	Oct 24, 2005
Weird Behavior with Rays in C and OpenGL	4	Feb 12, 2024
Char buffer error in a Structure	1	Apr 18, 2008
printing bits ... the right way	2	Apr 1, 2010
Computing a*b/c without overflow in the preprocessor	22	Mar 29, 2010
Weird problem with pointer dereferencing	4	Mar 25, 2010
unable to read char * strings from a buffer	4	Nov 14, 2006

writing uint16_t in a buffer

Alessio Sangalli

Bartc

Willem

litchie

Thad Smith

Flash Gordon

litchie

Chris Dollin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads