Name to method?

S

superheathen

Hi

I'm reading from a database that stores information as an integer
representing a char array of ints, it is created in the following
way:

unsigned char examplearray[] = {4, 3, 2, 1};
unsigned int exampleint = *(unsigned int *)examplearray;

and back again using:

unsigned char examplearray2[4];
*(unsigned int*)examplearray2 = exampleint;

it works, I just have no clue how it works. Does this technique have a
name so I can look into it?
 
W

Walter Roberson

I'm reading from a database that stores information as an integer
representing a char array of ints, it is created in the following
way:
unsigned char examplearray[] = {4, 3, 2, 1};
unsigned int exampleint = *(unsigned int *)examplearray;
and back again using:
unsigned char examplearray2[4];
*(unsigned int*)examplearray2 = exampleint;
it works, I just have no clue how it works. Does this technique have a
name so I can look into it?

It is sometimes called "type punning".
 
T

ts.death.angel

i'm still suck.

Maybe someone can explain the math in the (de)conversion?

There's not really much math involved, you're just swapping around
pointers. Here's a rewrite that may be easier to understand:
unsigned char examplearray[] = {4, 3, 2, 1};
unsigned int *pointerint = examplearray; /* pointerint points to
examplearray (which is 32-bits; the size of an int) */
unsigned int exampleint = *pointerint; /* sets new int exampleint to
what pointerint points to */
 
S

superheathen

i'm still suck.
Maybe someone can explain the math in the (de)conversion?

There's not really much math involved, you're just swapping around
pointers. Here's a rewrite that may be easier to understand:
unsigned char examplearray[] = {4, 3, 2, 1};
unsigned int *pointerint = examplearray; /* pointerint points to
examplearray (which is 32-bits; the size of an int) */
unsigned int exampleint = *pointerint; /* sets new int exampleint to
what pointerint points to */
There's got to be a little, in the example I provided the integer
isn't a pointer or array. (though I tried using the example you
provided and got:
warning: initialization from incompatible pointer type) Using some
casting , it somehow takes the array and converts it to 16909060 (how
does it get this number?) and then using 16909060 is able to
reconstruct the array. To me it'd make more sense if it did use
pointers instead of the casting.

Sorry if I'm too dense to get what you're getting at.
 
S

superheathen

There's not really much math involved, you're just swapping around
pointers. Here's a rewrite that may be easier to understand:
unsigned char examplearray[] = {4, 3, 2, 1};
unsigned int *pointerint = examplearray; /* pointerint points to
examplearray (which is 32-bits; the size of an int) */
unsigned int exampleint = *pointerint; /* sets new int exampleint to
what pointerint points to */

There's got to be a little, in the example I provided the integer
isn't a pointer or array. (though I tried using the example you
provided and got:
warning: initialization from incompatible pointer type) Using some
casting , it somehow takes the array and converts it to 16909060 (how
does it get this number?) and then using 16909060 is able to
reconstruct the array. To me it'd make more sense if it did use
pointers instead of the casting.

Sorry if I'm too dense to get what you're getting at.

also, if it were some sort of memory location, wouldn't it be
subjected to change each compile, rendering it unable to read the
database?
 
W

Walter Roberson

On Feb 11, 1:44=A0am, (e-mail address removed) wrote:
There's not really much math involved, you're just swapping around
pointers.


There is some elementary math involved.

The original code had,
unsigned char examplearray[] = {4, 3, 2, 1};
unsigned int exampleint = *(unsigned int *)examplearray;

This presumes that 'unsigned int' is the same size as 4 unsigned
char, which is can also be expressed as sizeof(unsigned int) == 4.

An unsigned char is always at least 8 bits, so unsigned int in
this code is presumed to be at least 32 bits wide. This is a
non-portable assumption: 'unsigned long' should be used
instead of 'unsigned int', as unsigned long is guaranteed to
be at least 32 bits, but unsigned int might be as small as 16 bits.

It is possible in C to have a 32 bit unsigned int or unsigned long
and yet for sizeof(int) to not be 4: for example it is legal in
C for unsigned char itself to be 32 bits and sizeof(unsigned int) == 1.
Real systems with such characteristics exist -- and the code would
completely break on them.

When unsigned char examplearray[] = {4, 3, 2, 1}; then C guarantees
that the 4, 3, 2, 1 will be stored in memory in increasing address
order. If I use | to mark the end of bytes in increasing memory order,
examplearray would end up holding |4|3|2|1| in that order.

When (unsigned int *)examplearray is done (note I removed the
leading * from the expression), the resulting pointer will be
a pointer to unsigned int, and it will point to the beginning of
that memory area, |4|3|2|1| . The * in front of the pointer expression,
*(unsigned int *)examplearray "dereferences" that pointer, so
unsigned int exampleint will be an unsigned int loaded from memory
that was initialized to |4|3|2|1| .

Now this is the part that starts getting complicated: the -numeric-
significance of each byte of the |4|3|2|1| for the purposes of
unsigned int, is not necessarily going to be in the same order
as the bytes are written in memory.

On some systems ("big endian systems") the numeric order -would- be in
exactly that order, and the numeric value of the unsigned int would be
4 << (3*CHARBIT) + 3 << (2*CHARBIT) + 2 << (1*CHARBIT) + 1 << (0*CHARBIT)
where CHARBIT is the number of bits in a char (typically 8 but could
be more.) Using a non-C notation for a moment where ** represents
exponentiation, this would be
4 * CHARBIT**3 + 3 * CHARBIT**2 + 2 * CHARBIT**1 + 1 * CHARBIT**0
which is exactly parallel to traditional decimal (base 10) notation
in which the base 10 number 4321 means
4 * 10**3 + 3 * 10**2 + 2 * 10**1 + 1 * 10**0

However, there are other systems ("little endian") in which the
numeric order of the |4|3|2|1| bytes would be loaded from memory
completely differently. Two variations with "little endian"
systems would be

3 << (3*CHARBIT) + 4 << (2*CHARBIT) + 1 << (1*CHARBIT) + 2 << (0*CHARBIT)
and
2 << (3*CHARBIT) + 1 << (2*CHARBIT) + 4 << (1*CHARBIT) + 3 << (0*CHARBIT)

which could be respectively written (in non-C notation) as

3 * CHARBIT**3 + 4 * CHARBIT**2 + 1 * CHARBIT**1 + 2 * CHARBIT**0
and
2 * CHARBIT**3 + 1 * CHARBIT**2 + 4 * CHARBIT**1 + 3 * CHARBIT**0

which would have analogs in base 10 as if the byte stream |4|3|2|1|
loaded into memory as the decimal numbers 3412 or 2143 respectively.

These different ways of assigning relative numeric significance to
streams of bytes in memory are not wrong, they are just different,
and as long as the program is consistant about which order is used
there is no problem (except when talking to other systems that
use different orders.)

Pentium-type processors tend to use one of the little-endian
orderings; some processors such as MIPS R4000/R10000/R12000 etc.
use "big-endian" orderings. If you work with more than 2 distinct
processor architectures, you will probably encounter different
"endian" orderings at some point.


Now, when the process is reversed and the character array is
populated with the unsigned long value, the processor will take
the numeric value it has in the processor, and will write a sequence
of bytes into memory. The order that it does that writing in
need not be "most significant bit first" (that is, it need not be
the bit that denotes the highest numeric value that gets written
first). It could be -- "big endian" systems write in that order
for example. But lots of other systems write in some other order
(perhaps for some attempt to maintain compatability with
the original 8 bit processors in their family lines). Whatever
order the processor uses to write values to memory will be the
exact mirror of the order that it loads from memory with,
so if the numeric order that it picked up from loading |4|3|2|1|
into memory was
3 * CHARBIT**3 + 4 * CHARBIT**2 + 1 * CHARBIT**1 + 2 * CHARBIT**0
then whatever current value it has to deal with will be written
reflecting that value order, producing |4|3|2|1| in memory.
With this ordering, if the current value it had in memory was
141 * CHARBIT**3 + 17 * CHARBIT**2 + 92 * CHARBIT**1 + 29 * CHARBIT**0
then to maintain consistency with the loads, the bytes it would
write into memory would be |17|141|29|92| .

Now, no matter what order was used to determine numeric signficance upon
load, the storage will undo the effect for the same value,
so no matter what order your processor uses internally, loading
|4|3|2|1| from memory into an unsigned long and storing it again
is going to result in |4|3|2|1| (assuming that sizeof(unsigned long) == 4)

So the matter is more complex than just "manipulating pointers",
but the mathematics involved ends up cancelling itself out if you
load and then store the same value. If you had, for example, added
1 to the unsigned long and then stored the result back into memory,
you might have ended up with |4|3|2|2| or with |4|3|3|1| or with
|4|4|2|1| or with |5|3|2|1| and the mathematics involved would
help describe that. And if you were working with CHARBIT 8
and you had (say) |4|255|255|1| and were to add 1 to the unsigned long
storage of that, you would need the mathematics shown above to understand
the results you might get.

For any given number of bits per char, there are 24 different values
that |4|3|2|1| might get loaded as an unsigned long, depending upon
the processor. A few processors, such as the ARM, are able to use
different memory storage orderings depending on the state of a flag.
(The MIPS Rx000 processors can as well, but it is more typical to
hard-wire the order bit so that it is constant for any one MIPS
motherboard.)
 
S

superheathen

There's not really much math involved, you're just swapping around
pointers.

There is some elementary math involved.

The original code had,
unsigned char examplearray[] = {4, 3, 2, 1};
unsigned int exampleint = *(unsigned int *)examplearray;

This presumes that 'unsigned int' is the same size as 4 unsigned
char, which is can also be expressed as sizeof(unsigned int) == 4.

An unsigned char is always at least 8 bits, so unsigned int in
this code is presumed to be at least 32 bits wide. This is a
non-portable assumption: 'unsigned long' should be used
instead of 'unsigned int', as unsigned long is guaranteed to
be at least 32 bits, but unsigned int might be as small as 16 bits.

It is possible in C to have a 32 bit unsigned int or unsigned long
and yet for sizeof(int) to not be 4: for example it is legal in
C for unsigned char itself to be 32 bits and sizeof(unsigned int) == 1.
Real systems with such characteristics exist -- and the code would
completely break on them.

When unsigned char examplearray[] = {4, 3, 2, 1}; then C guarantees
that the 4, 3, 2, 1 will be stored in memory in increasing address
order. If I use | to mark the end of bytes in increasing memory order,
examplearray would end up holding |4|3|2|1| in that order.

When (unsigned int *)examplearray is done (note I removed the
leading * from the expression), the resulting pointer will be
a pointer to unsigned int, and it will point to the beginning of
that memory area, |4|3|2|1| . The * in front of the pointer expression,
*(unsigned int *)examplearray "dereferences" that pointer, so
unsigned int exampleint will be an unsigned int loaded from memory
that was initialized to |4|3|2|1| .

Now this is the part that starts getting complicated: the -numeric-
significance of each byte of the |4|3|2|1| for the purposes of
unsigned int, is not necessarily going to be in the same order
as the bytes are written in memory.

On some systems ("big endian systems") the numeric order -would- be in
exactly that order, and the numeric value of the unsigned int would be
4 << (3*CHARBIT) + 3 << (2*CHARBIT) + 2 << (1*CHARBIT) + 1 << (0*CHARBIT)
where CHARBIT is the number of bits in a char (typically 8 but could
be more.) Using a non-C notation for a moment where ** represents
exponentiation, this would be
4 * CHARBIT**3 + 3 * CHARBIT**2 + 2 * CHARBIT**1 + 1 * CHARBIT**0
which is exactly parallel to traditional decimal (base 10) notation
in which the base 10 number 4321 means
4 * 10**3 + 3 * 10**2 + 2 * 10**1 + 1 * 10**0

However, there are other systems ("little endian") in which the
numeric order of the |4|3|2|1| bytes would be loaded from memory
completely differently. Two variations with "little endian"
systems would be

3 << (3*CHARBIT) + 4 << (2*CHARBIT) + 1 << (1*CHARBIT) + 2 << (0*CHARBIT)
and
2 << (3*CHARBIT) + 1 << (2*CHARBIT) + 4 << (1*CHARBIT) + 3 << (0*CHARBIT)

which could be respectively written (in non-C notation) as

3 * CHARBIT**3 + 4 * CHARBIT**2 + 1 * CHARBIT**1 + 2 * CHARBIT**0
and
2 * CHARBIT**3 + 1 * CHARBIT**2 + 4 * CHARBIT**1 + 3 * CHARBIT**0

which would have analogs in base 10 as if the byte stream |4|3|2|1|
loaded into memory as the decimal numbers 3412 or 2143 respectively.

These different ways of assigning relative numeric significance to
streams of bytes in memory are not wrong, they are just different,
and as long as the program is consistant about which order is used
there is no problem (except when talking to other systems that
use different orders.)

Pentium-type processors tend to use one of the little-endian
orderings; some processors such as MIPS R4000/R10000/R12000 etc.
use "big-endian" orderings. If you work with more than 2 distinct
processor architectures, you will probably encounter different
"endian" orderings at some point.

Now, when the process is reversed and the character array is
populated with the unsigned long value, the processor will take
the numeric value it has in the processor, and will write a sequence
of bytes into memory. The order that it does that writing in
need not be "most significant bit first" (that is, it need not be
the bit that denotes the highest numeric value that gets written
first). It could be -- "big endian" systems write in that order
for example. But lots of other systems write in some other order
(perhaps for some attempt to maintain compatability with
the original 8 bit processors in their family lines). Whatever
order the processor uses to write values to memory will be the
exact mirror of the order that it loads from memory with,
so if the numeric order that it picked up from loading |4|3|2|1|
into memory was
3 * CHARBIT**3 + 4 * CHARBIT**2 + 1 * CHARBIT**1 + 2 * CHARBIT**0
then whatever current value it has to deal with will be written
reflecting that value order, producing |4|3|2|1| in memory.
With this ordering, if the current value it had in memory was
141 * CHARBIT**3 + 17 * CHARBIT**2 + 92 * CHARBIT**1 + 29 * CHARBIT**0
then to maintain consistency with the loads, the bytes it would
write into memory would be |17|141|29|92| .

Now, no matter what order was used to determine numeric signficance upon
load, the storage will undo the effect for the same value,
so no matter what order your processor uses internally, loading
|4|3|2|1| from memory into an unsigned long and storing it again
is going to result in |4|3|2|1| (assuming that sizeof(unsigned long) == 4)

So the matter is more complex than just "manipulating pointers",
but the mathematics involved ends up cancelling itself out if you
load and then store the same value. If you had, for example, added
1 to the unsigned long and then stored the result back into memory,
you might have ended up with |4|3|2|2| or with |4|3|3|1| or with
|4|4|2|1| or with |5|3|2|1| and the mathematics involved would
help describe that. And if you were working with CHARBIT 8
and you had (say) |4|255|255|1| and were to add 1 to the unsigned long
storage of that, you would need the mathematics shown above to understand
the results you might get.

For any given number of bits per char, there are 24 different values
that |4|3|2|1| might get loaded as an unsigned long, depending upon
the processor. A few processors, such as the ARM, are able to use
different memory storage orderings depending on the state of a flag.
(The MIPS Rx000 processors can as well, but it is more typical to
hard-wire the order bit so that it is constant for any one MIPS
motherboard.)

excellent, appreciated tons.
 
J

Jack Klein

Hi

I'm reading from a database that stores information as an integer
representing a char array of ints, it is created in the following
way:

unsigned char examplearray[] = {4, 3, 2, 1};
unsigned int exampleint = *(unsigned int *)examplearray;

and back again using:

unsigned char examplearray2[4];
*(unsigned int*)examplearray2 = exampleint;

it works, I just have no clue how it works. Does this technique have a
name so I can look into it?

It might happen to "work" for your expectation of "work" on the
particular platform where you are using it. The C standard makes no
such guarantee, because the behavior is undefined. On some platforms,
if "examplearray" does not have the proper alignment, trying to access
it as an unsigned int will generate a hardware trap.

What you have here is an example of poorly written code by a
programmer who isn't anywhere near as knowledgeable as he/she thinks.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.club.cc.cmu.edu/~ajo/docs/FAQ-acllc.html
 
K

Kenny McCormack

Jack Klein said:
What you have here is an example of poorly written code by a
programmer who isn't anywhere near as knowledgeable as he/she thinks.

Yeah, that guy, Linus Torvalds, a real idiot. Probably lost his first
(and only) programming job - probably homeless and out on the street by now.

Yeah, I hear he used to do a lot of that sort of thing - type punning,
and god knows what else.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,012
Latest member
RoxanneDzm

Latest Threads

Top