Portably extracting data from a bytestring

jacob navia · Oct 25, 2005

James said:
Let S be a pointer to a bytestring of length L. I would like to extract 4
bytes from S at the location p = S + d, with 0 < d < L - 4, and store them
into an unsigned int. I am looking for suggestions on how to do this

1) In portable ANSI C.
2) As efficiently as possible.
3) Taking full account of the potential data alignment and
endianness issues that this action must tackle.

typedef union {
unsigned char c[sizeof(unsigned int)];
unsigned int i;
} U;

unsigned int convert(char *S,int d)
{
U u;
memcpy(&u,S+d,sizeof(unsigned int));
return u.i;
}

This assumes that at the given location an integer was stored.
The problem is that you did not define what "extract four bytes"
and "store them in an unsigned int" really means.

If you do not care about alignment (x86 architecture) you could

unsigned int convert(char *S,int d)
{
U *u;
u = (U *)(S+d);
return u->i;
}
More efficient, but you could get an alignment trap.

Both suppose that
1) You have stored before an integer at that location
2) You read them in the same machine architecture.

jacob

Ben Pfaff · Oct 25, 2005

jacob navia said:
James said:

Let S be a pointer to a bytestring of length L. I would like to extract 4
bytes from S at the location p = S + d, with 0 < d < L - 4, and store them
into an unsigned int. I am looking for suggestions on how to do this
1) In portable ANSI C. 2) As efficiently as possible.
3) Taking full account of the potential data alignment and
endianness issues that this action must tackle.

Click to expand...

typedef union {
unsigned char c[sizeof(unsigned int)];
unsigned int i;
} U;

unsigned int convert(char *S,int d)
{
U u;
memcpy(&u,S+d,sizeof(unsigned int));
return u.i;
}

Why not just this:

unsigned int convert(char *s, int d)
{
unsigned int i;
memcpy(&i, s + d, sizeof i);
return i;
}

Character access is allowed to any type; memcpy() does character access.

James S. Singleton · Oct 25, 2005

Let S be a pointer to a bytestring of length L. I would like to extract 4
bytes from S at the location p = S + d, with 0 < d < L - 4, and store them
into an unsigned int. I am looking for suggestions on how to do this

1) In portable ANSI C.
2) As efficiently as possible.
3) Taking full account of the potential data alignment and
endianness issues that this action must tackle.

Walter Roberson · Oct 25, 2005

typedef union {
unsigned char c[sizeof(unsigned int)];
unsigned int i;
} U;

unsigned int convert(char *S,int d)
{
U u;
memcpy(&u,S+d,sizeof(unsigned int));
return u.i;
}

I can't find the clause at the moment, but I'm relatively sure
that the behaviour is undefined to read a union member out of a
union unless it was the same one last written [except for cases
where you are retrieving from the same fundamental types
in union members with common prefixes.]

You are on safer grounds to cast the object pointer to char* .

Skarmander · Oct 25, 2005

James said:
Let S be a pointer to a bytestring of length L. I would like to extract 4
bytes from S at the location p = S + d, with 0 < d < L - 4, and store them
into an unsigned int. I am looking for suggestions on how to do this

1) In portable ANSI C.

Impossible if I take you literally, since an unsigned int isn't
guaranteed to be any bigger than 16 bits, and 4 bytes will be 32 bits
(since we're presumably talking about 8-bit bytes, not "C bytes" which
can be larger).

Make it an unsigned long instead. You could redescribe the problem as
"extracting sizeof(unsigned int) bytes" too, but this is something
different, and it may not be the problem at hand.

Alternatively, you could mean "in ANSI C that's portable save for the
assumption that an unsigned int is 32 bits". This will be acceptable for
the majority of existing platforms, as long as you keep in mind the
limits of portability here.

2) As efficiently as possible.

That's the trick, isn't it? The most efficient thing you can do is
obviously just interpreting those 4 bytes as an int through a union. But
that's not guaranteed to work (also see below).

3) Taking full account of the potential data alignment and
endianness issues that this action must tackle.

How can we take it into account if you don't describe what endianness
issues there are? What do the bytes in the string mean? Assuming the
four bytes are a contiguous sequence of bits making up the binary
representation of an integer, you'd still need to know in what order
they're stored before you can turn them into a machine integer.

Theoretically there are 24 separate orderings, but of course the only
ones that matter in practice are big-endian (call this B4 B3 B2 B1) and
little-endian (B1 B2 B3 B4), and maybe some mixed form for 16-bit
architectures (B3 B4 B1 B2 and B2 B1 B4 B3, perverse but not unheard
of). You do not need to know the endianness of the target architecture
to perform the conversion (though it may help for efficiency), but you
do need to know the endianness of the bytes in the string.

For practical approaches, see the "obvious" solutions already posted by
others. It's important to know what problems these solve, and if they
match the problem you described.

S.

Christopher Benson-Manica · Oct 25, 2005

Walter Roberson said:
I can't find the clause at the moment, but I'm relatively sure
that the behaviour is undefined to read a union member out of a
union unless it was the same one last written [except for cases
where you are retrieving from the same fundamental types
in union members with common prefixes.]

I believe the clause you are looking for is this one, from 3.3.2.3 of
the draft available at http://dev.unicals.com/papers/c89-draft.html:

"With one exception, if a member of a union object is accessed after a
value has been stored in a different member of the object, the
behavior is implementation-defined." [with the one exception being the
one you pointed out]

jacob navia · Oct 25, 2005

Ben said:
James said:

Let S be a pointer to a bytestring of length L. I would like to extract 4
bytes from S at the location p = S + d, with 0 < d < L - 4, and store them
into an unsigned int. I am looking for suggestions on how to do this
1) In portable ANSI C. 2) As efficiently as possible.
3) Taking full account of the potential data alignment and
endianness issues that this action must tackle.

Click to expand...

typedef union {
unsigned char c[sizeof(unsigned int)];
unsigned int i;
} U;

unsigned int convert(char *S,int d)
{
U u;
memcpy(&u,S+d,sizeof(unsigned int));
return u.i;
}

Click to expand...

Why not just this:

unsigned int convert(char *s, int d)
{
unsigned int i;
memcpy(&i, s + d, sizeof i);
return i;
}

Character access is allowed to any type; memcpy() does character access.

Well Ben, you are right

Much shorter, and essentially the same stuff.

jacob

Tim Rentsch · Oct 26, 2005

typedef union {
unsigned char c[sizeof(unsigned int)];
unsigned int i;
} U;

Click to expand...

unsigned int convert(char *S,int d)
{
U u;
memcpy(&u,S+d,sizeof(unsigned int));
return u.i;
}

Click to expand...

I can't find the clause at the moment, but I'm relatively sure
that the behaviour is undefined to read a union member out of a
union unless it was the same one last written [except for cases
where you are retrieving from the same fundamental types
in union members with common prefixes.]

Click to expand...

My best understanding is that it's debateable whether accessing
a union member other than the last one written results in
undefined behavior or in implementation-defined behavior. An
entry in an (informative) annex lists it as implementation-defined.

Tim Rentsch · Oct 26, 2005

Christopher Benson-Manica said:
Walter Roberson said:

I can't find the clause at the moment, but I'm relatively sure
that the behaviour is undefined to read a union member out of a
union unless it was the same one last written [except for cases
where you are retrieving from the same fundamental types
in union members with common prefixes.]

Click to expand...

I believe the clause you are looking for is this one, from 3.3.2.3 of
the draft available at http://dev.unicals.com/papers/c89-draft.html:

"With one exception, if a member of a union object is accessed after a
value has been stored in a different member of the object, the
behavior is implementation-defined." [with the one exception being the
one you pointed out]

This sentence has disappeared from the Standard by now. There is
however a similar statement in an informative annex.

Christopher Benson-Manica · Oct 26, 2005

Tim Rentsch said:
This sentence has disappeared from the Standard by now. There is
however a similar statement in an informative annex.

What, exactly, is the difference between "normative" and
"informative"? IIUC, "informative" is not strictly "standard" - does
that mean that there is no "normative" text specifying how
implementations should deal with union member access?

Ben Pfaff · Oct 26, 2005

Christopher Benson-Manica said:
What, exactly, is the difference between "normative" and
"informative"?

Normative text is part of the standard.
Informative text, like footnotes, examples, and some appendices,
are not part of the standard. They are for information only.

Tim Rentsch · Oct 26, 2005

Christopher Benson-Manica said:
What, exactly, is the difference between "normative" and
"informative"? IIUC, "informative" is not strictly "standard" - does
that mean that there is no "normative" text specifying how
implementations should deal with union member access?

Taken from ISO/IEC Directives part 3:

3.4

normative elements

those elements setting out the provisions to which it is
necessary to conform in order to be able to claim compliance
with the standard

A "normative element" must be observed in order to conform to the
standard in question. Any "informative" text is supposed to be
right (and presumably useful), but it does not by itself impose
requirements on whatever is being defined in the standard. Both
normative text and informative text are part of a standard, but
only normative text imposes requirements that must be observed.

There is normative text that gives requirements for accessing
union members, but that text is sprinkled through the rest of the
C Standard. So it isn't easy to tell if the logical consequences
of those requirements imply implementation defined behavior.

Tim Rentsch · Oct 26, 2005

Ben Pfaff said:
Normative text is part of the standard.
Informative text, like footnotes, examples, and some appendices,
are not part of the standard. They are for information only.

Not exactly. Both normative elements and informative elements
are part of a standard, but only normative elements give
provisions that must be observed in order to claim conformance
(to whatever it is that's being standardized).

(I admit it's a minor distinction; I thought some people
might appreciate the clarification.)

Christopher Benson-Manica · Oct 26, 2005

Tim Rentsch said:
(I admit it's a minor distinction; I thought some people
might appreciate the clarification.)

I did - thank you.

pete · Oct 27, 2005

Skarmander said:
Impossible if I take you literally, since an unsigned int isn't
guaranteed to be any bigger than 16 bits, and 4 bytes will be 32 bits
(since we're presumably talking about 8-bit bytes, not "C bytes" which
can be larger).

"portable ANSI C" means "C bytes"

You are correct in that it is impossible.

Code doesn't have to be portable to be useful.

Pretending that code is portable when it isn't, is wrong.

Skarmander · Oct 27, 2005

pete wrote:

Code doesn't have to be portable to be useful.

Pretending that code is portable when it isn't, is wrong.

"Portable modulo X" is a meaningful concept. That is, "portable across
all machines where an int is 32 bits" is meaningful. Whether it's
acceptable depends, and nobody could advertise it as "100% portable ISO
C", but it's not "pretending". Just as long as you don't call it
"portable" without qualification.

S.

Jordan Abel · Oct 28, 2005

pete wrote:

"Portable modulo X" is a meaningful concept. That is, "portable across
all machines where an int is 32 bits" is meaningful. Whether it's
acceptable depends, and nobody could advertise it as "100% portable ISO
C", but it's not "pretending". Just as long as you don't call it
"portable" without qualification.

And all that's needed in this case is to use long instead of int.

pete · Oct 28, 2005

Jordan said:
OK.

And all that's needed in this case is to use long instead of int.

Qulaification is still required for that.

For this specification:

I would like to extract 4 bytes

and store them into an unsigned int.

1) In portable ANSI C.

sizeof(long) can be less than 4 if CHAR_BIT is greater than 8.

Transmitting/receiving binary content portably	16	Feb 23, 2010
A process take input from /proc/<pid>/fd/0, but won't process it	0	Oct 29, 2023
Converting windows SYSTEMTIME to a standard struct tm	4	Feb 22, 2014
Drawing missing in bitmap in a pure C win32 program	4	Jun 3, 2023
Engineering a list container. Part 1.	71	Dec 7, 2013
URGENT	1	Jan 31, 2023
a constant pointer to constant data and ...	3	Apr 19, 2014
Can't solve problems! please Help	0	Sep 26, 2022

Portably extracting data from a bytestring

jacob navia

Ben Pfaff

James S. Singleton

Walter Roberson

Skarmander

Christopher Benson-Manica

jacob navia

Tim Rentsch

Tim Rentsch

Christopher Benson-Manica

Ben Pfaff

Tim Rentsch

Tim Rentsch

Christopher Benson-Manica

pete

Skarmander

Jordan Abel

pete

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads