Portably extracting data from a bytestring

J

jacob navia

James said:
Let S be a pointer to a bytestring of length L. I would like to extract 4
bytes from S at the location p = S + d, with 0 < d < L - 4, and store them
into an unsigned int. I am looking for suggestions on how to do this

1) In portable ANSI C.
2) As efficiently as possible.
3) Taking full account of the potential data alignment and
endianness issues that this action must tackle.
typedef union {
unsigned char c[sizeof(unsigned int)];
unsigned int i;
} U;

unsigned int convert(char *S,int d)
{
U u;
memcpy(&u,S+d,sizeof(unsigned int));
return u.i;
}

This assumes that at the given location an integer was stored.
The problem is that you did not define what "extract four bytes"
and "store them in an unsigned int" really means.

If you do not care about alignment (x86 architecture) you could

unsigned int convert(char *S,int d)
{
U *u;
u = (U *)(S+d);
return u->i;
}
More efficient, but you could get an alignment trap.

Both suppose that
1) You have stored before an integer at that location
2) You read them in the same machine architecture.

jacob
 
B

Ben Pfaff

jacob navia said:
James said:
Let S be a pointer to a bytestring of length L. I would like to extract 4
bytes from S at the location p = S + d, with 0 < d < L - 4, and store them
into an unsigned int. I am looking for suggestions on how to do this
1) In portable ANSI C. 2) As efficiently as possible.
3) Taking full account of the potential data alignment and
endianness issues that this action must tackle.
typedef union {
unsigned char c[sizeof(unsigned int)];
unsigned int i;
} U;

unsigned int convert(char *S,int d)
{
U u;
memcpy(&u,S+d,sizeof(unsigned int));
return u.i;
}

Why not just this:

unsigned int convert(char *s, int d)
{
unsigned int i;
memcpy(&i, s + d, sizeof i);
return i;
}

Character access is allowed to any type; memcpy() does character access.
 
J

James S. Singleton

Let S be a pointer to a bytestring of length L. I would like to extract 4
bytes from S at the location p = S + d, with 0 < d < L - 4, and store them
into an unsigned int. I am looking for suggestions on how to do this

1) In portable ANSI C.
2) As efficiently as possible.
3) Taking full account of the potential data alignment and
endianness issues that this action must tackle.
 
W

Walter Roberson

typedef union {
unsigned char c[sizeof(unsigned int)];
unsigned int i;
} U;
unsigned int convert(char *S,int d)
{
U u;
memcpy(&u,S+d,sizeof(unsigned int));
return u.i;
}

I can't find the clause at the moment, but I'm relatively sure
that the behaviour is undefined to read a union member out of a
union unless it was the same one last written [except for cases
where you are retrieving from the same fundamental types
in union members with common prefixes.]

You are on safer grounds to cast the object pointer to char* .
 
S

Skarmander

James said:
Let S be a pointer to a bytestring of length L. I would like to extract 4
bytes from S at the location p = S + d, with 0 < d < L - 4, and store them
into an unsigned int. I am looking for suggestions on how to do this

1) In portable ANSI C.

Impossible if I take you literally, since an unsigned int isn't
guaranteed to be any bigger than 16 bits, and 4 bytes will be 32 bits
(since we're presumably talking about 8-bit bytes, not "C bytes" which
can be larger).

Make it an unsigned long instead. You could redescribe the problem as
"extracting sizeof(unsigned int) bytes" too, but this is something
different, and it may not be the problem at hand.

Alternatively, you could mean "in ANSI C that's portable save for the
assumption that an unsigned int is 32 bits". This will be acceptable for
the majority of existing platforms, as long as you keep in mind the
limits of portability here.
2) As efficiently as possible.

That's the trick, isn't it? The most efficient thing you can do is
obviously just interpreting those 4 bytes as an int through a union. But
that's not guaranteed to work (also see below).
3) Taking full account of the potential data alignment and
endianness issues that this action must tackle.

How can we take it into account if you don't describe what endianness
issues there are? What do the bytes in the string mean? Assuming the
four bytes are a contiguous sequence of bits making up the binary
representation of an integer, you'd still need to know in what order
they're stored before you can turn them into a machine integer.

Theoretically there are 24 separate orderings, but of course the only
ones that matter in practice are big-endian (call this B4 B3 B2 B1) and
little-endian (B1 B2 B3 B4), and maybe some mixed form for 16-bit
architectures (B3 B4 B1 B2 and B2 B1 B4 B3, perverse but not unheard
of). You do not need to know the endianness of the target architecture
to perform the conversion (though it may help for efficiency), but you
do need to know the endianness of the bytes in the string.

For practical approaches, see the "obvious" solutions already posted by
others. It's important to know what problems these solve, and if they
match the problem you described.

S.
 
C

Christopher Benson-Manica

Walter Roberson said:
I can't find the clause at the moment, but I'm relatively sure
that the behaviour is undefined to read a union member out of a
union unless it was the same one last written [except for cases
where you are retrieving from the same fundamental types
in union members with common prefixes.]

I believe the clause you are looking for is this one, from 3.3.2.3 of
the draft available at http://dev.unicals.com/papers/c89-draft.html:

"With one exception, if a member of a union object is accessed after a
value has been stored in a different member of the object, the
behavior is implementation-defined." [with the one exception being the
one you pointed out]
 
J

jacob navia

Ben said:
James said:
Let S be a pointer to a bytestring of length L. I would like to extract 4
bytes from S at the location p = S + d, with 0 < d < L - 4, and store them
into an unsigned int. I am looking for suggestions on how to do this
1) In portable ANSI C. 2) As efficiently as possible.
3) Taking full account of the potential data alignment and
endianness issues that this action must tackle.

typedef union {
unsigned char c[sizeof(unsigned int)];
unsigned int i;
} U;

unsigned int convert(char *S,int d)
{
U u;
memcpy(&u,S+d,sizeof(unsigned int));
return u.i;
}


Why not just this:

unsigned int convert(char *s, int d)
{
unsigned int i;
memcpy(&i, s + d, sizeof i);
return i;
}

Character access is allowed to any type; memcpy() does character access.

Well Ben, you are right :)

Much shorter, and essentially the same stuff.

jacob
 
T

Tim Rentsch

typedef union {
unsigned char c[sizeof(unsigned int)];
unsigned int i;
} U;
unsigned int convert(char *S,int d)
{
U u;
memcpy(&u,S+d,sizeof(unsigned int));
return u.i;
}

I can't find the clause at the moment, but I'm relatively sure
that the behaviour is undefined to read a union member out of a
union unless it was the same one last written [except for cases
where you are retrieving from the same fundamental types
in union members with common prefixes.]

My best understanding is that it's debateable whether accessing
a union member other than the last one written results in
undefined behavior or in implementation-defined behavior. An
entry in an (informative) annex lists it as implementation-defined.
 
T

Tim Rentsch

Christopher Benson-Manica said:
Walter Roberson said:
I can't find the clause at the moment, but I'm relatively sure
that the behaviour is undefined to read a union member out of a
union unless it was the same one last written [except for cases
where you are retrieving from the same fundamental types
in union members with common prefixes.]

I believe the clause you are looking for is this one, from 3.3.2.3 of
the draft available at http://dev.unicals.com/papers/c89-draft.html:

"With one exception, if a member of a union object is accessed after a
value has been stored in a different member of the object, the
behavior is implementation-defined." [with the one exception being the
one you pointed out]

This sentence has disappeared from the Standard by now. There is
however a similar statement in an informative annex.
 
C

Christopher Benson-Manica

Tim Rentsch said:
This sentence has disappeared from the Standard by now. There is
however a similar statement in an informative annex.

What, exactly, is the difference between "normative" and
"informative"? IIUC, "informative" is not strictly "standard" - does
that mean that there is no "normative" text specifying how
implementations should deal with union member access?
 
B

Ben Pfaff

Christopher Benson-Manica said:
What, exactly, is the difference between "normative" and
"informative"?

Normative text is part of the standard.
Informative text, like footnotes, examples, and some appendices,
are not part of the standard. They are for information only.
 
T

Tim Rentsch

Christopher Benson-Manica said:
What, exactly, is the difference between "normative" and
"informative"? IIUC, "informative" is not strictly "standard" - does
that mean that there is no "normative" text specifying how
implementations should deal with union member access?

Taken from ISO/IEC Directives part 3:

3.4

normative elements

those elements setting out the provisions to which it is
necessary to conform in order to be able to claim compliance
with the standard

A "normative element" must be observed in order to conform to the
standard in question. Any "informative" text is supposed to be
right (and presumably useful), but it does not by itself impose
requirements on whatever is being defined in the standard. Both
normative text and informative text are part of a standard, but
only normative text imposes requirements that must be observed.

There is normative text that gives requirements for accessing
union members, but that text is sprinkled through the rest of the
C Standard. So it isn't easy to tell if the logical consequences
of those requirements imply implementation defined behavior.
 
T

Tim Rentsch

Ben Pfaff said:
Normative text is part of the standard.
Informative text, like footnotes, examples, and some appendices,
are not part of the standard. They are for information only.

Not exactly. Both normative elements and informative elements
are part of a standard, but only normative elements give
provisions that must be observed in order to claim conformance
(to whatever it is that's being standardized).

(I admit it's a minor distinction; I thought some people
might appreciate the clarification.)
 
P

pete

Skarmander said:
Impossible if I take you literally, since an unsigned int isn't
guaranteed to be any bigger than 16 bits, and 4 bytes will be 32 bits
(since we're presumably talking about 8-bit bytes, not "C bytes" which
can be larger).

"portable ANSI C" means "C bytes"

You are correct in that it is impossible.

Code doesn't have to be portable to be useful.

Pretending that code is portable when it isn't, is wrong.
 
S

Skarmander

pete wrote:
Code doesn't have to be portable to be useful.

Pretending that code is portable when it isn't, is wrong.

"Portable modulo X" is a meaningful concept. That is, "portable across
all machines where an int is 32 bits" is meaningful. Whether it's
acceptable depends, and nobody could advertise it as "100% portable ISO
C", but it's not "pretending". Just as long as you don't call it
"portable" without qualification.

S.
 
J

Jordan Abel

pete wrote:


"Portable modulo X" is a meaningful concept. That is, "portable across
all machines where an int is 32 bits" is meaningful. Whether it's
acceptable depends, and nobody could advertise it as "100% portable ISO
C", but it's not "pretending". Just as long as you don't call it
"portable" without qualification.

And all that's needed in this case is to use long instead of int.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top