Struct with unaligned fields

James Harris · Aug 23, 2013

Jorgen Grahn said:
This comes up often in this group.

I knew it was an old 'chestnut'! Hopefully the constraints I included made
it less boring than just asking how to align fields.

As far as I'm concerned, the standard approach for all binary I/O is
not to pretend the I/O buffers map to the in-memory representation of
some C data structure. Treat it as a buffer of N 8-bit octets instead
if that is what it is:

// assuming uint8_t is available, and that
// FAT uses little-endian encoding
const uint8_t * p = something();
p += 8;
unsigned bps = get16_le(p); p += 2;
unsigned spc = *p++;
unsigned blahblah = get16_le(p); p += 2;

You mention not wanting to do something "basic and slow" ... well, the
above is basic and /should/ be slow -- but (a) is that still true
today, with caches and stuff, and (b) does it really matter to your
application?

This is good. I had thought about extracting the fields individually to a
stuct but by mentioning each one explicitly. I hadn't thought about stepping
a pointer along them. As long as the get16_le operations were fast I don't
think your method would be slow per se. I might use that type of code in
some cases. Objections to the code, though:

* It effectively uses magic numbers for the field offsets. Is that good
practice?

* In some cases not all fields would need to be converted. Details follow.

As an example of the latter, and to correct the layout I had before which
had the odd and even alignments the wrong way round, here is the full struct
for the structure in question. I have simplified the field names as the
standard ones were just too ucky!

/*
* Define the FAT geometry table at the start of a volume
*
* The numbers preceding each field give the byte offsets of that field.
* Fields which cannot be aligned are split into arrays of octets.
*/

struct bootsect {
/* 0 */ ui8 jump[3],
/* 3 */ ui8 oem[8],
/* 11 */ ui8 secsize[2], /* Split due to alignment */
/* 13 */ ui8 sec_per_clus,
/* 14 */ ui16 sec_rsvd,
/* 16 */ ui8 fats,
/* 17 */ ui8 rootents[2], /* Split due to alignment */
/* 19 */ ui8 totsec16[2], /* Split due to alignment */
/* 21 */ ui8 medium_type,
/* 22 */ ui16 fatsize16,
/* 24 */ ui16 sec_per_trk,
/* 26 */ ui16 heads,
/* 28 */ ui32 hid_sec,
/* 32 */ ui32 totsec32,
/* 36 */ ui8 drv_num,
/* 37 */ ui8 rsvd1,
/* 38 */ ui8 bootsig,
/* 39 */ ui8 volid[4], /* Split due to alignment */
/* 43 */ ui8 vollab[11],
/* 54 */ ui8 fstype[8]
/* 63 */
};
typedef struct bootsect bootsect_t;

As you can see, in this case and as long as I have got the layout right,
there should be only four fields which would have to be split due to
alignment issues. The other fields should be accessible directly and so not
need to be converted first.

James

James Harris · Aug 23, 2013

Eric Sosman said:
On 8/22/2013 2:06 PM, James Harris wrote:
....

....

One standard approach is to use an array of unsigned char
for I/O, use a struct arranged by the local compiler for dealing
with the object internally, and to write a pair of "encode" and
"decode" functions. (Note that these functions can be highly
portable, and needn't require rewriting unless you encounter a
system that has no 8-bit integer type.)

OK.

I take it in this case that there is one encode and one decode function per
data type rather than one such pair per field?

The first approach can become awkward if the external data
format re-uses the same spans of bytes for different purposes
that are not easily detectable by the encoder and decoder. (For
example, maybe the format of an entry varies depending on something
in a different entry that's related to it.) In such cases another
approach is fairly standard: You keep the actual data bytes hidden
away as opaque data, and write accessor functions for the various
items within it. That is, you write encoders and decoders for each
datum instead of a single "wholesale" pair. Again, the accessors
can be written very portably.

You mean something like these?

ui16 secsize_get(bootsect_t *bs) {
return bs->secsize[1] << 8 | bs->secsize[0];
}

void secsize_set(bootsect_t *bs, ui16 value) {
bs->secsize[0] = value & 0xff;
bs->secsize[1] = (value >> 8) & 0xff;
}

You make two mentions of speed in connection with processing
these foreign-format items, but I think your concern is misplaced.
Where are these FAT entries coming from and going to, after all?
How many instructions can your CPU execute in the time required to
read one FAT entry? Maybe if they're pouring in from an SSD at
6Gbit/sec (who'd use FAT on such a beast?) you might see a delay,
but ... (Besides, at that rate you'd suck in 4GB in less than six
seconds; a moment's pause would be welcome anyhow.)

Partly agreed, and I see the same point was made by Bartc and Stephen
Sprunk. However, although disk speeds are vastly slower there are some
things to consider:

1. Once read from a disk into memory such a structure can be read and
written many times between CPU cache and CPU before being flushed back to
disk or dropped.

2. The structure could well come from a ramdisk.

3. A FAT header is just a particular example. There are other memory layouts
that would need to be treated in the same way. For instance, the BIOS Data
Area which is present on every PC at boot time is a design from the same era
and has some simlilarly badly aligned elements (not that I need to access
the BDA).

James

Eric Sosman · Aug 23, 2013

[...]
/*
* Define the FAT geometry table at the start of a volume
*
* The numbers preceding each field give the byte offsets of that field.
* Fields which cannot be aligned are split into arrays of octets.
*/

struct bootsect {
/* 0 */ ui8 jump[3],
/* 3 */ ui8 oem[8],
/* 11 */ ui8 secsize[2], /* Split due to alignment */
/* 13 */ ui8 sec_per_clus,
/* 14 */ ui16 sec_rsvd,
/* 16 */ ui8 fats,
/* 17 */ ui8 rootents[2], /* Split due to alignment */
/* 19 */ ui8 totsec16[2], /* Split due to alignment */
/* 21 */ ui8 medium_type,
/* 22 */ ui16 fatsize16,
/* 24 */ ui16 sec_per_trk,
/* 26 */ ui16 heads,
/* 28 */ ui32 hid_sec,
/* 32 */ ui32 totsec32,
/* 36 */ ui8 drv_num,
/* 37 */ ui8 rsvd1,
/* 38 */ ui8 bootsig,
/* 39 */ ui8 volid[4], /* Split due to alignment */
/* 43 */ ui8 vollab[11],
/* 54 */ ui8 fstype[8]
/* 63 */
};
typedef struct bootsect bootsect_t;

As you can see, in this case and as long as I have got the layout right,
there should be only four fields which would have to be split due to
alignment issues. The other fields should be accessible directly and so not
need to be converted first.

There remains a possibility that the struct could contain
superfluous padding, but that's mostly a theoretical concern.

My question is: What good is this struct, anyhow? Many of
its elements can't be manipulated directly, either because
they've been split into arrays, or because their endianness
might be wrong, or both. What does this struct buy you?

IMHO you'd be far better off with a "native" struct and
a pair of encoder/decoder functions. *Far* better.

Eric Sosman · Aug 23, 2013

OK.

I take it in this case that there is one encode and one decode function per
data type rather than one such pair per field?

Something like

void decodeFat(struct nativeFatStruct *out,
const unsigned char *in);
void encodeFat(unsigned char *out,
const struct nativeFatStruct *in);

.... and similarly for decodeThin()/encodeThin(), etc.

The first approach can become awkward if the external data
format re-uses the same spans of bytes for different purposes
that are not easily detectable by the encoder and decoder. (For
example, maybe the format of an entry varies depending on something
in a different entry that's related to it.) In such cases another
approach is fairly standard: You keep the actual data bytes hidden
away as opaque data, and write accessor functions for the various
items within it. That is, you write encoders and decoders for each
datum instead of a single "wholesale" pair. Again, the accessors
can be written very portably.

Click to expand...

You mean something like these?

ui16 secsize_get(bootsect_t *bs) {
return bs->secsize[1] << 8 | bs->secsize[0];
}

void secsize_set(bootsect_t *bs, ui16 value) {
bs->secsize[0] = value & 0xff;
bs->secsize[1] = (value >> 8) & 0xff;
}

Right. If the "secsize" field overlaps (or partially
overlaps) the "spasmcount" field, this approach gets the right
result with a "this_set" is followed by a "that_get".

Partly agreed, and I see the same point was made by Bartc and Stephen
Sprunk. However, although disk speeds are vastly slower there are some
things to consider:

1. Once read from a disk into memory such a structure can be read and
written many times between CPU cache and CPU before being flushed back to
disk or dropped.

How many times is "many times?" If you can count as high as
"many" with an unsigned short, don't worry.

Also, if you don't need the field-at-a-time approach you use
one count them one conversion on input, and one count them one on
output (but only if you write). The time spent on one count them
one decode and one count them one encode is zero count it zero,
for all practical purposes.

2. The structure could well come from a ramdisk.

... which was loaded from ... Well, from where?

3. A FAT header is just a particular example. There are other memory layouts
that would need to be treated in the same way. For instance, the BIOS Data
Area which is present on every PC at boot time is a design from the same era
and has some simlilarly badly aligned elements (not that I need to access
the BDA).

If you did, how many instances of the BDA would you need to
access? Could you count as high as "many" with a one-bit bit-field?

Seriously: You are worrying about the unimportant stuff.
Ever hear of something called the "KISS Principle?" It applies.

James Harris · Aug 23, 2013

Stephen Sprunk said:
....

Off the cuff and probably buggy, but shows the general idea:

#if (defined(ALLOW_UNALIGNED) && defined(LITTLE_ENDIAN))
# define get_uint16_le(p) (*(uint16_t *)p)
#else
# define get_uint16_le(p) (*(uint8_t *)p | *((uint8_t *)p+1) << 8)
#endif
Thanks.

....

Are you referring to partial register stalls? AIUI, that only happens
when two partial-register values are combined into one full-register
value, which doesn't happen in this case.

Yes, it can happen when reading from memory or an immediate, too. Taking the
case of x86_32, to read, say, 4 bytes which are stored in two halves

movzx ax, [ebx + 2]
shl eax, 16
mov ax, [ebx]

Effects always depend on the model of CPU but that kind of thing can is
likely to be significantly slowed by the 16-bit merge at the end.

movzx eax, [ebx + 2]
shl eax, 16
movzx ecx, [ebx]
or eax, ecx

That should be faster but uses an extra register.

In any case, I would expect

mov eax, [ebx]

to be faster than either because as well as leaving the hardware to realign
what it gets from cache it avoid the dependencies of the other solutions.

FWIW, I remember someone commenting that x86 will realign. That's true but
for completeness it should be noted that an OS can turn on alignment
checking so that misaligned accesses are trapped. I don't know of any OS
which does that but it is supported by hardware.

For instance, this idiom doesn't trigger a stall:

uint8_t a, b, c, d;
uint32_t x = (a << 24) | (b << 16) | (c << 8) | (d);

Each byte is zero-extended to full width, shifted, and then or'd with
another full register. The zero-extension itself does not create a
stall because it depends on only _one_ partial register, not two.

Right. Zero extension when loading a register is as fast as copying (as is
sign extension).

It may be; it's worth testing. Even if the shift form is faster, the
loss in readability may outweigh it. Memory (not to mention disk) is so
slow these days, and OOO mechanisms are so capable, that low-level
details like this are probably irrelevant anyway.

Anything that doesn't happing in a loop is irrelevant, and in a loop we
should be running from cache. If a tight loop is going to memory we have
bigger problems than partial register stalls.

Sorry, I couldn't work out what you meant about OOO mechanisms.

James

Nobody · Aug 23, 2013

I could do something basic and slow but what makes this an interesting
challenge is that: 1) many of the fields *will* be naturally aligned so
don't need any special access mechanism and 2) some of the machines the
code will run on will support unaligned accessed and some will not. It
would be good to use the same C source to run on any type of machine and
get the compiler to emit the faster code where it can.

Anyone already addressed the same sort of issue? I have some ideas but is
there a standard approach?

What is the expected number of times each record will be accessed between
reading it and discarding it?

If it's much more than one, you'll want to convert the records to an
efficient layout as they're read. If it's less than one, you're better off
storing the raw data and providing accessor functions/macros.

In the latter case, you can declare "packed" structs corresponding to the
record layout, and provide macros which just access the fields directly.
That will work provided that the platform supports packed structures and
little-endian storage.

For platforms which are little-endian but the compiler doesn't support
packed structures, you can use e.g. *(uint16_t*)((char*)ptr+offset) for
aligned fields and memcpy() for unaligned fields.

For other platforms, you need to provide functions or macros to decode the
bytes.

Les Cargill · Aug 23, 2013

James said:
The data may have come from a disk initially but a read in this context is
not a read from disk but from memory so no library call or ioctl() would be
needed.

So you mean a *memory* read, not a disk read? My bad, then.

Les Cargill · Aug 23, 2013

Stephen said:
That may not solve the unaligned access issue, though.

I am rather desperately trying to figure out why that's a problem.

Even if it's a problem, assign the unaligned struct member to an
aligned copy and pass the pointer to that. You can even be lazy and do
"crash directed development"

to decide where this is needed.

You're just using the struct as an interface to the disk layout,
not anything else. It's for assigning from and assigning to,
nothing else.

FAT originated on x86, so all multibyte fields are little-endian.

Right. Which means there's a context in which GNU is used where that's a
problem.

Stephen Sprunk · Aug 23, 2013

Stephen Sprunk said:
Stephen Sprunk said:

Are you referring to partial register stalls? AIUI, that only
happens when two partial-register values are combined into one
full-register value, which doesn't happen in this case.

Click to expand...

Yes, it can happen when reading from memory or an immediate, too.
Taking the case of x86_32, to read, say, 4 bytes which are stored in
two halves

movzx ax, [ebx + 2]
shl eax, 16
mov ax, [ebx]

Well, that assumes the two 16-bit loads are aligned, which may not be
true; let's keep it simple and stick with a 16-bit unaligned load:

mov al, [ebx] ; low byte
mov ah, [ebx+1] ; high byte
; use AX

AL and AH will be different (renamed) registers, which will definitely
cause a partial-register stall because they have to be combined into a
third (renamed) register to satisfy a dependency on AX.

Effects always depend on the model of CPU but that kind of thing
can is likely to be significantly slowed by the 16-bit merge at
the end.

Exactly. AFAIK, partial-register stalls apply to most/all x86 CPUs
starting with the P6 generation, and in roughly the same way. There may
be minor variations, but not enough to worry about.

movzx eax, [ebx + 2]
shl eax, 16
movzx ecx, [ebx]
or eax, ecx

That should be faster but uses an extra register.

Again, simplified to the 16-bit version:

movzx cx, [ebx+1] ; high byte
movzx ax, [ebx] ; low byte
shl cx, 8
or ax, cx

Yes, this uses another register, but it avoids a partial register stall
because partial-width temporaries never need to be combined to satisfy a
full-width dependency. That means, all else being equal, it should
execute faster than the above version.

This is also a direct translation of the C idiom, and that's not
coincidence: the P6's designers optimized for typical output of the
day's (rather dumb) compilers, and that has become a vicious cycle.

In any case, I would expect

mov eax, [ebx]

to be faster than either because as well as leaving the hardware to
realign what it gets from cache it avoid the dependencies of the
other solutions.

I'm not sure where an unaligned load is actually fixed up, or what
gyrations the hardware has to go through to do so. For all I know, the
decoder may spit out uops equivalent to the second example above, but
without the register pressure.

Anything that doesn't happing in a loop is irrelevant,

True enough. It is rarely worth optimizing _anything_ that isn't inside
a loop. Compiler loop optimizations have gotten so advanced that it's
debatable whether any meaningful difference can be made at the C level,
short of changing algorithms.

and in a loop we should be running from cache. If a tight loop is
going to memory we have bigger problems than partial register
stalls.

That should be true of most cases, at least after the first few
iterations--and those should disappear in the noise.

The canonical counter-example is a database server, where each loop
iteration will chase long chains of pointers throughout memory, with no
predictable (and therefore prefetchable) pattern. They're so
badly-behaved that not only is your next pointer not likely to be in
cache, even the _page table_ for that pointer isn't likely to be cached
(in the TLB) either. Don't even think about swapping. They can easily
spend 99%+ of cycles stalled, waiting for RAM to catch up.

Sorry, I couldn't work out what you meant about OOO mechanisms.

The entire purpose of OOO is to hide memory latency by effectively
hoisting loads (and, to a lesser extent, delaying stores) to run in
parallel with other operations. If you want to see where OOO fails,
which means your shiny new multi-GHz toy reverts to 1980s performance
levels, look for loads that _can't_ be hoisted--such as in the database
server example above. Luckily, such cases are extremely rare.

I'm not sure if current OOO implementations can hide the penalty for an
unaligned load. Probably.

S

James Harris · Aug 23, 2013

Eric Sosman said:
On 8/23/2013 8:03 AM, James Harris wrote:
....

How many times is "many times?" If you can count as high as
"many" with an unsigned short, don't worry.

Also, if you don't need the field-at-a-time approach you use
one count them one conversion on input, and one count them one on
output (but only if you write). The time spent on one count them
one decode and one count them one encode is zero count it zero,
for all practical purposes.

From this and other comments you have made I think you may be thinking of
these awkward fields as being external to the running system - i.e. read it
once and then forget about the original. That is true of some. But others
may form elements of the running system. At least if there's a chance of a
user program accessing any one of these fields we've been talking about I
cannot just maintain my own copy until the whole thing is written back to
disk. If I did, user programs would see inconsistent values.

I suppose I could maintain a separate copy and try to come up with some way
to detect whether something else is accessing the original. If it is then
switch into a different mode where I maintain both at the same time. But
that's much too complicated. It seems far simpler to maintain the original
structure all the time.

In fact I might be able to work with copies in some cases and I should
consider doing so. However, I'd rather start from knowing I have a good way
to deal with the most akward cases. Once those problems have been solved the
rest gets easier. If I were to make assumptions at an early stage that I
could align all fields it would then be harder later to work with those that
could not be copied and aligned.

... which was loaded from ... Well, from where?

Well, OK. That would probably come from disk.

If you did, how many instances of the BDA would you need to
access? Could you count as high as "many" with a one-bit bit-field?

As I mentioned, I don't need to access the BDA explicitly. It was just an
example of awkward structures from that era. Here's another: as86 object
files - which I do intend to access - have a header that seems to begin with
integers of these byte sizes: 4, 1, 4, 4, 2, 2, 4, 4. Yes, really! That
means that aside from the first field the following multibyte fields are
misaligned. It's madness but that's how things used to be done.

Seriously: You are worrying about the unimportant stuff.
Ever hear of something called the "KISS Principle?" It applies.

I'm all for keeping things simple but maintaining two copies is not simple.

On the point of starting off with a complex issue, as mentioned above it
surely makes sense to find an effective way to deal with the most complex
cases at the start of a process. Then the cases that come along later are
either already solved or are easier to deal with. I cannot see why anyone
would want to solve just the simpler cases then write a bunch of code and
only later find that there are some complex cases that need to be
shoe-horned in. IME that approach leads to kludges and code that becomes
hard - even scary! - to maintain.

James

James Harris · Aug 23, 2013

.... (snipped many points of agreement)

I'm not sure where an unaligned load is actually fixed up, or what
gyrations the hardware has to go through to do so. For all I know, the
decoder may spit out uops equivalent to the second example above, but
without the register pressure.

I'm not sure either but I believe that the caches maintain the natural
alignment. I also suspect that x86 machines, at least, are used to having to
realign data being read from the nearest cache.

....

That should be true of most cases, at least after the first few
iterations--and those should disappear in the noise.

I was thinking more of each iteration reading too much data for the nearer
caches to hold but there could be other causes such as you mention below.

The canonical counter-example is a database server, where each loop
iteration will chase long chains of pointers throughout memory, with no
predictable (and therefore prefetchable) pattern. They're so
badly-behaved that not only is your next pointer not likely to be in
cache, even the _page table_ for that pointer isn't likely to be cached
(in the TLB) either. Don't even think about swapping. They can easily
spend 99%+ of cycles stalled, waiting for RAM to catch up.

The entire purpose of OOO is to hide memory latency by effectively
hoisting loads (and, to a lesser extent, delaying stores) to run in
parallel with other operations. If you want to see where OOO fails,
which means your shiny new multi-GHz toy reverts to 1980s performance
levels, look for loads that _can't_ be hoisted--such as in the database
server example above. Luckily, such cases are extremely rare.

Haha - I saw OOO and for some reason thought you were talking about OO!

On the topic of working in harmony with the caching mechanisms you might be
interested in this case in point.

James

Jorgen Grahn · Aug 23, 2013

This is good. I had thought about extracting the fields individually to a
stuct but by mentioning each one explicitly. I hadn't thought about stepping
a pointer along them. As long as the get16_le operations were fast I don't
think your method would be slow per se.

I should have explained that get16_le() would be an inline function,
doing something like p[0] + 256 * p[1]. Not what you'd traditionally
do if you wanted fast code, but (a) it *always works* and (b) I
suspect modern CPUs with caches makes it hard to construct a case
where it makes a difference.

I might use that type of code in
some cases. Objections to the code, though:

* It effectively uses magic numbers for the field offsets. Is that good
practice?

Sure!

I used to care, but I find in practice it's not a problem. Instead of
encoding knowledge about the data format in a struct or a series of
#defines with obscure names, you encode it into a conversion function.
Or two, if you need to write it too.

If this code was spread and duplicated, then I would find it bad
practice. Concentrated to two functions -- acceptable.

* In some cases not all fields would need to be converted. Details follow.

That touches another reason I do it like that. When I work with
binary data it tends not to be 100% "struct-like". E.g. an IPv4
header with possible IP options. Also, I can rarely be sure I get
valid data: I must not run past the end of the input buffer etc.

I find it easier to deal with such things if I don't map too many
abstractions onto my parsing code. (The get16_le is bad enough: it's
not defined if I'm at the last octet of my message.)

As an example of the latter, and to correct the layout I had before which
had the odd and even alignments the wrong way round, here is the full struct
for the structure in question. I have simplified the field names as the
standard ones were just too ucky!
....
struct bootsect {
/* 0 */ ui8 jump[3], ....
};
typedef struct bootsect bootsect_t;

As you can see, in this case and as long as I have got the layout right,
there should be only four fields which would have to be split due to
alignment issues. The other fields should be accessible directly and so not
need to be converted first.

Assuming your input buffer is aligned, yes. I don't know -- it just
makes me nervous to deal with raw data as structs, and since I don't
have to do it I prefer not to.

IMHO It's one of those things which seem like a good idea, but aren't.

/Jorgen

Stephen Sprunk · Aug 23, 2013

I am rather desperately trying to figure out why that's a problem.

Some CPUs (e.g. most RISCs) don't allow unaligned accesses at all; if
you try it, the program crashes with a bus error.

Systems that merely have slow unaligned accesses aren't be worth
worrying about, unless you are doing it in a tight loop.

Even if it's a problem, assign the unaligned struct member to an
aligned copy and pass the pointer to that.

That won't work, for the same reason that direct access to the unaligned
member won't work. At some point, you need code that deals with the
unaligned accesses.

If you can afford an unpacked shadow copy, you could simply memcpy()
individual members between the two, but that assumes the original copy
won't change out from under you (is this a live disk?) _and_ there are
enough shadow copy accesses to make it worth the effort.

If you're doing an occasional access now and then to a live disk
structure, you're better off with accessor functions that realign the
data on the fly--one per field.

You can even be lazy and do "crash directed development" to
decide where this is needed.

Provided you have a system that actually does crash--and unit tests that
access every single field.

S

James Kuyper · Aug 23, 2013

On 08/23/2013 03:20 PM, James Harris wrote:
....

From this and other comments you have made I think you may be thinking of
these awkward fields as being external to the running system - i.e. read it
once and then forget about the original. That is true of some. But others
may form elements of the running system. At least if there's a chance of a
user program accessing any one of these fields we've been talking about I
cannot just maintain my own copy until the whole thing is written back to
disk. If I did, user programs would see inconsistent values.

Reading the FAT and updating it should always be hidden from the
ordinary user by system routines of some kind, and those routines must
be carefully designed to coordinate with each other to avoid the
potential problems you're referring to. If you're writing some of that
system code, you need to think a lot more carefully about the design
than you have currently seem to have done. If you're not writing such
code, you should not be able to modify the real FAT, all you should be
able to modify is a copy of it, so the problem you're worried about
shouldn't be possible.

glen herrmannsfeldt · Aug 23, 2013

Stephen Sprunk said:
On 23-Aug-13 09:31, James Harris wrote:
(snip)

Yes, it can happen when reading from memory or an immediate, too.
Taking the case of x86_32, to read, say, 4 bytes which are stored in
two halves
movzx ax, [ebx + 2]
shl eax, 16
mov ax, [ebx]

Click to expand...

Well, that assumes the two 16-bit loads are aligned, which may not be
true; let's keep it simple and stick with a 16-bit unaligned load:

mov al, [ebx] ; low byte
mov ah, [ebx+1] ; high byte
; use AX

AL and AH will be different (renamed) registers, which will definitely
cause a partial-register stall because they have to be combined into a
third (renamed) register to satisfy a dependency on AX.

Exactly. AFAIK, partial-register stalls apply to most/all x86 CPUs
starting with the P6 generation, and in roughly the same way.
There may be minor variations, but not enough to worry about.

I first remember hearing about this when the Pentium-Pro came out.

Intel was expecting people to transition to 32 bit systems,
including OS and applications. I used to run OS/2 2.x on mine,
but many were still running DOS, or maybe Win3.1.

As I remember, 16 bit software ran slower in the PPro than on
the Pentium. For the P2, as I remember, they fixed it so it wasn't
so much of a problem.

-- glen

BartC · Aug 23, 2013

2. The structure could well come from a ramdisk.

Even with ramdisk, there will be extra overheads accessing such data,
compared with normal memory, through having to use a file system, looking up
filenames, locating the sectors, and, usually, transferring the data of each
sector or cluster. In addition, the application might be doing its own
accesses to the final data.

So the overheads of unpacking a few bytes from the FAT should in most cases
be insignificant.

3. A FAT header is just a particular example. There are other memory
layouts that would need to be treated in the same way. For instance, the
BIOS Data Area which is present on every PC at boot time is a design from
the same era and has some simlilarly badly aligned elements (not that I
need to access the BDA).

That sounds like something else that doesn't need to be accessed that often.
But since it is specific to a PC (and hence x86) the misalignments can be
tolerated.

Edward A. Falk · Aug 23, 2013

A file allocation table is an on-disk structure that was designed in the
days before people took much care with field alignment. FAT file systems are
still prevalent today. I need to write some code to work with such
structures so have to deal with fields that will not be aligned.

To illustrate, the FAT header starts like this:

8-byte BS_OEMName
2-byte BPB_BytsPerSec
1-byte BPB_SecPerClus
2-byte BPB_RsvdSecCnt
...

As you can see, it is impossible to align the last of those fields,
BPB_RsvdSecCnt. The same is true of some of the fields that follow. (For
anyone who is interested the full list of fields can be seen in a document
called fatgen103. Copies are freely available on the internet.)

Frankly, in the interest of portability and robustness, I write my code
like this:

uint8_t ibuf[13];
char oemName[8];
uint16_t bytsPerSec;
uint8_t secPerClus;
uint16_t rsvdSecCnt;

read(fatfd, ibuf, sizeof(ibuf));
memcpy(oemName, ibuf, 8);
memcpy(&bytesPerSec, ibuf+8, 2);
secPerClus = ibuf[10];
memcpy(&rsvdSecCnt, ibuf+11, 2);

Error-checking has been omitted for clarity. There are other things
I would do with defined constants, etc. to make this code more
robust; I'm just keeping it simple and readable at the moment.

If the underlying architecture has a bulk move instruction equivalent
to memcpy, any decent compiler would substitute it for the memcpy
call.

If you *really* need to optimize the performance beyond this, then
you'll probably have to resort to non-portable patterns.

One more point: we haven't addressed the byte-ordering of the
underlying system relative to the data on disk, which I presume
is little-endian. So in fact, my code would *actually* look like:

static inline uint16_t
le2uint16(uint8_t *in)
{
return in[0] + (int[1]<<8);
}

...

memcpy(oemName, ibuf, 8);
bytesPerSec = le2uint16(ibuf+8);
secPerClus = ibuf[10];
rsvdSecCnt = le2uint16(ibuf+11);

Again, a good compiler would possibly optimize le2uint16() in a way
that accounts for the native byte order of the processor, and it's
ability to handle unaligned data.

Les Cargill · Aug 23, 2013

Stephen said:
Some CPUs (e.g. most RISCs) don't allow unaligned accesses at all; if
you try it, the program crashes with a bus error.

Perhaps I have lived a charmed life, then.
I hate to be that guy, but can I have a concrete example?

please remember that even primitive 'C' objects are an abstraction,
and that development tools can take some measure of pain to make
them more comfortable for us.

I have done this on ARM{7,9,10}, Coldfire, Intel from 16
through 64 bit, some 32 bit AVR or other and several MIPS ,
and it worked just fine.

I do not recall whether the compiler masked and shifted for you, but I
expect it did, if any of those had this difficulty.

I would be somewhat disappointed in a GNU implementation that allowed
that pragma but the code crashed because of it. It is part of the GNU
standard. I could see it #error-ing out if it could not be or was not
supported. But crashing would be very disappointing.

I have never seen a case where I could not develop a solution
that used bit fields and packed structs and demonstrate that to other
team members.

So while I accept that you're not alone in this claim, I'd
have to see an actual example to believe it. And because this is a
much more civilized thing than legions of shifts and masks littering
actual code ( even as macros)

, I'd start there anyway.

Systems that merely have slow unaligned accesses aren't be worth
worrying about, unless you are doing it in a tight loop.

Right.

That won't work, for the same reason that direct access to the unaligned
member won't work.

Then I would expect a compiler diagnostic from the "#pragma
pack(push,1)" And see below how I tested this.

At some point, you need code that deals with the
unaligned accesses.

I'd rather get that from the toolset if I can. And - surprise - it's
always worked out.

If you can afford an unpacked shadow copy, you could simply memcpy()
individual members between the two, but that assumes the original copy
won't change out from under you (is this a live disk?) _and_ there are
enough shadow copy accesses to make it worth the effort.

Right. You'd only do it when you needed to.

If you're doing an occasional access now and then to a live disk
structure, you're better off with accessor functions that realign the
data on the fly--one per field.

I can't disagree too much there. I've done that, too. If that were the
case, I would then lean on something akin to C++ or make it all table
driven.

I have also used scripting languages to generate test code and header
files to manage this.

Provided you have a system that actually does crash--and unit tests that
access every single field.

This is not difficult. memmove() a memory pattern to an instance of the
struct, then printf) every value in the struct.

I'd even capture the output, and diff that with the last
output. This was executed from the makefile. I don't trust
this, either - it's just worked out for me.

Ian Collins · Aug 24, 2013

Les said:
Perhaps I have lived a charmed life, then.
I hate to be that guy, but can I have a concrete example?

SPARC.

I had to get some 68K code that relied on packed structs to run on a Sun
box and it wasn't a trivial task. The Sun compiler did have a
misaligned option (which uses traps to call functions that use multiple
reads, shits and masks) , but using cased the code to run like a three
legged dog with a sore foot.

Les Cargill · Aug 24, 2013

Ian said:
SPARC.

Okay. I feel your pain, bro. It's over, man. It's over.

I had to get some 68K code that relied on packed structs to run on a Sun
box and it wasn't a trivial task. The Sun compiler did have a
misaligned option (which uses traps to call functions that use multiple
reads, shits and masks) , but using cased the code to run like a three
legged dog with a sore foot.

Now never let us speak of this again

unaligned pointer access	28	Sep 11, 2013
Unaligned access	15	May 5, 2010
Struct Member Variables Problem	0	Jun 21, 2023
Struct Member Variable Problems	1	Jun 21, 2023
Why struct not globally changed in function?	1	Aug 22, 2023
Initializing a malloc'ed struct whose fields are const with run-timevalues	59	Jan 22, 2013
Unaligned pointers question	7	Oct 22, 2005
Alignment in Struct	7	Feb 26, 2014

Struct with unaligned fields

James Harris

James Harris

Eric Sosman

Eric Sosman

James Harris

Nobody

Les Cargill

Les Cargill

Stephen Sprunk

James Harris

James Harris

Jorgen Grahn

Stephen Sprunk

James Kuyper

glen herrmannsfeldt

BartC

Edward A. Falk

Les Cargill

Ian Collins

Les Cargill

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads