Struct with unaligned fields

Discussion in 'C Programming' started by James Harris, Aug 22, 2013.

  1. James Harris

    James Harris Guest

    I knew it was an old 'chestnut'! Hopefully the constraints I included made
    it less boring than just asking how to align fields.
    This is good. I had thought about extracting the fields individually to a
    stuct but by mentioning each one explicitly. I hadn't thought about stepping
    a pointer along them. As long as the get16_le operations were fast I don't
    think your method would be slow per se. I might use that type of code in
    some cases. Objections to the code, though:

    * It effectively uses magic numbers for the field offsets. Is that good
    practice?

    * In some cases not all fields would need to be converted. Details follow.

    As an example of the latter, and to correct the layout I had before which
    had the odd and even alignments the wrong way round, here is the full struct
    for the structure in question. I have simplified the field names as the
    standard ones were just too ucky!

    /*
    * Define the FAT geometry table at the start of a volume
    *
    * The numbers preceding each field give the byte offsets of that field.
    * Fields which cannot be aligned are split into arrays of octets.
    */

    struct bootsect {
    /* 0 */ ui8 jump[3],
    /* 3 */ ui8 oem[8],
    /* 11 */ ui8 secsize[2], /* Split due to alignment */
    /* 13 */ ui8 sec_per_clus,
    /* 14 */ ui16 sec_rsvd,
    /* 16 */ ui8 fats,
    /* 17 */ ui8 rootents[2], /* Split due to alignment */
    /* 19 */ ui8 totsec16[2], /* Split due to alignment */
    /* 21 */ ui8 medium_type,
    /* 22 */ ui16 fatsize16,
    /* 24 */ ui16 sec_per_trk,
    /* 26 */ ui16 heads,
    /* 28 */ ui32 hid_sec,
    /* 32 */ ui32 totsec32,
    /* 36 */ ui8 drv_num,
    /* 37 */ ui8 rsvd1,
    /* 38 */ ui8 bootsig,
    /* 39 */ ui8 volid[4], /* Split due to alignment */
    /* 43 */ ui8 vollab[11],
    /* 54 */ ui8 fstype[8]
    /* 63 */
    };
    typedef struct bootsect bootsect_t;

    As you can see, in this case and as long as I have got the layout right,
    there should be only four fields which would have to be split due to
    alignment issues. The other fields should be accessible directly and so not
    need to be converted first.

    James
     
    James Harris, Aug 23, 2013
    #21
    1. Advertisements

  2. James Harris

    James Harris Guest

    OK.

    I take it in this case that there is one encode and one decode function per
    data type rather than one such pair per field?
    You mean something like these?

    ui16 secsize_get(bootsect_t *bs) {
    return bs->secsize[1] << 8 | bs->secsize[0];
    }

    void secsize_set(bootsect_t *bs, ui16 value) {
    bs->secsize[0] = value & 0xff;
    bs->secsize[1] = (value >> 8) & 0xff;
    }


    Partly agreed, and I see the same point was made by Bartc and Stephen
    Sprunk. However, although disk speeds are vastly slower there are some
    things to consider:

    1. Once read from a disk into memory such a structure can be read and
    written many times between CPU cache and CPU before being flushed back to
    disk or dropped.

    2. The structure could well come from a ramdisk.

    3. A FAT header is just a particular example. There are other memory layouts
    that would need to be treated in the same way. For instance, the BIOS Data
    Area which is present on every PC at boot time is a design from the same era
    and has some simlilarly badly aligned elements (not that I need to access
    the BDA).

    James
     
    James Harris, Aug 23, 2013
    #22
    1. Advertisements

  3. James Harris

    Eric Sosman Guest

    There remains a possibility that the struct could contain
    superfluous padding, but that's mostly a theoretical concern.

    My question is: What good is this struct, anyhow? Many of
    its elements can't be manipulated directly, either because
    they've been split into arrays, or because their endianness
    might be wrong, or both. What does this struct buy you?

    IMHO you'd be far better off with a "native" struct and
    a pair of encoder/decoder functions. *Far* better.
     
    Eric Sosman, Aug 23, 2013
    #23
  4. James Harris

    Eric Sosman Guest

    Something like

    void decodeFat(struct nativeFatStruct *out,
    const unsigned char *in);
    void encodeFat(unsigned char *out,
    const struct nativeFatStruct *in);

    .... and similarly for decodeThin()/encodeThin(), etc.
    Right. If the "secsize" field overlaps (or partially
    overlaps) the "spasmcount" field, this approach gets the right
    result with a "this_set" is followed by a "that_get".
    How many times is "many times?" If you can count as high as
    "many" with an unsigned short, don't worry.

    Also, if you don't need the field-at-a-time approach you use
    one count them one conversion on input, and one count them one on
    output (but only if you write). The time spent on one count them
    one decode and one count them one encode is zero count it zero,
    for all practical purposes.
    ... which was loaded from ... Well, from where?
    If you did, how many instances of the BDA would you need to
    access? Could you count as high as "many" with a one-bit bit-field?

    Seriously: You are worrying about the unimportant stuff.
    Ever hear of something called the "KISS Principle?" It applies.
     
    Eric Sosman, Aug 23, 2013
    #24
  5. James Harris

    James Harris Guest

    Yes, it can happen when reading from memory or an immediate, too. Taking the
    case of x86_32, to read, say, 4 bytes which are stored in two halves

    movzx ax, [ebx + 2]
    shl eax, 16
    mov ax, [ebx]

    Effects always depend on the model of CPU but that kind of thing can is
    likely to be significantly slowed by the 16-bit merge at the end.

    movzx eax, [ebx + 2]
    shl eax, 16
    movzx ecx, [ebx]
    or eax, ecx

    That should be faster but uses an extra register.

    In any case, I would expect

    mov eax, [ebx]

    to be faster than either because as well as leaving the hardware to realign
    what it gets from cache it avoid the dependencies of the other solutions.

    FWIW, I remember someone commenting that x86 will realign. That's true but
    for completeness it should be noted that an OS can turn on alignment
    checking so that misaligned accesses are trapped. I don't know of any OS
    which does that but it is supported by hardware.
    Right. Zero extension when loading a register is as fast as copying (as is
    sign extension).
    Anything that doesn't happing in a loop is irrelevant, and in a loop we
    should be running from cache. If a tight loop is going to memory we have
    bigger problems than partial register stalls.

    Sorry, I couldn't work out what you meant about OOO mechanisms.

    James
     
    James Harris, Aug 23, 2013
    #25
  6. James Harris

    Nobody Guest

    What is the expected number of times each record will be accessed between
    reading it and discarding it?

    If it's much more than one, you'll want to convert the records to an
    efficient layout as they're read. If it's less than one, you're better off
    storing the raw data and providing accessor functions/macros.

    In the latter case, you can declare "packed" structs corresponding to the
    record layout, and provide macros which just access the fields directly.
    That will work provided that the platform supports packed structures and
    little-endian storage.

    For platforms which are little-endian but the compiler doesn't support
    packed structures, you can use e.g. *(uint16_t*)((char*)ptr+offset) for
    aligned fields and memcpy() for unaligned fields.

    For other platforms, you need to provide functions or macros to decode the
    bytes.
     
    Nobody, Aug 23, 2013
    #26
  7. James Harris

    Les Cargill Guest

    So you mean a *memory* read, not a disk read? My bad, then.
     
    Les Cargill, Aug 23, 2013
    #27
  8. James Harris

    Les Cargill Guest

    I am rather desperately trying to figure out why that's a problem. :)
    Even if it's a problem, assign the unaligned struct member to an
    aligned copy and pass the pointer to that. You can even be lazy and do
    "crash directed development" :) to decide where this is needed.

    You're just using the struct as an interface to the disk layout,
    not anything else. It's for assigning from and assigning to,
    nothing else.
    Right. Which means there's a context in which GNU is used where that's a
    problem.
     
    Les Cargill, Aug 23, 2013
    #28
  9. Well, that assumes the two 16-bit loads are aligned, which may not be
    true; let's keep it simple and stick with a 16-bit unaligned load:

    mov al, [ebx] ; low byte
    mov ah, [ebx+1] ; high byte
    ; use AX

    AL and AH will be different (renamed) registers, which will definitely
    cause a partial-register stall because they have to be combined into a
    third (renamed) register to satisfy a dependency on AX.
    Exactly. AFAIK, partial-register stalls apply to most/all x86 CPUs
    starting with the P6 generation, and in roughly the same way. There may
    be minor variations, but not enough to worry about.
    Again, simplified to the 16-bit version:

    movzx cx, [ebx+1] ; high byte
    movzx ax, [ebx] ; low byte
    shl cx, 8
    or ax, cx

    Yes, this uses another register, but it avoids a partial register stall
    because partial-width temporaries never need to be combined to satisfy a
    full-width dependency. That means, all else being equal, it should
    execute faster than the above version.

    This is also a direct translation of the C idiom, and that's not
    coincidence: the P6's designers optimized for typical output of the
    day's (rather dumb) compilers, and that has become a vicious cycle.
    I'm not sure where an unaligned load is actually fixed up, or what
    gyrations the hardware has to go through to do so. For all I know, the
    decoder may spit out uops equivalent to the second example above, but
    without the register pressure.
    True enough. It is rarely worth optimizing _anything_ that isn't inside
    a loop. Compiler loop optimizations have gotten so advanced that it's
    debatable whether any meaningful difference can be made at the C level,
    short of changing algorithms.
    That should be true of most cases, at least after the first few
    iterations--and those should disappear in the noise.

    The canonical counter-example is a database server, where each loop
    iteration will chase long chains of pointers throughout memory, with no
    predictable (and therefore prefetchable) pattern. They're so
    badly-behaved that not only is your next pointer not likely to be in
    cache, even the _page table_ for that pointer isn't likely to be cached
    (in the TLB) either. Don't even think about swapping. They can easily
    spend 99%+ of cycles stalled, waiting for RAM to catch up.
    The entire purpose of OOO is to hide memory latency by effectively
    hoisting loads (and, to a lesser extent, delaying stores) to run in
    parallel with other operations. If you want to see where OOO fails,
    which means your shiny new multi-GHz toy reverts to 1980s performance
    levels, look for loads that _can't_ be hoisted--such as in the database
    server example above. Luckily, such cases are extremely rare.

    I'm not sure if current OOO implementations can hide the penalty for an
    unaligned load. Probably.

    S
     
    Stephen Sprunk, Aug 23, 2013
    #29
  10. James Harris

    James Harris Guest

    From this and other comments you have made I think you may be thinking of
    these awkward fields as being external to the running system - i.e. read it
    once and then forget about the original. That is true of some. But others
    may form elements of the running system. At least if there's a chance of a
    user program accessing any one of these fields we've been talking about I
    cannot just maintain my own copy until the whole thing is written back to
    disk. If I did, user programs would see inconsistent values.

    I suppose I could maintain a separate copy and try to come up with some way
    to detect whether something else is accessing the original. If it is then
    switch into a different mode where I maintain both at the same time. But
    that's much too complicated. It seems far simpler to maintain the original
    structure all the time.

    In fact I might be able to work with copies in some cases and I should
    consider doing so. However, I'd rather start from knowing I have a good way
    to deal with the most akward cases. Once those problems have been solved the
    rest gets easier. If I were to make assumptions at an early stage that I
    could align all fields it would then be harder later to work with those that
    could not be copied and aligned.
    Well, OK. That would probably come from disk.
    As I mentioned, I don't need to access the BDA explicitly. It was just an
    example of awkward structures from that era. Here's another: as86 object
    files - which I do intend to access - have a header that seems to begin with
    integers of these byte sizes: 4, 1, 4, 4, 2, 2, 4, 4. Yes, really! That
    means that aside from the first field the following multibyte fields are
    misaligned. It's madness but that's how things used to be done.
    I'm all for keeping things simple but maintaining two copies is not simple.

    On the point of starting off with a complex issue, as mentioned above it
    surely makes sense to find an effective way to deal with the most complex
    cases at the start of a process. Then the cases that come along later are
    either already solved or are easier to deal with. I cannot see why anyone
    would want to solve just the simpler cases then write a bunch of code and
    only later find that there are some complex cases that need to be
    shoe-horned in. IME that approach leads to kludges and code that becomes
    hard - even scary! - to maintain.

    James
     
    James Harris, Aug 23, 2013
    #30
  11. James Harris

    James Harris Guest

    .... (snipped many points of agreement)
    I'm not sure either but I believe that the caches maintain the natural
    alignment. I also suspect that x86 machines, at least, are used to having to
    realign data being read from the nearest cache.

    ....
    I was thinking more of each iteration reading too much data for the nearer
    caches to hold but there could be other causes such as you mention below.
    Haha - I saw OOO and for some reason thought you were talking about OO!

    On the topic of working in harmony with the caching mechanisms you might be
    interested in this case in point.



    James
     
    James Harris, Aug 23, 2013
    #31
  12. James Harris

    Jorgen Grahn Guest

    I should have explained that get16_le() would be an inline function,
    doing something like p[0] + 256 * p[1]. Not what you'd traditionally
    do if you wanted fast code, but (a) it *always works* and (b) I
    suspect modern CPUs with caches makes it hard to construct a case
    where it makes a difference.
    Sure! :)

    I used to care, but I find in practice it's not a problem. Instead of
    encoding knowledge about the data format in a struct or a series of
    #defines with obscure names, you encode it into a conversion function.
    Or two, if you need to write it too.

    If this code was spread and duplicated, then I would find it bad
    practice. Concentrated to two functions -- acceptable.
    That touches another reason I do it like that. When I work with
    binary data it tends not to be 100% "struct-like". E.g. an IPv4
    header with possible IP options. Also, I can rarely be sure I get
    valid data: I must not run past the end of the input buffer etc.

    I find it easier to deal with such things if I don't map too many
    abstractions onto my parsing code. (The get16_le is bad enough: it's
    not defined if I'm at the last octet of my message.)
    Assuming your input buffer is aligned, yes. I don't know -- it just
    makes me nervous to deal with raw data as structs, and since I don't
    have to do it I prefer not to.

    IMHO It's one of those things which seem like a good idea, but aren't.

    /Jorgen
     
    Jorgen Grahn, Aug 23, 2013
    #32
  13. Some CPUs (e.g. most RISCs) don't allow unaligned accesses at all; if
    you try it, the program crashes with a bus error.

    Systems that merely have slow unaligned accesses aren't be worth
    worrying about, unless you are doing it in a tight loop.
    That won't work, for the same reason that direct access to the unaligned
    member won't work. At some point, you need code that deals with the
    unaligned accesses.

    If you can afford an unpacked shadow copy, you could simply memcpy()
    individual members between the two, but that assumes the original copy
    won't change out from under you (is this a live disk?) _and_ there are
    enough shadow copy accesses to make it worth the effort.

    If you're doing an occasional access now and then to a live disk
    structure, you're better off with accessor functions that realign the
    data on the fly--one per field.
    Provided you have a system that actually does crash--and unit tests that
    access every single field.

    S
     
    Stephen Sprunk, Aug 23, 2013
    #33
  14. James Harris

    James Kuyper Guest

    On 08/23/2013 03:20 PM, James Harris wrote:
    ....
    Reading the FAT and updating it should always be hidden from the
    ordinary user by system routines of some kind, and those routines must
    be carefully designed to coordinate with each other to avoid the
    potential problems you're referring to. If you're writing some of that
    system code, you need to think a lot more carefully about the design
    than you have currently seem to have done. If you're not writing such
    code, you should not be able to modify the real FAT, all you should be
    able to modify is a copy of it, so the problem you're worried about
    shouldn't be possible.
     
    James Kuyper, Aug 23, 2013
    #34
  15. I first remember hearing about this when the Pentium-Pro came out.

    Intel was expecting people to transition to 32 bit systems,
    including OS and applications. I used to run OS/2 2.x on mine,
    but many were still running DOS, or maybe Win3.1.

    As I remember, 16 bit software ran slower in the PPro than on
    the Pentium. For the P2, as I remember, they fixed it so it wasn't
    so much of a problem.

    -- glen
     
    glen herrmannsfeldt, Aug 23, 2013
    #35
  16. James Harris

    BartC Guest

    Even with ramdisk, there will be extra overheads accessing such data,
    compared with normal memory, through having to use a file system, looking up
    filenames, locating the sectors, and, usually, transferring the data of each
    sector or cluster. In addition, the application might be doing its own
    accesses to the final data.

    So the overheads of unpacking a few bytes from the FAT should in most cases
    be insignificant.
    That sounds like something else that doesn't need to be accessed that often.
    But since it is specific to a PC (and hence x86) the misalignments can be
    tolerated.
     
    BartC, Aug 23, 2013
    #36
  17. Frankly, in the interest of portability and robustness, I write my code
    like this:

    uint8_t ibuf[13];
    char oemName[8];
    uint16_t bytsPerSec;
    uint8_t secPerClus;
    uint16_t rsvdSecCnt;

    read(fatfd, ibuf, sizeof(ibuf));
    memcpy(oemName, ibuf, 8);
    memcpy(&bytesPerSec, ibuf+8, 2);
    secPerClus = ibuf[10];
    memcpy(&rsvdSecCnt, ibuf+11, 2);

    Error-checking has been omitted for clarity. There are other things
    I would do with defined constants, etc. to make this code more
    robust; I'm just keeping it simple and readable at the moment.

    If the underlying architecture has a bulk move instruction equivalent
    to memcpy, any decent compiler would substitute it for the memcpy
    call.

    If you *really* need to optimize the performance beyond this, then
    you'll probably have to resort to non-portable patterns.



    One more point: we haven't addressed the byte-ordering of the
    underlying system relative to the data on disk, which I presume
    is little-endian. So in fact, my code would *actually* look like:

    static inline uint16_t
    le2uint16(uint8_t *in)
    {
    return in[0] + (int[1]<<8);
    }

    ...

    memcpy(oemName, ibuf, 8);
    bytesPerSec = le2uint16(ibuf+8);
    secPerClus = ibuf[10];
    rsvdSecCnt = le2uint16(ibuf+11);

    Again, a good compiler would possibly optimize le2uint16() in a way
    that accounts for the native byte order of the processor, and it's
    ability to handle unaligned data.
     
    Edward A. Falk, Aug 24, 2013
    #37
  18. James Harris

    Les Cargill Guest

    Perhaps I have lived a charmed life, then.
    I hate to be that guy, but can I have a concrete example?

    please remember that even primitive 'C' objects are an abstraction,
    and that development tools can take some measure of pain to make
    them more comfortable for us.

    I have done this on ARM{7,9,10}, Coldfire, Intel from 16
    through 64 bit, some 32 bit AVR or other and several MIPS ,
    and it worked just fine.

    I do not recall whether the compiler masked and shifted for you, but I
    expect it did, if any of those had this difficulty.

    I would be somewhat disappointed in a GNU implementation that allowed
    that pragma but the code crashed because of it. It is part of the GNU
    standard. I could see it #error-ing out if it could not be or was not
    supported. But crashing would be very disappointing.

    I have never seen a case where I could not develop a solution
    that used bit fields and packed structs and demonstrate that to other
    team members.

    So while I accept that you're not alone in this claim, I'd
    have to see an actual example to believe it. And because this is a
    much more civilized thing than legions of shifts and masks littering
    actual code ( even as macros) :) , I'd start there anyway.

    Then I would expect a compiler diagnostic from the "#pragma
    pack(push,1)" And see below how I tested this.
    I'd rather get that from the toolset if I can. And - surprise - it's
    always worked out.
    Right. You'd only do it when you needed to.
    I can't disagree too much there. I've done that, too. If that were the
    case, I would then lean on something akin to C++ or make it all table
    driven.

    I have also used scripting languages to generate test code and header
    files to manage this.
    This is not difficult. memmove() a memory pattern to an instance of the
    struct, then printf) every value in the struct.


    I'd even capture the output, and diff that with the last
    output. This was executed from the makefile. I don't trust
    this, either - it's just worked out for me.
     
    Les Cargill, Aug 24, 2013
    #38
  19. James Harris

    Ian Collins Guest

    SPARC.

    I had to get some 68K code that relied on packed structs to run on a Sun
    box and it wasn't a trivial task. The Sun compiler did have a
    misaligned option (which uses traps to call functions that use multiple
    reads, shits and masks) , but using cased the code to run like a three
    legged dog with a sore foot.
     
    Ian Collins, Aug 24, 2013
    #39
  20. James Harris

    Les Cargill Guest

    Okay. I feel your pain, bro. It's over, man. It's over.
    Now never let us speak of this again :)
     
    Les Cargill, Aug 24, 2013
    #40
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.