Re: Data alignment and endianess

Discussion in 'C Programming' started by Eric Sosman, Jan 23, 2011.

  1. Eric Sosman

    Eric Sosman Guest

    On 1/23/2011 10:05 AM, pozz wrote:
    > I need to parse a packet of data received on a serial port or stored in
    > a file. The bytes are stored in an arry:
    > unsigned char data[PACKET_LENGTH];
    >
    > Now I'm wondering how I can write a "full portable" code to retrieve a
    > 32-bit unsigned integer in Network Byte Order (Big Endian), even at
    > not-aligned addresses.
    >
    > Suppose I have a 32-bit integer starting from data[1]. I think the most
    > portable way to retrieve this integer is:
    > ---
    > uint32_t i;
    > i = (data[1] << 24) | (data[2] << 16) | (data[3] << 8) | data[4];


    This is almost right, and will work on many machines. But on
    machines with 16-bit `int' (in fact, anything narrower than 32, or
    even a 32-bit `int' on machines that are picky about overflow) the
    promotion rules will trip you up. The `<<' operator will promote
    each `unsigned char' to an `int' (or `unsigned int' on some machines),
    but not to a `uint32_t', and then the shift may not work as desired.

    To ensure you're shifting with the proper width, then, you
    need something like

    i = (((uint32_t)data[1]) << 24)
    | (((uint32_t)data[2]) << 16)
    | (((uint32_t)data[3]) << 8)
    | (((uint32_t)data[4]) << 0);

    (You could lose the final shift, and even the final cast, but the
    pattern may be more readable if you stick with it all the way.
    You could also lose the outermost sets of parentheses, but I'd
    suggest keeping them in the interests of clarity.)

    > i = ntohl(i);


    No. The original calculation has already produced the value
    you want, so no further mucking around is necessary or desirable.

    By the way, you'll probably find ntohl() and its relatives
    already defined for you on systems that support networking. If
    you need them (you don't, here), prefer the system's versions to
    free-hand code.

    > [...]
    > What happens if I retrieve a 32-bit integer on a Big Endian CPU at an
    > aligned address, for example starting from data[0]? With the code above,
    > the CPU will constructs the integer starting from the bytes, copying and
    > shifting them.
    > The code works well, but is more time consuming compared with a simple
    > cast:
    > ---
    > uint32_t i = *(uint32_t *)&data[0];


    For this variant, you *do* need ntohl() or something like it.

    Alignment is still a problem, even starting at `data[0]', because
    we have no reason (in what you've shown) to think that `data' itself
    is aligned in any particular way.

    You say that the original "is more time consuming" than the
    second. How much more? What have your measurements shown? (You
    *have* measured the speed difference, have you not? Surely you
    would not be so rash as to state that there "is" a difference without
    evidence to support your assertion, would you? Hmmm?)

    > Can you suggest a better snippet code?


    Go back to the original, minus the ntohl(). Until and unless
    you have solid evidence that simple and portable code is a bottleneck,
    leave it portable and simple. "More computing sins are committed in
    the name of efficiency (without necessarily achieving it) than for any
    other single reason--including blind stupidity." - W.A. Wulf

    --
    Eric Sosman
    lid
     
    Eric Sosman, Jan 23, 2011
    #1
    1. Advertising

  2. Eric Sosman

    Eric Sosman Guest

    On 1/23/2011 6:06 PM, pozz wrote:
    > Il 23/01/2011 17:06, Eric Sosman ha scritto:
    >> [...]
    >> You say that the original "is more time consuming" than the
    >> second. How much more? What have your measurements shown? (You
    >> *have* measured the speed difference, have you not? Surely you
    >> would not be so rash as to state that there "is" a difference without
    >> evidence to support your assertion, would you? Hmmm?)

    >
    > Oh, I haven't made any measurement. But I think that a simple cast could
    > be very fast instead of copying, shifting and or-ing bytes in the
    > correct order. Isn't this true?


    What have your measurements shown?

    Compilers are clever, even devious, in what they'll do to
    optimize code, and what you see in the source may have only a
    sketchy resemblance to what the CPU winds up doing. Also, on
    many systems the CPU is hundreds of times faster than memory,
    so "optimization" is largely about scheduling memory references
    advantageously rather than worrying about how many CPU-internal
    manipulations are done on the data once it's fetched. (And if
    you think you can count the memory fetches by studying the source
    of the two proposed snippets, see the earlier remark about the
    deviousness of optimizers.)

    If you actually care about the relative speed, the only
    practical approach is to measure. Relying on operation counts
    (as seen in the source) is not entirely useless, but as near to
    it as makes no never-mind. MEASURE!

    > Keep in mind that I use low-end embedded processor with very limited
    > resources compared with x86 desktop processors.


    How can I "keep in mind" a datum that you now offer for the
    very first time? Or as Alice put it, "How can I have more when
    I have'n't had any?" Besides, in your original post you asked
    about "most portable" solutions for "all the CPUs (Big and Little
    Endian)," not for one specific unnamed processor.

    --
    Eric Sosman
    lid
     
    Eric Sosman, Jan 24, 2011
    #2
    1. Advertising

  3. Eric Sosman

    Willem Guest

    pozz wrote:
    ) You are saying that code A could be faster than code B on a platform
    ) (CPU+compiler), but could be slower on another platform.
    ) It depends on the memory access time, compiler optimization and so on.
    )
    ) In my case, I work on embedded low-end microcontrollers (8- or 16-bits),
    ) with internal RAM and Flash memories.
    ) I made a test:
    ) const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 };
    ) i = *(int *)&data[1];
    ) is translated in:
    ) MOVW A, _data+1
    ) MOVW @RW0, A
    ) With this example I understand that this processor can access int
    ) variables even at non aligned addresses. And the assignment is very
    ) fast: just two move instructions.
    )
    ) While:
    ) const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 };
    ) i = ((int)data[1] << 8) + ((int)data[2] << 0);
    ) is translated as:
    ) MOV A, _data+1
    ) SWAP
    ) MOV A, _data+2
    ) ADDW A
    ) MOVW @RW0, A
    ) In this case I have 5 instructions, 3 of them are move. I think the
    ) second code is more portable, but more time consuming.

    Did you try this with full optimization ?
    A good compiler should have emitted the same instructions for both pieces
    of source code, assuming the first set of instructions is indeed faster
    in the case of misaligned access.

    ) This is the reason for my first question: how can I wrote code that is
    ) at the same time "full portable" (or almost portable) on every
    ) CPU/compiler and fast (for example, using the misaligned access feature
    ) as in my CPU)?

    Get a good optimizing compiler, and have it do the work for you.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
     
    Willem, Jan 24, 2011
    #3
  4. Eric Sosman

    Bartc Guest

    "pozz" <> wrote in message
    news:ihkn21$6u6$...
    > Il 24/01/2011 13:52, Eric Sosman ha scritto:


    > In my case, I work on embedded low-end microcontrollers (8- or 16-bits),
    > with internal RAM and Flash memories.
    > I made a test:
    > const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 };
    > i = *(int *)&data[1];
    > is translated in:
    > MOVW A, _data+1
    > MOVW @RW0, A
    > With this example I understand that this processor can access int
    > variables even at non aligned addresses. And the assignment is very fast:
    > just two move instructions.
    >
    > While:
    > const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 };
    > i = ((int)data[1] << 8) + ((int)data[2] << 0);
    > is translated as:
    > MOV A, _data+1
    > SWAP
    > MOV A, _data+2
    > ADDW A
    > MOVW @RW0, A
    > In this case I have 5 instructions, 3 of them are move. I think the second
    > code is more portable, but more time consuming.
    >
    > This is the reason for my first question: how can I wrote code that is at
    > the same time "full portable" (or almost portable) on every CPU/compiler
    > and fast (for example, using the misaligned access feature as in my CPU)?


    You want to use portable code for misaligned memory loads and stores, but
    don't want a speed penalty where the hardware can directly do misaligned
    access?

    (But presumably you are happy for there to be a penalty anyway where the
    hardware does not allow misaligned access? Some processors may not even be
    byte addressable: your char data[] could be stored one character per word.
    Or could be stored packed, so that quite a few instructions are needed to do
    even byte access to the data, let alone combine arbitrary bytes into a
    word.)

    In C I believe this sort of thing is handled with conditional code, hidden
    behind macros. It's not pretty, but means code can be (manually) tweaked
    according to the target processor.

    For example:

    #if ...
    #define GETINT(addr) (*(int*)(addr)
    #else
    #define GETINT(addr) (*(char*)(addr)<<8) + (*(char*)(addr+1))
    #endif

    (made-up macros not tested). Then with conditionals, you compile one or the
    other depending on which is best. If misaligned loads are OK, use the first;
    if not, use the other. But there needs to be some flag or other which is set
    differently in each target.

    And they'd be used as i = GETINT(&data[1]);

    --
    Bartc
     
    Bartc, Jan 24, 2011
    #4
  5. Eric Sosman

    Bartc Guest

    "Eric Sosman" <> wrote in message
    news:ihldu4$ou6$-september.org...

    > 3) ... but okay: Suppose that by expending hours and hours of
    > careful measurement and days and days of programming and weeks
    > and weeks of ongoing maintenance you actually *do* manage to
    > process the number in 30ns instead of 51, for a clear savings
    > of nearly forty percent, whee! Now, please calculate how
    > many numbers you need to process in order to break even on the
    > time you spent engineering the savings. For argument's sake,
    > suppose you somehow got through all the initial work in just
    > one forty-hour week, and divide that by 21ns per number to find
    > the payoff moment. (With these completely made-up numbers, you
    > will break even at 6.8 TRILLION input values. YMMV, of course.)


    You also have to multiply any savings by the number of people who are going
    to run the program, and by how many times they run it.

    These savings are typical of the improvements you might get through one
    minor optimisation in a compiler. Or a slightly different strategy for
    predicting branches in a processor.

    By themselves they are nothing. But there might be hundreds of such
    improvements. They might be invoked a billion times per program. And there
    might be a million copies of the program.

    At the very least, valuable experience can be gained as to how far a piece
    of software or hardware can be pushed.

    Presumably you are not suggesting that compiler writers and processor
    designers shouldn't bother with the stuff they do?

    That +2 clocks penalty for a misaligned load, on page 76 of that processor
    manual, is mentioned for a reason.

    --
    Bartc
     
    Bartc, Jan 25, 2011
    #5
  6. Eric Sosman

    Eric Sosman Guest

    On 1/24/2011 3:22 PM, pozz wrote:
    > Il 24/01/2011 13:52, Eric Sosman ha scritto:
    >> [...]
    >> If you actually care about the relative speed, the only
    >> practical approach is to measure. Relying on operation counts
    >> (as seen in the source) is not entirely useless, but as near to
    >> it as makes no never-mind. MEASURE!

    >
    > You are saying that code A could be faster than code B on a platform
    > (CPU+compiler), but could be slower on another platform.


    Yes. If you believe A is always faster/slower and B is always
    slower/faster, regardless of platform, compiler version, compiler
    options, program context, and phase of the Moon, you are too young
    and inexperienced for the work you're engaged in.

    > [...]
    > In this case I have 5 instructions, 3 of them are move. I think the
    > second code is more portable, but more time consuming.


    "I think," you say, but I notice that you STILL don't say
    "I measure." What's the matter? Scared?

    > This is the reason for my first question: how can I wrote code that is
    > at the same time "full portable" (or almost portable) on every
    > CPU/compiler and fast (for example, using the misaligned access feature
    > as in my CPU)?
    >
    > After discussing with you, I think there isn't a solution for that
    > question.


    It is surpassingly unlikely that a straight-line source-code
    sequence A will be faster than B in all circumstances. Compilers
    are startlingly subtle, but are not magical. So, what's a poor
    programmer to do? There are several approaches, among them:

    1) Lard your code with #ifdef's for different platforms. This
    approach offers the best opportunities for speed, but the
    worst burden for maintenance: You wind up with N independent
    implementations, >=N times the effort to keep them current.
    There's also bit-rot to consider: `#ifdef FROBOZZ_MAGIC_C'
    may be fine today, but next month Frobozz comes out with a
    new and improved version that optimizes differently. If you
    don't have a continuing project to verify and re-verify all
    those hand-crafted optimizations, at least a few of them
    *will* turn into pessimizations. And if you *do* have such
    a project, all you can look forward to is adding a new
    `#ifdef FROBOZZ_MAGIC_C_VERSION_3_POINT_5_OR_GREATER' and
    increasing your maintenance burden from N to (N+1).

    2) Measure (that *awful* word!) the speeds of A and B on several
    platforms of interest, and ponder the magnitude of the spread.
    You're worried about extracting a 32-bit integer from a
    network packet, fine. Let's suppose that A takes 32-44 ns
    on your suite of platforms, while B takes 30-51. Well, is
    that difference significant in relation to the speed at which
    the integers arrive? If you're already sucking the data across
    a WiFi link shared by forty-seven other coffee drinkers, and
    shipping them through an entire TCP/IP stack with checksumming
    and window management and retransmissions and ACK's and all the
    rest, the difference between 32ns and 51ns is unlikely to matter.

    3) ... but okay: Suppose that by expending hours and hours of
    careful measurement and days and days of programming and weeks
    and weeks of ongoing maintenance you actually *do* manage to
    process the number in 30ns instead of 51, for a clear savings
    of nearly forty percent, whee! Now, please calculate how
    many numbers you need to process in order to break even on the
    time you spent engineering the savings. For argument's sake,
    suppose you somehow got through all the initial work in just
    one forty-hour week, and divide that by 21ns per number to find
    the payoff moment. (With these completely made-up numbers, you
    will break even at 6.8 TRILLION input values. YMMV, of course.)

    Okay, this analysis is far from complete. I've focused on payoff --
    time invested versus time saved -- and that's not the only "performance"
    measure. Still, it's a useful measure to keep in mind when assessing
    whether something is worth doing.

    Here's a (possibly misleading) analogy I've used before: With the
    price of gasoline on the rise, you may well desire to improve your
    car's fuel efficiency. Should you give the old jalopy a fresh coat
    of wax? YEA: It will make the car slipperier, reducing the wind
    resistance and allowing the engine to work less hard. NAY: It will
    make the car heavier, forcing the engine to drag a heavier load down
    the road. Oh, gosh, oh, gosh, whatever should I do?

    Answer: Stop fretting about wax and unhitch the boat trailer.

    ... and that is the LAST you will hear from me on this thread
    until and unless you offer some actual MEASUREMENTS! If you say
    "I think" or "It stands to reason" or "Everyone knows," I will
    personally come after you with a big book of six-place logarithms
    and bludgeon you so hard your mantissa will fall off.

    --
    Eric Sosman
    lid
     
    Eric Sosman, Jan 25, 2011
    #6
  7. Eric Sosman

    Eric Sosman Guest

    On 1/24/2011 7:04 PM, Bartc wrote:
    > "Eric Sosman"<> wrote in message
    > news:ihldu4$ou6$-september.org...
    >
    >> 3) ... but okay: Suppose that by expending hours and hours of
    >> careful measurement and days and days of programming and weeks
    >> and weeks of ongoing maintenance you actually *do* manage to
    >> process the number in 30ns instead of 51, for a clear savings
    >> of nearly forty percent, whee! Now, please calculate how
    >> many numbers you need to process in order to break even on the
    >> time you spent engineering the savings. For argument's sake,
    >> suppose you somehow got through all the initial work in just
    >> one forty-hour week, and divide that by 21ns per number to find
    >> the payoff moment. (With these completely made-up numbers, you
    >> will break even at 6.8 TRILLION input values. YMMV, of course.)

    >
    > You also have to multiply any savings by the number of people who are going
    > to run the program, and by how many times they run it.


    You still need 6.8 trillion (or whatever) to break even, spread
    it out however you may.

    > Presumably you are not suggesting that compiler writers and processor
    > designers shouldn't bother with the stuff they do?


    IMHO they bother with lots of stuff that's pointless, in pursuit
    of a better SPECKmock than the next guy. But even so: The CPU maven,
    the compiler wizard, even the library fabricator -- all these folks
    face the prospect that their stuff really *will* run trillions of
    times. The O.P. has offered *no* reason to think he's at that sort
    of level, not even within several orders of magnitude.

    Of course, this famous "6.8 trillion" is an entirely fictitious
    number. It's not to be taken literally, but as an illustration of
    the sort of calculation the O.P. *ought* to make before expending
    further effort on lightening his car by installing thinner windshields.
    But he doesn't seem to be quantitatively inclined; if anything, he's
    quantitatively disinclined. Such people should give up on programming
    and resign themselves to being computer science professors. ;-)

    --
    Eric Sosman
    lid
     
    Eric Sosman, Jan 25, 2011
    #7
  8. Eric Sosman

    Bartc Guest

    "Ben Bacarisse" <> wrote in message
    news:...
    > Eric Sosman <> writes:


    >> You still need 6.8 trillion (or whatever) to break even, spread
    >> it out however you may.

    >
    > I don't buy these peculiar sums! The hours I spend shaving a few ms off
    > a program run (note ms and not just ns) are real hours that I could
    > have spent doing something else -- adding a feature, fixing a bug, going
    > home on time. The few ms shaved off each run might not even be noticed
    > by the user. Even if a billion users gain these extra milliseconds a
    > hundred times a day, they don't sum to anything other than zero (or to
    > something very close to zero)!


    A billion times 1msec times 100 times a day comes to about 3 years'
    savings -- per day. One entire lifetime saved per month.

    > Of course there are time when it does matter, but it can't simply be
    > decided by adding up time spent and time gained.


    I just tried this:

    for (i=0; i<N; ++i) sum+=p;

    p is an array of integers, Misaligning by 1 byte on an x86 makes this code
    20% slower. That may or may not have significance for the application. But
    if you are writing a library function (where N is a parameter) or are
    responsible for allocating p, then it's something that needs to be
    considered, and deserves some time spent on it.

    But if you know it will only ever save 20% (strictly, 16.7%) of 1% of the
    runtime, then perhaps it can wait (unless the idea of a misaligned array
    grates, then you might work on it anyway..)

    --
    Bartc
     
    Bartc, Jan 25, 2011
    #8
  9. Eric Sosman

    James Kuyper Guest

    On 01/24/2011 03:34 PM, Willem wrote:
    > pozz wrote:

    ....
    > ) This is the reason for my first question: how can I wrote code that is
    > ) at the same time "full portable" (or almost portable) on every
    > ) CPU/compiler and fast (for example, using the misaligned access feature
    > ) as in my CPU)?
    >
    > Get a good optimizing compiler, and have it do the work for you.


    The requirements "every CPU/compiler" and "good optimizing compiler" are
    incompatible.

    --
    James Kuyper
     
    James Kuyper, Jan 25, 2011
    #9
  10. Eric Sosman

    James Kuyper Guest

    On 01/24/2011 03:22 PM, pozz wrote:
    ....
    > This is the reason for my first question: how can I wrote code that is
    > at the same time "full portable" (or almost portable) on every
    > CPU/compiler and fast (for example, using the misaligned access feature
    > as in my CPU)?


    You can't say "every CPU/compiler" and also mandate "fast"; there are
    probably some compilers out there which are inherently incapable of
    generating "fast" code, by whatever standard you want to use for
    defining "fast". There's also pairs of compilers out there where the
    fastest code for one of the two compilers is different from the fastest
    code for the other. If it is essential to you to optimize at any level
    other than the language-independent level of the algorithm itself, you
    must inherently abandon any hope for portability.

    --
    James Kuyper
     
    James Kuyper, Jan 25, 2011
    #10
  11. Eric Sosman <> writes:

    > On 1/24/2011 7:04 PM, Bartc wrote:
    >> "Eric Sosman"<> wrote in message
    >> news:ihldu4$ou6$-september.org...
    >>
    >>> 3) ... but okay: Suppose that by expending hours and hours of
    >>> careful measurement and days and days of programming and weeks
    >>> and weeks of ongoing maintenance you actually *do* manage to
    >>> process the number in 30ns instead of 51, for a clear savings
    >>> of nearly forty percent, whee! Now, please calculate how
    >>> many numbers you need to process in order to break even on the
    >>> time you spent engineering the savings. For argument's sake,
    >>> suppose you somehow got through all the initial work in just
    >>> one forty-hour week, and divide that by 21ns per number to find
    >>> the payoff moment. (With these completely made-up numbers, you
    >>> will break even at 6.8 TRILLION input values. YMMV, of course.)

    >>
    >> You also have to multiply any savings by the number of people who are going
    >> to run the program, and by how many times they run it.

    >
    > You still need 6.8 trillion (or whatever) to break even, spread
    > it out however you may.


    I don't buy these peculiar sums! The hours I spend shaving a few ms off
    a program run (note ms and not just ns) are real hours that I could
    have spent doing something else -- adding a feature, fixing a bug, going
    home on time. The few ms shaved off each run might not even be noticed
    by the user. Even if a billion users gain these extra milliseconds a
    hundred times a day, they don't sum to anything other than zero (or to
    something very close to zero)!

    Of course there are time when it does matter, but it can't simply be
    decided by adding up time spent and time gained.

    <snip>
    --
    Ben.
     
    Ben Bacarisse, Jan 25, 2011
    #11
  12. pozz Wrote:
    > I made a test:
    > const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 };
    > i = *(int *)&data[1];
    > is translated in:
    > MOVW A, _data+1
    > MOVW @RW0, A
    > With this example I understand that this processor can access int
    > variables even at non aligned addresses.

    That is not shown by the compiler output.
    The compiler is not required to check (and account) for aligned memory
    access, that is your task as a programmer. The fact that the compiler emits
    instructions to perform unaligned data access does not mean this access will
    be performed correctly by the processor.
    > And the assignment is very
    > fast: just two move instructions.
    > While:
    > const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 };
    > i = ((int)data[1] << 8) + ((int)data[2] << 0);
    > is translated as:
    > MOV A, _data+1
    > SWAP
    > MOV A, _data+2
    > ADDW A
    > MOVW @RW0, A
    > In this case I have 5 instructions, 3 of them are move. I think the
    > second code is more portable, but more time consuming.

    There are indeed more assembly instructions, but especially on low-end
    processors, not all instructions take the same amount of time to execute. In
    the end, it may turn out to be faster to do more, but simpler, instructions.
    The second code is definitely much more portable, given that it is
    guaranteed to work in all cases, while the first code only works with a
    big-endian processor that supports unaligned loads.
    Incidentally, the processor documentation you posted in a follow-up
    indicates your target processor is little-endian, so for that processor the
    two samples produce different results.
    > This is the reason for my first question: how can I wrote code that is
    > at the same time "full portable" (or almost portable) on every
    > CPU/compiler and fast (for example, using the misaligned access feature
    > as in my CPU)?
    > After discussing with you, I think there isn't a solution for that
    > question.

    "fully portable" and "the fastest code possible for all platforms" are two
    goals that don't go together all that well. To squeeze the last
    clock-cycles, you always have to resort to hand-tuned code that is specific
    to the target processor/compiler combination or possibly even assembly code.
    > >> Keep in mind that I use low-end embedded processor with very limited
    > >> resources compared with x86 desktop processors.

    > >

    Just for reference, I have worked on embedded projects with similar low-end
    processors where we used the byte-shifting technique a lot (communication
    between internal components was done with messages. Messages were defined as
    big-endian encoded, the processor was little-endian, so every multi-byte
    value passed between components had to be encoded and decoded with byte
    shifts). The shifting to convert between litte-endian and big-endian (and
    vice versa) has never caused any performance problems.
    Bart v Ingen Schenau
     
    Bart van Ingen Schenau, Jan 25, 2011
    #12
  13. "Bartc" <> writes:

    > "Ben Bacarisse" <> wrote in message
    > news:...
    >> Eric Sosman <> writes:

    >
    >>> You still need 6.8 trillion (or whatever) to break even, spread
    >>> it out however you may.

    >>
    >> I don't buy these peculiar sums! The hours I spend shaving a few ms off
    >> a program run (note ms and not just ns) are real hours that I could
    >> have spent doing something else -- adding a feature, fixing a bug, going
    >> home on time. The few ms shaved off each run might not even be noticed
    >> by the user. Even if a billion users gain these extra milliseconds a
    >> hundred times a day, they don't sum to anything other than zero (or to
    >> something very close to zero)!

    >
    > A billion times 1msec times 100 times a day comes to about 3 years'
    > savings -- per day. One entire lifetime saved per month.


    You don't really believe that do you? If you do that's fine -- we can
    just agree to differ, but maybe you were just doing the sums to show how
    it adds up (if one were to choose to do so).

    >> Of course there are time when it does matter, but it can't simply be
    >> decided by adding up time spent and time gained.

    >
    > I just tried this:
    >
    > for (i=0; i<N; ++i) sum+=p;
    >
    > p is an array of integers, Misaligning by 1 byte on an x86 makes this code
    > 20% slower.


    That's just one data point. I'm not disputing it, but I see different
    numbers; and just by quoting a number it can suggest a degree of
    validity that is often not warranted.

    > That may or may not have significance for the application. But
    > if you are writing a library function (where N is a parameter) or are
    > responsible for allocating p, then it's something that needs to be
    > considered, and deserves some time spent on it.
    >
    > But if you know it will only ever save 20% (strictly, 16.7%) of 1% of the
    > runtime, then perhaps it can wait (unless the idea of a misaligned array
    > grates, then you might work on it anyway..)


    I don't think we disagree about this.

    --
    Ben.
     
    Ben Bacarisse, Jan 25, 2011
    #13
  14. Eric Sosman

    Bartc Guest

    "Eric Sosman" <> wrote in message
    news:ihnu1r$833$-september.org...
    > On 1/25/2011 5:30 AM, Bartc wrote:
    >> [...]
    >> A billion times 1msec times 100 times a day comes to about 3 years'
    >> savings -- per day. One entire lifetime saved per month.

    >
    > ... and if you can't see the absurdity inherent in that
    > calculation, you ought to give up programming and go into politics.


    Looks like I need to go into politics then.

    BTW what *is* the absurdity? You mean saving 3 years, in only one day?
    That's spread over more than one person!

    (The figures were Ben's, who claimed they added to near zero. Changing the
    billion to a million, and the 1msec to 1 second, makes them more realistic
    (some products I use *do* have a response time of a second or more, and are
    manufactured in the hundreds of thousands), but the resulting saving is the
    same.)

    --
    Bartc
     
    Bartc, Jan 25, 2011
    #14
  15. Eric Sosman

    Seebs Guest

    On 2011-01-25, Ben Bacarisse <> wrote:
    > I don't buy these peculiar sums! The hours I spend shaving a few ms off
    > a program run (note ms and not just ns) are real hours that I could
    > have spent doing something else -- adding a feature, fixing a bug, going
    > home on time. The few ms shaved off each run might not even be noticed
    > by the user. Even if a billion users gain these extra milliseconds a
    > hundred times a day, they don't sum to anything other than zero (or to
    > something very close to zero)!


    In particular: If I get my command prompt back after running a command
    faster than the refresh cycle of my monitor, it really doesn't matter whether
    you can make it faster, from an interactive perspective. Now, if it's
    scriptable, milliseconds might matter, but even then, often not.

    Absolutely, by far, the most important thing is to measure performance
    first, and look closely at what you're doing.

    I once spent a day or two hand-tuning some vectorized code, eliminating
    branches (which were expensive), caching trig results, and so on. I got
    a 30% improvement in the performance of the code I was working on.

    Then I added a check for whether two adjacent values in something that
    would be expanded repeatedly (iterative fractals) were the same value, because
    if they were, they'd expand into N of the same value, then N^2, and so on.

    I got a couple percent difference in performance on small cases, well over
    a factor of two on the largest cases that had been possible in main memory
    before the change, and I could do things about 2x-3x as large in main memory,
    which resulted in about a 50x improvement on some larger cases.

    Five minutes spent doing the right thing is WAY more effective than a day
    or two spent doing the wrong thing.

    -s
    --
    Copyright 2010, all wrongs reversed. Peter Seebach /
    http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
    http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
    I am not speaking for my employer, although they do rent some of my opinions.
     
    Seebs, Jan 26, 2011
    #15
  16. Eric Sosman

    Eric Sosman Guest

    On 1/25/2011 5:30 AM, Bartc wrote:
    > [...]
    > A billion times 1msec times 100 times a day comes to about 3 years'
    > savings -- per day. One entire lifetime saved per month.


    ... and if you can't see the absurdity inherent in that
    calculation, you ought to give up programming and go into politics.

    --
    Eric Sosman
    lid
     
    Eric Sosman, Jan 26, 2011
    #16
  17. Eric Sosman

    Eric Sosman Guest

    On 1/25/2011 9:06 AM, Bart van Ingen Schenau wrote:
    > pozz Wrote:
    >> I made a test:
    >> const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 };
    >> i = *(int *)&data[1];
    >> is translated in:
    >> MOVW A, _data+1
    >> MOVW @RW0, A
    >> With this example I understand that this processor can access int
    >> variables even at non aligned addresses.

    > That is not shown by the compiler output.
    > The compiler is not required to check (and account) for aligned memory
    > access, that is your task as a programmer. The fact that the compiler emits
    > instructions to perform unaligned data access does not mean this access will
    > be performed correctly by the processor.


    FWIW, I've observed four distinct behaviors for unaligned accesses
    (on actual machines, not DeathStation 9000's):

    0) No difference. What this really means is that the machine has
    no alignment requirements to begin with: An 8-bit Z80 can fetch
    a 16-bit "double length" integer from any address, odd or even.

    1) The hardware performs the access, but takes longer and may lose
    interesting properties like atomicity. Speed penalties run
    from 25-100% (5/4 to twice the aligned time), often far worse
    if the misalignment straddles a page boundary.

    2) The hardware traps. Sometimes the trap crashes the program,
    sometimes a software trap handler emulates it. The trap handler
    takes 100x to 2000x as much time as an aligned access, more if
    page boundaries are crossed.

    3) The hardware runs merrily at full speed, but silently accesses
    the wrong data. The machine on which I saw this behavior just
    ignored the "offending" low-order address bits: If you accessed
    a four-byte-aligned datum at binary address xx...x1010, the
    machine happily used xx...x1000 instead. Mysteries ensued ...

    This is just what I've personally run across: There are probably
    even odder things out there that I haven't yet encountered.

    --
    Eric Sosman
    lid
     
    Eric Sosman, Jan 26, 2011
    #17
  18. "Bartc" <> writes:

    > "Eric Sosman" <> wrote in message
    > news:ihnu1r$833$-september.org...
    >> On 1/25/2011 5:30 AM, Bartc wrote:
    >>> [...]
    >>> A billion times 1msec times 100 times a day comes to about 3 years'
    >>> savings -- per day. One entire lifetime saved per month.

    >>
    >> ... and if you can't see the absurdity inherent in that
    >> calculation, you ought to give up programming and go into politics.

    >
    > Looks like I need to go into politics then.
    >
    > BTW what *is* the absurdity? You mean saving 3 years, in only one day?
    > That's spread over more than one person!
    >
    > (The figures were Ben's, who claimed they added to near zero. Changing the
    > billion to a million, and the 1msec to 1 second, makes them more realistic
    > (some products I use *do* have a response time of a second or more, and are
    > manufactured in the hundreds of thousands), but the resulting saving is the
    > same.)


    Let's say You have a spare hour in which to read. Do you care if it is
    from 12 to 1pm or it if made up of 3600 separate seconds scattered
    throughout the day? It's the same total time after all. I contend that
    time does not always add up in the obvious way. The fractions gained
    here and there can't always be used as if there were run together.

    --
    Ben.
     
    Ben Bacarisse, Jan 26, 2011
    #18
  19. Eric Sosman

    Eric Sosman Guest

    [OT] Re: Data alignment and endianess

    On 1/26/2011 4:43 AM, christian.bau wrote:
    > On Jan 25, 4:29 am, Eric Sosman<> wrote:
    >
    >> IMHO they bother with lots of stuff that's pointless, in pursuit
    >> of a better SPECKmock than the next guy. But even so: The CPU maven,
    >> the compiler wizard, even the library fabricator -- all these folks
    >> face the prospect that their stuff really *will* run trillions of
    >> times. The O.P. has offered *no* reason to think he's at that sort
    >> of level, not even within several orders of magnitude.

    >
    > There is a very simple argument: Whenever you handle data coming from
    > the network, or going to the network, there will be some network I/O
    > just before or just after the handling of the data. That network I/O
    > will dwarf whatever execution time the handling code has.


    Usually, maybe, but certainly not always. 10Gb/s network links
    are quite common, and 100Gb/s links are not uncommon in datacenter
    backbones. In a previous life (less than two years ago!) I was
    trying to help a customer whose CPU's were unable to keep up with
    the network's interrupt load; the solution was to put the interface
    in a non-interrupting mode and dedicate an entire core to polling
    the device!

    > Networking code doesn't get fast by doing "clever" and possibly non-
    > portable things at the lowest level. It gets fast by doing things
    > right at a higher level.


    A good deal of low-level cleverness gets expended on things like
    scatter-gather I/O to eliminate buffer copying.

    --
    Eric Sosman
    lid
     
    Eric Sosman, Jan 26, 2011
    #19
  20. Eric Sosman

    James Kuyper Guest

    On 01/25/2011 11:24 PM, Ben Bacarisse wrote:
    > "Bartc"<> writes:
    >
    >> "Eric Sosman"<> wrote in message
    >> news:ihnu1r$833$-september.org...
    >>> On 1/25/2011 5:30 AM, Bartc wrote:
    >>>> [...]
    >>>> A billion times 1msec times 100 times a day comes to about 3 years'
    >>>> savings -- per day. One entire lifetime saved per month.
    >>>
    >>> ... and if you can't see the absurdity inherent in that
    >>> calculation, you ought to give up programming and go into politics.

    >>
    >> Looks like I need to go into politics then.
    >>
    >> BTW what *is* the absurdity? You mean saving 3 years, in only one day?
    >> That's spread over more than one person!
    >>
    >> (The figures were Ben's, who claimed they added to near zero. Changing the
    >> billion to a million, and the 1msec to 1 second, makes them more realistic
    >> (some products I use *do* have a response time of a second or more, and are
    >> manufactured in the hundreds of thousands), but the resulting saving is the
    >> same.)

    >
    > Let's say You have a spare hour in which to read. Do you care if it is
    > from 12 to 1pm or it if made up of 3600 separate seconds scattered
    > throughout the day? It's the same total time after all. I contend that
    > time does not always add up in the obvious way. The fractions gained
    > here and there can't always be used as if there were run together.


    Let's say that you have a program that takes 10 seconds to perform an
    operation that you have to wait for before you can do anything else
    useful. Thanks to some 20,000 separate decisions made by 5000 different
    hardware and software designers about how each individual hardware and
    software component of your system should be designed, that operation
    does NOT take a a full minute to execute; those individual decisions
    saved you an average of only 2.5 milliseconds each. If each of those
    designers had taken the attitude that "no one cares about saving 2.5
    milliseconds", most of those decisions might have been made differently.
    Do you think that would have been the right thing for those designers to
    have done?


    --
    James Kuyper
     
    James Kuyper, Jan 26, 2011
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. hantheman

    Portable endianess

    hantheman, Dec 20, 2003, in forum: C++
    Replies:
    6
    Views:
    4,375
    Nick Hounsome
    Dec 28, 2003
  2. Endianess...

    , Feb 2, 2005, in forum: C++
    Replies:
    4
    Views:
    484
    Howard
    Feb 3, 2005
  3. SpOiLeR
    Replies:
    5
    Views:
    1,349
    SpOiLeR
    Mar 16, 2005
  4. Guest

    endianess

    Guest, Jul 18, 2003, in forum: C Programming
    Replies:
    8
    Views:
    669
    Kevin Easton
    Jul 20, 2003
  5. jmfauth
    Replies:
    4
    Views:
    319
    jmfauth
    Oct 13, 2010
Loading...

Share This Page