# Re: Data alignment and endianess

Discussion in 'C Programming' started by Eric Sosman, Jan 23, 2011.

1. ### Eric SosmanGuest

On 1/23/2011 10:05 AM, pozz wrote:
> I need to parse a packet of data received on a serial port or stored in
> a file. The bytes are stored in an arry:
> unsigned char data[PACKET_LENGTH];
>
> Now I'm wondering how I can write a "full portable" code to retrieve a
> 32-bit unsigned integer in Network Byte Order (Big Endian), even at
>
> Suppose I have a 32-bit integer starting from data[1]. I think the most
> portable way to retrieve this integer is:
> ---
> uint32_t i;
> i = (data[1] << 24) | (data[2] << 16) | (data[3] << 8) | data[4];

This is almost right, and will work on many machines. But on
machines with 16-bit `int' (in fact, anything narrower than 32, or
even a 32-bit `int' on machines that are picky about overflow) the
promotion rules will trip you up. The `<<' operator will promote
each `unsigned char' to an `int' (or `unsigned int' on some machines),
but not to a `uint32_t', and then the shift may not work as desired.

To ensure you're shifting with the proper width, then, you
need something like

i = (((uint32_t)data[1]) << 24)
| (((uint32_t)data[2]) << 16)
| (((uint32_t)data[3]) << 8)
| (((uint32_t)data[4]) << 0);

(You could lose the final shift, and even the final cast, but the
pattern may be more readable if you stick with it all the way.
You could also lose the outermost sets of parentheses, but I'd
suggest keeping them in the interests of clarity.)

> i = ntohl(i);

No. The original calculation has already produced the value
you want, so no further mucking around is necessary or desirable.

By the way, you'll probably find ntohl() and its relatives
already defined for you on systems that support networking. If
you need them (you don't, here), prefer the system's versions to
free-hand code.

> [...]
> What happens if I retrieve a 32-bit integer on a Big Endian CPU at an
> aligned address, for example starting from data[0]? With the code above,
> the CPU will constructs the integer starting from the bytes, copying and
> shifting them.
> The code works well, but is more time consuming compared with a simple
> cast:
> ---
> uint32_t i = *(uint32_t *)&data[0];

For this variant, you *do* need ntohl() or something like it.

Alignment is still a problem, even starting at `data[0]', because
we have no reason (in what you've shown) to think that `data' itself
is aligned in any particular way.

You say that the original "is more time consuming" than the
second. How much more? What have your measurements shown? (You
*have* measured the speed difference, have you not? Surely you
would not be so rash as to state that there "is" a difference without
evidence to support your assertion, would you? Hmmm?)

> Can you suggest a better snippet code?

Go back to the original, minus the ntohl(). Until and unless
you have solid evidence that simple and portable code is a bottleneck,
leave it portable and simple. "More computing sins are committed in
the name of efficiency (without necessarily achieving it) than for any
other single reason--including blind stupidity." - W.A. Wulf

--
Eric Sosman
lid

Eric Sosman, Jan 23, 2011

2. ### Eric SosmanGuest

On 1/23/2011 6:06 PM, pozz wrote:
> Il 23/01/2011 17:06, Eric Sosman ha scritto:
>> [...]
>> You say that the original "is more time consuming" than the
>> second. How much more? What have your measurements shown? (You
>> *have* measured the speed difference, have you not? Surely you
>> would not be so rash as to state that there "is" a difference without
>> evidence to support your assertion, would you? Hmmm?)

>
> Oh, I haven't made any measurement. But I think that a simple cast could
> be very fast instead of copying, shifting and or-ing bytes in the
> correct order. Isn't this true?

Compilers are clever, even devious, in what they'll do to
optimize code, and what you see in the source may have only a
sketchy resemblance to what the CPU winds up doing. Also, on
many systems the CPU is hundreds of times faster than memory,
so "optimization" is largely about scheduling memory references
manipulations are done on the data once it's fetched. (And if
you think you can count the memory fetches by studying the source
of the two proposed snippets, see the earlier remark about the
deviousness of optimizers.)

If you actually care about the relative speed, the only
practical approach is to measure. Relying on operation counts
(as seen in the source) is not entirely useless, but as near to
it as makes no never-mind. MEASURE!

> Keep in mind that I use low-end embedded processor with very limited
> resources compared with x86 desktop processors.

How can I "keep in mind" a datum that you now offer for the
very first time? Or as Alice put it, "How can I have more when
about "most portable" solutions for "all the CPUs (Big and Little
Endian)," not for one specific unnamed processor.

--
Eric Sosman
lid

Eric Sosman, Jan 24, 2011

3. ### WillemGuest

pozz wrote:
) You are saying that code A could be faster than code B on a platform
) (CPU+compiler), but could be slower on another platform.
) It depends on the memory access time, compiler optimization and so on.
)
) In my case, I work on embedded low-end microcontrollers (8- or 16-bits),
) with internal RAM and Flash memories.
) const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 };
) i = *(int *)&data[1];
) is translated in:
) MOVW A, _data+1
) MOVW @RW0, A
) With this example I understand that this processor can access int
) variables even at non aligned addresses. And the assignment is very
) fast: just two move instructions.
)
) While:
) const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 };
) i = ((int)data[1] << 8) + ((int)data[2] << 0);
) is translated as:
) MOV A, _data+1
) SWAP
) MOV A, _data+2
) MOVW @RW0, A
) In this case I have 5 instructions, 3 of them are move. I think the
) second code is more portable, but more time consuming.

Did you try this with full optimization ?
A good compiler should have emitted the same instructions for both pieces
of source code, assuming the first set of instructions is indeed faster
in the case of misaligned access.

) This is the reason for my first question: how can I wrote code that is
) at the same time "full portable" (or almost portable) on every
) CPU/compiler and fast (for example, using the misaligned access feature
) as in my CPU)?

Get a good optimizing compiler, and have it do the work for you.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Willem, Jan 24, 2011
4. ### BartcGuest

"pozz" <> wrote in message
news:ihkn21\$6u6\$...
> Il 24/01/2011 13:52, Eric Sosman ha scritto:

> In my case, I work on embedded low-end microcontrollers (8- or 16-bits),
> with internal RAM and Flash memories.
> const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 };
> i = *(int *)&data[1];
> is translated in:
> MOVW A, _data+1
> MOVW @RW0, A
> With this example I understand that this processor can access int
> variables even at non aligned addresses. And the assignment is very fast:
> just two move instructions.
>
> While:
> const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 };
> i = ((int)data[1] << 8) + ((int)data[2] << 0);
> is translated as:
> MOV A, _data+1
> SWAP
> MOV A, _data+2
> MOVW @RW0, A
> In this case I have 5 instructions, 3 of them are move. I think the second
> code is more portable, but more time consuming.
>
> This is the reason for my first question: how can I wrote code that is at
> the same time "full portable" (or almost portable) on every CPU/compiler
> and fast (for example, using the misaligned access feature as in my CPU)?

You want to use portable code for misaligned memory loads and stores, but
don't want a speed penalty where the hardware can directly do misaligned
access?

(But presumably you are happy for there to be a penalty anyway where the
hardware does not allow misaligned access? Some processors may not even be
byte addressable: your char data[] could be stored one character per word.
Or could be stored packed, so that quite a few instructions are needed to do
even byte access to the data, let alone combine arbitrary bytes into a
word.)

In C I believe this sort of thing is handled with conditional code, hidden
behind macros. It's not pretty, but means code can be (manually) tweaked
according to the target processor.

For example:

#if ...
#else
#endif

(made-up macros not tested). Then with conditionals, you compile one or the
other depending on which is best. If misaligned loads are OK, use the first;
if not, use the other. But there needs to be some flag or other which is set
differently in each target.

And they'd be used as i = GETINT(&data[1]);

--
Bartc

Bartc, Jan 24, 2011
5. ### BartcGuest

"Eric Sosman" <> wrote in message
news:ihldu4\$ou6\$-september.org...

> 3) ... but okay: Suppose that by expending hours and hours of
> careful measurement and days and days of programming and weeks
> and weeks of ongoing maintenance you actually *do* manage to
> process the number in 30ns instead of 51, for a clear savings
> of nearly forty percent, whee! Now, please calculate how
> many numbers you need to process in order to break even on the
> time you spent engineering the savings. For argument's sake,
> suppose you somehow got through all the initial work in just
> one forty-hour week, and divide that by 21ns per number to find
> the payoff moment. (With these completely made-up numbers, you
> will break even at 6.8 TRILLION input values. YMMV, of course.)

You also have to multiply any savings by the number of people who are going
to run the program, and by how many times they run it.

These savings are typical of the improvements you might get through one
minor optimisation in a compiler. Or a slightly different strategy for
predicting branches in a processor.

By themselves they are nothing. But there might be hundreds of such
improvements. They might be invoked a billion times per program. And there
might be a million copies of the program.

At the very least, valuable experience can be gained as to how far a piece
of software or hardware can be pushed.

Presumably you are not suggesting that compiler writers and processor
designers shouldn't bother with the stuff they do?

That +2 clocks penalty for a misaligned load, on page 76 of that processor
manual, is mentioned for a reason.

--
Bartc

Bartc, Jan 25, 2011
6. ### Eric SosmanGuest

On 1/24/2011 3:22 PM, pozz wrote:
> Il 24/01/2011 13:52, Eric Sosman ha scritto:
>> [...]
>> If you actually care about the relative speed, the only
>> practical approach is to measure. Relying on operation counts
>> (as seen in the source) is not entirely useless, but as near to
>> it as makes no never-mind. MEASURE!

>
> You are saying that code A could be faster than code B on a platform
> (CPU+compiler), but could be slower on another platform.

Yes. If you believe A is always faster/slower and B is always
slower/faster, regardless of platform, compiler version, compiler
options, program context, and phase of the Moon, you are too young
and inexperienced for the work you're engaged in.

> [...]
> In this case I have 5 instructions, 3 of them are move. I think the
> second code is more portable, but more time consuming.

"I think," you say, but I notice that you STILL don't say
"I measure." What's the matter? Scared?

> This is the reason for my first question: how can I wrote code that is
> at the same time "full portable" (or almost portable) on every
> CPU/compiler and fast (for example, using the misaligned access feature
> as in my CPU)?
>
> After discussing with you, I think there isn't a solution for that
> question.

It is surpassingly unlikely that a straight-line source-code
sequence A will be faster than B in all circumstances. Compilers
are startlingly subtle, but are not magical. So, what's a poor
programmer to do? There are several approaches, among them:

1) Lard your code with #ifdef's for different platforms. This
approach offers the best opportunities for speed, but the
worst burden for maintenance: You wind up with N independent
implementations, >=N times the effort to keep them current.
There's also bit-rot to consider: `#ifdef FROBOZZ_MAGIC_C'
may be fine today, but next month Frobozz comes out with a
new and improved version that optimizes differently. If you
don't have a continuing project to verify and re-verify all
those hand-crafted optimizations, at least a few of them
*will* turn into pessimizations. And if you *do* have such
a project, all you can look forward to is adding a new
`#ifdef FROBOZZ_MAGIC_C_VERSION_3_POINT_5_OR_GREATER' and
increasing your maintenance burden from N to (N+1).

2) Measure (that *awful* word!) the speeds of A and B on several
platforms of interest, and ponder the magnitude of the spread.
You're worried about extracting a 32-bit integer from a
network packet, fine. Let's suppose that A takes 32-44 ns
on your suite of platforms, while B takes 30-51. Well, is
that difference significant in relation to the speed at which
the integers arrive? If you're already sucking the data across
a WiFi link shared by forty-seven other coffee drinkers, and
shipping them through an entire TCP/IP stack with checksumming
and window management and retransmissions and ACK's and all the
rest, the difference between 32ns and 51ns is unlikely to matter.

3) ... but okay: Suppose that by expending hours and hours of
careful measurement and days and days of programming and weeks
and weeks of ongoing maintenance you actually *do* manage to
process the number in 30ns instead of 51, for a clear savings
of nearly forty percent, whee! Now, please calculate how
many numbers you need to process in order to break even on the
time you spent engineering the savings. For argument's sake,
suppose you somehow got through all the initial work in just
one forty-hour week, and divide that by 21ns per number to find
the payoff moment. (With these completely made-up numbers, you
will break even at 6.8 TRILLION input values. YMMV, of course.)

Okay, this analysis is far from complete. I've focused on payoff --
time invested versus time saved -- and that's not the only "performance"
measure. Still, it's a useful measure to keep in mind when assessing
whether something is worth doing.

Here's a (possibly misleading) analogy I've used before: With the
price of gasoline on the rise, you may well desire to improve your
car's fuel efficiency. Should you give the old jalopy a fresh coat
of wax? YEA: It will make the car slipperier, reducing the wind
resistance and allowing the engine to work less hard. NAY: It will
make the car heavier, forcing the engine to drag a heavier load down
the road. Oh, gosh, oh, gosh, whatever should I do?

... and that is the LAST you will hear from me on this thread
until and unless you offer some actual MEASUREMENTS! If you say
"I think" or "It stands to reason" or "Everyone knows," I will
personally come after you with a big book of six-place logarithms
and bludgeon you so hard your mantissa will fall off.

--
Eric Sosman
lid

Eric Sosman, Jan 25, 2011
7. ### Eric SosmanGuest

On 1/24/2011 7:04 PM, Bartc wrote:
> "Eric Sosman"<> wrote in message
> news:ihldu4\$ou6\$-september.org...
>
>> 3) ... but okay: Suppose that by expending hours and hours of
>> careful measurement and days and days of programming and weeks
>> and weeks of ongoing maintenance you actually *do* manage to
>> process the number in 30ns instead of 51, for a clear savings
>> of nearly forty percent, whee! Now, please calculate how
>> many numbers you need to process in order to break even on the
>> time you spent engineering the savings. For argument's sake,
>> suppose you somehow got through all the initial work in just
>> one forty-hour week, and divide that by 21ns per number to find
>> the payoff moment. (With these completely made-up numbers, you
>> will break even at 6.8 TRILLION input values. YMMV, of course.)

>
> You also have to multiply any savings by the number of people who are going
> to run the program, and by how many times they run it.

You still need 6.8 trillion (or whatever) to break even, spread
it out however you may.

> Presumably you are not suggesting that compiler writers and processor
> designers shouldn't bother with the stuff they do?

IMHO they bother with lots of stuff that's pointless, in pursuit
of a better SPECKmock than the next guy. But even so: The CPU maven,
the compiler wizard, even the library fabricator -- all these folks
face the prospect that their stuff really *will* run trillions of
times. The O.P. has offered *no* reason to think he's at that sort
of level, not even within several orders of magnitude.

Of course, this famous "6.8 trillion" is an entirely fictitious
number. It's not to be taken literally, but as an illustration of
the sort of calculation the O.P. *ought* to make before expending
further effort on lightening his car by installing thinner windshields.
But he doesn't seem to be quantitatively inclined; if anything, he's
quantitatively disinclined. Such people should give up on programming
and resign themselves to being computer science professors. ;-)

--
Eric Sosman
lid

Eric Sosman, Jan 25, 2011
8. ### BartcGuest

"Ben Bacarisse" <> wrote in message
news:...
> Eric Sosman <> writes:

>> You still need 6.8 trillion (or whatever) to break even, spread
>> it out however you may.

>
> I don't buy these peculiar sums! The hours I spend shaving a few ms off
> a program run (note ms and not just ns) are real hours that I could
> have spent doing something else -- adding a feature, fixing a bug, going
> home on time. The few ms shaved off each run might not even be noticed
> by the user. Even if a billion users gain these extra milliseconds a
> hundred times a day, they don't sum to anything other than zero (or to
> something very close to zero)!

A billion times 1msec times 100 times a day comes to about 3 years'
savings -- per day. One entire lifetime saved per month.

> Of course there are time when it does matter, but it can't simply be
> decided by adding up time spent and time gained.

I just tried this:

for (i=0; i<N; ++i) sum+=p;

p is an array of integers, Misaligning by 1 byte on an x86 makes this code
20% slower. That may or may not have significance for the application. But
if you are writing a library function (where N is a parameter) or are
responsible for allocating p, then it's something that needs to be
considered, and deserves some time spent on it.

But if you know it will only ever save 20% (strictly, 16.7%) of 1% of the
runtime, then perhaps it can wait (unless the idea of a misaligned array
grates, then you might work on it anyway..)

--
Bartc

Bartc, Jan 25, 2011
9. ### James KuyperGuest

On 01/24/2011 03:34 PM, Willem wrote:
> pozz wrote:

....
> ) This is the reason for my first question: how can I wrote code that is
> ) at the same time "full portable" (or almost portable) on every
> ) CPU/compiler and fast (for example, using the misaligned access feature
> ) as in my CPU)?
>
> Get a good optimizing compiler, and have it do the work for you.

The requirements "every CPU/compiler" and "good optimizing compiler" are
incompatible.

--
James Kuyper

James Kuyper, Jan 25, 2011
10. ### James KuyperGuest

On 01/24/2011 03:22 PM, pozz wrote:
....
> This is the reason for my first question: how can I wrote code that is
> at the same time "full portable" (or almost portable) on every
> CPU/compiler and fast (for example, using the misaligned access feature
> as in my CPU)?

You can't say "every CPU/compiler" and also mandate "fast"; there are
probably some compilers out there which are inherently incapable of
generating "fast" code, by whatever standard you want to use for
defining "fast". There's also pairs of compilers out there where the
fastest code for one of the two compilers is different from the fastest
code for the other. If it is essential to you to optimize at any level
other than the language-independent level of the algorithm itself, you
must inherently abandon any hope for portability.

--
James Kuyper

James Kuyper, Jan 25, 2011
11. ### Ben BacarisseGuest

Eric Sosman <> writes:

> On 1/24/2011 7:04 PM, Bartc wrote:
>> "Eric Sosman"<> wrote in message
>> news:ihldu4\$ou6\$-september.org...
>>
>>> 3) ... but okay: Suppose that by expending hours and hours of
>>> careful measurement and days and days of programming and weeks
>>> and weeks of ongoing maintenance you actually *do* manage to
>>> process the number in 30ns instead of 51, for a clear savings
>>> of nearly forty percent, whee! Now, please calculate how
>>> many numbers you need to process in order to break even on the
>>> time you spent engineering the savings. For argument's sake,
>>> suppose you somehow got through all the initial work in just
>>> one forty-hour week, and divide that by 21ns per number to find
>>> the payoff moment. (With these completely made-up numbers, you
>>> will break even at 6.8 TRILLION input values. YMMV, of course.)

>>
>> You also have to multiply any savings by the number of people who are going
>> to run the program, and by how many times they run it.

>
> You still need 6.8 trillion (or whatever) to break even, spread
> it out however you may.

I don't buy these peculiar sums! The hours I spend shaving a few ms off
a program run (note ms and not just ns) are real hours that I could
have spent doing something else -- adding a feature, fixing a bug, going
home on time. The few ms shaved off each run might not even be noticed
by the user. Even if a billion users gain these extra milliseconds a
hundred times a day, they don't sum to anything other than zero (or to
something very close to zero)!

Of course there are time when it does matter, but it can't simply be
decided by adding up time spent and time gained.

<snip>
--
Ben.

Ben Bacarisse, Jan 25, 2011
12. ### Bart van Ingen SchenauGuest

pozz Wrote:
> const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 };
> i = *(int *)&data[1];
> is translated in:
> MOVW A, _data+1
> MOVW @RW0, A
> With this example I understand that this processor can access int
> variables even at non aligned addresses.

That is not shown by the compiler output.
The compiler is not required to check (and account) for aligned memory
access, that is your task as a programmer. The fact that the compiler emits
instructions to perform unaligned data access does not mean this access will
be performed correctly by the processor.
> And the assignment is very
> fast: just two move instructions.
> While:
> const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 };
> i = ((int)data[1] << 8) + ((int)data[2] << 0);
> is translated as:
> MOV A, _data+1
> SWAP
> MOV A, _data+2
> MOVW @RW0, A
> In this case I have 5 instructions, 3 of them are move. I think the
> second code is more portable, but more time consuming.

There are indeed more assembly instructions, but especially on low-end
processors, not all instructions take the same amount of time to execute. In
the end, it may turn out to be faster to do more, but simpler, instructions.
The second code is definitely much more portable, given that it is
guaranteed to work in all cases, while the first code only works with a
big-endian processor that supports unaligned loads.
Incidentally, the processor documentation you posted in a follow-up
indicates your target processor is little-endian, so for that processor the
two samples produce different results.
> This is the reason for my first question: how can I wrote code that is
> at the same time "full portable" (or almost portable) on every
> CPU/compiler and fast (for example, using the misaligned access feature
> as in my CPU)?
> After discussing with you, I think there isn't a solution for that
> question.

"fully portable" and "the fastest code possible for all platforms" are two
goals that don't go together all that well. To squeeze the last
clock-cycles, you always have to resort to hand-tuned code that is specific
to the target processor/compiler combination or possibly even assembly code.
> >> Keep in mind that I use low-end embedded processor with very limited
> >> resources compared with x86 desktop processors.

> >

Just for reference, I have worked on embedded projects with similar low-end
processors where we used the byte-shifting technique a lot (communication
between internal components was done with messages. Messages were defined as
big-endian encoded, the processor was little-endian, so every multi-byte
value passed between components had to be encoded and decoded with byte
shifts). The shifting to convert between litte-endian and big-endian (and
vice versa) has never caused any performance problems.
Bart v Ingen Schenau

Bart van Ingen Schenau, Jan 25, 2011
13. ### Ben BacarisseGuest

"Bartc" <> writes:

> "Ben Bacarisse" <> wrote in message
> news:...
>> Eric Sosman <> writes:

>
>>> You still need 6.8 trillion (or whatever) to break even, spread
>>> it out however you may.

>>
>> I don't buy these peculiar sums! The hours I spend shaving a few ms off
>> a program run (note ms and not just ns) are real hours that I could
>> have spent doing something else -- adding a feature, fixing a bug, going
>> home on time. The few ms shaved off each run might not even be noticed
>> by the user. Even if a billion users gain these extra milliseconds a
>> hundred times a day, they don't sum to anything other than zero (or to
>> something very close to zero)!

>
> A billion times 1msec times 100 times a day comes to about 3 years'
> savings -- per day. One entire lifetime saved per month.

You don't really believe that do you? If you do that's fine -- we can
just agree to differ, but maybe you were just doing the sums to show how
it adds up (if one were to choose to do so).

>> Of course there are time when it does matter, but it can't simply be
>> decided by adding up time spent and time gained.

>
> I just tried this:
>
> for (i=0; i<N; ++i) sum+=p;
>
> p is an array of integers, Misaligning by 1 byte on an x86 makes this code
> 20% slower.

That's just one data point. I'm not disputing it, but I see different
numbers; and just by quoting a number it can suggest a degree of
validity that is often not warranted.

> That may or may not have significance for the application. But
> if you are writing a library function (where N is a parameter) or are
> responsible for allocating p, then it's something that needs to be
> considered, and deserves some time spent on it.
>
> But if you know it will only ever save 20% (strictly, 16.7%) of 1% of the
> runtime, then perhaps it can wait (unless the idea of a misaligned array
> grates, then you might work on it anyway..)

--
Ben.

Ben Bacarisse, Jan 25, 2011
14. ### BartcGuest

"Eric Sosman" <> wrote in message
news:ihnu1r\$833\$-september.org...
> On 1/25/2011 5:30 AM, Bartc wrote:
>> [...]
>> A billion times 1msec times 100 times a day comes to about 3 years'
>> savings -- per day. One entire lifetime saved per month.

>
> ... and if you can't see the absurdity inherent in that
> calculation, you ought to give up programming and go into politics.

Looks like I need to go into politics then.

BTW what *is* the absurdity? You mean saving 3 years, in only one day?
That's spread over more than one person!

(The figures were Ben's, who claimed they added to near zero. Changing the
billion to a million, and the 1msec to 1 second, makes them more realistic
(some products I use *do* have a response time of a second or more, and are
manufactured in the hundreds of thousands), but the resulting saving is the
same.)

--
Bartc

Bartc, Jan 25, 2011
15. ### SeebsGuest

On 2011-01-25, Ben Bacarisse <> wrote:
> I don't buy these peculiar sums! The hours I spend shaving a few ms off
> a program run (note ms and not just ns) are real hours that I could
> have spent doing something else -- adding a feature, fixing a bug, going
> home on time. The few ms shaved off each run might not even be noticed
> by the user. Even if a billion users gain these extra milliseconds a
> hundred times a day, they don't sum to anything other than zero (or to
> something very close to zero)!

In particular: If I get my command prompt back after running a command
faster than the refresh cycle of my monitor, it really doesn't matter whether
you can make it faster, from an interactive perspective. Now, if it's
scriptable, milliseconds might matter, but even then, often not.

Absolutely, by far, the most important thing is to measure performance
first, and look closely at what you're doing.

I once spent a day or two hand-tuning some vectorized code, eliminating
branches (which were expensive), caching trig results, and so on. I got
a 30% improvement in the performance of the code I was working on.

Then I added a check for whether two adjacent values in something that
would be expanded repeatedly (iterative fractals) were the same value, because
if they were, they'd expand into N of the same value, then N^2, and so on.

I got a couple percent difference in performance on small cases, well over
a factor of two on the largest cases that had been possible in main memory
before the change, and I could do things about 2x-3x as large in main memory,
which resulted in about a 50x improvement on some larger cases.

Five minutes spent doing the right thing is WAY more effective than a day
or two spent doing the wrong thing.

-s
--
Copyright 2010, all wrongs reversed. Peter Seebach /
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
I am not speaking for my employer, although they do rent some of my opinions.

Seebs, Jan 26, 2011
16. ### Eric SosmanGuest

On 1/25/2011 5:30 AM, Bartc wrote:
> [...]
> A billion times 1msec times 100 times a day comes to about 3 years'
> savings -- per day. One entire lifetime saved per month.

... and if you can't see the absurdity inherent in that
calculation, you ought to give up programming and go into politics.

--
Eric Sosman
lid

Eric Sosman, Jan 26, 2011
17. ### Eric SosmanGuest

On 1/25/2011 9:06 AM, Bart van Ingen Schenau wrote:
> pozz Wrote:
>> const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 };
>> i = *(int *)&data[1];
>> is translated in:
>> MOVW A, _data+1
>> MOVW @RW0, A
>> With this example I understand that this processor can access int
>> variables even at non aligned addresses.

> That is not shown by the compiler output.
> The compiler is not required to check (and account) for aligned memory
> access, that is your task as a programmer. The fact that the compiler emits
> instructions to perform unaligned data access does not mean this access will
> be performed correctly by the processor.

FWIW, I've observed four distinct behaviors for unaligned accesses
(on actual machines, not DeathStation 9000's):

0) No difference. What this really means is that the machine has
no alignment requirements to begin with: An 8-bit Z80 can fetch
a 16-bit "double length" integer from any address, odd or even.

1) The hardware performs the access, but takes longer and may lose
interesting properties like atomicity. Speed penalties run
from 25-100% (5/4 to twice the aligned time), often far worse
if the misalignment straddles a page boundary.

2) The hardware traps. Sometimes the trap crashes the program,
sometimes a software trap handler emulates it. The trap handler
takes 100x to 2000x as much time as an aligned access, more if
page boundaries are crossed.

3) The hardware runs merrily at full speed, but silently accesses
the wrong data. The machine on which I saw this behavior just
ignored the "offending" low-order address bits: If you accessed
a four-byte-aligned datum at binary address xx...x1010, the
machine happily used xx...x1000 instead. Mysteries ensued ...

This is just what I've personally run across: There are probably
even odder things out there that I haven't yet encountered.

--
Eric Sosman
lid

Eric Sosman, Jan 26, 2011
18. ### Ben BacarisseGuest

"Bartc" <> writes:

> "Eric Sosman" <> wrote in message
> news:ihnu1r\$833\$-september.org...
>> On 1/25/2011 5:30 AM, Bartc wrote:
>>> [...]
>>> A billion times 1msec times 100 times a day comes to about 3 years'
>>> savings -- per day. One entire lifetime saved per month.

>>
>> ... and if you can't see the absurdity inherent in that
>> calculation, you ought to give up programming and go into politics.

>
> Looks like I need to go into politics then.
>
> BTW what *is* the absurdity? You mean saving 3 years, in only one day?
> That's spread over more than one person!
>
> (The figures were Ben's, who claimed they added to near zero. Changing the
> billion to a million, and the 1msec to 1 second, makes them more realistic
> (some products I use *do* have a response time of a second or more, and are
> manufactured in the hundreds of thousands), but the resulting saving is the
> same.)

Let's say You have a spare hour in which to read. Do you care if it is
from 12 to 1pm or it if made up of 3600 separate seconds scattered
throughout the day? It's the same total time after all. I contend that
time does not always add up in the obvious way. The fractions gained
here and there can't always be used as if there were run together.

--
Ben.

Ben Bacarisse, Jan 26, 2011
19. ### Eric SosmanGuest

[OT] Re: Data alignment and endianess

On 1/26/2011 4:43 AM, christian.bau wrote:
> On Jan 25, 4:29 am, Eric Sosman<> wrote:
>
>> IMHO they bother with lots of stuff that's pointless, in pursuit
>> of a better SPECKmock than the next guy. But even so: The CPU maven,
>> the compiler wizard, even the library fabricator -- all these folks
>> face the prospect that their stuff really *will* run trillions of
>> times. The O.P. has offered *no* reason to think he's at that sort
>> of level, not even within several orders of magnitude.

>
> There is a very simple argument: Whenever you handle data coming from
> the network, or going to the network, there will be some network I/O
> just before or just after the handling of the data. That network I/O
> will dwarf whatever execution time the handling code has.

Usually, maybe, but certainly not always. 10Gb/s network links
are quite common, and 100Gb/s links are not uncommon in datacenter
backbones. In a previous life (less than two years ago!) I was
trying to help a customer whose CPU's were unable to keep up with
the network's interrupt load; the solution was to put the interface
in a non-interrupting mode and dedicate an entire core to polling
the device!

> Networking code doesn't get fast by doing "clever" and possibly non-
> portable things at the lowest level. It gets fast by doing things
> right at a higher level.

A good deal of low-level cleverness gets expended on things like
scatter-gather I/O to eliminate buffer copying.

--
Eric Sosman
lid

Eric Sosman, Jan 26, 2011
20. ### James KuyperGuest

On 01/25/2011 11:24 PM, Ben Bacarisse wrote:
> "Bartc"<> writes:
>
>> "Eric Sosman"<> wrote in message
>> news:ihnu1r\$833\$-september.org...
>>> On 1/25/2011 5:30 AM, Bartc wrote:
>>>> [...]
>>>> A billion times 1msec times 100 times a day comes to about 3 years'
>>>> savings -- per day. One entire lifetime saved per month.
>>>
>>> ... and if you can't see the absurdity inherent in that
>>> calculation, you ought to give up programming and go into politics.

>>
>> Looks like I need to go into politics then.
>>
>> BTW what *is* the absurdity? You mean saving 3 years, in only one day?
>> That's spread over more than one person!
>>
>> (The figures were Ben's, who claimed they added to near zero. Changing the
>> billion to a million, and the 1msec to 1 second, makes them more realistic
>> (some products I use *do* have a response time of a second or more, and are
>> manufactured in the hundreds of thousands), but the resulting saving is the
>> same.)

>
> Let's say You have a spare hour in which to read. Do you care if it is
> from 12 to 1pm or it if made up of 3600 separate seconds scattered
> throughout the day? It's the same total time after all. I contend that
> time does not always add up in the obvious way. The fractions gained
> here and there can't always be used as if there were run together.

Let's say that you have a program that takes 10 seconds to perform an
operation that you have to wait for before you can do anything else
useful. Thanks to some 20,000 separate decisions made by 5000 different
hardware and software designers about how each individual hardware and
software component of your system should be designed, that operation
does NOT take a a full minute to execute; those individual decisions
saved you an average of only 2.5 milliseconds each. If each of those
designers had taken the attitude that "no one cares about saving 2.5
milliseconds", most of those decisions might have been made differently.
Do you think that would have been the right thing for those designers to
have done?

--
James Kuyper

James Kuyper, Jan 26, 2011