Bug/Gross InEfficiency in HeathField's fgetline program

Keith Thompson · Oct 29, 2007

I quite agree. It makes sense to use size_t in libraries for generic
string-handling functions and the like, but if you've got an array of
employees or chess pieces then int is the natural choice.

My current employer, the University of California, has more than 32767
employees (and my employee number requires at least 20 bits).

Granted you're not likely to be running software that deals with UC
employees on a 16-bit system (though I'm sure it was done in the
past), but you still need to be careful about your assumptions.

Perhaps if size_t had been called count_t, and described as a type
capable of holding counts of objects, there won't be so many
complaints about it.

Keith Thompson · Oct 29, 2007

Richard said:
In what way? 0 to N. Nothing to worry about. N is well within the limits
99.9999% of the time. And you need to be aware if it isn't in order to
specifically move to size_t or long or whatever.

100% is better than 99.9999%.

And probabilities like that tend not to apply to software. If you're
convinced that something will work 99.9999% of the time, some
malicious hacker could find a way to exploit the 0.0001% case.

Used a different name?

Too late.

Flash Gordon · Oct 29, 2007

Malcolm McLean wrote, On 28/10/07 22:13:
It *is* reading and writing a lot of data. For example, it was over 7
years ago that the 2GB file size limit became a problem and it is
working with a lot of files. Running a report can easily involve reading
a million records and there can be a significant number of reports being
run on data which is being changed during peak periods.

Even in the very worst case you have to be incredibly unlucky to slow
down the program by more than 50%. (I was going to say "can't", but
that's not strictly true - you can construct a scenario in which you
thrash the cache by doubling int size). That might sound a lot to an
electrical engineer, but in software terms it really is on the boundary
of a worthwhile optimisation - if we're not doubling speed at least we
aren't really speeding things up.

Most customers would complain loudly about a 10% slow down.

However that represents the sever at peak capacity.

Due to the usage the time when it is running at peak is *also* the time
when performance is most important. Outside peak times most users would
not notice a 50% slow down (a few users on each server would), but at
peak times almost all users would notice a 5% slow down.

But if it has got
hundreds of simultaneous users, how often is the system running at peak
capacity?

Every month end. An even bigger peak every year end. Other peaks at
project start-ups and close-downs.

Doubling int sizes would doubtless have some impact on performace. But
not the sort that is being suggested.

Well, the slow-down does not have to be much before it has major impact
in some areas. There are always people running close to the limit.

There is also the impact on storage requirements, and the time to
convert the data currently stored on disk (10s of GB of data for small
customers, 100s of GB or terrabytes for larger customers, so just the
time reading and then writing would be significant).

Then I don't believe that most of
the calculations are of amounts of money, anyway.

Have you seen the code? No? Well, I have and there is not much indexing
since the data that could be stored in arrays is actually stored in the
database on disk. We have a lot of manipulation of quantities other than
money, but they are also quantities we have to keep an exact tally of
and so use integers of fixed-point numbers.

I think the system
spends most of its time copying data from one location to another.

Well, the larger the data the longer the copying takes to copy, so
assuming you are correct (and if you count copying between memory and
disk you are for the application I am referring to) then you have just
given a reason why increasing the size of an int is a bad idea.

It could be farming out memory to disk, however. In which case doubling
memory take would have a significant impact.

Yes, it would MASSIVELY increase the cost of the servers. I am not
talking about desktop machines here, I am talking about nice high spec
servers currently being used where customers at a minimum buy as much
RAM as they can without hitting RAM being a ridiculously large cost and
lots of disk in a RAID set up to maximise bandwidth.

IO is still largely latency
bound, but the number of writes would double.

Increasing the number of reads and writes by 10% would be unpopular
unless it brought some other massive benefit, and your proposed change
would not provide my customers with any benefit, only costs.

Dann Corbit · Oct 29, 2007

Flash Gordon said:
Malcolm McLean wrote, On 28/10/07 22:13:

It *is* reading and writing a lot of data. For example, it was over 7
years ago that the 2GB file size limit became a problem and it is working
with a lot of files. Running a report can easily involve reading a million
records and there can be a significant number of reports being run on data
which is being changed during peak periods.

Most customers would complain loudly about a 10% slow down.

Due to the usage the time when it is running at peak is *also* the time
when performance is most important. Outside peak times most users would
not notice a 50% slow down (a few users on each server would), but at peak
times almost all users would notice a 5% slow down.

Every month end. An even bigger peak every year end. Other peaks at
project start-ups and close-downs.

Well, the slow-down does not have to be much before it has major impact in
some areas. There are always people running close to the limit.

There is also the impact on storage requirements, and the time to convert
the data currently stored on disk (10s of GB of data for small customers,
100s of GB or terrabytes for larger customers, so just the time reading
and then writing would be significant).

Have you seen the code? No? Well, I have and there is not much indexing
since the data that could be stored in arrays is actually stored in the
database on disk. We have a lot of manipulation of quantities other than
money, but they are also quantities we have to keep an exact tally of and
so use integers of fixed-point numbers.

Well, the larger the data the longer the copying takes to copy, so
assuming you are correct (and if you count copying between memory and disk
you are for the application I am referring to) then you have just given a
reason why increasing the size of an int is a bad idea.

Yes, it would MASSIVELY increase the cost of the servers. I am not talking
about desktop machines here, I am talking about nice high spec servers
currently being used where customers at a minimum buy as much RAM as they
can without hitting RAM being a ridiculously large cost and lots of disk
in a RAID set up to maximise bandwidth.

Increasing the number of reads and writes by 10% would be unpopular unless
it brought some other massive benefit, and your proposed change would not
provide my customers with any benefit, only costs.

I suspect that MonetDB may be helpful for your application.
Frequently, read based problems have a huge benefit from in-memory column
storage (e.g. 10-100x faster).
You would need at least one machine with large memory, though.

http://monetdb.cwi.nl/

Tor Rustad · Oct 29, 2007

Malcolm said:
In reality we would use a floating point rather than an integer type.

No, most financial institutions are not even allowed to use
floating-point for financial calculations. In fact, I have only seen
fixed point calculations, when dealing with money.

Tor Rustad · Oct 29, 2007

Ben said:
Your software is disk I/O *bandwidth* bound? You must be reading
or writing massive amounts of storage, then. My guess would have
been that it was disk I/O *latency* bound, in which case doubling
the size of the data wouldn't make much difference.

With I/O latency, you mean disk seek time + rotation delay?

First rule in a transaction environment, is partitioning the DB on
multiple disks and using mirrors with independent I/O paths. The disk
controller could be an important bottleneck here.

Tor Rustad · Oct 29, 2007

Dann said:
[...]

Increasing the number of reads and writes by 10% would be unpopular unless
it brought some other massive benefit, and your proposed change would not
provide my customers with any benefit, only costs.

Click to expand...

I suspect that MonetDB may be helpful for your application.
Frequently, read based problems have a huge benefit from in-memory column
storage (e.g. 10-100x faster).
You would need at least one machine with large memory, though.

I rather suspect, nobody would really want to replace their financial
DB2 or Nonstop DB systems. Changing the DB infrastructure on a
business-critical system, could be very expensive.

Performance requirements might be important, but correctness, risk,
fault-tolerance, disaster-recovery, scalability, fail-over-time etc. are
typically more important, at least if Gordon is not talking about some
back-office system, but rather an online transaction environment, like
e.g. a stock-exchange or an EFTPOS host.

There is no simple answers to tuning such high-integrity systems, and
the usual lesson apply, first perform measurement to identify the
bottlenecks. If no benefits, don't do it.

user923005 · Oct 29, 2007

Dann said:
Dann said:

[...]

Increasing the number of reads and writes by 10% would be unpopular unless
it brought some other massive benefit, and your proposed change would not
provide my customers with any benefit, only costs.

Click to expand...

Click to expand...

I suspect that MonetDB may be helpful for your application.
Frequently, read based problems have a huge benefit from in-memory column
storage (e.g. 10-100x faster).
You would need at least one machine with large memory, though.

Click to expand...

I rather suspect, nobody would really want to replace their financial
DB2 or Nonstop DB systems. Changing the DB infrastructure on a
business-critical system, could be very expensive.

Performance requirements might be important, but correctness, risk,
fault-tolerance, disaster-recovery, scalability, fail-over-time etc. are
typically more important, at least if Gordon is not talking about some
back-office system, but rather an online transaction environment, like
e.g. a stock-exchange or an EFTPOS host.

There is no simple answers to tuning such high-integrity systems, and
the usual lesson apply, first perform measurement to identify the
bottlenecks. If no benefits, don't do it.

I am not suggesting a replacement. I am suggesing a tool for OLAP.

Dann Corbit · Oct 29, 2007

Tor Rustad said:
No, most financial institutions are not even allowed to use floating-point
for financial calculations. In fact, I have only seen fixed point
calculations, when dealing with money.

Even for time value of money calculations (e.g. compound interest)?

Larry__Weiss · Oct 29, 2007

Tor said:
No, most financial institutions are not even allowed to use
floating-point for financial calculations. In fact, I have only seen
fixed point calculations, when dealing with money.

Sometimes you have to apply an allocation to multiple vested partners
(by percentage assignment, for example). You use floating point in the
calculations, but you account for every penny nonetheless.
This happens in calculations for partnerships and also in combined
mineral leases and I'm sure in myriad other cases I'm not familiar with.

Many common tax return computations use floating point calculations (at
least for U.S. income taxes).

Sales taxes are usually done as a percentage of an amount.

- Larry

Richard Heathfield · Oct 30, 2007

Dann Corbit said:

Even for time value of money calculations (e.g. compound interest)?

In the insurance industry (at least in the UK), double-precision floating
point is used for compound interest and related calculations. The rule of
thumb for correctness is "if the program agrees with the actuaries'
calculations to eleven decimal places, it is correct". And if it agrees to
ten decimal places over a sufficiently long period, it is likely to be
considered "sweet".

Many years ago, I had to check with an actuary (hi James!) about whether my
interpretation of a bit of actuarial tomfoolery in a specification was or
was not a correct reading in accordance with the intent of the writer. He
looked at it, and eventually said, "yeah, it's sweet". I had occasionally
heard this expression 'it's sweet' from other actuaries, and it had even
leaked over into the programming team, who used it with the sense 'it's
correct'. But since I had never heard it defined, I decided to ask James
what it meant.

'Sweet', he said, 'means it's not actually *right* as such, but nobody
apart from you and me will ever know.'

Richard Heathfield · Oct 30, 2007

Malcolm McLean said:

if we're not doubling speed at least we aren't really speeding things up.

Wrong. I can cite a 20% reduction in run-time in a (fairly hefty) program
that had taken 25 seconds to produce one particular kind of quotation.
This optimisation (which resulted from the rewriting of just *one* line of
code) brought the runtime to within (barely) acceptable performance
levels. Asking a broker and his customer to wait 20 seconds for a
quotation was considered to be teetering on the edge between their
patience and their impatience. (I should stress that this was on 8086s.
Nowadays, obviously the broker and his customer would be much more
patient, since everyone knows it takes computers ages to do things.)

Had I not found this optimisation, I don't think it's unreasonable to say
that MumbleCo would have been at a severe disadvantage compared to those
of its competitors that also supplied brokers with quotation software.

Flush with success at knocking a fifth off the runtime, I went gunning for
the next hotspot, and managed to knock another 0.000017% (considerably
less than a millisecond) off the runtime, at the expense of greatly
complicating the code.

Charlie Gordon · Oct 30, 2007

Richard Heathfield said:
Malcolm McLean said:

Wrong. I can cite a 20% reduction in run-time in a (fairly hefty) program
that had taken 25 seconds to produce one particular kind of quotation.
This optimisation (which resulted from the rewriting of just *one* line of
code) brought the runtime to within (barely) acceptable performance
levels. Asking a broker and his customer to wait 20 seconds for a
quotation was considered to be teetering on the edge between their
patience and their impatience. (I should stress that this was on 8086s.
Nowadays, obviously the broker and his customer would be much more
patient, since everyone knows it takes computers ages to do things.)

Had I not found this optimisation, I don't think it's unreasonable to say
that MumbleCo would have been at a severe disadvantage compared to those
of its competitors that also supplied brokers with quotation software.

Flush with success at knocking a fifth off the runtime, I went gunning for
the next hotspot, and managed to knock another 0.000017% (considerably
less than a millisecond) off the runtime, at the expense of greatly
complicating the code.

Perfection is achieved, not when there is nothing more to add, but when
there is nothing left to take away. Antoine de Saint-Exupery.

Richard · Oct 30, 2007

Keith Thompson said:
100% is better than 99.9999%.

Ridiculous comment.

So do you use BIGINT libraries when you have any line which says

x+=y ?

My 99.99% thing was about when the programmer *knows* the limits of the
operation are in "int" capabilities. Also indicating he knows if the
limits are too much and then he can use the size_t. Yes, by using size_t
all the time he doesnt even have to think about it. But, hes a pretty
poor programmer if he is writing loops etc without some idea of the
scales involved.

And probabilities like that tend not to apply to software. If you're
convinced that something will work 99.9999% of the time, some
malicious hacker could find a way to exploit the 0.0001% case.

Too late.

Yes. Obviously. It's just a civil conversation saying it is an ugley and
misleading name IMO for such a supposedly "core" type.

Eric Sosman · Oct 30, 2007

Larry__Weiss wrote On 10/29/07 22:14,:

[...]
Many common tax return computations use floating point calculations (at
least for U.S. income taxes).

Sales taxes are usually done as a percentage of an amount.

Sales taxes are usually *stated* as a percentage
but *computed* by table-lookup.

Al Balmer · Oct 30, 2007

Larry__Weiss wrote On 10/29/07 22:14,:

[...]
Many common tax return computations use floating point calculations (at
least for U.S. income taxes).

Sales taxes are usually done as a percentage of an amount.

Click to expand...

Sales taxes are usually *stated* as a percentage
but *computed* by table-lookup.

Not in our software. Given the fact that we handle multiple tax rates
in multiple taxing jurisdictions, there would be an awful lot of
tables.

Flash Gordon · Oct 30, 2007

Tor Rustad wrote, On 29/10/07 23:50:

Dann said:
Dann said:

[...]

Increasing the number of reads and writes by 10% would be unpopular
unless it brought some other massive benefit, and your proposed
change would not provide my customers with any benefit, only costs.

Click to expand...

I suspect that MonetDB may be helpful for your application.
Frequently, read based problems have a huge benefit from in-memory
column storage (e.g. 10-100x faster).
You would need at least one machine with large memory, though.

Click to expand...

I rather suspect, nobody would really want to replace their financial
DB2 or Nonstop DB systems. Changing the DB infrastructure on a
business-critical system, could be very expensive.

It would be, especially as we have not yet gone as far as using an SQL
engine. We are actually using DIsam.

We are actually looking at moving to SQL databases as the back end and
have a long-running project working on the port, but there are specific
commercial reasons for the SQL engines we need to support.

Performance requirements might be important, but correctness, risk,
fault-tolerance, disaster-recovery, scalability, fail-over-time etc. are
typically more important, at least if Gordon is not talking about some
back-office system, but rather an online transaction environment, like
e.g. a stock-exchange or an EFTPOS host.

It is a major back-office system. It is the system that a number of
large companies are using for entering all their requisitions, orders,
invoices and goods received notes, matching invoices to GRNs and lots of
other stuff. Also lots of reporting on all this data. It is business
critical for the companies using it.

There is no simple answers to tuning such high-integrity systems, and
the usual lesson apply, first perform measurement to identify the
bottlenecks. If no benefits, don't do it.

Actually, the first rule is do not even consider looking at it or
bothering doing any measurements unless there is known to be a good
reason to start looking. As long as the code is fast enough (which it is
currently) we will concentrate on writing code to meet new and changed
requirements as correctly as possible and fixing the existing bugs. We
will only worry about measuring when there is a performance problem
significant enough to be worth the costs of investigating it.

Malcolm McLean · Oct 30, 2007

Ben Bacarisse said:
Yes, but has anyone actually advised size_t in a case when the array
index is certainly known to be less than INT_MAX?

I have. The problem is that a lot of arrays are unbounded but must be quite
small. It's quite hard to say what is the maximum number of children in a
class, for instance, but 255 is unmanageable.
The temptation is to use int for Nchildren. However you'll soon create a
situation where you've got a mix of signed and unsigned, int and size_t, and
you've got to juggle two types. It's manageable until you need to pass
indices by indirection, then all of a sudden the system gums up.

The '_t' is a bit ugly but what should the committee have done? I've
not seen a good alternative.

It should have realised that it wasn't fixing up a little problem with
malloc(). It was making a fundamental change to the language. Very few
size_t's hold sizes in terms of bytes of memory.

Malcolm McLean · Oct 30, 2007

Richard Heathfield said:
Malcolm McLean said:

Wrong. I can cite a 20% reduction in run-time in a (fairly hefty) program
that had taken 25 seconds to produce one particular kind of quotation.

Another case is interactive graphics. It doesn't matter how long your
calculation takes as long as you make the next frame. If you miss it by a
few microseconds, you've got all of 17 millseconds to play with until the
next one come along.

However generally I regard a 100% speedup as about the threshold for code
worth my time optimising.

Tor Rustad · Oct 30, 2007

user923005 wrote:

[...]

I am not suggesting a replacement. I am suggesing a tool for OLAP.

OK, I assumed the financial system in question, rather was OLTP.

Fibonacci	0	May 13, 2023
Adding adressing of IPv6 to program	1	Feb 16, 2023
C language. work with text	3	Dec 9, 2021
code review	26	Feb 6, 2004
Can't solve problems! please Help	0	Sep 26, 2022
compressing charatcers	35	Apr 2, 2014
Strange bug	65	Nov 19, 2010
K&R exercise 5-5	10	Feb 19, 2007

Bug/Gross InEfficiency in HeathField's fgetline program

Keith Thompson

Keith Thompson

Flash Gordon

Dann Corbit

Tor Rustad

Tor Rustad

Tor Rustad

user923005

Dann Corbit

Larry__Weiss

Richard Heathfield

Richard Heathfield

Charlie Gordon

Richard

Eric Sosman

Al Balmer

Flash Gordon

Malcolm McLean

Malcolm McLean

Tor Rustad

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads