Bug/Gross InEfficiency in HeathField's fgetline program

  • Thread starter Antoninus Twink
  • Start date
K

Keith Thompson

I quite agree. It makes sense to use size_t in libraries for generic
string-handling functions and the like, but if you've got an array of
employees or chess pieces then int is the natural choice.

My current employer, the University of California, has more than 32767
employees (and my employee number requires at least 20 bits).

Granted you're not likely to be running software that deals with UC
employees on a 16-bit system (though I'm sure it was done in the
past), but you still need to be careful about your assumptions.

Perhaps if size_t had been called count_t, and described as a type
capable of holding counts of objects, there won't be so many
complaints about it.
 
K

Keith Thompson

Richard said:
In what way? 0 to N. Nothing to worry about. N is well within the limits
99.9999% of the time. And you need to be aware if it isn't in order to
specifically move to size_t or long or whatever.

100% is better than 99.9999%.

And probabilities like that tend not to apply to software. If you're
convinced that something will work 99.9999% of the time, some
malicious hacker could find a way to exploit the 0.0001% case.
Used a different name?

Too late.
 
F

Flash Gordon

Malcolm McLean wrote, On 28/10/07 22:13:
It *is* reading and writing a lot of data. For example, it was over 7
years ago that the 2GB file size limit became a problem and it is
working with a lot of files. Running a report can easily involve reading
a million records and there can be a significant number of reports being
run on data which is being changed during peak periods.
Even in the very worst case you have to be incredibly unlucky to slow
down the program by more than 50%. (I was going to say "can't", but
that's not strictly true - you can construct a scenario in which you
thrash the cache by doubling int size). That might sound a lot to an
electrical engineer, but in software terms it really is on the boundary
of a worthwhile optimisation - if we're not doubling speed at least we
aren't really speeding things up.

Most customers would complain loudly about a 10% slow down.
However that represents the sever at peak capacity.

Due to the usage the time when it is running at peak is *also* the time
when performance is most important. Outside peak times most users would
not notice a 50% slow down (a few users on each server would), but at
peak times almost all users would notice a 5% slow down.
But if it has got
hundreds of simultaneous users, how often is the system running at peak
capacity?

Every month end. An even bigger peak every year end. Other peaks at
project start-ups and close-downs.
Doubling int sizes would doubtless have some impact on performace. But
not the sort that is being suggested.

Well, the slow-down does not have to be much before it has major impact
in some areas. There are always people running close to the limit.

There is also the impact on storage requirements, and the time to
convert the data currently stored on disk (10s of GB of data for small
customers, 100s of GB or terrabytes for larger customers, so just the
time reading and then writing would be significant).
Then I don't believe that most of
the calculations are of amounts of money, anyway.

Have you seen the code? No? Well, I have and there is not much indexing
since the data that could be stored in arrays is actually stored in the
database on disk. We have a lot of manipulation of quantities other than
money, but they are also quantities we have to keep an exact tally of
and so use integers of fixed-point numbers.
I think the system
spends most of its time copying data from one location to another.

Well, the larger the data the longer the copying takes to copy, so
assuming you are correct (and if you count copying between memory and
disk you are for the application I am referring to) then you have just
given a reason why increasing the size of an int is a bad idea.
It could be farming out memory to disk, however. In which case doubling
memory take would have a significant impact.

Yes, it would MASSIVELY increase the cost of the servers. I am not
talking about desktop machines here, I am talking about nice high spec
servers currently being used where customers at a minimum buy as much
RAM as they can without hitting RAM being a ridiculously large cost and
lots of disk in a RAID set up to maximise bandwidth.
IO is still largely latency
bound, but the number of writes would double.

Increasing the number of reads and writes by 10% would be unpopular
unless it brought some other massive benefit, and your proposed change
would not provide my customers with any benefit, only costs.
 
D

Dann Corbit

Flash Gordon said:
Malcolm McLean wrote, On 28/10/07 22:13:

It *is* reading and writing a lot of data. For example, it was over 7
years ago that the 2GB file size limit became a problem and it is working
with a lot of files. Running a report can easily involve reading a million
records and there can be a significant number of reports being run on data
which is being changed during peak periods.


Most customers would complain loudly about a 10% slow down.


Due to the usage the time when it is running at peak is *also* the time
when performance is most important. Outside peak times most users would
not notice a 50% slow down (a few users on each server would), but at peak
times almost all users would notice a 5% slow down.


Every month end. An even bigger peak every year end. Other peaks at
project start-ups and close-downs.


Well, the slow-down does not have to be much before it has major impact in
some areas. There are always people running close to the limit.

There is also the impact on storage requirements, and the time to convert
the data currently stored on disk (10s of GB of data for small customers,
100s of GB or terrabytes for larger customers, so just the time reading
and then writing would be significant).


Have you seen the code? No? Well, I have and there is not much indexing
since the data that could be stored in arrays is actually stored in the
database on disk. We have a lot of manipulation of quantities other than
money, but they are also quantities we have to keep an exact tally of and
so use integers of fixed-point numbers.


Well, the larger the data the longer the copying takes to copy, so
assuming you are correct (and if you count copying between memory and disk
you are for the application I am referring to) then you have just given a
reason why increasing the size of an int is a bad idea.


Yes, it would MASSIVELY increase the cost of the servers. I am not talking
about desktop machines here, I am talking about nice high spec servers
currently being used where customers at a minimum buy as much RAM as they
can without hitting RAM being a ridiculously large cost and lots of disk
in a RAID set up to maximise bandwidth.


Increasing the number of reads and writes by 10% would be unpopular unless
it brought some other massive benefit, and your proposed change would not
provide my customers with any benefit, only costs.

I suspect that MonetDB may be helpful for your application.
Frequently, read based problems have a huge benefit from in-memory column
storage (e.g. 10-100x faster).
You would need at least one machine with large memory, though.

http://monetdb.cwi.nl/
 
T

Tor Rustad

Malcolm said:
In reality we would use a floating point rather than an integer type.

No, most financial institutions are not even allowed to use
floating-point for financial calculations. In fact, I have only seen
fixed point calculations, when dealing with money.
 
T

Tor Rustad

Ben said:
Your software is disk I/O *bandwidth* bound? You must be reading
or writing massive amounts of storage, then. My guess would have
been that it was disk I/O *latency* bound, in which case doubling
the size of the data wouldn't make much difference.

With I/O latency, you mean disk seek time + rotation delay?

First rule in a transaction environment, is partitioning the DB on
multiple disks and using mirrors with independent I/O paths. The disk
controller could be an important bottleneck here.
 
T

Tor Rustad

Dann said:
[...]
Increasing the number of reads and writes by 10% would be unpopular unless
it brought some other massive benefit, and your proposed change would not
provide my customers with any benefit, only costs.

I suspect that MonetDB may be helpful for your application.
Frequently, read based problems have a huge benefit from in-memory column
storage (e.g. 10-100x faster).
You would need at least one machine with large memory, though.

I rather suspect, nobody would really want to replace their financial
DB2 or Nonstop DB systems. Changing the DB infrastructure on a
business-critical system, could be very expensive.

Performance requirements might be important, but correctness, risk,
fault-tolerance, disaster-recovery, scalability, fail-over-time etc. are
typically more important, at least if Gordon is not talking about some
back-office system, but rather an online transaction environment, like
e.g. a stock-exchange or an EFTPOS host.

There is no simple answers to tuning such high-integrity systems, and
the usual lesson apply, first perform measurement to identify the
bottlenecks. If no benefits, don't do it.
 
U

user923005

Dann said:
[...]
Increasing the number of reads and writes by 10% would be unpopular unless
it brought some other massive benefit, and your proposed change would not
provide my customers with any benefit, only costs.
I suspect that MonetDB may be helpful for your application.
Frequently, read based problems have a huge benefit from in-memory column
storage (e.g. 10-100x faster).
You would need at least one machine with large memory, though.

I rather suspect, nobody would really want to replace their financial
DB2 or Nonstop DB systems. Changing the DB infrastructure on a
business-critical system, could be very expensive.

Performance requirements might be important, but correctness, risk,
fault-tolerance, disaster-recovery, scalability, fail-over-time etc. are
typically more important, at least if Gordon is not talking about some
back-office system, but rather an online transaction environment, like
e.g. a stock-exchange or an EFTPOS host.

There is no simple answers to tuning such high-integrity systems, and
the usual lesson apply, first perform measurement to identify the
bottlenecks. If no benefits, don't do it.

I am not suggesting a replacement. I am suggesing a tool for OLAP.
 
D

Dann Corbit

Tor Rustad said:
No, most financial institutions are not even allowed to use floating-point
for financial calculations. In fact, I have only seen fixed point
calculations, when dealing with money.

Even for time value of money calculations (e.g. compound interest)?
 
L

Larry__Weiss

Tor said:
No, most financial institutions are not even allowed to use
floating-point for financial calculations. In fact, I have only seen
fixed point calculations, when dealing with money.

Sometimes you have to apply an allocation to multiple vested partners
(by percentage assignment, for example). You use floating point in the
calculations, but you account for every penny nonetheless.
This happens in calculations for partnerships and also in combined
mineral leases and I'm sure in myriad other cases I'm not familiar with.

Many common tax return computations use floating point calculations (at
least for U.S. income taxes).

Sales taxes are usually done as a percentage of an amount.

- Larry
 
R

Richard Heathfield

Dann Corbit said:
Even for time value of money calculations (e.g. compound interest)?

In the insurance industry (at least in the UK), double-precision floating
point is used for compound interest and related calculations. The rule of
thumb for correctness is "if the program agrees with the actuaries'
calculations to eleven decimal places, it is correct". And if it agrees to
ten decimal places over a sufficiently long period, it is likely to be
considered "sweet".

Many years ago, I had to check with an actuary (hi James!) about whether my
interpretation of a bit of actuarial tomfoolery in a specification was or
was not a correct reading in accordance with the intent of the writer. He
looked at it, and eventually said, "yeah, it's sweet". I had occasionally
heard this expression 'it's sweet' from other actuaries, and it had even
leaked over into the programming team, who used it with the sense 'it's
correct'. But since I had never heard it defined, I decided to ask James
what it meant.

'Sweet', he said, 'means it's not actually *right* as such, but nobody
apart from you and me will ever know.'
 
R

Richard Heathfield

Malcolm McLean said:

if we're not doubling speed at least we aren't really speeding things up.

Wrong. I can cite a 20% reduction in run-time in a (fairly hefty) program
that had taken 25 seconds to produce one particular kind of quotation.
This optimisation (which resulted from the rewriting of just *one* line of
code) brought the runtime to within (barely) acceptable performance
levels. Asking a broker and his customer to wait 20 seconds for a
quotation was considered to be teetering on the edge between their
patience and their impatience. (I should stress that this was on 8086s.
Nowadays, obviously the broker and his customer would be much more
patient, since everyone knows it takes computers ages to do things.)

Had I not found this optimisation, I don't think it's unreasonable to say
that MumbleCo would have been at a severe disadvantage compared to those
of its competitors that also supplied brokers with quotation software.



Flush with success at knocking a fifth off the runtime, I went gunning for
the next hotspot, and managed to knock another 0.000017% (considerably
less than a millisecond) off the runtime, at the expense of greatly
complicating the code.
 
C

Charlie Gordon

Richard Heathfield said:
Malcolm McLean said:



Wrong. I can cite a 20% reduction in run-time in a (fairly hefty) program
that had taken 25 seconds to produce one particular kind of quotation.
This optimisation (which resulted from the rewriting of just *one* line of
code) brought the runtime to within (barely) acceptable performance
levels. Asking a broker and his customer to wait 20 seconds for a
quotation was considered to be teetering on the edge between their
patience and their impatience. (I should stress that this was on 8086s.
Nowadays, obviously the broker and his customer would be much more
patient, since everyone knows it takes computers ages to do things.)

Had I not found this optimisation, I don't think it's unreasonable to say
that MumbleCo would have been at a severe disadvantage compared to those
of its competitors that also supplied brokers with quotation software.



Flush with success at knocking a fifth off the runtime, I went gunning for
the next hotspot, and managed to knock another 0.000017% (considerably
less than a millisecond) off the runtime, at the expense of greatly
complicating the code.

Perfection is achieved, not when there is nothing more to add, but when
there is nothing left to take away. Antoine de Saint-Exupery.
 
R

Richard

Keith Thompson said:
100% is better than 99.9999%.

Ridiculous comment.

So do you use BIGINT libraries when you have any line which says

x+=y ?

My 99.99% thing was about when the programmer *knows* the limits of the
operation are in "int" capabilities. Also indicating he knows if the
limits are too much and then he can use the size_t. Yes, by using size_t
all the time he doesnt even have to think about it. But, hes a pretty
poor programmer if he is writing loops etc without some idea of the
scales involved.
And probabilities like that tend not to apply to software. If you're
convinced that something will work 99.9999% of the time, some
malicious hacker could find a way to exploit the 0.0001% case.


Too late.

Yes. Obviously. It's just a civil conversation saying it is an ugley and
misleading name IMO for such a supposedly "core" type.
 
E

Eric Sosman

Larry__Weiss wrote On 10/29/07 22:14,:
[...]
Many common tax return computations use floating point calculations (at
least for U.S. income taxes).

Sales taxes are usually done as a percentage of an amount.

Sales taxes are usually *stated* as a percentage
but *computed* by table-lookup.
 
A

Al Balmer

Larry__Weiss wrote On 10/29/07 22:14,:
[...]
Many common tax return computations use floating point calculations (at
least for U.S. income taxes).

Sales taxes are usually done as a percentage of an amount.

Sales taxes are usually *stated* as a percentage
but *computed* by table-lookup.

Not in our software. Given the fact that we handle multiple tax rates
in multiple taxing jurisdictions, there would be an awful lot of
tables.
 
F

Flash Gordon

Tor Rustad wrote, On 29/10/07 23:50:
Dann said:
[...]
Increasing the number of reads and writes by 10% would be unpopular
unless it brought some other massive benefit, and your proposed
change would not provide my customers with any benefit, only costs.

I suspect that MonetDB may be helpful for your application.
Frequently, read based problems have a huge benefit from in-memory
column storage (e.g. 10-100x faster).
You would need at least one machine with large memory, though.

I rather suspect, nobody would really want to replace their financial
DB2 or Nonstop DB systems. Changing the DB infrastructure on a
business-critical system, could be very expensive.

It would be, especially as we have not yet gone as far as using an SQL
engine. We are actually using DIsam.

We are actually looking at moving to SQL databases as the back end and
have a long-running project working on the port, but there are specific
commercial reasons for the SQL engines we need to support.
Performance requirements might be important, but correctness, risk,
fault-tolerance, disaster-recovery, scalability, fail-over-time etc. are
typically more important, at least if Gordon is not talking about some
back-office system, but rather an online transaction environment, like
e.g. a stock-exchange or an EFTPOS host.

It is a major back-office system. It is the system that a number of
large companies are using for entering all their requisitions, orders,
invoices and goods received notes, matching invoices to GRNs and lots of
other stuff. Also lots of reporting on all this data. It is business
critical for the companies using it.
There is no simple answers to tuning such high-integrity systems, and
the usual lesson apply, first perform measurement to identify the
bottlenecks. If no benefits, don't do it.

Actually, the first rule is do not even consider looking at it or
bothering doing any measurements unless there is known to be a good
reason to start looking. As long as the code is fast enough (which it is
currently) we will concentrate on writing code to meet new and changed
requirements as correctly as possible and fixing the existing bugs. We
will only worry about measuring when there is a performance problem
significant enough to be worth the costs of investigating it.
 
M

Malcolm McLean

Ben Bacarisse said:
Yes, but has anyone actually advised size_t in a case when the array
index is certainly known to be less than INT_MAX?
I have. The problem is that a lot of arrays are unbounded but must be quite
small. It's quite hard to say what is the maximum number of children in a
class, for instance, but 255 is unmanageable.
The temptation is to use int for Nchildren. However you'll soon create a
situation where you've got a mix of signed and unsigned, int and size_t, and
you've got to juggle two types. It's manageable until you need to pass
indices by indirection, then all of a sudden the system gums up.
The '_t' is a bit ugly but what should the committee have done? I've
not seen a good alternative.
It should have realised that it wasn't fixing up a little problem with
malloc(). It was making a fundamental change to the language. Very few
size_t's hold sizes in terms of bytes of memory.
 
M

Malcolm McLean

Richard Heathfield said:
Malcolm McLean said:



Wrong. I can cite a 20% reduction in run-time in a (fairly hefty) program
that had taken 25 seconds to produce one particular kind of quotation.
Another case is interactive graphics. It doesn't matter how long your
calculation takes as long as you make the next frame. If you miss it by a
few microseconds, you've got all of 17 millseconds to play with until the
next one come along.

However generally I regard a 100% speedup as about the threshold for code
worth my time optimising.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Fibonacci 0
Adding adressing of IPv6 to program 1
C language. work with text 3
code review 26
Can't solve problems! please Help 0
compressing charatcers 35
Strange bug 65
K&R exercise 5-5 10

Members online

No members online now.

Forum statistics

Threads
474,262
Messages
2,571,058
Members
48,769
Latest member
Clifft

Latest Threads

Top