Testing As Proof

Alan Gutierrez · Jul 29, 2010

I'm working on implementing a B+tree and I'd like to be a little more
rigorous with testing than I have in the past. What I'd like to do is
write tests that prove the validity of the algorithm.

The algorithm works, mind you, but I don't feel like I can prove it. I
want than a bug stomping test suite. I want tests that capture the
reasoning behind the algorithm. Ideally I could point you to these tests
and you could see a proof. You could find fault in the reasoning or else
gain confidence in the implementation.

Nuts for me that I never got past geometry in high-school. I'll need to
weave a study of this sort of testing practice in with a study of
proofs. I need to learn how one might reasonably convince oneself of
something without deceiving oneself, which is a human tendency.

I'm getting good at identifying fallacies as a discussion survival tool
for more the raucous forms. I've used the resources on the Internet that
help you identify fallacies. I'm not yet confident in my ability to draw
new conclusions, only to spot misdirection. I imagine there might be
similar resources that help a person structure their own arguments,
rather than simply poke holes in the fallacious arguments of others.

Does this make sense to you? It may sound as if I'm asking for something
too basic, but there must be other programmers who learned the trade but
then later wanted to circle back and learn the formalities. It seems
like someone must have taken the testing as proof approach already.

Lew · Jul 29, 2010

I'm working on implementing a B+tree and I'd like to be a little more
rigorous with testing than I have in the past. What I'd like to do is
write tests that prove the validity of the algorithm.

The algorithm works, mind you, but I don't feel like I can prove it. I
want than a bug stomping test suite. I want tests that capture the
reasoning behind the algorithm. Ideally I could point you to these tests
and you could see a proof. You could find fault in the reasoning or else
gain confidence in the implementation.

Tests don't give a formal proof. They just show that certain cases produce
expected results.

An algorithm is either proven or not proven. You can use a proven algorithm,
and use published material to verify that the algorithm is valid. You can
look up the proof that the algorithm is valid.

That leaves the question of whether your implementation actually expresses
that proven algorithm. You can and should establish that by salting the code
with 'assert' statements to prove that the implementation adheres to the
algorithmic invariants.

Tom Anderson · Jul 29, 2010

Tests don't give a formal proof. They just show that certain cases produce
expected results.

True. I think what Alan is getting at is more 'testing as argument' than
'testing as proof'. For instance, if i was going to write such tests for
an implementation of quicksort, i might write tests like:

// i have an isTrivialCase to weed out arrays that don't need sorting
isTrivialCaseReturnsTrueForLengthZeroArray
isTrivialCaseReturnsTrueForLengthOneArray
isTrivialCaseReturnsFalseForLongerArrays // test 2, 3, 100

// and i'm going to deal with two-element arrays specially to save some headache
isTwosortCaseReturnsTrueForLengthTwoArray
isTwosortCaseReturnsFalseForLengthThreeArray
twosortSortsALengthTwoArray // maybe try a few examples

// i use median-of-three partitioning to choose the pivot
choosePivotFindsAnElementOfTheArray
choosePivotFindsAnElementOfTheArrayWithAtLeastOneSmallerElement
choosePivotFindsAnElementOfTheArrayWithAtLeastOneLargerElement

// i partition around the pivot
partitionLeavesTheArrayPartitionedAroundThePivot // perhaps the crucial test
// might have several variations on the above, for pathological cases and so on
partitionAppliedToASubarrayDoesNotAffectTheRestOfTheArray

// we need some kind of test of the fact that the two halves left after
// partitioning are always smaller than the starting array

// final test
quicksortLeavesTheArraySorted

At each step, i test one of the invariants which are the building blocks
of the proof of correctness. The tests aren't a proof, but they help
illustrate the proof - and check that my implementation conforms to the
it.

tom

Lew · Jul 29, 2010

Tom said:
At each step, i test one of the invariants which are the building blocks
of the proof of correctness. The tests aren't a proof, but they help
illustrate the proof - and check that my implementation conforms to the it.

Tests are good things. There are three legs on the stool of implementation
correctness here: assertions, conditional/exception blocks and unit tests.

Joshua Cranmer · Jul 29, 2010

I'm working on implementing a B+tree and I'd like to be a little more
rigorous with testing than I have in the past. What I'd like to do is
write tests that prove the validity of the algorithm.

To quote Dijkstra, "Testing cannot prove the absence of bugs, they can
only prove the presence of bugs."

Instead, the simplest thing to do is to write enough tests such that all
execution paths can be covered. Basically, find a few examples of common
cases and use every corner case you can think of--null parameters,
0-length arrays, etc.

If you're making a datatype, you probably need to have your tests ensure
that the internal structure never gets out of sync. In addition, you'd
need to ensure that the tree looks like how it's supposed to look after
each operation.

You can also (if the license is appropriate) reuse others' examples as
test cases.

Stefan Ram · Jul 31, 2010

Joshua Cranmer said:
To quote Dijkstra, "Testing cannot prove the absence of bugs, they can
only prove the presence of bugs."

A specification usually involves explicit or implied
universal quantifiers.

All these assertions can be tested indeed whenever the
quantification covers some small set (such as bool).
When all possibilities are tested, this is a proof.
So you can prove the operator »&« for booleans by testing
false&false, false&true, true&false, and true&true.

However, one can not prove Math.pow(double,double) in
this way. (Think of the Pentium bug.)

Well, actually, a malign programmer could have implemented
»bool&bool« to give false results only around midnight or
only between 2010-07-31T18:47:24+02:00 and
2010-07-31T18:47:26+02:00 or only if there is a certain file
in the current directory or only if the source file contains
a certain comment. So, even testing once for false&false,
false&true, true&false, and true&true does not prove the
correct implementation of »bool&bool« for all environments.

Joshua Cranmer · Jul 31, 2010

So you can prove the operator »&« for booleans by testing
false&false, false&true, true&false, and true&true.

Well, this works if you have the implicit assumption that there is no
reliance on other state. I suppose you could say that the test can turn
into a proof of correctness if you know that it covers all of the
possible cases for the implementation.

However, one can not prove Math.pow(double,double) in
this way. (Think of the Pentium bug.)

If you know the algorithm, you can reduce it to categories and then
prove all categories correct. If you change the algorithm, though, the
test is no longer necessarily complete test coverage.

Actually, I have had to struggle with finding representative tests of a
program when you have no knowledge of how it is implemented (i.e.,
writing an autograder). That was when I discovered that most people
don't know how to thoroughly test their own code to find problems such
as a really common one brought about by implicit type conversion.

Alan Gutierrez · Jul 31, 2010

Well, actually, a malign programmer could have implemented
»bool&bool« to give false results only around midnight or
only between 2010-07-31T18:47:24+02:00 and
2010-07-31T18:47:26+02:00 or only if there is a certain file
in the current directory or only if the source file contains
a certain comment. So, even testing once for false&false,
false&true, true&false, and true&true does not prove the
correct implementation of »bool&bool« for all environments.

I believe I'd make an exception for Byzantine fault tolerance.

Does that help define the concept? I'd like to be able to argue that my
program is at least as trustworthy as the platform on which it runs.

Alan Gutierrez · Aug 1, 2010

Joshua said:
If you know the algorithm, you can reduce it to categories and then
prove all categories correct. If you change the algorithm, though, the
test is no longer necessarily complete test coverage.

Actually, I have had to struggle with finding representative tests of a
program when you have no knowledge of how it is implemented (i.e.,
writing an autograder).

I'm not sure how you'd go about testing without source code and coverage
tools. I can imagine that you could investigate an implementation, using
a test framework, but I wouldn't have too much confidence in tests that
I wrote solely against an interface.

> That was when I discovered that most people
don't know how to thoroughly test their own code to find problems such
as a really common one brought about by implicit type conversion.

Which would that be? Curious. I'm probably one of the people that
doesn't know how to test for it.

Alan Gutierrez · Aug 1, 2010

Tom said:
True. I think what Alan is getting at is more 'testing as argument' than
'testing as proof'. For instance, if i was going to write such tests for
an implementation of quicksort, i might write tests like:

// i have an isTrivialCase to weed out arrays that don't need sorting
isTrivialCaseReturnsTrueForLengthZeroArray
isTrivialCaseReturnsTrueForLengthOneArray
isTrivialCaseReturnsFalseForLongerArrays // test 2, 3, 100

At each step, i test one of the invariants which are the building blocks
of the proof of correctness. The tests aren't a proof, but they help
illustrate the proof - and check that my implementation conforms to the it.

Yes. I mean testing as argument, with the intent of persuading adopters
of the code that code is of a certain quality, as well as persuading myself.

I don't mean to get into a discussion about the Halting problem, nor do
I mean to assert through testing that the software will never fail under
any condition. I did however run across the field of formal verification
in my research on this topic.

CMU - Specification and Verification Center

http://www.cs.cmu.edu/~svc/

Verifying an Operating System Kernel

http://ertos.nicta.com.au/publications/papers/Klein_NEH_07.pdf

That would be overreach for me at this point.

Indeed, I'd like to structure testing so that another programmer can see
my reasoning, step though the logic for an aspect of the implementation,
feel confident that it is correct or else tell me what I might be
missing. If someone can understand that I am trying to show that
`partitionLeavesTheArrayPartitionedAroundThePivot` they are going to be
in a better position to find the edge case and show me that I'm wrong.

The JUnit FAQ has this algorithm:

becomeTimidAndTestEverything
while writingTheSameThingOverAndOverAgain
becomeMoreAggressive
writeFewerTests
writeTestsForMoreInterestingCases
if getBurnedByStupidDefect
feelStupid
becomeTimidAndTestEverything
end
end

I don't see it as aggressive versus timid. I have no problem getting the
100% line and branch coverage, because I favor testing, and any best
practice that makes something impossible to test is rejected in favor of
that which is testable.

I'd like to see it as persuasive and communicative.

So I'm wondering how to structure testing to convey reasoning and
instill confidence. Seeing a bunch of unit tests named ticket81930(),
ticket81931(), ticket81932() doesn't really instill confidence because
you just see bug squashing, but no reasoning.

Anyway, thanks to Tom for the detailed response. Maybe that is all there
is to it, create a well documented JUnit `TestCase` with comments and
meaningful method names.

Martin Gregorie · Aug 1, 2010

I'm not sure how you'd go about testing without source code and coverage
tools. I can imagine that you could investigate an implementation, using
a test framework, but I wouldn't have too much confidence in tests that
I wrote solely against an interface.

IME testability is down to the designer, since that is presumably who
defined and documented the interface and the purpose of the unit being
tested. If all they did was publish an interface with about one sentence
saying what its for then you're right - its untestable.

Put another way, if you have to read the source code to write tests, then
you've been trapped into testing what the coder wrote and not what you
should be testing, which is whether the code matches the requirement is
was written to fulfil. In this case the designer is at fault, since he
failed to provide a clear, unambiguous description of what the unit is
required to do, which must include its handling of bad inputs.

I've run into this problem with designers who think they're done when
they've produced a set of use cases that a non-technical user can
understand. They haven't. A one page use case with no exception handling
detail is plain inadequate. It doesn't help either if requests for
clarification only result in the designer changing the use case without
providing the missing detail or, worse, contradicting some of his
previous requirements. I was involved with writing automated tests for a
complex package a bit over two years ago which had exactly this level of
airy-fairy documentation. The test package wasn't complete when my
contract ended and I've just heard that its apparently no further on
right now and said complex package is still far from being completely
tested.

Designers should at least document to the level of a fully comprehensive
Javadoc - in fact I'd go further and say that it is not unreasonable for
the designers to deliver module specifications at the package, interface
and externally visible class level either as a set of outline Java files
containing comment and skeletons acceptable to the javadocs utility, or
at least text that can be rapidly reformatted into something that
javadocs can process. And, before you ask, that's exactly the type of
documentation I aim to deliver regardless of whether the coder is myself
or somebody else.

Eric Sosman · Aug 1, 2010

[...]
All these assertions can be tested indeed whenever the
quantification covers some small set (such as bool).
When all possibilities are tested, this is a proof.
So you can prove the operator »&« for booleans by testing
false&false, false&true, true&false, and true&true.

What about `false & true | true', and so on? If the
implementation incorrectly treats `&' as having lower precedence
than `|', is that not an error? One could quibble about whether
the error is in `&' or in `|', but either way it is clear that
there's something wrong that your four cases plus four similar
cases for `|' would not catch.

Also, your four cases should be expanded to cover boolean
values of different provenance: Literal constants, static final
constants, local variables, static variables, instance variables,
method values, ... It is entirely possible that `&' could work
just fine for most of these but fail for some quirky combination
(e.g., through an incorrect optimization). Then there's JIT ...

At some point, I think a practical test must eventually rely
on "inside information" about the implementation, or on a notion
of "close enough for jazz." Without that -- well, you might be
able to demonstrate the correctness of many many results today,
but does that prove they will also be correct tomorrow?

Tom Anderson · Aug 1, 2010

[...]
All these assertions can be tested indeed whenever the
quantification covers some small set (such as bool).
When all possibilities are tested, this is a proof.
So you can prove the operator ?&? for booleans by testing
false&false, false&true, true&false, and true&true.

Click to expand...

What about `false & true | true', and so on? If the
implementation incorrectly treats `&' as having lower precedence
than `|', is that not an error? One could quibble about whether
the error is in `&' or in `|',

No, the error is clearly in ' '.

but either way it is clear that there's something wrong that your four
cases plus four similar cases for `|' would not catch.

Ah, but that's an integration test!

tom

Tom Anderson · Aug 1, 2010

IME testability is down to the designer, since that is presumably who
defined and documented the interface and the purpose of the unit being
tested. If all they did was publish an interface with about one sentence
saying what its for then you're right - its untestable.

Put another way, if you have to read the source code to write tests, then
you've been trapped into testing what the coder wrote and not what you
should be testing, which is whether the code matches the requirement is
was written to fulfil. In this case the designer is at fault, since he
failed to provide a clear, unambiguous description of what the unit is
required to do, which must include its handling of bad inputs.

I've run into this problem with designers who think they're done when
they've produced a set of use cases that a non-technical user can
understand. They haven't. A one page use case with no exception handling
detail is plain inadequate. It doesn't help either if requests for
clarification only result in the designer changing the use case without
providing the missing detail or, worse, contradicting some of his
previous requirements. I was involved with writing automated tests for a
complex package a bit over two years ago which had exactly this level of
airy-fairy documentation. The test package wasn't complete when my
contract ended and I've just heard that its apparently no further on
right now and said complex package is still far from being completely
tested.

Designers should at least document to the level of a fully comprehensive
Javadoc - in fact I'd go further and say that it is not unreasonable for
the designers to deliver module specifications at the package, interface
and externally visible class level either as a set of outline Java files
containing comment and skeletons acceptable to the javadocs utility, or
at least text that can be rapidly reformatted into something that
javadocs can process. And, before you ask, that's exactly the type of
documentation I aim to deliver regardless of whether the coder is myself
or somebody else.

This all sounds a bit mental to me. If this alleged designer is going to
design to this level of detail, why don't they just write the code?

tom

Joshua Cranmer · Aug 2, 2010

I'm not sure how you'd go about testing without source code and coverage
tools. I can imagine that you could investigate an implementation, using
a test framework, but I wouldn't have too much confidence in tests that
I wrote solely against an interface.

The most common cases I can think of where this is done is when you're
writing a test suite for some specification (e.g., CSS) and when you are
trying to write an autograder for homework submissions. In both cases,
you'll probably have a corpus of buggy implementations you can use to
test for specific bug classes. Even if you don't, a good programmer
should be able to guess where the most problems are likely to occur
(e.g., a cross of two features generally not intended to mix) and write
test cases for those specific situations.

You won't get perfect coverage, but it should satisfy most people's
expectations.

Which would that be? Curious. I'm probably one of the people that
doesn't know how to test for it.

The assignment was to write a simulator for a simple 16-bit processor
architecture. There is an instruction which basically says "load the
memory at the address of this register plus this (signed) immediate value".

The types:
u16 memory[65536]; s16 (or u16) reg; s16 immed;

The problematic line:
reg = memory[reg + immed];

The end result was that reg + immed should be equal to 0xB716 after
overflowing. The type conversion implicitly converted the computed index
into either a signed 16-bit value or a 32-bit value. Respectively, that
would produce either a negative index or an index of 0x1B716, both of
which are out of bounds.

Stefan Ram · Aug 2, 2010

Joshua Cranmer said:
The end result was that reg + immed should be equal to 0xB716 after
overflowing. The type conversion implicitly converted the computed index
into either a signed 16-bit value or a 32-bit value. Respectively, that
would produce either a negative index or an index of 0x1B716, both of
which are out of bounds.

In

http://en.wikipedia.org/wiki/Cleanroom_Software_Engineering

one uses »statistically sound testing« that yields an
»estimate of the reliability of the software, and a
level of confidence in that estimate«.

And - for those who do not yet know it -, I always enjoy reading

http://www.fastcompany.com/node/28121/print

.

Martin Gregorie · Aug 2, 2010

This all sounds a bit mental to me. If this alleged designer is going to
design to this level of detail, why don't they just write the code?

The level I'm advocating is pretty much what you see in a good Javadoc or
a C standard library manpage. There's a good argument to be made that if
you don't document to that level before coding starts, then its pretty
much hit or miss whether the resulting code does what its intended to do
if the coding is done by somebody else.

OK, it might stand a chance of doing its job if the designer wrote it,
but what are the chances of somebody else using it correctly without a
fair amount of coaching and hand-holding from the designer?

The project I mentioned with the use-case system docs was like that: for
some packages running Javadocs showed absolutely nothing but the class
names and method interfaces - and the method names weren't particularly
meaningful. The result was that the Javadocs were meaningless for
determining how and when to use the classes - we had to prize example
code and argument descriptions out of the author before we could even
think about using them. You may think that's acceptable but I don't.

Read the source? We didn't have it and couldn't get it. Besides, I expect
to use library objects without needing to see implementation detail.
Isn't that the while point of OO?

Martin Gregorie · Aug 2, 2010

But if the domain of the API is large (or floating point), it's clearly
impossible to "fully test".

A particular common case is when a function utilises a cache. Clearly
the caching should be tested, and yet how (without some kind of
knowledge of cache size and replacement algorithms) could one know that
the cache has been tested?

I see no realistic way of getting trustworthy testing in this case
without some knowledge of the implementation.

I think I'd expect the cache capacity to be documented, assuming there
isn't a method for setting it.

Similarly, I'd want the cache's discard algorithm to be described. There
are two reasons for doing this:

(1) to give the user some idea of what to expect. A LRU discard/overwrite
algorithm will give a very different performance in some
circumstances from FIFO. There are cases when either may be
appropriate so the user needs to which was used.

(2) you need to tell the coder which algorithm to implement, so why not
simply write it into the spec. and deal with both points.

Besides, if more than a few milliseconds is spent writing the spec, you
may come to realise that, in this example, a selectable set of discard
algorithms should be provided along with a method of specifying which is
to be used in each instance.

[*] LRU is appropriate when you want to retain the most frequently
used items and item usage is heavily skewed, but FIFO may be more
appropriate if there is also a TTL requirement or item usage is
fairly evenly spread.

Alan Gutierrez · Aug 2, 2010

Read the source? We didn't have it and couldn't get it. Besides, I expect
to use library objects without needing to see implementation detail.
Isn't that the while point of OO?

There are other points to OO. I want to be able to read the source of
the software I use for development. I use open source software, not 100%
Pure Open Source, I don't take an approach that is zealotry.

That should actually clarify my intent, which is to publish a program as
open source, using the tests as a way to argue the quality of the
implementation, plus give people a way to structure the impromptu code
reviews that occur when a potential adopter investigates the implementation.

In any case, hiding implementation detail is the point of encapsulation,
but it is not the "whole point of OO."

Sometimes the devil is in the details. If you're testing, you need to
see those devils. Thus, I agree with Tom, and would say that the
generalized problem of specification conformance test is not what I'm
trying address, rather the establishment of the validity of an
implementation.

Jim Janney · Aug 12, 2010

Alan Gutierrez said:
I'm working on implementing a B+tree and I'd like to be a little more
rigorous with testing than I have in the past. What I'd like to do is
write tests that prove the validity of the algorithm.

The algorithm works, mind you, but I don't feel like I can prove it. I
want than a bug stomping test suite. I want tests that capture the
reasoning behind the algorithm. Ideally I could point you to these
tests and you could see a proof. You could find fault in the reasoning
or else gain confidence in the implementation.

Nuts for me that I never got past geometry in high-school. I'll need
to weave a study of this sort of testing practice in with a study of
proofs. I need to learn how one might reasonably convince oneself of
something without deceiving oneself, which is a human tendency.

I'm getting good at identifying fallacies as a discussion survival
tool for more the raucous forms. I've used the resources on the
Internet that help you identify fallacies. I'm not yet confident in my
ability to draw new conclusions, only to spot misdirection. I imagine
there might be similar resources that help a person structure their
own arguments, rather than simply poke holes in the fallacious
arguments of others.

Does this make sense to you? It may sound as if I'm asking for
something too basic, but there must be other programmers who learned
the trade but then later wanted to circle back and learn the
formalities. It seems like someone must have taken the testing as
proof approach already.

It's possible to reason formally about code. This typically involves
pre- and post-conditions, things that are known to be true before the
execution of a piece of code and things that can be derived from the
code and its preconditions. A simple, probably not very good example
would be:

int x;
/* no precondition */
x = 10;
/* postcondition: x == 10 */

/* precondition: x == 10 */
while (x < 20) {
/* precondition: x < 20 */
x++;
/* postcondition: x <= 20 */
}

/* postcondition: x >= 20 (from the condition of the while) */

It gets more complicated than this, but I hope you get the general
idea. In principle it's possible, but usually not easy, to prove the
correctness of entire classes or even whole programs this way. As Lew
has already pointed out, this kind of reasoning converts fairly
directly into assert statements embedded in the code.

BUT -- and this is a big but -- this is not unit testing, because it
depends on a line-by-line knowledge of the code and the assertions
have to be embedded directly in the code. All that unit testing can
do is to call the publically accessible methods of a class and check
the results. And normally you *don't* want a unit test to care how a
method works inside, only that it behaves in the required manner.

Sorry for the late response, I'm catching up after a vacation.

Question re testing constructor	15	Dec 6, 2011
OCaml, Language syntax, and Proof Systems	4	Jan 23, 2009
What load testing solutions do you use?	5	Apr 3, 2012
Testing Java	10	Feb 23, 2010
Error Testing	37	Oct 19, 2013
C unit testing and regression testing	40	Aug 8, 2013
Regarding survey on automated software testing	0	Sep 26, 2011
Is load testing a web service with seperate threads from one machinethe same as a single thread from	5	Mar 25, 2011

Testing As Proof

Alan Gutierrez

Lew

Tom Anderson

Lew

Joshua Cranmer

Stefan Ram

Joshua Cranmer

Alan Gutierrez

Alan Gutierrez

Alan Gutierrez

Martin Gregorie

Eric Sosman

Tom Anderson

Tom Anderson

Joshua Cranmer

Stefan Ram

Martin Gregorie

Martin Gregorie

Alan Gutierrez

Jim Janney

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads