Exhaustive Unit Testing

E

Emanuele D'Arrigo

Your experiences are one of the reasons that writing the tests *first*
can be so helpful. You think about the *behaviour* you want from your
units and you test for that behaviour - *then* you write the code
until the tests pass.

Thank you Michael, you are perfectly right in reminding me this. At
this particular point in time I'm not yet architecturally foresighted
enough to be able to do that. While I was writing the design documents
I did write the list of methods each object would have needed and from
that description theoretically I could have made the tests first. In
practice, while eventually writing the code for those methods, I've
come to realize that there was a large amount of variance between what
I -thought- I needed and what I -actually- needed. So, had I written
the test before, I would have had to rewrite them again. That been
said, I agree that writing the tests before must be my goal. I hope
that as my experience increases I'll be able to know beforehand the
behaviors I need from each method/object/module of my applications.
One step at the time I'll get there... =)

Manu
 
R

Roel Schroeven

Fuzzyman schreef:
By the way, to reduce the number of independent code paths you need to
test you can use mocking. You only need to test the logic inside the
methods you create (testing behaviour), and not every possible
combination of paths.

I don't understand that. This is part of something I've never understood
about unit testing, and each time I try to apply unit testing I bump up
against, and don't know how to resolve. I find it also difficult to
explain exactly what I mean.

Suppose I need to write method spam() that turns out to be somewhat
complex, like the class method Emanuele was talking about. When I try to
write test_spam() before the method, I have no way to know that I'm
going to need so many code paths, and that I'm going to split the code
out into a number of other functions spam_ham(), spam_eggs(), etc.

So whatever happens, I still have to test spam(), however many codepaths
it contains? Even if it only contains a few lines with fors and ifs and
calls to the other functions, it still needs to be tested? Or not? From
a number of postings in this thread a get the impression (though that
might be an incorrect interpretation) that many people are content to
only test the various helper functions, and not the spam() itself. You
say you don't have to test every possible combination of paths, but how
thorough is your test suite if you have untested code paths?

A related matter (at least in my mind) is this: after I've written
test_spam() but before spam() is correctly working, I find out that I
need to write spam_ham() and spam_eggs(), so I need test_spam_ham() and
test_spam_eggs(). That means that I can never have a green light while
coding test_spam_ham() and test_stam_eggs(), since test_spam() will
fail. That feels wrong. And this is a simple case; I've seen cases where
I've created various new classes in order to write one new function.
Maybe I shouldn't care so much about the failing unit test? Or perhaps I
should turn test_spam() of while testing test_spam_ham() and
test_spam_eggs().

I've read "test-driven development" by David Astels, but somehow it
seems that the issues I encounter in practice don't come up in his examples.

--
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
-- Isaac Asimov

Roel Schroeven
 
S

Steven D'Aprano

Fuzzyman schreef:

I don't understand that. This is part of something I've never understood
about unit testing, and each time I try to apply unit testing I bump up
against, and don't know how to resolve. I find it also difficult to
explain exactly what I mean.

Suppose I need to write method spam() that turns out to be somewhat
complex, like the class method Emanuele was talking about. When I try to
write test_spam() before the method, I have no way to know that I'm
going to need so many code paths, and that I'm going to split the code
out into a number of other functions spam_ham(), spam_eggs(), etc.

So whatever happens, I still have to test spam(), however many codepaths
it contains? Even if it only contains a few lines with fors and ifs and
calls to the other functions, it still needs to be tested? Or not?

The first thing to remember is that it is impractical for unit tests to
be exhaustive. Consider the following trivial function:

def add(a, b): # a and b ints only
return a+b+1

Clearly you're not expected to test *every imaginable* path through this
function (ignoring unit tests for error handling and bad input):

assert add(0, 0) == 1
assert add(1, 0) == 2
assert add(2, 0) == 3
assert add(3, 0) == 4
....
assert add(99736263, 8264891001) = 8364627265
....

Instead, your tests for add() can rely on the + operator being
sufficiently tested that you can trust it, and so you only need to test
the logic of your function. To do that, it would be sufficient to test a
relatively small representative sample of data. One test would probably
be sufficient:

assert add(1, 3) == 5

That test would detect almost all bugs in the function, although of
course it won't detect every imaginable bug. A second test will make the
chances of such false negatives virtually disappear.

Now imagine a more complicated function:

def spam(a, b):
return spam_eggs(a, b) + spam_ham(a) - 2*spam_tomato(b)

Suppose spam_eggs has four paths that need testing (paths A, B, C, D),
spam_ham and spam_tomato have two each (E F and G H), and let's assume
that they are all independent. Then your spam unit tests need to test
every path:

A E G
A E H
A F G
A F H
B E G
B E H
....
D F H

for a total of 4*2*2=16 paths, in the spam unit tests.

But suppose that we have tested spam_eggs independently. It has four
paths, so we need four tests to cover them all. Now our spam testing can
assume that spam_eggs is correct, in the same way that we earlier assumed
that the plus operator was correct, and reduce the number of tests to a
small set of representative data.

No matter which path through spam_eggs we take, we can trust the result,
because we have unit tests that will fail if spam_eggs has a bug. So
instead of testing every path, I choose a much more limited set:

A E G
A E H
A F G
A F H

I arbitrarily choose path A alone, confident that paths B C and D are
correct, but of course I could make other choices. There's no need to
test paths B C and D *within spam's unit tests*, because they are already
tested elsewhere. To test them again within spam doesn't gain me anything.

Consequently, we reduce our total number of tests from 16 to 8 (four
tests for spam, four for spam_eggs).

From
a number of postings in this thread a get the impression (though that
might be an incorrect interpretation) that many people are content to
only test the various helper functions, and not the spam() itself. You
say you don't have to test every possible combination of paths, but how
thorough is your test suite if you have untested code paths?

The success of this tactic assumes that you can identify code paths and
make them independent. If they are dependent, then you can't be sure that
path E G after A is the same as E G after D.

Real world example: compare driving your car from home to the mall to the
park, compared to driving from work to the mall to the park. The journey
from the mall to the park is the same, no matter how you got to the mall.
If you can drive from home to the mall and then to the park, and you can
drive from work to the mall, then you can be sure that you can drive from
work to the mall to the park even though you've never done it before.

But if you can't be sure the paths are independent, then you can't make
that simplifying assumption, and you do have to test more paths in more
places.

A related matter (at least in my mind) is this: after I've written
test_spam() but before spam() is correctly working, I find out that I
need to write spam_ham() and spam_eggs(), so I need test_spam_ham() and
test_spam_eggs(). That means that I can never have a green light while
coding test_spam_ham() and test_stam_eggs(), since test_spam() will
fail. That feels wrong.

I would say that means you're letting your tests get too far ahead of
your code. In theory, you should never have more than one failing test at
a time: the last test you just wrote. If you have to refactor code so
much that a bunch of tests start failing, then you need to take those
tests out, and re-introduce them one at a time.

In practice, I can't imagine too many people have the discipline to
follow that practice precisely. I know I don't :)
 
F

Fuzzyman

Thank you Michael, you are perfectly right in reminding me this. At
this particular point in time I'm not yet architecturally foresighted
enough to be able to do that. While I was writing the design documents
I did write the list of methods each object would have needed and from
that description theoretically I could have made the tests first. In
practice, while eventually writing the code for those methods, I've
come to realize that there was a large amount of variance between what
I -thought- I needed and what I -actually- needed. So, had I written
the test before, I would have had to rewrite them again. That been
said, I agree that writing the tests before must be my goal. I hope
that as my experience increases I'll be able to know beforehand the
behaviors I need from each method/object/module of my applications.
One step at the time I'll get there... =)


Personally I find writing the tests an invaluable part of the design
process. It works best if you do it 'top-down'. i.e. You have a
written feature specification (a user story) - you turn this into an
executable specification in the form of a functional test.

Next to mid level unit tests and downwards - so your tests become your
design documents (and the way you think about design), but better than
a document they are executable. So just like code conveys intent so do
the tests (it is important that tests are readable).

For the situation where you don't really know what the API should look
like, Extreme Programming (of which TDD is part) includes a practise
called spiking. I wrote a bit about that here:

http://www.voidspace.org.uk/python/weblog/arch_d7_2007_11_03.shtml#e867

Mocking can help reduce the number of code paths you need to test for
necessarily complex code. Say you have a method that looks something
like:

def method(self):
if conditional:
# do stuff
else:
# do other stuff
# then do more stuff

You may be able to refactor this to look more like the following

def method(self):
if conditional:
self.method2()
else:
self.method3()
self.method4()

You can then write unit tests that *only* tests methods 2 - 4 on their
own. That code is then tested. You can then test method by mocking out
methods 2 - 4 on the instance and only need to test that they are
called in the right conditions and with the right arguments (and you
can mock out the return values to test that method handles them
correctly).

Mocking in Python is very easy, but there are plenty of mock libraries
to make it even easier. My personal favourite (naturally) is my own:

http://www.voidspace.org.uk/python/mock.html

All the best,

Michael
 
R

Roel Schroeven

Thanks for your answer. I still don't understand completely though. I
suppose it's me, but I've been trying to understand some of this for
quite some and somehow I can't seem to wrap my head around it.

Steven D'Aprano schreef:
On Sat, 29 Nov 2008 11:36:56 +0100, Roel Schroeven wrote:

The first thing to remember is that it is impractical for unit tests to
be exhaustive. Consider the following trivial function:

def add(a, b): # a and b ints only
return a+b+1

Clearly you're not expected to test *every imaginable* path through this
function (ignoring unit tests for error handling and bad input):

assert add(0, 0) == 1
assert add(1, 0) == 2
assert add(2, 0) == 3
assert add(3, 0) == 4
...
assert add(99736263, 8264891001) = 8364627265
...
OK

> ...
I arbitrarily choose path A alone, confident that paths B C and D are
correct, but of course I could make other choices. There's no need to
test paths B C and D *within spam's unit tests*, because they are already
tested elsewhere.

Except that I'm always told that the goal of unit tests, at least
partly, is to protect us agains mistakes when we make changes to the
tested functions. They should tell me wether I can still trust spam()
after refactoring it. Doesn't that mean that the unit test should see
spam() as a black box, providing a certain (but probably not 100%)
guarantee that the unit test is still a good test even if I change the
implementation of spam()?

And I don't understand how that works in test-driven development; I
can't possibly adapt the tests to the code paths in my code, because the
code doesn't exist yet when I write the test.
> To test them again within spam doesn't gain me anything.

I would think it gains you the freedom of changing spam's implementation
while still being able to rely on the unit tests. Or maybe I'm thinking
too far?
The success of this tactic assumes that you can identify code paths and
make them independent. If they are dependent, then you can't be sure that
path E G after A is the same as E G after D.

Real world example: compare driving your car from home to the mall to the
park, compared to driving from work to the mall to the park. The journey
from the mall to the park is the same, no matter how you got to the mall.
If you can drive from home to the mall and then to the park, and you can
drive from work to the mall, then you can be sure that you can drive from
work to the mall to the park even though you've never done it before.

But if you can't be sure the paths are independent, then you can't make
that simplifying assumption, and you do have to test more paths in more
places.

OK, but that only works if I know the code paths, meaning I've already
written the code. Wasn't the whole point of TDD that you write the tests
before the code?
I would say that means you're letting your tests get too far ahead of
your code. In theory, you should never have more than one failing test at
a time: the last test you just wrote. If you have to refactor code so
much that a bunch of tests start failing, then you need to take those
tests out, and re-introduce them one at a time.

I still fail to see how that works. I know I must be wrong since so many
people successfully apply TDD, but I don't see what I'm missing.

Let's take a more-or-less realistic example: I want/need a function to
calculate the least common multiple of two numbers. First I write some
tests:

assert(lcm(1, 1) == 1)
assert(lcm(2, 5) == 10)
assert(lcm(2, 4) == 4)

Then I start to write the lcm() function. I do some research and I find
out that I can calculate the lcm from the gcd, so I write:

def lcm(a, b):
return a / gcd(a, b) * b

But gcd() doesn't exist yet, so I write some tests for gcd(a, b) and
start writing the gcd function. But all the time while writing that, the
lcm tests will fail.

I don't see how I can avoid that, unless I create gcd() before I create
lcm(), but that only works if I know that I'm going to need it. In a
simple case like this I could know, but in many cases I don't know it
beforehand.

--
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
-- Isaac Asimov

Roel Schroeven
 
S

Steven D'Aprano

Except that I'm always told that the goal of unit tests, at least
partly, is to protect us agains mistakes when we make changes to the
tested functions. They should tell me wether I can still trust spam()
after refactoring it. Doesn't that mean that the unit test should see
spam() as a black box, providing a certain (but probably not 100%)
guarantee that the unit test is still a good test even if I change the
implementation of spam()?

Yes, but you get to choose how strong that guarantee is. If you want to
test the same thing in multiple places in your code, you're free to do
so. Refactoring merely reduces the minimum number of tests you need for
complete code coverage, not the maximum.

The aim here isn't to cut the number of unit tests down to the absolute
minimum number required to cover all paths through your code, but to
reduce that minimum number to something tractable: O(N) or O(N**2)
instead of O(2**N), where N = some appropriate measure of code complexity.

It is desirable to have some redundant tests, because they reduce the
chances of a freakish bug just happening to give the correct result for
the test but wrong results for everything else. (Assuming of course that
the redundant tests aren't identical -- you gain nothing by running the
exact same test twice.) They also give you extra confidence that you can
refactor the code without introducing such freakish bugs. But if you find
yourself making such sweeping changes to your code base that you no
longer have such confidence, then by all means add more tests!


And I don't understand how that works in test-driven development; I
can't possibly adapt the tests to the code paths in my code, because the
code doesn't exist yet when I write the test.

That's where you should be using mocks and stubs to ease the pain.

http://en.wikipedia.org/wiki/Mock_object
http://en.wikipedia.org/wiki/Method_stub


I would think it gains you the freedom of changing spam's implementation
while still being able to rely on the unit tests. Or maybe I'm thinking
too far?

No, you are right, and I over-stated the case.


[snip]
I still fail to see how that works. I know I must be wrong since so many
people successfully apply TDD, but I don't see what I'm missing.

Let's take a more-or-less realistic example: I want/need a function to
calculate the least common multiple of two numbers. First I write some
tests:

assert(lcm(1, 1) == 1)
assert(lcm(2, 5) == 10)
assert(lcm(2, 4) == 4)

(Aside: assert is not a function, you don't need the parentheses.)

Arguably, that's too many tests. Start with one.

assert lcm(1, 1) == 1

And now write lcm:

def lcm(a, b):
return 1

That's a stub, and our test passes. So add another test:

assert lcm(2, 5) == 10

and the test fails. So let's fix the function by using gcd.

def lcm(a, b):
return a/gcd(a, b)*b

(By the way: there's a subtle bug in lcm() that will hit you in Python 3.
Can you spot it? Here's a hint: your unit tests should also assert that
the result of lcm is always an int.)

Now that we've introduced a new function, we need a stub and a test for
it:

def gcd(a, b):
return 1

Why does the stub return 1? So it will make the lcm test pass. If we had
more lcm tests, it would be harder to write a gcd stub, hence the
insistence of only adding a single test at a time.

assert gcd(1, 1) == 1

Now that all the tests work and we get a nice green light. Let's add
another test. We need to add it to the gcd test suite, because it's the
latest, least working function. If you add a test to the lcm test suite,
and it fails, you don't know if it failed because of an error in lcm() or
because of an error in gcd(). So leave lcm alone until gcm is working:

assert gcd(2, 5) == 2

Now go and fix gcd. At some time you have to decide to stop using a stub
for gcd, and write the function properly. For a function that simple,
"now" is that time, but just for the exercise let me write a slightly
more complicated stub. This is (probably) the next simplest stub which
allows all the tests to pass while still being "wrong":

def gcd(a, b):
if a == b:
return 1
else:
return 2

When you're convinced gcd() is working, you can go back and add
additional tests to lcm.

In practice, of course, you can skip a few steps. It's hard to be
disciplined enough to program in such tiny little steps. But the cost of
being less disciplined is that it takes longer to have all the tests pass.
 
S

Steven D'Aprano

def lcm(a, b):
return a/gcd(a, b)*b

(By the way: there's a subtle bug in lcm() that will hit you in Python
3. Can you spot it?

Er, ignore this. Division in Python 3 only returns a float if the
remainder is non-zero, and when dividing by the gcd the remainder should
always be zero.
Here's a hint: your unit tests should also assert that
the result of lcm is always an int.)

Still good advise.
 
T

Terry Reedy

Steven said:
3. Can you spot it?

Er, ignore this. Division in Python 3 only returns a float if the
remainder is non-zero, and when dividing by the gcd the remainder should
always be zero.

You were right the first time.
IDLE 3.0rc32.0

lcm should return an int so should be written a//gcd(a,b) * b
to guarantee that.
 
R

Roel Schroeven

Steven D'Aprano schreef:

Thank you for elaborate answer, Steven. I think I'm really starting to
get it now.

--
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
-- Isaac Asimov

Roel Schroeven
 
J

James Harris

Ok, I've taken this wise suggestion on board and of course I found
immediately ways to improve the method. -However- this generates
another issue. I can fragment the code of the original method into one
public method and a few private support methods. But this doesn't
reduce the complexity of the testing because the number and complexity
of the possible path stays more or less the same. The solution to this
would be to test the individual methods separately, but is the only
way to test private methods in python to make them (temporarily) non
private? I guess ultimately this would only require the removal of the
appropriate double-underscores followed by method testing and then
adding the double-underscores back in place. There is no "cleaner"
way, is there?

Difficult to say without seeing the code. You could post it, perhaps.
On the other hand a general recommendation from Programming Pearls
(Jon Bentley) is to convert code to data structures. Maybe you could
convert some of the code to decision tables or similar.

James
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
SterlingLa
Top