Probabilistic unit tests?

N

Nick Mellor

Hi,

I've got a unit test that will usually succeed but sometimes fails. An occasional failure is expected and fine. It's failing all the time I want to test for.

What I want to test is "on average, there are the same number of males and females in a sample, give or take 2%."

Here's the unit test code:
import unittest
from collections import counter

sex_count = Counter()
for contact in range(self.binary_check_sample_size):
p = get_record_as_dict()
sex_count[p['Sex']] += 1
self.assertAlmostEqual(sex_count['male'],
sex_count['female'],
delta=sample_size * 2.0 / 100.0)

My question is: how would you run an identical test 5 times and pass the group *as a whole* if only one or two iterations passed the test? Something like:

for n in range(5):
# self.assertAlmostEqual(...)
# if test passed: break
else:
self.fail()

(except that would create 5+1 tests as written!)

Thanks for any thoughts,

Best wishes,

Nick
 
R

Roy Smith

Nick Mellor said:
Hi,

I've got a unit test that will usually succeed but sometimes fails. An
occasional failure is expected and fine. It's failing all the time I want to
test for.

What I want to test is "on average, there are the same number of males and
females in a sample, give or take 2%."
[...]
My question is: how would you run an identical test 5 times and pass the
group *as a whole* if only one or two iterations passed the test? Something
like:

for n in range(5):
# self.assertAlmostEqual(...)
# if test passed: break
else:
self.fail()

I would do something like:

def do_test_body():
"""Returns 1 if it passes, 0 if it fails"""

results = [do_test() for n in range(number_of_trials)]
self.assert(sum(results) > threshold)

That's the simple part.

The more complicated part is figuring out how many times to run the test
and what an appropriate threshold is. For that, you need to talk to a
statistician.
 
S

Steven D'Aprano

Hi,

I've got a unit test that will usually succeed but sometimes fails. An
occasional failure is expected and fine. It's failing all the time I
want to test for.

Well, that's not really a task for unit testing. Unit tests, like most
tests, are well suited to deterministic tests, but not really to
probabilistic testing. As far as I know, there aren't really any good
frameworks for probabilistic testing, so you're stuck with inventing your
own. (Possibly on top of unittest.)

What I want to test is "on average, there are the same number of males
and females in a sample, give or take 2%."

Here's the unit test code:
import unittest
from collections import counter

sex_count = Counter()
for contact in range(self.binary_check_sample_size):
p = get_record_as_dict()
sex_count[p['Sex']] += 1
self.assertAlmostEqual(sex_count['male'],
sex_count['female'],
delta=sample_size * 2.0 / 100.0)

That's a cheap and easy way to almost get what you want, or at least what
I think you should want.

Rather than a "Succeed/Fail" boolean test result, I think it is worth
producing a float between 0 and 1 inclusive, where 0 is "definitely
failed" and 1 is "definitely passed", and intermediate values reflect
some sort of fuzzy logic score. In your case, you might look at the ratio
of males to females. If the ratio is exactly 1, the fuzzy score would be
1.0 ("definitely passed"), otherwise as the ratio gets further away from
1, the score would approach 0.0:

if males <= females:
score = males/females
else:
score = females/males

should do it.

Finally you probabilistic-test framework could then either report the
score itself, or decide on a cut-off value below which you turn it into a
unittest failure.

That's still not quite right though. To be accurate, you're getting into
the realm of hypotheses testing and conditional probabilities:

- if these random samples of males and females came from a population of
equal numbers of each, what is the probability I could have got the
result I did?

- should I reject the hypothesis that the samples came from a population
with equal numbers of males and females?


Talk to a statistician on how to do this.

My question is: how would you run an identical test 5 times and pass the
group *as a whole* if only one or two iterations passed the test?
Something like:

for n in range(5):
# self.assertAlmostEqual(...)
# if test passed: break
else:
self.fail()

(except that would create 5+1 tests as written!)


Simple -- don't use assertAlmostEqual, or any other of the unittest
assertSomething methods. Write your own function to decide whether or not
something passed, then count how many times it passed:

count = 0
for n in range(5):
count += self.run_some_test() # returns 0 or 1, or a fuzzy score
if count < some_cut_off:
self.fail()
 
S

Steven D'Aprano

Hi,

I've got a unit test that will usually succeed but sometimes fails. An
occasional failure is expected and fine. It's failing all the time I
want to test for.

What I want to test is "on average, there are the same number of males
and females in a sample, give or take 2%."
[...]

unit test are for testing your code, not checking if input data is in
the correct range so unless you are writing a program intended to
generate test data I don't see why unit test are appropriate in this
case.

I don't believe Nick is using unittest to check input data. As I
understand it, Nick has a program which generates random values. If his
program works correctly, it should generate approximately equal numbers
of "male" and "female" values. So he writes a unit test to check that the
numbers are roughly equal.

This is an appropriate test, although as I already suggested earlier,
unit tests are not well suited for non-deterministic testing.
 
D

duncan smith

Hi,

I've got a unit test that will usually succeed but sometimes fails. An occasional failure is expected and fine. It's failing all the time I want to test for.

What I want to test is "on average, there are the same number of males and females in a sample, give or take 2%."

Here's the unit test code:
import unittest
from collections import counter

sex_count = Counter()
for contact in range(self.binary_check_sample_size):
p = get_record_as_dict()
sex_count[p['Sex']] += 1
self.assertAlmostEqual(sex_count['male'],
sex_count['female'],
delta=sample_size * 2.0 / 100.0)

My question is: how would you run an identical test 5 times and pass the group *as a whole* if only one or two iterations passed the test? Something like:

for n in range(5):
# self.assertAlmostEqual(...)
# if test passed: break
else:
self.fail()

(except that would create 5+1 tests as written!)

Thanks for any thoughts,

Best wishes,

Nick

The appropriateness of "give or take 2%" will depend on sample size.
e.g. If the proportion of males should be 0.5 and your sample size is
small enough this will fail most of the time regardless of whether the
proportion is 0.5.

What you could do is perform a statistical test. Generally this involves
generating a p-value and rejecting the null hypothesis if the p-value is
below some chosen threshold (Type I error rate), often taken to be 0.05.
Here the null hypothesis would be that the underlying proportion of
males is 0.5.

A statistical test will incorrectly reject a true null in a proportion
of cases equal to the chosen Type I error rate. A test will also fail to
reject false nulls a certain proportion of the time (the Type II error
rate). The Type II error rate can be reduced by using larger samples. I
prefer to generate several samples and test whether the proportion of
failures is about equal to the error rate.

The above implies that p-values follow a [0,1] uniform density function
if the null hypothesis is true. So alternatively you could generate many
samples / p-values and test the p-values for uniformity. That is what I
generally do:


p_values = []
for _ in range(numtests):
values = data generated from code to be tested
p_values.append(stat_test(values))
test p_values for uniformity


The result is still a test that will fail a given proportion of the
time. You just have to live with that. Run your test suite several times
and check that no one test is "failing" too regularly (more often than
the chosen Type I error rate for the test of uniformity). My experience
is that any issues generally result in the test of uniformity being
consistently rejected (which is why a do that rather than just
performing a single test on a single generated data set).

In your case you're testing a Binomial proportion and as long as you're
generating enough data (you need to take into account any test
assumptions / approximations) the observed proportions will be
approximately normally distributed. Samples of e.g. 100 would be fine.
P-values can be generated from the appropriate normal
(http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval),
and uniformity can be tested using e.g. the Kolmogorov-Smirnov or
Anderson-Darling test
(http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm).

I'd have thought that something like this also exists somewhere. How do
people usually test e.g. functions that generate random variates, or
other cases where deterministic tests don't cut it?

Duncan
 
A

alex23

Well, that's not really a task for unit testing. Unit tests, like most
tests, are well suited to deterministic tests, but not really to
probabilistic testing. As far as I know, there aren't really any good
frameworks for probabilistic testing, so you're stuck with inventing your
own. (Possibly on top of unittest.)

One approach I've had success with is providing a seed to the RNG, so
that the random results are deterministic.
 
R

Roy Smith

alex23 said:
One approach I've had success with is providing a seed to the RNG, so
that the random results are deterministic.

Sometimes, a hybrid approach is best.

I was once working on some code which had timing-dependent behavior.
The input space was so large, there was no way to exhaustively test all
conditions. What we did was use a PRNG to drive the test scenarios,
seeded with the time. We would print out the seed at the beginning of
the test. This let us explore a much larger range of the input space
than we could have with hand-written test scenarios.

There was also a mode where you could supply your own PRNG seed. So,
the typical deal would be to wait for a failure during normal (nightly
build) testing, then grab the seed from the test logs and use that to
replicate the behavior for further study.
 
D

duncan smith

One approach I've had success with is providing a seed to the RNG, so
that the random results are deterministic.

My ex-boss once instructed to do the same thing to test functions for
generating random variates. I used a statistical approach instead.

There are often several ways of generating data that follow a particular
distribution. If you use a given seed so that you get a deterministic
sequence of uniform random variates you will get deterministic outputs
for a specific implementation. But if you change the implementation the
tests are likely to fail. e.g. To generate a negative exponential
variate -ln(U)/lambda or -ln(1-U)/lambda will do the job correctly, but
tests for one implementation would fail with the other. So each time you
changed the implementation you'd need to change the tests.

I think my boss had in mind that I would write the code, seed the RNG,
call the function a few times, then use the generated values in the
test. That would not even have tested the original implementation. I
would have had a test that would only have tested whether the
implementation had changed. I would argue, worse than no test at all. If
I'd gone to the trouble of manually calculating the expected outputs so
that I got valid tests for the original implementation, then I would
have had a test that would effectively just serve as a reminder to go
through the whole manual calculation process again for any changed
implementation.

A reasonably general statistical approach is possible. Any hypothesis
about generated data that lends itself to statistical testing can be
used to generate a sequence of p-values (one for each set of generated
values) that can be checked (statistically) for uniformity. This
effectively tests the distribution of the test statistic, so is better
than simply testing whether tests on generated data pass, say, 95% of
the time (for a chosen 5% Type I error rate). Cheers.

Duncan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,015
Latest member
AmbrosePal

Latest Threads

Top