Probabilistic unit tests?

Discussion in 'Python' started by Nick Mellor, Jan 11, 2013.

  1. Nick Mellor

    Nick Mellor Guest

    Hi,

    I've got a unit test that will usually succeed but sometimes fails. An occasional failure is expected and fine. It's failing all the time I want to test for.

    What I want to test is "on average, there are the same number of males and females in a sample, give or take 2%."

    Here's the unit test code:
    import unittest
    from collections import counter

    sex_count = Counter()
    for contact in range(self.binary_check_sample_size):
    p = get_record_as_dict()
    sex_count[p['Sex']] += 1
    self.assertAlmostEqual(sex_count['male'],
    sex_count['female'],
    delta=sample_size * 2.0 / 100.0)

    My question is: how would you run an identical test 5 times and pass the group *as a whole* if only one or two iterations passed the test? Something like:

    for n in range(5):
    # self.assertAlmostEqual(...)
    # if test passed: break
    else:
    self.fail()

    (except that would create 5+1 tests as written!)

    Thanks for any thoughts,

    Best wishes,

    Nick
    Nick Mellor, Jan 11, 2013
    #1
    1. Advertising

  2. Nick Mellor

    Roy Smith Guest

    In article <>,
    Nick Mellor <> wrote:

    > Hi,
    >
    > I've got a unit test that will usually succeed but sometimes fails. An
    > occasional failure is expected and fine. It's failing all the time I want to
    > test for.
    >
    > What I want to test is "on average, there are the same number of males and
    > females in a sample, give or take 2%."
    > [...]
    > My question is: how would you run an identical test 5 times and pass the
    > group *as a whole* if only one or two iterations passed the test? Something
    > like:
    >
    > for n in range(5):
    > # self.assertAlmostEqual(...)
    > # if test passed: break
    > else:
    > self.fail()


    I would do something like:

    def do_test_body():
    """Returns 1 if it passes, 0 if it fails"""

    results = [do_test() for n in range(number_of_trials)]
    self.assert(sum(results) > threshold)

    That's the simple part.

    The more complicated part is figuring out how many times to run the test
    and what an appropriate threshold is. For that, you need to talk to a
    statistician.
    Roy Smith, Jan 11, 2013
    #2
    1. Advertising

  3. On Thu, 10 Jan 2013 17:59:05 -0800, Nick Mellor wrote:

    > Hi,
    >
    > I've got a unit test that will usually succeed but sometimes fails. An
    > occasional failure is expected and fine. It's failing all the time I
    > want to test for.


    Well, that's not really a task for unit testing. Unit tests, like most
    tests, are well suited to deterministic tests, but not really to
    probabilistic testing. As far as I know, there aren't really any good
    frameworks for probabilistic testing, so you're stuck with inventing your
    own. (Possibly on top of unittest.)


    > What I want to test is "on average, there are the same number of males
    > and females in a sample, give or take 2%."
    >
    > Here's the unit test code:
    > import unittest
    > from collections import counter
    >
    > sex_count = Counter()
    > for contact in range(self.binary_check_sample_size):
    > p = get_record_as_dict()
    > sex_count[p['Sex']] += 1
    > self.assertAlmostEqual(sex_count['male'],
    > sex_count['female'],
    > delta=sample_size * 2.0 / 100.0)


    That's a cheap and easy way to almost get what you want, or at least what
    I think you should want.

    Rather than a "Succeed/Fail" boolean test result, I think it is worth
    producing a float between 0 and 1 inclusive, where 0 is "definitely
    failed" and 1 is "definitely passed", and intermediate values reflect
    some sort of fuzzy logic score. In your case, you might look at the ratio
    of males to females. If the ratio is exactly 1, the fuzzy score would be
    1.0 ("definitely passed"), otherwise as the ratio gets further away from
    1, the score would approach 0.0:

    if males <= females:
    score = males/females
    else:
    score = females/males

    should do it.

    Finally you probabilistic-test framework could then either report the
    score itself, or decide on a cut-off value below which you turn it into a
    unittest failure.

    That's still not quite right though. To be accurate, you're getting into
    the realm of hypotheses testing and conditional probabilities:

    - if these random samples of males and females came from a population of
    equal numbers of each, what is the probability I could have got the
    result I did?

    - should I reject the hypothesis that the samples came from a population
    with equal numbers of males and females?


    Talk to a statistician on how to do this.


    > My question is: how would you run an identical test 5 times and pass the
    > group *as a whole* if only one or two iterations passed the test?
    > Something like:
    >
    > for n in range(5):
    > # self.assertAlmostEqual(...)
    > # if test passed: break
    > else:
    > self.fail()
    >
    > (except that would create 5+1 tests as written!)



    Simple -- don't use assertAlmostEqual, or any other of the unittest
    assertSomething methods. Write your own function to decide whether or not
    something passed, then count how many times it passed:

    count = 0
    for n in range(5):
    count += self.run_some_test() # returns 0 or 1, or a fuzzy score
    if count < some_cut_off:
    self.fail()


    --
    Steven
    Steven D'Aprano, Jan 11, 2013
    #3
  4. On Fri, 11 Jan 2013 16:26:20 +0000, Alister wrote:

    > On Thu, 10 Jan 2013 17:59:05 -0800, Nick Mellor wrote:
    >
    >> Hi,
    >>
    >> I've got a unit test that will usually succeed but sometimes fails. An
    >> occasional failure is expected and fine. It's failing all the time I
    >> want to test for.
    >>
    >> What I want to test is "on average, there are the same number of males
    >> and females in a sample, give or take 2%."

    [...]

    > unit test are for testing your code, not checking if input data is in
    > the correct range so unless you are writing a program intended to
    > generate test data I don't see why unit test are appropriate in this
    > case.


    I don't believe Nick is using unittest to check input data. As I
    understand it, Nick has a program which generates random values. If his
    program works correctly, it should generate approximately equal numbers
    of "male" and "female" values. So he writes a unit test to check that the
    numbers are roughly equal.

    This is an appropriate test, although as I already suggested earlier,
    unit tests are not well suited for non-deterministic testing.


    --
    Steven
    Steven D'Aprano, Jan 11, 2013
    #4
  5. Nick Mellor

    duncan smith Guest

    On 11/01/13 01:59, Nick Mellor wrote:
    > Hi,
    >
    > I've got a unit test that will usually succeed but sometimes fails. An occasional failure is expected and fine. It's failing all the time I want to test for.
    >
    > What I want to test is "on average, there are the same number of males and females in a sample, give or take 2%."
    >
    > Here's the unit test code:
    > import unittest
    > from collections import counter
    >
    > sex_count = Counter()
    > for contact in range(self.binary_check_sample_size):
    > p = get_record_as_dict()
    > sex_count[p['Sex']] += 1
    > self.assertAlmostEqual(sex_count['male'],
    > sex_count['female'],
    > delta=sample_size * 2.0 / 100.0)
    >
    > My question is: how would you run an identical test 5 times and pass the group *as a whole* if only one or two iterations passed the test? Something like:
    >
    > for n in range(5):
    > # self.assertAlmostEqual(...)
    > # if test passed: break
    > else:
    > self.fail()
    >
    > (except that would create 5+1 tests as written!)
    >
    > Thanks for any thoughts,
    >
    > Best wishes,
    >
    > Nick
    >


    The appropriateness of "give or take 2%" will depend on sample size.
    e.g. If the proportion of males should be 0.5 and your sample size is
    small enough this will fail most of the time regardless of whether the
    proportion is 0.5.

    What you could do is perform a statistical test. Generally this involves
    generating a p-value and rejecting the null hypothesis if the p-value is
    below some chosen threshold (Type I error rate), often taken to be 0.05.
    Here the null hypothesis would be that the underlying proportion of
    males is 0.5.

    A statistical test will incorrectly reject a true null in a proportion
    of cases equal to the chosen Type I error rate. A test will also fail to
    reject false nulls a certain proportion of the time (the Type II error
    rate). The Type II error rate can be reduced by using larger samples. I
    prefer to generate several samples and test whether the proportion of
    failures is about equal to the error rate.

    The above implies that p-values follow a [0,1] uniform density function
    if the null hypothesis is true. So alternatively you could generate many
    samples / p-values and test the p-values for uniformity. That is what I
    generally do:


    p_values = []
    for _ in range(numtests):
    values = data generated from code to be tested
    p_values.append(stat_test(values))
    test p_values for uniformity


    The result is still a test that will fail a given proportion of the
    time. You just have to live with that. Run your test suite several times
    and check that no one test is "failing" too regularly (more often than
    the chosen Type I error rate for the test of uniformity). My experience
    is that any issues generally result in the test of uniformity being
    consistently rejected (which is why a do that rather than just
    performing a single test on a single generated data set).

    In your case you're testing a Binomial proportion and as long as you're
    generating enough data (you need to take into account any test
    assumptions / approximations) the observed proportions will be
    approximately normally distributed. Samples of e.g. 100 would be fine.
    P-values can be generated from the appropriate normal
    (http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval),
    and uniformity can be tested using e.g. the Kolmogorov-Smirnov or
    Anderson-Darling test
    (http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm).

    I'd have thought that something like this also exists somewhere. How do
    people usually test e.g. functions that generate random variates, or
    other cases where deterministic tests don't cut it?

    Duncan
    duncan smith, Jan 11, 2013
    #5
  6. Nick Mellor

    alex23 Guest

    On 11 Jan, 13:34, Steven D'Aprano <steve
    > wrote:
    > Well, that's not really a task for unit testing. Unit tests, like most
    > tests, are well suited to deterministic tests, but not really to
    > probabilistic testing. As far as I know, there aren't really any good
    > frameworks for probabilistic testing, so you're stuck with inventing your
    > own. (Possibly on top of unittest.)


    One approach I've had success with is providing a seed to the RNG, so
    that the random results are deterministic.
    alex23, Jan 12, 2013
    #6
  7. Nick Mellor

    Roy Smith Guest

    In article
    <>,
    alex23 <> wrote:

    > On 11 Jan, 13:34, Steven D'Aprano <steve
    > > wrote:
    > > Well, that's not really a task for unit testing. Unit tests, like most
    > > tests, are well suited to deterministic tests, but not really to
    > > probabilistic testing. As far as I know, there aren't really any good
    > > frameworks for probabilistic testing, so you're stuck with inventing your
    > > own. (Possibly on top of unittest.)

    >
    > One approach I've had success with is providing a seed to the RNG, so
    > that the random results are deterministic.


    Sometimes, a hybrid approach is best.

    I was once working on some code which had timing-dependent behavior.
    The input space was so large, there was no way to exhaustively test all
    conditions. What we did was use a PRNG to drive the test scenarios,
    seeded with the time. We would print out the seed at the beginning of
    the test. This let us explore a much larger range of the input space
    than we could have with hand-written test scenarios.

    There was also a mode where you could supply your own PRNG seed. So,
    the typical deal would be to wait for a failure during normal (nightly
    build) testing, then grab the seed from the test logs and use that to
    replicate the behavior for further study.
    Roy Smith, Jan 12, 2013
    #7
  8. Nick Mellor

    duncan smith Guest

    On 12/01/13 08:07, alex23 wrote:
    > On 11 Jan, 13:34, Steven D'Aprano <steve
    > > wrote:
    >> Well, that's not really a task for unit testing. Unit tests, like most
    >> tests, are well suited to deterministic tests, but not really to
    >> probabilistic testing. As far as I know, there aren't really any good
    >> frameworks for probabilistic testing, so you're stuck with inventing your
    >> own. (Possibly on top of unittest.)

    >
    > One approach I've had success with is providing a seed to the RNG, so
    > that the random results are deterministic.
    >


    My ex-boss once instructed to do the same thing to test functions for
    generating random variates. I used a statistical approach instead.

    There are often several ways of generating data that follow a particular
    distribution. If you use a given seed so that you get a deterministic
    sequence of uniform random variates you will get deterministic outputs
    for a specific implementation. But if you change the implementation the
    tests are likely to fail. e.g. To generate a negative exponential
    variate -ln(U)/lambda or -ln(1-U)/lambda will do the job correctly, but
    tests for one implementation would fail with the other. So each time you
    changed the implementation you'd need to change the tests.

    I think my boss had in mind that I would write the code, seed the RNG,
    call the function a few times, then use the generated values in the
    test. That would not even have tested the original implementation. I
    would have had a test that would only have tested whether the
    implementation had changed. I would argue, worse than no test at all. If
    I'd gone to the trouble of manually calculating the expected outputs so
    that I got valid tests for the original implementation, then I would
    have had a test that would effectively just serve as a reminder to go
    through the whole manual calculation process again for any changed
    implementation.

    A reasonably general statistical approach is possible. Any hypothesis
    about generated data that lends itself to statistical testing can be
    used to generate a sequence of p-values (one for each set of generated
    values) that can be checked (statistically) for uniformity. This
    effectively tests the distribution of the test statistic, so is better
    than simply testing whether tests on generated data pass, say, 95% of
    the time (for a chosen 5% Type I error rate). Cheers.

    Duncan
    duncan smith, Jan 12, 2013
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. ChrisC
    Replies:
    5
    Views:
    5,664
    rochnet
    Nov 1, 2009
  2. Guest
    Replies:
    3
    Views:
    752
    Guest
    Feb 5, 2004
  3. Replies:
    8
    Views:
    881
  4. Robert Feldt

    Probabilistic BDD?

    Robert Feldt, Nov 1, 2006, in forum: Ruby
    Replies:
    6
    Views:
    141
    Robert Feldt
    Nov 1, 2006
  5. dayo
    Replies:
    11
    Views:
    327
    Ilya Zakharevich
    Dec 16, 2005
Loading...

Share This Page