Hypothesis and Pandera: Generate Synthesis Pandas DataFrame for Testing

    Create Clean and Robust Tests with Property-Based Testing

    Image by Author

    Imagine you are trying to figure out whether the function processing_fn is working properly. You use pytest to test the function with an example.

    The test passed, but you know that one example is not enough. You need to test the function with more examples to make sure that the function is working properly with any data.

    To do that, you might use pytest parameterize, but it is difficult to come up with every example that might result in failures.

    Even if you take the time to write all those examples, it takes a long time for you to run all of the tests.

    Wouldn’t it be nice if there is a testing strategy that allows you to:

    • Write tests easily
    • Generate good data for testing
    • Detect falsifying examples quickly
    • Produce small and straightforward tests
    Image by Author

    That is when Hypothesis and Pandera come in handy.

    Pandera is a simple Python library for validating a pandas DataFrame.

    To install Pandera, type:

    pip install pandera

    Hypothesis is a flexible and easy-to-use library for property-based testing.

    Example-based tests use concrete examples and concrete expected outputs. Property-based tests generalize these concrete examples into essential features.

    As a result, property-based tests allow you to write cleaner tests and specify the behavior of the code better.

    Image by Author

    To install Hypothesis, type:

    pip install hypothesis

    This article will show you how to use these two tools to generate synthesis pandas DataFrame for testing.

    First, we will use Pandera to test if the output of a function satisfies some constraints when given one input.

    In the code below, we:

    • Use pandera.DataFrameSchema to specify some constraints for the output such as the datatype and the range of the values of a column.
    • Use the pandera.check_output decorator to test if the output of the function satisfies the constraints.

    Since there is no error when running this code, the output is valid.

    Next, we will use hypothesis to create data for testing based on the constraints given by pandera.DataFrameSchema.

    Specifically, we will add:

    • schema.strategy(size=5) to specify the search strategy that describes how to generate and simplify the data
    • @given to run the test function over a wide range of matching data from the specified strategy

    Run the tests with pytest:



    We found a falsifying example in less than 2 seconds! The output is also very simple. For example, instead of choosing an example like the following that could result in an error:

          val1  val2
    0 1 2
    1 2 1
    2 3 0
    3 4 0
    4 5 1

    Hypothesis chooses an example that is simpler and easy to understand:

          val1  val2
    0 0 0
    1 0 0
    2 0 0
    3 0 0
    4 0 0

    This is very cool because:

    • We do not need to specify any concrete examples.
    • The examples are straigh-forward enough for us to quickly understand the behavior of the tested function.
    • We find the falsifying example in a short amount of time.

    Congratulations! You have just learned how to use Pandera and Hypothesis to generate synthesis data for testing. I hope this article will give you the knowledge needed to create robust and clean tests for your Python functions.

    Feel free to play and fork the source code of this article here:

    Hypothesis and Pandera: Generate Synthesis Pandas DataFrame for Testing Republished from Source via

    Recent Articles


    Related Stories

    Stay on op - Ge the daily news in your inbox