Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit Testing SQL in DBT #2354

Closed
MichelleArk opened this issue Apr 23, 2020 · 14 comments
Closed

Unit Testing SQL in DBT #2354

MichelleArk opened this issue Apr 23, 2020 · 14 comments
Labels
dbt tests Issues related to built-in dbt testing functionality enhancement New feature or request

Comments

@MichelleArk
Copy link
Contributor

MichelleArk commented Apr 23, 2020

Describe the feature

In addition to the existing data test support DBT provides, it would be great if users had the capability to write unit tests to assert model behaviour generally and in edge cases. These would validate expected behaviour of models for data that isn't yet observed in production.

To do this, DBT would need to provide the ability to run models on a set of static inputs which could either be created at query-time or ahead of time.

We prototyped a solution where users encode static input in CSV files, configure a 'tests.yml' file that provides mappings between source/ref models and CSV files, as well as specifying an expected output (also encoded as a CSV file). Our framework then generated a query that created a CTE for each static input, a CTE that represented the model being tested (replacing source/ref macros with the static input CTE names), and lastly ran a diff between the expected model and the model generated using static inputs. This generated query was then fed to dbt test - if the diff returned 0 results, the test would pass.

Feedback from data scientists was that encoding static inputs in CSV files was cumbersome, readability of tests was poor because of the many disparate files representing a test case, and flexibility to programmatically encode static inputs and write custom expectations beyond equality was also desired.

Wondering if other DBT users have tried to achieve something similar, and how the community feels it's best to approach unit testing in DBT.

Describe alternatives you've considered

We have considered running DBT's built-in data tests and running them on a small sample of production data locally. However, creating a representative sample of data for all edge cases for all downstream models is a challenging task and also bad practice - unit tests should have a single reason to fail. Creating many small tables representing individual test cases could be done to counter this but our main concern was where/how these static datasets were encoded - if they are in separate (let's say CSV) files, this creates a readability issue where reviewers / users have to jump between multiple files to understand a test case.

Another more general issue with this approach is that writing assertions for unit tests feels quite unnatural in SQL - its tricky even to get the right semantics for an equality check.

Additional context

There are definitely aspects of this that are database-specific. For example, in BigQuery, we can create static inputs as CTEs using ARRAYs of STRUCT types. For other databases, a different syntax or more preferred method of creating static data for testing. In addition, to create static inputs in BigQuery as ARRAYs of SRUCTs the data type of each column needs to be specified.

Who will this benefit?

I think all DBT users would benefit, especially large organizations where there will be frequent updates and many collaborators for a single model. Unit testing will give users more confidence that the changes they are making will not break existing behaviour.

@MichelleArk MichelleArk added enhancement New feature or request triage labels Apr 23, 2020
@drewbanin drewbanin removed the triage label Apr 23, 2020
@bashyroger
Copy link

This indeed would be useful to have for all orgs where a focus on data quality is of the utmost importance. AFAIK this is a hard problem to solve on the people / processes side of things (as you mention) and not something that has been done before for DATA unit testing.

@boxysean
Copy link
Contributor

Happy to find this issue! This is also an enhancement that would be useful to my team. I think this type of testing falls outside of the two existing kinds of dbt tests: schema tests and data tests.

I've implemented a form of unit testing in my company's codebase. It currently executes via pytest to test pl/pgsql transformations on Postgres, but I think the technique could be adapted to other databases and dbt.

Implementation sketch

My test suite folder looks like this:

tests
  -> orders
      -> 000_basic_order                     # the name of the unit test
          README.md                          # further details about the unit test
          -> input                           # setup pre-conditions with SQL and data from YAML files
              ddl.sql                        # creates tables
              source_order_1000.yml          # data to populate the tables with
              source_order_2000.yml          # more data to populate the tables with
          -> expected_output                 # assertions to make after executing dbt run
              reporting.dim_order.yml        # records that must exist in table reporting.dim_order
              reporting.dim_order_item.yml   # records that must exist in table reporting.dim_order_item
       -> 001_reprocess_orders               # another unit test
           ...
       -> 002_customer_pickup_order
           ...
       ...

(selected sample files)

The algorithm looks as follows:

  1. For each unit test case folder,
    a. Run SQL files under the input directory
    b. Load static data from YAML files under the input directory
    c. Execute SQL transforms
    d. For each YAML file under the expected_output directory (which must be named after a table in the database), assert that each record exists and matches the YAML record.

The approach is probably similar to what @MichelleArk has reported with CSVs. These tests are cumbersome to setup, and haven't been able to convince my team to do this kind of testing yet. :-)

I see that Dataform has unit testing. I guess one advantage of their implementation is that they are generating the test dataset in the database. Since I am defining the data in YAML, there could be issues translating data types from YAML into the database under test.

@dm03514
Copy link

dm03514 commented Feb 22, 2021

Hello! I would love to see this feature as part of dbt core. I created a small
MVP that shows an approach that allows end users to write their tests in python using pythons built in unit testing framework.

https://discourse.getdbt.com/t/dbt-model-think-unit-tests-poc/2160

One of the core design constraints was the ability to exercise models one at a time. This means that The framework needed to provide some mechanism for stubbing our ref/source.

the approach I took with the mvp listed above was to namespace the stubbed tables with a prefix, which is set as an environmental variable. The following describes the logical steps the mvp test harness takes to stub out ref/source and provide test defined data:

  • model test package provides overridden ref/source macros
  • dbt project installs the model test package
  • Dbt project defines A standard test_*.py unit test file
  • Dbt project subclasses a model test class And defines a standard python test function
  • Test creates a pandas dataframe for each ref/source that needs to be stubbed
  • Test calls into the model test function which will create the tables from the dataframe
  • The test then invokes the dbt model toggling on the mode test framework which will rewrite the ref/source to look at the tables generated from the pandas dataframe
  • The model is effectively executed against the stub data

this allows very focused model (“unit”) tests. Tests configure a couple of rows of stub data, exercise the model, and then assert on the output using a python dataframe. This allows for targeted, fast testing of model transformation code.

I’m most likely going to move forward with this approach at dayjob.

If anyone is interesting it should be relatively easy to convert this python approach to a “configuration” yaml approach.

I would love to hear your thoughts.

@jmriego
Copy link
Contributor

jmriego commented Mar 1, 2021

Hi!

I wanted continuing the conversation from #2740
I had not time to work on this but we are prioritizing this again. I'll paste the same comment I wrote over there:

I have been playing around with a way to automate this and I have a working concept here: https://github.com/jmriego/dbt-bdd

The tests are run with behave which is a library for BDD testing which in my opinion is a great fit for DBT as it makes the tests easy to understand by analysts the same way it already does for ELT.
Here's an example unit test for a model call fom that counts number of days in a month:

Scenario: run a sample unit test
Given calendar is loaded with this data
| dt | first_day_of_month |
| 2020-01-01 | 2020-01-01 |
| 2020-01-02 | 2020-01-01 |
| 2020-02-02 | 2020-02-01 |
When we run the load for fom
Then the results of the model are
| first_day_of_month | count_days |
| 2020-01-01 | 2 |
| 2020-02-01 | 1 |
The main trick is that the phrase "calendar is loaded with this data" will generate a seed file with that info and with a name specific to this test that it generates automatically. It does that with a scenario id prefix (i.e. abcd124_calendar).

Then, it will replace all ref to calendar with abcd124_calendar. This is really the main concept and I didn't find a better solution, but it does so by passing to dbt a var with the following key and value: {calendar: abcd124}. The code that detects the reference is here: https://github.com/jmriego/dbt-bdd/blob/master/macros/ref.sql
I played with the idea of just renaming the alias of the models, but then there's a conflict with the seeds created while setting up the test case

I'm seeing @dm03514 you also created something similar but with pytest

@jtcohen6 jtcohen6 added the dbt tests Issues related to built-in dbt testing functionality label Mar 10, 2021
@reubster
Copy link

We've been experimenting with unit tests. We've decided to (probably) use SQL mocks rather than seeds because they're faster.

For any given model we have a __source.sql/s and a __expected.sql.
We run equality / other tests against the __expected
The source for the real model is controlled by a customised ref function.. which is driven by variables / environment settings.
Selector tags group these into a set for easy invocation.

It took me a while to come to realise that there's a fundamental paradox; either I have to deploy the model and change it's sources... or I have to have different "versions" of the model itself pointing at different sources... because the source needs to be instantiated and the model deployed before it can be tested. Ideally, though this would be easier to control with config.

@jmriego
Copy link
Contributor

jmriego commented May 19, 2021

Hi @reubster! How are you creating those SQL mocks? Do you mean that people writing the tests need to create fake source models and expected values with a SQL query similar to this?

SELECT 1 id, 'A' name, 100 value UNION ALL
SELECT 2 id, 'B' name, 200 value

@Zatte
Copy link

Zatte commented Jun 15, 2021

Getting (static) test data into the database is something that will be test dependent; sometimes you want small test data and then writing SQL that mocks the data is doable (select ... union all select...), medium data sets fit well inside yaml files and large files can be provisioned with regular dbt tooling practices.

Since all of them have different approaches / tooling I would rather see a solution to unit testing where we can get a model's sql code (parsed) but where refs and sources (potentially variables as well) can be overridden in a test local scope. Assume such a macro exists (where we can get the compiled sql code) and that it is called cte then a unit test could look like this: (note how it requires no changes to the model code and there are no new concepts are introduced to allow testing).

Contrived example
models/addder.sql

    SELECT 
      *,
      a+b AS sum
    FROM {{ ref('some_other_model') }}

/tests/adder_001.sql

{% set model_to_test = cte('model.addder')
      | replace(ref('some_other_model'), "mocked_src1") 
%}

WITH
mocked_src1 AS (
  SELECT
    *
  FROM
  -- bigquery construct; same effect as union all select ....
  UNNEST([
     struct(1 as a,  2 as b,  3 as expectedSum),
     struct(2 as a,  2 as b,  4 as expectedSum),
     struct(0 as a,  0 as b,  0 as expectedSum),
     struct(1 as a, -2 as b, -1 as expectedSum),
   ])
),

final AS (
  {{ model_to_test }}
)

-- Some SQL assertion based on the specific mocked data.
SELECT * FROM final WHERE sum != expectedSum

This example doesn't have ref/source overrides but rather does a simple string replace but you get the general idea. How the data is sourced in mocked_src1 is up to the test designer to decide.

_ This is mostly a copy from this thread we're I've tried to get some feedback on this approach : https://discourse.getdbt.com/t/testing-with-fixed-data-set/564/9 _

@jmriego
Copy link
Contributor

jmriego commented Jun 17, 2021

hi @Zatte ,

I really like that approach. It definitely feels more DBT-onic than what I was proposing. As you say, there might be multiple ways of filling data for testing depending on the size of the tests. Nothing stops the yaml I was proposing to generate these test sqls automatically so it's not even like these two approaches are exclusive.
What you are describing seems like a good way of getting the number one requirement for getting any type of unit test: a way of overriding a ref/source. Getting the compiled SQL in a macro seems like a good way.
It could even be enhanced to have a parameter with the overrides you want to make instead of doing the replace separately

@noel
Copy link

noel commented Jun 25, 2021

I see the preference is to have the mock data in some file in the repo. I am curious as to why not have a different database / schema for the mock data e.g. raw_mock_data and replace sources with that maybe using a var and adding a tag to these tests so you can include/exclude them on a given run

@Zatte
Copy link

Zatte commented Jun 25, 2021

/../ maybe using a var and adding a tag to these tests so you can include/exclude them on a given run /.../

I personally would like all tests to be able to run using just dbt test, at most it cli-args can be used to separate unit/integration tests just because they are orders of magnitude slower/faster to run.
The second you have (different) cli-args to run (different) tests you end up needing tooling (e.g. make files) to ensure all test suites are run; otherwise you will risk missing a test(s). To me this seems like an anti pattern. Ideally I would like dbt test to contain just about everything you need to run tests so no additional tooling is needed.

/../ not have a different database / schema for the mock data e.g. raw_mock_data and replace sources with that /.../

I think this approach can work in many situations but not all. If you can only swap out the schema then you are limited to swapping 1:1 between production/mock data. What if you want to test a model using different mocks and/or which depends on 2 or more tables (let's call them A, B); Testing with mocks A1, A2, B1, B2, B3 and combinations of these would be difficult.

@noel
Copy link

noel commented Jun 25, 2021

makes sense, thanks for clarifying.

@cdiniz
Copy link

cdiniz commented Oct 4, 2021

Hi

I'm doing unit tests in dbt with a couple of custom macro helpers, with a couple of trade-offs and not practical things.
A test looks like this:

{% set model_to_test = ref('covid_19_cases_per_day') %}

{% set input %}
insert into covid_19_ingestion(day, payload) values ('2021-05-05', '[{}]');
insert into covid_19_ingestion(day, payload) values ('2021-05-06','[{"newCases": 20}]');
{% endset %}

{% set expectations %}
select '2021-05-05'::timestamp as day, 0 as cases
union all
select '2021-05-06'::timestamp as day, 20 as cases
{% endset %}

{{ unit_test(model_to_test, input, expectations) }}

It looks good:

  • the data setup is a set of inserts into a table used by the model_to_test.
  • the model_to_test needs to be a view when running the tests (so I don't have to do a dbt run in the middle of the testing macro to update the model with the inserted data).

But it has a couple of flaws:

  • Imagine the model_to_test is using a model in a model chain. I need to insert the data on the root table and rely on the logic of the models in the middle till it reaches the model_to_test, because I can't mock properly what is being used on the model_to_test. @Zatte's pr is trying to achieve this mocking capability with another workaround by having the sql for the model.
  • Every time that I change the model_to_test, I need to re-run dbt run before dbt test to update the models.

So it's far from a perfect setup.

I was having a look at Dataform and they have the concept of unit tests as a feature. In Dataform, for each unit test, we need to defined the model that we want to test (as in my approach) and we need to always provide the input data for each model used by the model_to_test.

Rewriting my initial test in Dataform looks like this:

config {
  type: "test",
  dataset: "covid19_cases_per_day_dataform"
}

input "covid19" {
  SELECT '2021-05-05' as day, '[{}]' as payload UNION ALL
  SELECT '2021-05-06' as day, '[{"newCases": 20}]' as payload
}

select cast('2021-05-05' as timestamp) as day, 0 as cases
union all
select cast('2021-05-06' as timestamp) as day, 20 as cases

That's neat imo.
The test setup is close to test expectations and it's possible and mandatory to mock the underlying models.
Also the neatest part of this approach is the test execution, it doesn't use the database models, so there is no need to update the models before running the tests.

That being said, I was impressed with the Dataform approach, and I think an approach like that is the way to go for dbt. I think this is slightly hard to put on a PR by an 'outsider'. Could someone from the dbt team please share what is the road-map for unit tests?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 7, 2021

Just chiming in to link a solid slack thread, prompted by the comment above. This is a topic I'm very interested in — and would be interested in revisiting, in earnest, next year

@cdiniz
Copy link

cdiniz commented Oct 19, 2021

Hi.
Following up my last comment, the intent of this update is to share our journey on unit tests, which can be valuable for others. I've been working with @psousa50 to improve the previous approach, and we reached a better approach.
In our current approach a unit tests looks like this:

{{
    config(
        tags=['unit-test'],
        model_under_test='covid19_cases_per_day'
    )
}}

{% set inputs %}
covid19_raw as (
select cast('2021-05-05' as date) as day, '[{}]' as payload
UNION ALL
select cast('2021-05-06' as date) as day, '[{"newCases":20}]' as payload
{% endset %}

{% set expectations %}
select cast('2021-05-05' as Date) as day, 0 as cases
UNION ALL
select cast('2021-05-06' as Date) as day, 20 as cases
{% endset %}
 
{{ unit_test(inputs, expectations) }}

The unit test is composed by 4 separated parts:

  • Configuration: using tags to mark the test as a unit-test, so we can execute it in a slightly different way
  • Inputs: a set of with clauses where the name of the clause needs to match the name of the source that we are 'mocking' and the name of the columns should match the actual columns.
  • Expectations: a set of rows defined by selects with the column names matching the model column names
  • Boilerplate: we need to execute the custom macro

Under the hood the unit_test macro constructs a big sql query which doesn't depend on models, just depends on the inputs. That being said we don't need to make dbt run, to refresh the models each time we want to test a new change, so the feedback loop is seconds.

We solved our main problem which was mocking the sources of a model, also we improved the feedback loop, anyway, we still have a couple of ideas to improve based on our customers feedback:

  • Right now there is the need to have all the columns when defining an input, and if the column is not important for test we can have a 'null' as column_x, but it's not doable/maintainable when we have a big number of columns. So we want to make easier to define the inputs and we'll try to find a solution to create the non defined columns as null automatically.

We'll share the custom macros in Equal Experts GitHub.

@dbt-labs dbt-labs locked and limited conversation to collaborators Dec 8, 2021
@jtcohen6 jtcohen6 converted this issue into discussion #4455 Dec 8, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
dbt tests Issues related to built-in dbt testing functionality enhancement New feature or request
Projects
None yet
Development

No branches or pull requests