Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add upper_bound_column to order by #660

Merged

Conversation

sfc-gh-ancoleman
Copy link
Contributor

@sfc-gh-ancoleman sfc-gh-ancoleman commented Aug 30, 2022

resolves #659

This is a:

  • documentation update
  • bug fix with no breaking changes
  • new functionality
  • a breaking change

All pull requests from community contributors should target the main branch (default).

Description & motivation

The order by clause(s) used in the window functions in the mutually_exclusive_ranges test give non-deterministic results when more than one range within a partition has the same lower_bound. If zero-length ranges are not allowed, the test will FAIL (as it should) if this occurs. But if zero length ranges are allowed and gaps are not required, it should be expected that one could have more than one range within a partition with the same lower_bound.

This PR changes the order by clause(s) to use {{ lower_bound_column }}, {{ upper_bound_column }} as the ordering criteria. This will ensure that when multiple ranges have the same lower_bound, they are sorted based on their upper_bound, which will place all zero-length ranges together, with any non-zero-length range with the same lower_bound appearing as the last record in this group, causing it's upper_bound to be compared to the next distinct lower_bound.

Since the test only considers the lower and upper bound columns, the fact that records with the same lower and upper bounds won't have a guaranteed order shouldn't be problematic, since these are effectively interchangeable as far as the test is concerned. So this should be enough to guarantee deterministic results (i.e., the test will always PASS or will always FAIL, unless changes are made to the dataset or to the test configuration).

Checklist

  • This code is associated with an Issue which has been triaged and accepted for development.
  • I have verified that these changes work locally on the following warehouses (Note: it's okay if you do not have access to all warehouses, this helps us understand what has been covered)
    • BigQuery
    • Postgres
    • Redshift
    • Snowflake
  • I followed guidelines to ensure that my changes will work on "non-core" adapters by:
    • dispatching any new macro(s) so non-core adapters can also use them (e.g. the star() source)
    • using the limit_zero() macro in place of the literal string: limit 0
    • using dbt.type_* macros instead of explicit datatypes (e.g. dbt.type_timestamp() instead of TIMESTAMP
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)
  • I have added an entry to CHANGELOG.md

@joellabes
Copy link
Contributor

@sfc-gh-ancoleman I like this! Would you be able to add a test case just like your one from the initial issue which fails under the current setup (subject to the vagaries of non-deterministic sort algorithms at least) and passes under the new system?

There are sample tests in the integration_tests project in this repo - let me know if you need a hand digging into them

@sfc-gh-ancoleman
Copy link
Contributor Author

@joellabes Since I believe that I just identified an edge-case that the existing test data doesn't include, I modified the existing data_test_mutually_exclusive_ranges_with_gaps_zero_length.csv seed so that it includes a zero-length record for 2020-05-08 as well as a positive-length record that also begins on 2020-05-08. I placed the positive-length record above the zero-length record in a (possibly misguided) attempt to influence ordering.

I had initially also extended the final record for supscription_id 3 so that it overlaps with some records for subscription_id 4, so that if at some point in the future the macro is changed in a way that breaks the functioning of the partition_by argument, the test on the dataset I modified should fail. I have undone this change, since it's not really related to this issue, and there is probably a better place to incorporate such a test.

@joellabes
Copy link
Contributor

Works for me! Thanks @sfc-gh-ancoleman 🌟

@joellabes joellabes merged commit 53f8352 into dbt-labs:main Sep 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants