-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deduplicate
macro for Databricks now uses the QUALIFY
clause, which fixes NULL
columns issues from the default natural join logic
#786
Conversation
@graciegoheen Thank you very much for creating this pull request. Dropping this macro into my project's macro folder saved me at least a half-day of "fix it" work and frantic testing after I stumbled into #713 To the maintainers: This worked for me as a one-off solution in a Databricks-powered project at work, and while I have not exhaustively tested it, it definitely fixed the "I am missing most of the data for some strange reason" data problems that I was seeing post-deduplication from #713 |
deduplicate
macro for Databricks now uses the QUALIFY
clause, which fixes NULL
columns issues from the default natural join logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@graciegoheen I'm going to merge this as-is. Rationale below.
It looks like we don't have circle-ci running integration tests for Databricks - should we add that?
I made an initial attempt at adding Databricks to CI, but it didn't work. The cause appears related to the integration tests using pre-releases of dbt-core (1.6.0bx) that are somehow incompatible with the latest available version of dbt-databricks (1.5.x).
So I'm going to defer that decision to a later date.
I can alternatively add this to spark_utils.
Since we already have databricks__get_table_types_sql
(#769), it seems reasonable to add databricks__deduplicate
here also (rather than putting it in spark_utils).
Additionally, I'd like to add some additional integration tests to confirm this macro will work for null values, but I think that we would need to update the default version of this macro and remove the natural join entirely.
There's a simple test case in #713 that would work great. But if we add it to CI without changing the default implementation, then Redshift will start failing CI.
Since removing the natural join will take more time and thought, I'm going to defer adding new integration tests to a later date as well.
Related to issues described in #713 and #621
It looks like we don't have circle-ci running integration tests for Databricks - should we add that? I can alternatively add this to
spark_utils
. Additionally, I'd like to add some additional integration tests to confirm this macro will work for null values, but I think that we would need to update the default version of this macro and remove the natural join entirely.This is a:
All pull requests from community contributors should target the
main
branch (default).Description & motivation
In databricks, natural join was causing issues for a customer - we can use the new QUALIFY in databricks to update the deduplicate macro.
Checklist
star()
source)limit_zero()
macro in place of the literal string:limit 0
dbt.type_*
macros instead of explicit datatypes (e.g.dbt.type_timestamp()
instead ofTIMESTAMP