-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Methods to achieve null safety for deduplicate
#815
base: main
Are you sure you want to change the base?
Conversation
from {{ relation }} as _inner | ||
) | ||
|
||
select * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What databases allow for minus
or except
syntax? I know snowflake does - that could be an option for removing the extra column. Though maybe in that case you'd just use qualify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would minus or except work to remove extra column(s)? Do you mean select * exclude ( <col_name>, <col_name>, ... )
?
This would be the perfect solution if we could rely on it! 💡
But it is not in the SQL standard, and the databases that don't have qualify
are probably missing select * exclude (...)
as well. So I don't think we'll be able to reliably use it as part of the default implementation 😢.
select * exclude (...)
Snowflake has select * exclude
:
And so does DuckDB:
select * except (...)
And because it's not in the standard, other databases use except
instead of exclude
.
BigQuery uses except
:
As does Databricks:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, yes I meant exclude
. What about using the star macro with the except
argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial implementation in #512 used the star
macro but it was removed in #548.
I haven't considered the details of how we might be able to bring it back or what those implications would be.
I think we'd still need to handle the case where the relation
is a CTE name instead of a Relation. That's the case that this draft PR is covering with the row_alias
parameter. An alternative way to cover it would be a columns
parameter like suggested here. Allowing the end user to choose between either row_alias
or columns
would provide the most optionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@graciegoheen your idea about using the star macro inspired fe03f43.
It retrieves columns similarly to dbt_utils.star
IFF:
- relation is a Relation
- relation is not an ephemeral CTE
Otherwise, a user can pass a list of columns
manually (d46676e). Or they can specify a row_alias
that is acceptable to them.
@@ -104,7 +104,10 @@ path: {} | |||
{% set row_alias = kwargs.get('row_alias') %} | |||
{% set columns = kwargs.get('columns') %} | |||
|
|||
{% if row_alias != None or columns != None %} | |||
{% if relation.is_cte is defined and not relation.is_cte %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deduplicate
via row_alias
keyword argumentdeduplicate
This PR has been marked as Stale because it has been open with no activity as of late. If you would like the PR to remain open, please comment on the PR or else it will be closed in 7 days. |
resolves #814
resolves #621
This is a bug fix with no breaking changes.
It also adds two new features:
row_alias
keyword argument (type: string, default:none
)columns
keyword argument (type: list, default:none
)Description & motivation
This PR is still in draft status, and more description will be added at a later date.
In the meantime, see #814 (and everything it links to, in particular #713) for background motivation and discussion to-date.
As a summary, this PR gives the user multiple options to achieve null safety for
deduplicate
:relation
is not a CTE, it's columns can be fetched via theget_filtered_columns_in_relation
macrorow_alias
keyword argumentrow_alias
keyword argument is set, then we can deduplicate via therow_number()
window function (at the cost of therow_alias
being an extra column that wasn't in the original data set)columns
keyword argumentcolumns
keyword argument is set, then we can deduplicate via therow_number()
window function and only return the requested columnsOutside of those options, the deduplication will not be null-safe.
Option 1
models/my_model_1.sql
models/deduped_1.sql
Option 2
models/my_model_2.sql
models/deduped_2.sql
Option 3
models/my_model_3.sql
models/deduped_3.sql
Option 4
models/my_model_4.sql
Warning
This is the one not guaranteed to be null-safe (depending on the adapter).
models/deduped_4.sql
Here's the warning that will be logged:
Key history of
deduplicate
macroalias
argument todeduplicate
macro #526deduplicate()
arguments #548Checklist