-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAP-465] [Bug] Breaking schema changes for nested fields #673
Comments
@amrutha-sreedharane Thanks so much for opening, and for trying out the prerelease functionality! You're right, this is a gap we should look to close — we want you to be able to test/document the nested fields, and validate contracts on them. I think we could support this by recognizing the dots in the column names ( This should be doable with a reimplementation of |
Agree it's doable there - this is where the columns dictionary is created from the yaml spec. Arguably it'd be worth implementing this functionality (constructing a list of |
@MichelleArk It would be possible to implement this via a one-off Python method, on the BigQuery adapter only. Longer-term, we may want to move more of this logic into I played around with something like: @available.parse(lambda *a, **k: {})
@classmethod
def handle_nested_columns(self, columns: Dict[str, Dict[str, Any]]) -> Dict[str, str]:
final_columns = {}
struct_columns = {}
for column in columns.values():
# if this is a nested field in a struct, we need to add it to the struct
if '.' in column['name']:
struct_field_name, nested_name = column['name'].split('.')
nested_col = f"{nested_name} {column['data_type']}"
# have we seen this struct before? add this nested field in
if struct_field_name in struct_columns:
struct_columns[struct_field_name]['nested'].append(nested_col)
# otherwise, initialize the struct + first nested field
else:
struct_columns.update({struct_field_name: {"name": struct_field_name, "nested": [nested_col]}})
# this is a non-struct column
else:
final_columns.update({column['name']: column})
struct_formatted = {col["name"]: {"name": col['name'], "data_type": f"""struct<{", ".join(col['nested'])}>"""} for col in struct_columns.values()}
return {**final_columns, **struct_formatted} And then: {% macro bigquery__get_empty_schema_sql(columns) %}
{%- set columns = adapter.handle_nested_columns(columns) -%}
{{ return(dbt.default__get_empty_schema_sql(columns)) }}
{% endmacro %}
{% macro bigquery__get_select_subquery(sql) %}
{%- set columns = adapter.handle_nested_columns(model['columns']) -%}
select
{% for column in columns %}
{{ column }}{{ ", " if not loop.last }}
{% endfor %}
from (
{{ sql }}
) as model_subq
{%- endmacro %} This's close, but not quite there:
Acceptance case-- models/nested_fields.sql
select
'string' as a,
struct(
1 as id,
'name' as name,
struct(2 as id, 'another' as another) as double_nested
) as b # models/nested_fields.yml
models:
- name: nested_fields
config:
contract: {enforced: true}
materialized: table
columns:
- name: a
data_type: string
- name: b.id
data_type: integer
constraints:
- type: not_null
- name: b.name
data_type: string
- name: b.double_nested.id
data_type: integer
- name: b.double_nested.another
data_type: string Should pass the preflight check by producing SQL like: select * from (
select
cast(null as string) as a, cast(null as struct<id integer, name string, double_nested struct<id int, another string>>
) as b
) as __dbt_sbq
where false
limit 0 And final DDL like: create or replace table `dbt-test-env`.`dbt_jcohen`.`nested_fields`(
a string,
b struct<id integer not null, name string, double_nested struct<id int, another string>>
)
as (
select a, b
from (
select
'string' as a,
struct(
1 as id,
'name' as name,
struct(2 as id, 'another' as another) as double_nested
) as b
) as model_subq
); |
We also encountered this error when upgrading to 1.5. To add to the acceptance-case above, we also need to be able to add descriptions to the # models/nested_fields.yml
models:
- name: nested_fields
config:
contract: {enforced: true}
materialized: table
columns:
- name: a
data_type: string
- name: b
description: "A description of the b-struct"
- name: b.id
data_type: integer
constraints:
- type: not_null
- name: b.name
data_type: string
- name: b.double_nested.id
data_type: integer
- name: b.double_nested.another
data_type: string |
Re-labeling this as a |
Fixing the The Dataform docs happen to show a nested example, copied below. Being able to put columns within a column seems like the obvious way to do this (with less repetition and clearer structure than the
e.g. taking the above example, with nesting there's no need to repeat the # models/nested_fields.yml
models:
- name: nested_fields
config:
contract: {enforced: true}
materialized: table
columns:
- name: a
data_type: string
- name: b
description: "A description of the b-struct"
columns:
- name: id
data_type: integer
constraints:
- type: not_null
- name: name
data_type: string
- name: double_nested
columns:
- name: id
data_type: integer
- name: another
data_type: string |
This is a really nice example @elyobo ! Adopting such a syntax might even make the implementation easier (but might also make it harder 🤷). Definitely something for the implementer to consider. |
@jtcohen6 assigning to you as it relates to core contracts, but please remove yourself if you think you should be off the hook ;) |
@jtcohen6 mentioned it briefly in one of the comments but I just wanted to raise that this is also a problem preventing people to test their nested fields (which was working in 1.4). |
I haven't been able to reproduce a regression case - persisting descriptions and testing columns on models with nested fields all seem to work on 1.5 and the latest dbt-core HEAD. Both features do fail when the new Here's how I've tested for regression: -- schema.yml
models:
- name: nested_fields
config:
materialized: table
columns:
- name: a
data_type: string
description: "field a"
- name: b.id
data_type: integer
description: "nested field b.id"
tests:
- not_null -- dbt_project.yml
...
models:
+persist_docs:
relation: true
columns: true Column tests: @b-per @dbeatty10 - is there a regression case here that doesn't involve setting |
Good point about the regression. Yes, the issue is only with |
Gotcha! In that case I'm going to remove the regression label - this issue is still very much a priority though! |
Is this your first time submitting a feature request?
Describe the feature
As part of v1.5 and the new breaking schema changes identification feature, there is a new format required for nested fields. Our team was utilizing the
.
notation for nested fields. For example:However, as part of the new feature, to identify a breaking change to the contract, we now need to use this format:
This doesn't allow us to add a description or policy tags for nested fields. This is a blocker for us to use this feature. Please let us know if there is a way we can use the original syntax or define the description/policy tags in a different manner!
Describe alternatives you've considered
No response
Who will this benefit?
This will allow us and other teams in our company to use the contracts feature.
Are you interested in contributing this feature?
No response
Anything else?
#446
The text was updated successfully, but these errors were encountered: