[ADAP-465] [Bug] Breaking schema changes for nested fields #673

amrutha-sreedharane · 2023-04-19T20:34:06Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

As part of v1.5 and the new breaking schema changes identification feature, there is a new format required for nested fields. Our team was utilizing the . notation for nested fields. For example:

model:
    - name: a
      policy_tags:
      data_type: string
    - name: b.id
       policy_tags:
       data_type: integer
    - name: b.name
       policy_tags:
       data_type: string

However, as part of the new feature, to identify a breaking change to the contract, we now need to use this format:

model:
      - name: a
        policy_tags:
        data_type: string
     - name: b
       data_type: struct<id integer, name string>

This doesn't allow us to add a description or policy tags for nested fields. This is a blocker for us to use this feature. Please let us know if there is a way we can use the original syntax or define the description/policy tags in a different manner!

Describe alternatives you've considered

No response

Who will this benefit?

This will allow us and other teams in our company to use the contracts feature.

Are you interested in contributing this feature?

No response

Anything else?

#446

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2023-04-20T17:20:26Z

@amrutha-sreedharane Thanks so much for opening, and for trying out the prerelease functionality! You're right, this is a gap we should look to close — we want you to be able to test/document the nested fields, and validate contracts on them.

I think we could support this by recognizing the dots in the column names (b.id, b.name), and then constructing the appropriate data type during the pre-flight check (cast(null as struct<id integer, name string>)). That wouldn't support columns that contain dots in their names — but I'm pretty sure BigQuery doesn't support column names containing special characters to begin with.

This should be doable with a reimplementation of get_empty_schema_sql. (@MichelleArk Does that feel like the right place, or can you think of a better one offhand?)

MichelleArk · 2023-04-20T19:46:26Z

This should be doable with a reimplementation of get_empty_schema_sql. (@MichelleArk Does that feel like the right place, or can you think of a better one offhand?)

Agree it's doable there - this is where the columns dictionary is created from the yaml spec. Arguably it'd be worth implementing this functionality (constructing a list of cast(null as <>) statements from a column dictionary) in python because its functionality the end user probably won't be interested in overriding, and has one 'correct' implementation per adapter. But it would be a larger lift to do that way and require changes across all adapters, so I'd start with an override of get_empty_schema_sql at this point.

jtcohen6 · 2023-04-21T11:47:58Z

@MichelleArk It would be possible to implement this via a one-off Python method, on the BigQuery adapter only. Longer-term, we may want to move more of this logic into dbt-core, to share similar code with other data platforms that also have strongly typed nested fields.

I played around with something like:

    @available.parse(lambda *a, **k: {})
    @classmethod
    def handle_nested_columns(self, columns: Dict[str, Dict[str, Any]]) -> Dict[str, str]:
        final_columns = {}
        struct_columns = {}
        for column in columns.values():
            # if this is a nested field in a struct, we need to add it to the struct
            if '.' in column['name']:
                struct_field_name, nested_name = column['name'].split('.')
                nested_col = f"{nested_name} {column['data_type']}"
                # have we seen this struct before? add this nested field in
                if struct_field_name in struct_columns:
                    struct_columns[struct_field_name]['nested'].append(nested_col)
                # otherwise, initialize the struct + first nested field
                else:
                    struct_columns.update({struct_field_name: {"name": struct_field_name, "nested": [nested_col]}})
            # this is a non-struct column
            else:
                final_columns.update({column['name']: column})
        
        struct_formatted = {col["name"]: {"name": col['name'], "data_type": f"""struct<{", ".join(col['nested'])}>"""} for col in struct_columns.values()}
        return {**final_columns, **struct_formatted}

And then:

{% macro bigquery__get_empty_schema_sql(columns) %}
    {%- set columns = adapter.handle_nested_columns(columns) -%}
    {{ return(dbt.default__get_empty_schema_sql(columns)) }}
{% endmacro %}

{% macro bigquery__get_select_subquery(sql) %}
    {%- set columns = adapter.handle_nested_columns(model['columns']) -%}
    select
    {% for column in columns %}
      {{ column }}{{ ", " if not loop.last }}
    {% endfor %}
    from (
        {{ sql }}
    ) as model_subq
{%- endmacro %}

This's close, but not quite there:

We'd also need a custom implementation of bigquery__get_table_columns_and_constraints (macro), or maybe just render_raw_columns_constraints, to handle the DDL that gets templated up top within the create table as.
Which made me realize, BQ supports constraints on nested fields.
BQ also supports arbitrary levels of nesting (apparently the soft limit is 100?), so we need the "handle structs" function to be recursive :)

Acceptance case

-- models/nested_fields.sql
select
  'string' as a,
  struct(
    1 as id,
    'name' as name,
    struct(2 as id, 'another' as another) as double_nested
  ) as b

# models/nested_fields.yml
models:
  - name: nested_fields
    config:
      contract: {enforced: true}
      materialized: table
    columns:
      - name: a
        data_type: string
      - name: b.id
        data_type: integer
        constraints:
          - type: not_null
      - name: b.name
        data_type: string
      - name: b.double_nested.id
        data_type: integer
      - name: b.double_nested.another
        data_type: string

Should pass the preflight check by producing SQL like:

select * from (
    select
cast(null as string) as a, cast(null as struct<id integer, name string, double_nested struct<id int, another string>>
) as b
) as __dbt_sbq
where false
limit 0

And final DDL like:

create or replace table `dbt-test-env`.`dbt_jcohen`.`nested_fields`(
    a string,
    b struct<id integer not null, name string, double_nested struct<id int, another string>>
)
as (
  select a, b
  from (
    select
      'string' as a,
      struct(
        1 as id,
        'name' as name,
        struct(2 as id, 'another' as another) as double_nested
      ) as b
  ) as model_subq
);

bendiktv2 · 2023-05-02T08:29:48Z

We also encountered this error when upgrading to 1.5.

To add to the acceptance-case above, we also need to be able to add descriptions to the STRUCT column itself, so the yml should be something like:

# models/nested_fields.yml
models:
  - name: nested_fields
    config:
      contract: {enforced: true}
      materialized: table
    columns:
      - name: a
        data_type: string
      - name: b
        description: "A description of the b-struct"
      - name: b.id
        data_type: integer
        constraints:
          - type: not_null
      - name: b.name
        data_type: string
      - name: b.double_nested.id
        data_type: integer
      - name: b.double_nested.another
        data_type: string

dbeatty10 · 2023-05-02T12:31:34Z

Re-labeling this as a regression since the functionality in dbt-labs/dbt-core#2549 is working in dbt-bigquery 1.4.x but not in 1.5.x.

elyobo · 2023-05-04T06:42:57Z

Fixing the . format to work again is good, but nesting the columns seems much more intuitive if this was ever to be intentionally revised - this is how Dataform does it and it took me a while to figure out how to make it work in dbt when we converted.

The Dataform docs happen to show a nested example, copied below. Being able to put columns within a column seems like the obvious way to do this (with less repetition and clearer structure than the . format).

config {
  type: "table",
  description: "This table defines something",
  columns: {
    column1: "A test column",
    column2: "A test column",
    record1: {
      description: "A struct",
      columns: {
        column3: "A nested column"
      }
    }
  }
}

e.g. taking the above example, with nesting there's no need to repeat the b name, and the structure is visually much clearer

# models/nested_fields.yml
models:
  - name: nested_fields
    config:
      contract: {enforced: true}
      materialized: table
    columns:
      - name: a
        data_type: string
      - name: b
        description: "A description of the b-struct"
        columns:
          - name: id
            data_type: integer
            constraints:
              - type: not_null
          - name: name
            data_type: string
          - name: double_nested
             columns:
               - name: id
                 data_type: integer
               - name: another
                 data_type: string

dbeatty10 · 2023-05-04T16:14:57Z

This is a really nice example @elyobo !

Adopting such a syntax might even make the implementation easier (but might also make it harder 🤷). Definitely something for the implementer to consider.

Fleid · 2023-05-09T21:24:42Z

@jtcohen6 assigning to you as it relates to core contracts, but please remove yourself if you think you should be off the hook ;)

b-per · 2023-05-23T14:02:30Z

@jtcohen6 mentioned it briefly in one of the comments but I just wanted to raise that this is also a problem preventing people to test their nested fields (which was working in 1.4).

MichelleArk · 2023-05-24T17:48:27Z

I haven't been able to reproduce a regression case - persisting descriptions and testing columns on models with nested fields all seem to work on 1.5 and the latest dbt-core HEAD. Both features do fail when the new contract: {enforced: true} config is added - so there's definitely an issue but not a regression because contract: {enforced: true} is new in 1.5.

Here's how I've tested for regression:

-- schema.yml
models:
  - name: nested_fields
    config:
      materialized: table
    columns:
      - name: a
        data_type: string
        description: "field a"
      - name: b.id
        data_type: integer
        description: "nested field b.id"
        tests:
          - not_null

-- dbt_project.yml
...
models:
  +persist_docs:
    relation: true
    columns: true

Docs persisted:

Column tests:
dbt run --nested_fields and dbt test --nested_fields passed

@b-per @dbeatty10 - is there a regression case here that doesn't involve setting contract: {enforced: true}?

b-per · 2023-05-25T06:17:58Z

Good point about the regression.

Yes, the issue is only with contract: {enforced: true}, so while this is not purely a regression (e.g. if people upgrade to 1.5 without setting up contracts, it is not raising issues), they can't use contracts without losing the ability to test/document nested fields.

MichelleArk · 2023-05-26T15:36:27Z

Gotcha! In that case I'm going to remove the regression label - this issue is still very much a priority though!

amrutha-sreedharane added enhancement New feature or request triage labels Apr 19, 2023

github-actions bot changed the title ~~[Feature] Breaking schema changes for nested fields~~ [ADAP-465] [Feature] Breaking schema changes for nested fields Apr 19, 2023

jtcohen6 removed the triage label Apr 20, 2023

jtcohen6 mentioned this issue Apr 24, 2023

[CT-2421] [Epic] Multi-project collaboration - Milestone 2 dbt-labs/dbt-core#7372

Closed

bendiktv2 mentioned this issue May 2, 2023

[ADAP-505] [Regression] v1.5 breaks models with descriptions of BigQuery struct columns #691

Closed

2 tasks

dbeatty10 added bug Something isn't working regression and removed enhancement New feature or request labels May 2, 2023

dbeatty10 changed the title ~~[ADAP-465] [Feature] Breaking schema changes for nested fields~~ [ADAP-465] [Regression] Breaking schema changes for nested fields May 2, 2023

lauriemerrell mentioned this issue May 9, 2023

Revert "upgrade dbt-bigquery and re-generate artifact types" cal-itp/data-infra#2570

Merged

Fleid assigned jtcohen6 May 9, 2023

leahwicz assigned emmyoop and unassigned jtcohen6 May 9, 2023

github-actions bot added the triage label May 23, 2023

dbeatty10 removed the triage label May 23, 2023

leahwicz assigned MichelleArk and unassigned emmyoop May 24, 2023

github-actions bot added the triage label May 24, 2023

Fleid removed the triage label May 24, 2023

Fleid added the awaiting_response label May 24, 2023

MichelleArk mentioned this issue May 24, 2023

model contracts + constraints with nested fields #738

Merged

6 tasks

github-actions bot added triage and removed awaiting_response labels May 25, 2023

MichelleArk removed the regression label May 26, 2023

MichelleArk changed the title ~~[ADAP-465] [Regression] Breaking schema changes for nested fields~~ [ADAP-465] [Bug] Breaking schema changes for nested fields May 26, 2023

MichelleArk removed the triage label May 26, 2023

MichelleArk closed this as completed in #738 Jun 14, 2023

jtcohen6 mentioned this issue Jun 22, 2023

[ADAP-638] [Bug] dbt crashes if for nested fields (e.g. struct) both the parent and subfields are defined in the YAML file #782

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAP-465] [Bug] Breaking schema changes for nested fields #673

[ADAP-465] [Bug] Breaking schema changes for nested fields #673

amrutha-sreedharane commented Apr 19, 2023 •

edited

Loading

jtcohen6 commented Apr 20, 2023 •

edited

Loading

MichelleArk commented Apr 20, 2023 •

edited

Loading

jtcohen6 commented Apr 21, 2023

bendiktv2 commented May 2, 2023

dbeatty10 commented May 2, 2023

elyobo commented May 4, 2023 •

edited

Loading

dbeatty10 commented May 4, 2023

Fleid commented May 9, 2023

b-per commented May 23, 2023

MichelleArk commented May 24, 2023

b-per commented May 25, 2023

MichelleArk commented May 26, 2023

[ADAP-465] [Bug] Breaking schema changes for nested fields #673

[ADAP-465] [Bug] Breaking schema changes for nested fields #673

Comments

amrutha-sreedharane commented Apr 19, 2023 • edited Loading

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

jtcohen6 commented Apr 20, 2023 • edited Loading

MichelleArk commented Apr 20, 2023 • edited Loading

jtcohen6 commented Apr 21, 2023

Acceptance case

bendiktv2 commented May 2, 2023

dbeatty10 commented May 2, 2023

elyobo commented May 4, 2023 • edited Loading

dbeatty10 commented May 4, 2023

Fleid commented May 9, 2023

b-per commented May 23, 2023

MichelleArk commented May 24, 2023

b-per commented May 25, 2023

MichelleArk commented May 26, 2023

amrutha-sreedharane commented Apr 19, 2023 •

edited

Loading

jtcohen6 commented Apr 20, 2023 •

edited

Loading

MichelleArk commented Apr 20, 2023 •

edited

Loading

elyobo commented May 4, 2023 •

edited

Loading