Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-465] [Bug] Breaking schema changes for nested fields #673

Closed
3 tasks done
Tracked by #7372
amrutha-sreedharane opened this issue Apr 19, 2023 · 12 comments · Fixed by #738
Closed
3 tasks done
Tracked by #7372

[ADAP-465] [Bug] Breaking schema changes for nested fields #673

amrutha-sreedharane opened this issue Apr 19, 2023 · 12 comments · Fixed by #738
Assignees
Labels
bug Something isn't working

Comments

@amrutha-sreedharane
Copy link

amrutha-sreedharane commented Apr 19, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

As part of v1.5 and the new breaking schema changes identification feature, there is a new format required for nested fields. Our team was utilizing the . notation for nested fields. For example:

model:
    - name: a
      policy_tags:
      data_type: string
    - name: b.id
       policy_tags:
       data_type: integer
    - name: b.name
       policy_tags:
       data_type: string

However, as part of the new feature, to identify a breaking change to the contract, we now need to use this format:

model:
      - name: a
        policy_tags:
        data_type: string
     - name: b
       data_type: struct<id integer, name string>

This doesn't allow us to add a description or policy tags for nested fields. This is a blocker for us to use this feature. Please let us know if there is a way we can use the original syntax or define the description/policy tags in a different manner!

Describe alternatives you've considered

No response

Who will this benefit?

This will allow us and other teams in our company to use the contracts feature.

Are you interested in contributing this feature?

No response

Anything else?

#446

@amrutha-sreedharane amrutha-sreedharane added enhancement New feature or request triage labels Apr 19, 2023
@github-actions github-actions bot changed the title [Feature] Breaking schema changes for nested fields [ADAP-465] [Feature] Breaking schema changes for nested fields Apr 19, 2023
@jtcohen6
Copy link
Contributor

jtcohen6 commented Apr 20, 2023

@amrutha-sreedharane Thanks so much for opening, and for trying out the prerelease functionality! You're right, this is a gap we should look to close — we want you to be able to test/document the nested fields, and validate contracts on them.

I think we could support this by recognizing the dots in the column names (b.id, b.name), and then constructing the appropriate data type during the pre-flight check (cast(null as struct<id integer, name string>)). That wouldn't support columns that contain dots in their names — but I'm pretty sure BigQuery doesn't support column names containing special characters to begin with.

This should be doable with a reimplementation of get_empty_schema_sql. (@MichelleArk Does that feel like the right place, or can you think of a better one offhand?)

@jtcohen6 jtcohen6 removed the triage label Apr 20, 2023
@MichelleArk
Copy link
Contributor

MichelleArk commented Apr 20, 2023

This should be doable with a reimplementation of get_empty_schema_sql. (@MichelleArk Does that feel like the right place, or can you think of a better one offhand?)

Agree it's doable there - this is where the columns dictionary is created from the yaml spec. Arguably it'd be worth implementing this functionality (constructing a list of cast(null as <>) statements from a column dictionary) in python because its functionality the end user probably won't be interested in overriding, and has one 'correct' implementation per adapter. But it would be a larger lift to do that way and require changes across all adapters, so I'd start with an override of get_empty_schema_sql at this point.

@jtcohen6
Copy link
Contributor

@MichelleArk It would be possible to implement this via a one-off Python method, on the BigQuery adapter only. Longer-term, we may want to move more of this logic into dbt-core, to share similar code with other data platforms that also have strongly typed nested fields.

I played around with something like:

    @available.parse(lambda *a, **k: {})
    @classmethod
    def handle_nested_columns(self, columns: Dict[str, Dict[str, Any]]) -> Dict[str, str]:
        final_columns = {}
        struct_columns = {}
        for column in columns.values():
            # if this is a nested field in a struct, we need to add it to the struct
            if '.' in column['name']:
                struct_field_name, nested_name = column['name'].split('.')
                nested_col = f"{nested_name} {column['data_type']}"
                # have we seen this struct before? add this nested field in
                if struct_field_name in struct_columns:
                    struct_columns[struct_field_name]['nested'].append(nested_col)
                # otherwise, initialize the struct + first nested field
                else:
                    struct_columns.update({struct_field_name: {"name": struct_field_name, "nested": [nested_col]}})
            # this is a non-struct column
            else:
                final_columns.update({column['name']: column})
        
        struct_formatted = {col["name"]: {"name": col['name'], "data_type": f"""struct<{", ".join(col['nested'])}>"""} for col in struct_columns.values()}
        return {**final_columns, **struct_formatted}

And then:

{% macro bigquery__get_empty_schema_sql(columns) %}
    {%- set columns = adapter.handle_nested_columns(columns) -%}
    {{ return(dbt.default__get_empty_schema_sql(columns)) }}
{% endmacro %}

{% macro bigquery__get_select_subquery(sql) %}
    {%- set columns = adapter.handle_nested_columns(model['columns']) -%}
    select
    {% for column in columns %}
      {{ column }}{{ ", " if not loop.last }}
    {% endfor %}
    from (
        {{ sql }}
    ) as model_subq
{%- endmacro %}

This's close, but not quite there:

Acceptance case

-- models/nested_fields.sql
select
  'string' as a,
  struct(
    1 as id,
    'name' as name,
    struct(2 as id, 'another' as another) as double_nested
  ) as b
# models/nested_fields.yml
models:
  - name: nested_fields
    config:
      contract: {enforced: true}
      materialized: table
    columns:
      - name: a
        data_type: string
      - name: b.id
        data_type: integer
        constraints:
          - type: not_null
      - name: b.name
        data_type: string
      - name: b.double_nested.id
        data_type: integer
      - name: b.double_nested.another
        data_type: string

Should pass the preflight check by producing SQL like:

select * from (
    select
cast(null as string) as a, cast(null as struct<id integer, name string, double_nested struct<id int, another string>>
) as b
) as __dbt_sbq
where false
limit 0

And final DDL like:

create or replace table `dbt-test-env`.`dbt_jcohen`.`nested_fields`(
    a string,
    b struct<id integer not null, name string, double_nested struct<id int, another string>>
)
as (
  select a, b
  from (
    select
      'string' as a,
      struct(
        1 as id,
        'name' as name,
        struct(2 as id, 'another' as another) as double_nested
      ) as b
  ) as model_subq
);

@bendiktv2
Copy link

We also encountered this error when upgrading to 1.5.

To add to the acceptance-case above, we also need to be able to add descriptions to the STRUCT column itself, so the yml should be something like:

# models/nested_fields.yml
models:
  - name: nested_fields
    config:
      contract: {enforced: true}
      materialized: table
    columns:
      - name: a
        data_type: string
      - name: b
        description: "A description of the b-struct"
      - name: b.id
        data_type: integer
        constraints:
          - type: not_null
      - name: b.name
        data_type: string
      - name: b.double_nested.id
        data_type: integer
      - name: b.double_nested.another
        data_type: string

@dbeatty10 dbeatty10 added bug Something isn't working regression and removed enhancement New feature or request labels May 2, 2023
@dbeatty10 dbeatty10 changed the title [ADAP-465] [Feature] Breaking schema changes for nested fields [ADAP-465] [Regression] Breaking schema changes for nested fields May 2, 2023
@dbeatty10
Copy link
Contributor

Re-labeling this as a regression since the functionality in dbt-labs/dbt-core#2549 is working in dbt-bigquery 1.4.x but not in 1.5.x.

@elyobo
Copy link

elyobo commented May 4, 2023

Fixing the . format to work again is good, but nesting the columns seems much more intuitive if this was ever to be intentionally revised - this is how Dataform does it and it took me a while to figure out how to make it work in dbt when we converted.

The Dataform docs happen to show a nested example, copied below. Being able to put columns within a column seems like the obvious way to do this (with less repetition and clearer structure than the . format).

config {
  type: "table",
  description: "This table defines something",
  columns: {
    column1: "A test column",
    column2: "A test column",
    record1: {
      description: "A struct",
      columns: {
        column3: "A nested column"
      }
    }
  }
}

e.g. taking the above example, with nesting there's no need to repeat the b name, and the structure is visually much clearer

# models/nested_fields.yml
models:
  - name: nested_fields
    config:
      contract: {enforced: true}
      materialized: table
    columns:
      - name: a
        data_type: string
      - name: b
        description: "A description of the b-struct"
        columns:
          - name: id
            data_type: integer
            constraints:
              - type: not_null
          - name: name
            data_type: string
          - name: double_nested
             columns:
               - name: id
                 data_type: integer
               - name: another
                 data_type: string

@dbeatty10
Copy link
Contributor

This is a really nice example @elyobo !

Adopting such a syntax might even make the implementation easier (but might also make it harder 🤷). Definitely something for the implementer to consider.

@Fleid
Copy link
Contributor

Fleid commented May 9, 2023

@jtcohen6 assigning to you as it relates to core contracts, but please remove yourself if you think you should be off the hook ;)

@leahwicz leahwicz assigned emmyoop and unassigned jtcohen6 May 9, 2023
@b-per
Copy link
Contributor

b-per commented May 23, 2023

@jtcohen6 mentioned it briefly in one of the comments but I just wanted to raise that this is also a problem preventing people to test their nested fields (which was working in 1.4).

@dbeatty10 dbeatty10 removed the triage label May 23, 2023
@leahwicz leahwicz assigned MichelleArk and unassigned emmyoop May 24, 2023
@MichelleArk
Copy link
Contributor

I haven't been able to reproduce a regression case - persisting descriptions and testing columns on models with nested fields all seem to work on 1.5 and the latest dbt-core HEAD. Both features do fail when the new contract: {enforced: true} config is added - so there's definitely an issue but not a regression because contract: {enforced: true} is new in 1.5.

Here's how I've tested for regression:

-- schema.yml
models:
  - name: nested_fields
    config:
      materialized: table
    columns:
      - name: a
        data_type: string
        description: "field a"
      - name: b.id
        data_type: integer
        description: "nested field b.id"
        tests:
          - not_null
-- dbt_project.yml
...
models:
  +persist_docs:
    relation: true
    columns: true

Docs persisted:
Screen Shot 2023-05-24 at 12 34 03 PM

Column tests:
dbt run --nested_fields and dbt test --nested_fields passed

@b-per @dbeatty10 - is there a regression case here that doesn't involve setting contract: {enforced: true}?

@Fleid Fleid removed the triage label May 24, 2023
@b-per
Copy link
Contributor

b-per commented May 25, 2023

Good point about the regression.

Yes, the issue is only with contract: {enforced: true}, so while this is not purely a regression (e.g. if people upgrade to 1.5 without setting up contracts, it is not raising issues), they can't use contracts without losing the ability to test/document nested fields.

@MichelleArk
Copy link
Contributor

Gotcha! In that case I'm going to remove the regression label - this issue is still very much a priority though!

@MichelleArk MichelleArk changed the title [ADAP-465] [Regression] Breaking schema changes for nested fields [ADAP-465] [Bug] Breaking schema changes for nested fields May 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants