Skip to content

Conversation

@thejens
Copy link
Contributor

@thejens thejens commented Apr 14, 2021

With this change we implement a new operator that handles patching of table schemas in bigquery.

This is needed as typing out an entire schema data structure (schema), in order to set e.g. a field description on a single field requires a lot of overhead. Also, many times the schema is not known or very complex as it may be the result of a Query or parsed automatically when importing files as tables.

This operator is useful for a workflow like:
Upstream: Create a BigQuery table as the output of a Query or import operator. Writer of job/operator knows the names of the fields, perhaps the types, but not necessarily how other schema fields are defined.

Downstream (this operator): Supply a partial schema definition that only contains field names and description values that will be patched on to the "generated by bigquery" schema from upstream.

@boring-cyborg boring-cyborg bot added area:providers kind:documentation provider:google Google (including GCP) related issues labels Apr 14, 2021
@thejens thejens force-pushed the master branch 6 times, most recently from 5d878e2 to a797f32 Compare April 14, 2021 10:51
@thejens thejens force-pushed the master branch 4 times, most recently from 0f8cca8 to 791d158 Compare April 14, 2021 12:17
@thejens thejens force-pushed the master branch 2 times, most recently from 630d250 to 0fb0247 Compare April 16, 2021 15:29
@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@thejens
Copy link
Contributor Author

thejens commented Apr 19, 2021

@marcosmarxm
Before I change more; I believe these are the three outstanding items from your review. How strongly do you feel about these? As in - before I go ahead and modify I'm keen to present some "why's" for my decisions...

How important would you say it is to move functionality into a hook? Solving the use-case was easy using two already existing hooks and I don't think I need to create an additional hook given I don't implement any new functionality in how we talk with BigQuery

Further I think mutating a mutable object shouldn't be a problem in this case and I am not too keen to deepcopy it for the sake of it.

Lastly the naming of the schema_fields parameter, I wanted to be consistent with naming and input format of that field between this and other operators, do you think it's a problem? Changing it into updates_to_schema_fields or similar is an easy possibility.

@marcosmarxm
Copy link
Contributor

@marcosmarxm
Before I change more; I believe these are the three outstanding items from your review. How strongly do you feel about these? As in - before I go ahead and modify I'm keen to present some "why's" for my decisions...

How important would you say it is to move functionality into a hook? Solving the use-case was easy using two already existing hooks and I don't think I need to create an additional hook given I don't implement any new functionality in how we talk with BigQuery

Further I think mutating a mutable object shouldn't be a problem in this case and I am not too keen to deepcopy it for the sake of it.

Lastly the naming of the schema_fields parameter, I wanted to be consistent with naming and input format of that field between this and other operators, do you think it's a problem? Changing it into updates_to_schema_fields or similar is an easy possibility.

  1. This is my point of view, but makes more sense you create a function call update_schema in the BigQueryHook. Why? Follow the convention, all functions that interact with bigquery are in the Hook not with Operators, in the future if someone search for a method to update_schema in the hook they won't find it. I know that the function you created doesn't interact directly with bigquery but you are creating the action update_schema and this interacts. Imagine a user reading the docs: BigQuery Hook and BigQuery Operators. Maybe you can talk to a PMC/Committer to get better feedback on this topic =D
  2. It's not a problem and maybe tricky is not the best word to describe. It's just a suggestion to be more readable and easy to follow the execute flow. Again maybe a PMC/Committer can you better on this also
  3. I read other operators and it's not a problem keeping the schema_fields. It's a suggestion to think more about users because keep the same name users think they need to send the info as in other Operators? (so it's a minor in my opinion)

The CI broke because of pylint can you run pre-commit locally to organize imports?

@thejens thejens force-pushed the master branch 2 times, most recently from ee376d1 to 3f3f5b7 Compare April 23, 2021 16:51
@thejens thejens changed the title Implement BigQuery Table Schema Patch Operator Implement BigQuery Table Schema Update Operator Apr 23, 2021
@thejens
Copy link
Contributor Author

thejens commented Apr 23, 2021

@marcosmarxm
Thanks for the feedback, I broke out the functionality into a new hook and added a unit-test for it. I also changed the name of the parameter and made the recursive function not rely on mutability.

@thejens thejens force-pushed the master branch 2 times, most recently from b059345 to bd9af54 Compare April 23, 2021 17:44
@thejens thejens force-pushed the master branch 2 times, most recently from 22a7093 to 7e128e2 Compare April 27, 2021 07:53
@github-actions
Copy link

The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the workflow link to check the reason.

@thejens thejens force-pushed the master branch 2 times, most recently from 89964c5 to 0f960c3 Compare April 30, 2021 13:20
@thejens thejens force-pushed the master branch 2 times, most recently from 044e0d0 to 8d8bc74 Compare April 30, 2021 13:32
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

@potiuk potiuk merged commit cf6324e into apache:master May 4, 2021
@potiuk
Copy link
Member

potiuk commented May 4, 2021

Thanks for the thorough reviews @marcosmarxm and @tswast. Great jiob @thejens !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers kind:documentation provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants