Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate Coral's RelNode for views from base table schema #409

Merged
merged 5 commits into from
May 19, 2023

Conversation

aastha25
Copy link
Contributor

@aastha25 aastha25 commented May 15, 2023

What changes are proposed in this pull request, and why are they necessary?

This PR includes the following fixes:
(1) Includes changes from #394 which aimed to remove CAST in HiveViewTable.toRel to solve Nested View Stale Schema Issue.
(2) Enables avro schema generation for a dataset with map field with NULL key
(3) When a field schema is a UnionType of a single data type, such as uniontype<array<string>>, surface the underlying field & its schema in Coral's SqlNode RelNode representation, such as array<string>. SingleUnionFieldReferenceTransformer has been introduced for backward compatibility.

For a table t1: (f1: uniontype ) and a view v1 defined on top of it SELECT custom_udf_to_remove_single_type_union(f1) AS f2 FROM t1
v1 has a schema: (f2: int)
The RelNode representation, prior to dropping the CAST operator in the RelNode, would force the derived data type to be int for f2, as expected. Hence, the generated avro schema could be analyzed by the Spark engine. However, after dropping the CAST, as documented here, the derived type is Struct<tag_0:int>. The Avro schema is incorrectly generated and cannot be analysed by the Spark engine.

The resolution is to derive the data type of single uniontype fields like f1 as the underlying data type (int in this case).

How was this patch tested?

./gradlew spotlessApply
./gradlew build
Updated UTs
Thoroughly tested production views across all supported engines.

@ljfgem
Copy link
Collaborator

ljfgem commented May 15, 2023

@aastha25: Thanks for this PR. Could we create a PR for union type handling only and keep #394 ? Or keep the author information for the part in #394 if this PR can't be splitted?

@aastha25
Copy link
Contributor Author

@ljfgem Unfortunately the PR cant be split up. All these changes need to go in together. I have updated the author's information for the commits related to #394.

Copy link
Collaborator

@ljfgem ljfgem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @aastha25 for this PR!
Could you please add more details to the PR description and the code comment to clarify the necessity of the uniontype part changes? i.e. why it will fail if we don't make those changes, preferably with simple example?

@aastha25
Copy link
Contributor Author

@ljfgem I have added an example clarifying the usecase in the PR description.

Copy link
Collaborator

@ljfgem ljfgem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @aastha25 !

@aastha25 aastha25 merged commit 76789bf into linkedin:master May 19, 2023
@aastha25 aastha25 deleted the singleUnionType branch May 23, 2023 16:53
@wmoustafa
Copy link
Contributor

wmoustafa commented May 28, 2023

Should not this PR change the behavior of CoalesceStructUtility so that if the input is a field with single type, it returns a struct? (e.g., extract_union(string) returns struct<tag_0: string>).

@aastha25
Copy link
Contributor Author

This PR sets the return type of extract_union(string) as string, making the operator a no-op. I can modify the comment on the class to document this change if required.

@wmoustafa
Copy link
Contributor

Discussed offline. We will create a follow-up PR to address this since the current approach: 1- breaks extract_union contract of returning only structs, 2- manages this new contract in another unrelated operator (DOT) with a method to guess the context.

aastha25 added a commit to aastha25/coral that referenced this pull request Feb 12, 2024
aastha25 added a commit that referenced this pull request Feb 13, 2024
…) (#489)

* Revert "Generate Coral's RelNode for views from base table schema (#409)"

This reverts commit 76789bf.

* ./gradlew spotlessApply

* modify workflow action
KevinGe00 pushed a commit to KevinGe00/coral that referenced this pull request Feb 22, 2024
…nkedin#409) (linkedin#489)

* Revert "Generate Coral's RelNode for views from base table schema (linkedin#409)"

This reverts commit 76789bf.

* ./gradlew spotlessApply

* modify workflow action
KevinGe00 added a commit to KevinGe00/coral that referenced this pull request May 8, 2024
…hema (linkedin#409) (linkedin#489)"

This reverts commit 40f6958.

# Conflicts:
#	.github/workflows/ci.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants