Skip to content

[Coral-Common] Convert Hive uniontype into a struct-RelDataType that conforms Trino' schema#192

Merged
wmoustafa merged 12 commits intolinkedin:masterfrom
autumnust:union_trino_coalsce
Nov 19, 2021
Merged

[Coral-Common] Convert Hive uniontype into a struct-RelDataType that conforms Trino' schema#192
wmoustafa merged 12 commits intolinkedin:masterfrom
autumnust:union_trino_coalsce

Conversation

@autumnust
Copy link
Copy Markdown
Contributor

@autumnust autumnust commented Nov 9, 2021

Background:

  • Previously to make union type translatable between engines, we represent the union-type in Coral-IR as a struct where each member field of union becomes a subfield within a struct with each carrying the name tag_N. For example we would see uniontype<int,string> becomes something like struct<1,null> or struct<null, "h">. This format is based on Hive's extract_union UDF.
  • Trino supports natively reading union field by exploding the union into a struct, with a different schema: Support reading uniontype as struct from Avro/ORC Hive tables trinodb/trino#3483
  • This PR conforms the Coral IR's union-after-explosion representation to Trino's format, so that there won't be type differences between engine's deserialized format versus IR of a view's field.

Also added unit tests for the different union conversion logic.


Updates from long discussion in this PR:

  • For some of the corner cases like empty union, we decided to keep consistent with Trino's PR where it will convert the empty union into a struct with only one field named as tag.
  • The tag field is in "TINYINT" type. Note that this type does not exist in Iceberg so we will need further testing to verify the whole flow works.

@autumnust
Copy link
Copy Markdown
Contributor Author

@rzhang10 @funcheetah Can you take a look when you get a chance? thanks !

Copy link
Copy Markdown
Collaborator

@ljfgem ljfgem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @autumnust !
Could you also add the unit tests for Coral-Trino in HiveToTrinoConverterTest?

List<String> fNames = IntStream.range(0, unionType.getAllUnionObjectTypeInfos().size()).mapToObj(i -> "tag_" + i)
List<String> fNames = IntStream.range(0, unionType.getAllUnionObjectTypeInfos().size()).mapToObj(i -> "field" + i)
.collect(Collectors.toList());
if (fNames.size() > 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this if statement since an empty union is a wrong schema, to begin with, maybe we could add a Preconditions.checkState to check it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently how do we prevent empty union to appear?
The reason for this branch is to preserve the semantic of the same code block earlier. If we add a precondition check, encountering empty union will result in unchecked exception but it will simply return an empty RelDataType before the change. is that what we want ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that here https://github.com/trinodb/trino/pull/3483/files# the size is not checked. Does the Avro or ORC standard say anything about this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ORC spec doesn't mention it while I do see uniontype<> happening in production. Regardless, this check is there to keep parity with the original (or we could end up with an additional tag field).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if using the current code, uniontype<> ends up in an empty struct? I recall there are some issues with empty struct somewhere, either in Iceberg or in Spark. So I'm not sure if this whole corner case will work or not.
My point is if we explicitly announce this special case's behavior is undefined, then users have the responsibility to create the correct schema, it's a tradeoff between us and the users, I'm willing to accept either way.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our references should be: 1- what the standard says (again for Avro or ORC) 2- what the Trino transformation does. Hopefully all of them align. If not, we can discuss how to move forward.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another special case is union with only a single type uniontype<[type]>. We should consider whether we want to support it and make sure its semantics is consistent across Trino and Iceberg implementation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There aren't any spec that I could find regarding this point in either Avro or ORC, and looks like Trino doesn't check the size either.
Taking a step back, I also don't think Coral is in right position to gate such usage. So I am leaning towards let such case pass through (which means a check like fNames.size > 0 is necessary since we see struct<> as a legit type).

I am also OK to add a precondition check to fail the translation. But again, leaning towards the former option.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the current approach

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. The objective is to match the Trino schema. We should remove the check from here since it is not used there.

List<String> fNames = IntStream.range(0, unionType.getAllUnionObjectTypeInfos().size()).mapToObj(i -> "tag_" + i)
List<String> fNames = IntStream.range(0, unionType.getAllUnionObjectTypeInfos().size()).mapToObj(i -> "field" + i)
.collect(Collectors.toList());
if (fNames.size() > 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that here https://github.com/trinodb/trino/pull/3483/files# the size is not checked. Does the Avro or ORC standard say anything about this?

Copy link
Copy Markdown
Collaborator

@ljfgem ljfgem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LTGM except the type issue, thanks @autumnust !

List<String> fNames = IntStream.range(0, unionType.getAllUnionObjectTypeInfos().size()).mapToObj(i -> "field" + i)
.collect(Collectors.toList());
if (fNames.size() > 0) {
fTypes.add(0, dtFactory.createSqlType(SqlTypeName.INTEGER));
Copy link
Copy Markdown
Collaborator

@ljfgem ljfgem Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the corresponding PR:
https://github.com/trinodb/trino/pull/3483/files#diff-6a37b030f26bfeb6eca302ca157050cad2ead0476c1785fad57fc5f14fab88cfR264
the type of tag is TINYINT, rather than INT, so it should be

fTypes.add(0, dtFactory.createSqlType(SqlTypeName.TINYINT));

querying the view will fail if the types mismatch.
I think the test also needs to be modified.

Enhanced coral-trino test suite helped to catch this hidden issue 😉

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow that's a great catch, thanks ! I will address this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work for enhanced coral-trino test suite! This is awesome! We can catch more issues before hitting production.

@ljfgem
Copy link
Copy Markdown
Collaborator

ljfgem commented Nov 16, 2021

FYI, enhanced i-test succeeded after fixing the type mismatch issue, we can merge it once all the comments are addressed.

@autumnust
Copy link
Copy Markdown
Contributor Author

FYI, enhanced i-test succeeded after fixing the type mismatch issue, we can merge it once all the comments are addressed.

I addressed the RowType comparison comment. The remaining discussion is mostly on how does Coral deal with corner cases in struct representation. Let's get some agreement on that, if that's needed before merging.

@wmoustafa wmoustafa merged commit 4324471 into linkedin:master Nov 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants