[Coral-Common] Convert Hive uniontype into a struct-RelDataType that conforms Trino' schema#192
Conversation
|
@rzhang10 @funcheetah Can you take a look when you get a chance? thanks ! |
ljfgem
left a comment
There was a problem hiding this comment.
Thanks @autumnust !
Could you also add the unit tests for Coral-Trino in HiveToTrinoConverterTest?
coral-trino/src/test/java/com/linkedin/coral/trino/rel2trino/TestUtils.java
Outdated
Show resolved
Hide resolved
coral-spark/src/main/java/com/linkedin/coral/spark/IRRelToSparkRelTransformer.java
Outdated
Show resolved
Hide resolved
| List<String> fNames = IntStream.range(0, unionType.getAllUnionObjectTypeInfos().size()).mapToObj(i -> "tag_" + i) | ||
| List<String> fNames = IntStream.range(0, unionType.getAllUnionObjectTypeInfos().size()).mapToObj(i -> "field" + i) | ||
| .collect(Collectors.toList()); | ||
| if (fNames.size() > 0) { |
There was a problem hiding this comment.
I think we don't need this if statement since an empty union is a wrong schema, to begin with, maybe we could add a Preconditions.checkState to check it.
There was a problem hiding this comment.
currently how do we prevent empty union to appear?
The reason for this branch is to preserve the semantic of the same code block earlier. If we add a precondition check, encountering empty union will result in unchecked exception but it will simply return an empty RelDataType before the change. is that what we want ?
There was a problem hiding this comment.
Seems that here https://github.com/trinodb/trino/pull/3483/files# the size is not checked. Does the Avro or ORC standard say anything about this?
There was a problem hiding this comment.
ORC spec doesn't mention it while I do see uniontype<> happening in production. Regardless, this check is there to keep parity with the original (or we could end up with an additional tag field).
There was a problem hiding this comment.
So if using the current code, uniontype<> ends up in an empty struct? I recall there are some issues with empty struct somewhere, either in Iceberg or in Spark. So I'm not sure if this whole corner case will work or not.
My point is if we explicitly announce this special case's behavior is undefined, then users have the responsibility to create the correct schema, it's a tradeoff between us and the users, I'm willing to accept either way.
There was a problem hiding this comment.
Our references should be: 1- what the standard says (again for Avro or ORC) 2- what the Trino transformation does. Hopefully all of them align. If not, we can discuss how to move forward.
There was a problem hiding this comment.
Another special case is union with only a single type uniontype<[type]>. We should consider whether we want to support it and make sure its semantics is consistent across Trino and Iceberg implementation
There was a problem hiding this comment.
There aren't any spec that I could find regarding this point in either Avro or ORC, and looks like Trino doesn't check the size either.
Taking a step back, I also don't think Coral is in right position to gate such usage. So I am leaning towards let such case pass through (which means a check like fNames.size > 0 is necessary since we see struct<> as a legit type).
I am also OK to add a precondition check to fail the translation. But again, leaning towards the former option.
There was a problem hiding this comment.
+1 to the current approach
There was a problem hiding this comment.
Discussed offline. The objective is to match the Trino schema. We should remove the check from here since it is not used there.
1631e52 to
e7fac25
Compare
| List<String> fNames = IntStream.range(0, unionType.getAllUnionObjectTypeInfos().size()).mapToObj(i -> "tag_" + i) | ||
| List<String> fNames = IntStream.range(0, unionType.getAllUnionObjectTypeInfos().size()).mapToObj(i -> "field" + i) | ||
| .collect(Collectors.toList()); | ||
| if (fNames.size() > 0) { |
There was a problem hiding this comment.
Seems that here https://github.com/trinodb/trino/pull/3483/files# the size is not checked. Does the Avro or ORC standard say anything about this?
coral-hive/src/test/java/com/linkedin/coral/hive/hive2rel/HiveTableTest.java
Show resolved
Hide resolved
coral-hive/src/test/java/com/linkedin/coral/hive/hive2rel/HiveTableTest.java
Outdated
Show resolved
Hide resolved
e7fac25 to
f0f64a9
Compare
ljfgem
left a comment
There was a problem hiding this comment.
LTGM except the type issue, thanks @autumnust !
| List<String> fNames = IntStream.range(0, unionType.getAllUnionObjectTypeInfos().size()).mapToObj(i -> "field" + i) | ||
| .collect(Collectors.toList()); | ||
| if (fNames.size() > 0) { | ||
| fTypes.add(0, dtFactory.createSqlType(SqlTypeName.INTEGER)); |
There was a problem hiding this comment.
According to the corresponding PR:
https://github.com/trinodb/trino/pull/3483/files#diff-6a37b030f26bfeb6eca302ca157050cad2ead0476c1785fad57fc5f14fab88cfR264
the type of tag is TINYINT, rather than INT, so it should be
fTypes.add(0, dtFactory.createSqlType(SqlTypeName.TINYINT));
querying the view will fail if the types mismatch.
I think the test also needs to be modified.
Enhanced coral-trino test suite helped to catch this hidden issue 😉
There was a problem hiding this comment.
Wow that's a great catch, thanks ! I will address this.
There was a problem hiding this comment.
Nice work for enhanced coral-trino test suite! This is awesome! We can catch more issues before hitting production.
|
FYI, enhanced i-test succeeded after fixing the type mismatch issue, we can merge it once all the comments are addressed. |
I addressed the RowType comparison comment. The remaining discussion is mostly on how does Coral deal with corner cases in |
Background:
tag_N. For example we would seeuniontype<int,string>becomes something likestruct<1,null>orstruct<null, "h">. This format is based on Hive'sextract_unionUDF.Also added unit tests for the different union conversion logic.
Updates from long discussion in this PR:
tag.