Add support to coerce structural type#9131
Conversation
22a3a52 to
464f8ea
Compare
|
@geraint0923 I noticed you initially implement the coercion for primitive types. Could you advise who can take a review for this patch? |
c9f2258 to
fe410ef
Compare
|
I really appreciate this PR. We are strugling with schema evolution problems in our infrastructure. This PR looks to be solving them all :) I hope it won't be skipped like some older PRs :) |
|
@haozhun, can you review this? |
|
@martint, @haozhun just reminding myself about ancient PR which does a bit of refactor and simplification of coercion mechanism in hive connector. As well as adding support for decimal types. I've just rebased it to current master and new incarnation is here: #9422. It would be great if this one is put on top of refactorings from my PR (assuming we want those). |
4554f6d to
ea59c88
Compare
Yeah - corrected |
|
@losipiuk I think there won't be much difficulty to refactor this PR later against your refactorings. There is no interface/public method/class change in this PR. All added methods/classes are private in the existing class, so there won't have hard conflicts for further refactoring. I would suggest we can work on this first and then work on the refactoring as a whole if that's necessary. Thoughts ? |
I am a bit concerned mostly about changes in test as I made a bunch of changes there so much less code needs to be written. I do not expect much work in resolving conflicts of coercion code itself. |
There was a problem hiding this comment.
@Yaliang I reviewed the first 3 commits. Unfortunately, my comments ask for non-trivial changes to this pull request.
I understand that there was a major delay on our side reviewing your changes. I apologize for that. I understand your priorities might have changed. I would like to understand (1) whether you're still interested in driving this pull request (2) an estimate on how long it will take you to address the comments.
Thank you for your contribution, @Yaliang
There was a problem hiding this comment.
I do not see why t_struct should only exist in _parquet test table.
Add it to presto_test_types_textfile so that all presto_test_types_ tables have it.
I would like to point out that that this would be a change not really related to this pull request. However, I find it a reasonable change. It's ok as long as it's continues to be a separate commit (as it currently is).
There was a problem hiding this comment.
I understand that you added this dummy partitioning column so that you can alter column type without affecting pre-existing data. However, parquet test is just a piece in a suites of presto_test_types_ tests. It's not supposed to be different other than when absolutely necessary (for example, due to limitations of this file format). When one test out of the series fails, we're going to try to find out how one file format is affected differently by whatever changes made prior to that failure. It's completely unintuitive that one has a different set up.
There was a problem hiding this comment.
Same comment here: presto_test_types_ is a series.
I see that you also added tests in AbstractTestHiveClient.testMismatchSchemaTable. Those tests are specific for Hive connector as well. And I feel like you don't need this here any more.
There was a problem hiding this comment.
We format .stream() like this:
return ((StructTypeInfo) hiveType.getTypeInfo()).getAllStructFieldTypeInfos().stream()
.map(typeInfo -> HiveType.valueOf(typeInfo.getTypeName()))
.collect(toImmutableList());
There was a problem hiding this comment.
I have concern about not validating here that names match.
Hive's behavior is complex and inconsistent when it comes to schema evolution of rows, dependent on file format and configuration. As a result, we have two options: (1) not allow mismatched names at all for now, (2) implement correctly (matching Hive behavior) and test extensively. I prefer option 1.
There was a problem hiding this comment.
Addressed as the option 1
There was a problem hiding this comment.
Call them ListCoercer, MapCoercer, StructCoercer. Ditto for the ones in HivePageSource.
There was a problem hiding this comment.
Do not call getType in a tight loop. It can be quite expensive.
Ditto for the rest of this commit. In particular, don't miss rewriteBlock
There was a problem hiding this comment.
Do not use getSingleValueBlock. It is legacy, and is exorbitantly expensive in terms of both CPU and GC overhead. Take a look at ColumnarArray, ColumnarMap, and ColumnarRow. As a positive side effect, by using those, you'll be forced to implement these coercers in a way that is significantly more efficient. Ditto for map and struct.
There was a problem hiding this comment.
You should keep this. But to make it clearer for readers, you could rename this to column_to_drop or alike.
There was a problem hiding this comment.
You can use mapType in HiveTestUtils to make initialization of MISMATCH_SCHEMA_MAP_TYPE_AFTER much simpler.
I recommend adding two additional utility methods (arrayType and rowType) to HiveTestUtils to make code here much simpler. Once you do that, please inline all MISMATCH_SCHEMA_..._TYPE_... and MISMATCH_SCHEMA_PRIMITIVE_COLUMN_...
|
I'm happy to drive it through. I will address comment before the end of week |
2a63f62 to
dd63022
Compare
bdf3e78 to
a1e8418
Compare
|
@haozhun I addressed most of the comments except the one about testing all 5 Presto types and ready for another round of review(The travis CI is passed in my repo's branch, FYI). About the ask about the testing, do we want to test it specifically on the |
haozhun
left a comment
There was a problem hiding this comment.
Comments for "Add support to coerce structural type".
It's mostly minor stuff
There was a problem hiding this comment.
fromElementHiveType and toElementHiveType are only used in this constructor. Make them local variables.
There was a problem hiding this comment.
setObject(toType.getObject(blockBuilder, blockBuilder.getPositionCount() - 1));
There was a problem hiding this comment.
extract toHiveType.getType(typeManager)) as a field called toType. Put toType assignment right before toElementType.
There was a problem hiding this comment.
ditto (see ListCoercer comments)
There was a problem hiding this comment.
ditto (see ListCoercer comments)
There was a problem hiding this comment.
ditto (see ListCoercer comments)
There was a problem hiding this comment.
Is this field parameter meaningful? If not, remove it.
There was a problem hiding this comment.
I don't see this being used in this code.
You can change this to throw new UnsupportedOperationException() here, and remove setType.
There was a problem hiding this comment.
return value == null;
I don't see value of Objects.isNull over == null. And there's only 1 place in Presto code that this is used.
There was a problem hiding this comment.
Use if here instead. We don't nest LazyBlocks
haozhun
left a comment
There was a problem hiding this comment.
"Add tests for structural coercers" - some comments
There was a problem hiding this comment.
I acknowledge the need for this function in AbstractTestClient. But I don't find this generally useful.
Move this to AbstractTestClient. Make it private. Call it toRowType.
There was a problem hiding this comment.
return (RowType) TYPE_MANAGER.getParameterizedType(
StandardTypes.ROW,
elementTypeSignatures.stream()
.map(TypeSignatureParameter::of)
.collect(toImmutableList()));
There was a problem hiding this comment.
return (ArrayType) TYPE_MANAGER.getParameterizedType(
StandardTypes.ARRAY,
ImmutableList.of(TypeSignatureParameter.of(elementType.getTypeSignature())));
There was a problem hiding this comment.
Your test is very complete. However, I find it a bit too hard to read. I would also like to cover a few more aspects.
For readability, I make a few proposals:
- Use abbreviations. (We avoid abbreviation in Presto in general. But I find it excessively wordy in this case otherwise)
- Avoid unnecessary spaces.
- Don't cover all conversions in each of row_to_row, list_to_list, map_to_map. Instead, cover all of them across the three. (And add comment to point this out.)
In addition, I find it useful to also test these:
- Non coerced items in the structs. In your implementation, they go to quite different code paths.
- Add a field at the end.
- Remove the last field.
// all primitive/varchar coercions are covered across row_to_row, list_to_list, and map_to_map
" row_to_row STRUCT<vc: VARCHAR, ti2si: TINYINT, si2int: SMALLINT, int2bi: INT, bi2vc: BIGINT>," +
" list_to_list ARRAY<STRUCT<ti2int: TINYINT, si2bi: SMALLINT, bi2vc: BIGINT, remove: VARCHAR>>," +
" map_to_map MAP<TINYINT, STRUCT<ti2bi: TINYINT, int2bi: INT, float2double: " + floatToDoubleType + ">>" +
There was a problem hiding this comment.
I find the cost you pay in order to avoid duplication in the VALUES clause too high. By introducing the dummy table, a nontrivial amount of complexity is introduced to this test. It makes the test harder to read and understand.
Instead, I would manually write it out, and remove the dummy table:
"(-1, 2, -3, 100, -101, 2323, 12345, 0.5," +
" named_struct(...)," +
" array(named_struct(...))," +
" map(..., named_struct(...)))," +
"(1, -2, null, -100, 101, -2323, -12345, -1.5," +
" named_struct(...)," +
" array(named_struct(...))," +
" map(..., named_struct(...)))," +
There was a problem hiding this comment.
Just realized I have tried this before and it seems impossible.
Unable to create temp file for insert values Expression of type TOK_FUNCTION not supported in insert/values
There was a problem hiding this comment.
There is actually no reason that these data is inserted with Hive. (Maybe there was at the time.) I changed it to use Presto, which offers the necessary SQL expressiveness.
There was a problem hiding this comment.
ditto here. (See my comment in tableDefinitionBuilder in AbstractTestHiveClient)
There was a problem hiding this comment.
Remove this block of code. Instead, write them out directly in the lines above (without any logic).
There was a problem hiding this comment.
Realized assertThat doesn't support JAVA_OBJECT directly. Have to extract each nested field.
java.lang.RuntimeException: Unsupported sql type JAVA_OBJECT
at io.prestodb.tempto.internal.query.QueryResultValueComparator.compare(QueryResultValueComparator.java:108)
at io.prestodb.tempto.assertions.QueryAssert.isAnyValueEqual(QueryAssert.java:344)
at io.prestodb.tempto.assertions.QueryAssert.rowsEqual(QueryAssert.java:334)
at io.prestodb.tempto.assertions.QueryAssert.containsRow(QueryAssert.java:316)
at io.prestodb.tempto.assertions.QueryAssert.contains(QueryAssert.java:176)
at io.prestodb.tempto.assertions.QueryAssert.containsOnly(QueryAssert.java:207)
at io.prestodb.tempto.assertions.QueryAssert.containsOnly(QueryAssert.java:219)
at com.facebook.presto.tests.hive.TestHiveCoercion.doTestHiveCoercion(TestHiveCoercion.java:285)
at com.facebook.presto.tests.hive.TestHiveCoercion.testHiveCoercionOrc(TestHiveCoercion.java:199)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
at org.testng.internal.Invoker.invokeMethod(Invoker.java:645)
at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:851)
at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1177)
at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:129)
at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:112)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
There was a problem hiding this comment.
With ImmutableMap.of(...), you don't need this helper (up to 5 key-value entries).
When you go over, we can't do this any more. But we wouldn't do it anyways because readability is pretty bad at that point anyways. Instead, we do:
ImmutableMap.<K, V>builder()
.put(..., ...)
.put(..., ...)
.put(..., ...)
.put(..., ...)
.put(..., ...)
.build()
There was a problem hiding this comment.
I initially come across ImmutableMap but then I realized I couldn't use it because I want to test null value but ImmutableMap don't allow null.
|
Please address my comments, and rebase on top of latest master. Thank you, @Yaliang. The code is in generally good shape. And I think we will be able to merge to this in the next week or the week after. |
|
I made edits to a comment. Please use the latest version on github, and avoid reading the email notification from github. |
acccdf9 to
b25cf97
Compare
|
I have addressed most comments, rebased on top of master, tried best to make it easy to read and ready for another round of review (I wish it's the final one). Thanks for all the feedbacks through, @haozhun. It seems to be impossible to avoid a dummy table. But I avoid putting meaningful values so it would be easy to understand now. |
|
Rebased again |
|
I made quite a few cleanups in I will merge this with my changes after 0.195 is out. Schema evolution for row types in Hive connector will be available in 0.196. |
Add support to coerce structural type:
+cc @dabaitu