Skip to content

Add support to coerce structural type#9131

Closed
Yaliang wants to merge 4 commits intoprestodb:masterfrom
Yaliang:yaliangw/oss/structural-type-coerce
Closed

Add support to coerce structural type#9131
Yaliang wants to merge 4 commits intoprestodb:masterfrom
Yaliang:yaliangw/oss/structural-type-coerce

Conversation

@Yaliang
Copy link
Contributor

@Yaliang Yaliang commented Oct 10, 2017

Add support to coerce structural type:

  1. Nested primitive/structural field coercion
  2. Adding/removing tailing fields in (nested) struct field

+cc @dabaitu

@Yaliang
Copy link
Contributor Author

Yaliang commented Oct 10, 2017

@geraint0923 I noticed you initially implement the coercion for primitive types. Could you advise who can take a review for this patch?

@Yaliang Yaliang force-pushed the yaliangw/oss/structural-type-coerce branch 6 times, most recently from c9f2258 to fe410ef Compare October 14, 2017 19:45
@DaimonPl
Copy link

DaimonPl commented Nov 2, 2017

I really appreciate this PR.

We are strugling with schema evolution problems in our infrastructure. This PR looks to be solving them all :)

I hope it won't be skipped like some older PRs :)

@martint
Copy link
Contributor

martint commented Nov 20, 2017

@haozhun, can you review this?

@losipiuk
Copy link
Contributor

losipiuk commented Nov 27, 2017

@martint, @haozhun just reminding myself about ancient PR which does a bit of refactor and simplification of coercion mechanism in hive connector. As well as adding support for decimal types.

I've just rebased it to current master and new incarnation is here: #9422.

It would be great if this one is put on top of refactorings from my PR (assuming we want those).

@martint
Copy link
Contributor

martint commented Nov 27, 2017

I've just rebased it to current master and new incarnation is here: #9409.

Did you mean #9422 ?

@Yaliang Yaliang force-pushed the yaliangw/oss/structural-type-coerce branch 3 times, most recently from 4554f6d to ea59c88 Compare November 28, 2017 05:39
@losipiuk
Copy link
Contributor

Did you mean #9422

Yeah - corrected

@Yaliang
Copy link
Contributor Author

Yaliang commented Nov 29, 2017

@losipiuk I think there won't be much difficulty to refactor this PR later against your refactorings. There is no interface/public method/class change in this PR. All added methods/classes are private in the existing class, so there won't have hard conflicts for further refactoring. I would suggest we can work on this first and then work on the refactoring as a whole if that's necessary. Thoughts ?

@losipiuk
Copy link
Contributor

@losipiuk I think there won't be much difficulty to refactor this PR later against your refactorings.

I am a bit concerned mostly about changes in test as I made a bunch of changes there so much less code needs to be written. I do not expect much work in resolving conflicts of coercion code itself.

Copy link
Contributor

@haozhun haozhun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yaliang I reviewed the first 3 commits. Unfortunately, my comments ask for non-trivial changes to this pull request.

I understand that there was a major delay on our side reviewing your changes. I apologize for that. I understand your priorities might have changed. I would like to understand (1) whether you're still interested in driving this pull request (2) an estimate on how long it will take you to address the comments.

Thank you for your contribution, @Yaliang

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see why t_struct should only exist in _parquet test table.

Add it to presto_test_types_textfile so that all presto_test_types_ tables have it.

I would like to point out that that this would be a change not really related to this pull request. However, I find it a reasonable change. It's ok as long as it's continues to be a separate commit (as it currently is).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that you added this dummy partitioning column so that you can alter column type without affecting pre-existing data. However, parquet test is just a piece in a suites of presto_test_types_ tests. It's not supposed to be different other than when absolutely necessary (for example, due to limitations of this file format). When one test out of the series fails, we're going to try to find out how one file format is affected differently by whatever changes made prior to that failure. It's completely unintuitive that one has a different set up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here: presto_test_types_ is a series.

I see that you also added tests in AbstractTestHiveClient.testMismatchSchemaTable. Those tests are specific for Hive connector as well. And I feel like you don't need this here any more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We format .stream() like this:

        return ((StructTypeInfo) hiveType.getTypeInfo()).getAllStructFieldTypeInfos().stream()
                .map(typeInfo -> HiveType.valueOf(typeInfo.getTypeName()))
                .collect(toImmutableList());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concern about not validating here that names match.

Hive's behavior is complex and inconsistent when it comes to schema evolution of rows, dependent on file format and configuration. As a result, we have two options: (1) not allow mismatched names at all for now, (2) implement correctly (matching Hive behavior) and test extensively. I prefer option 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed as the option 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call them ListCoercer, MapCoercer, StructCoercer. Ditto for the ones in HivePageSource.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not call getType in a tight loop. It can be quite expensive.

Ditto for the rest of this commit. In particular, don't miss rewriteBlock

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use getSingleValueBlock. It is legacy, and is exorbitantly expensive in terms of both CPU and GC overhead. Take a look at ColumnarArray, ColumnarMap, and ColumnarRow. As a positive side effect, by using those, you'll be forced to implement these coercers in a way that is significantly more efficient. Ditto for map and struct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should keep this. But to make it clearer for readers, you could rename this to column_to_drop or alike.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use mapType in HiveTestUtils to make initialization of MISMATCH_SCHEMA_MAP_TYPE_AFTER much simpler.

I recommend adding two additional utility methods (arrayType and rowType) to HiveTestUtils to make code here much simpler. Once you do that, please inline all MISMATCH_SCHEMA_..._TYPE_... and MISMATCH_SCHEMA_PRIMITIVE_COLUMN_...

@Yaliang
Copy link
Contributor Author

Yaliang commented Jan 11, 2018

I'm happy to drive it through. I will address comment before the end of week

@Yaliang Yaliang force-pushed the yaliangw/oss/structural-type-coerce branch 7 times, most recently from 2a63f62 to dd63022 Compare January 15, 2018 02:41
@Yaliang Yaliang force-pushed the yaliangw/oss/structural-type-coerce branch from bdf3e78 to a1e8418 Compare February 1, 2018 00:24
@Yaliang
Copy link
Contributor Author

Yaliang commented Feb 1, 2018

@haozhun I addressed most of the comments except the one about testing all 5 Presto types and ready for another round of review(The travis CI is passed in my repo's branch, FYI). About the ask about the testing, do we want to test it specifically on the rewriteBlock method or just add some exception tests on unsupported coercion(which I think may not actually reach rewriteBlock method)?

Copy link
Contributor

@haozhun haozhun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments for "Add support to coerce structural type".

It's mostly minor stuff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fromElementHiveType and toElementHiveType are only used in this constructor. Make them local variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setObject(toType.getObject(blockBuilder, blockBuilder.getPositionCount() - 1));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract toHiveType.getType(typeManager)) as a field called toType. Put toType assignment right before toElementType.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto (see ListCoercer comments)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto (see ListCoercer comments)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto (see ListCoercer comments)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this field parameter meaningful? If not, remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this being used in this code.

You can change this to throw new UnsupportedOperationException() here, and remove setType.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return value == null;

I don't see value of Objects.isNull over == null. And there's only 1 place in Presto code that this is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use if here instead. We don't nest LazyBlocks

Copy link
Contributor

@haozhun haozhun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Add tests for structural coercers" - some comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I acknowledge the need for this function in AbstractTestClient. But I don't find this generally useful.

Move this to AbstractTestClient. Make it private. Call it toRowType.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        return (RowType) TYPE_MANAGER.getParameterizedType(
                StandardTypes.ROW,
                elementTypeSignatures.stream()
                        .map(TypeSignatureParameter::of)
                        .collect(toImmutableList()));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        return (ArrayType) TYPE_MANAGER.getParameterizedType(
                StandardTypes.ARRAY,
                ImmutableList.of(TypeSignatureParameter.of(elementType.getTypeSignature())));

Copy link
Contributor

@haozhun haozhun Feb 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your test is very complete. However, I find it a bit too hard to read. I would also like to cover a few more aspects.

For readability, I make a few proposals:

  • Use abbreviations. (We avoid abbreviation in Presto in general. But I find it excessively wordy in this case otherwise)
  • Avoid unnecessary spaces.
  • Don't cover all conversions in each of row_to_row, list_to_list, map_to_map. Instead, cover all of them across the three. (And add comment to point this out.)

In addition, I find it useful to also test these:

  • Non coerced items in the structs. In your implementation, they go to quite different code paths.
  • Add a field at the end.
  • Remove the last field.
// all primitive/varchar coercions are covered across row_to_row, list_to_list, and map_to_map 
"    row_to_row                 STRUCT<vc: VARCHAR, ti2si: TINYINT, si2int: SMALLINT, int2bi: INT, bi2vc: BIGINT>," +
"    list_to_list               ARRAY<STRUCT<ti2int: TINYINT, si2bi: SMALLINT, bi2vc: BIGINT, remove: VARCHAR>>," +
"    map_to_map                 MAP<TINYINT, STRUCT<ti2bi: TINYINT, int2bi: INT, float2double: " + floatToDoubleType + ">>" +

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the cost you pay in order to avoid duplication in the VALUES clause too high. By introducing the dummy table, a nontrivial amount of complexity is introduced to this test. It makes the test harder to read and understand.

Instead, I would manually write it out, and remove the dummy table:

                        "(-1, 2, -3, 100, -101, 2323, 12345, 0.5," +
                        "    named_struct(...)," +
                        "    array(named_struct(...))," +
                        "    map(..., named_struct(...)))," +
                        "(1, -2, null, -100, 101, -2323, -12345, -1.5," +
                        "    named_struct(...)," +
                        "    array(named_struct(...))," +
                        "    map(..., named_struct(...)))," +

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized I have tried this before and it seems impossible.

Unable to create temp file for insert values Expression of type TOK_FUNCTION not supported in insert/values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is actually no reason that these data is inserted with Hive. (Maybe there was at the time.) I changed it to use Presto, which offers the necessary SQL expressiveness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto here. (See my comment in tableDefinitionBuilder in AbstractTestHiveClient)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this block of code. Instead, write them out directly in the lines above (without any logic).

Copy link
Contributor Author

@Yaliang Yaliang Feb 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized assertThat doesn't support JAVA_OBJECT directly. Have to extract each nested field.

java.lang.RuntimeException: Unsupported sql type JAVA_OBJECT
	at io.prestodb.tempto.internal.query.QueryResultValueComparator.compare(QueryResultValueComparator.java:108)
	at io.prestodb.tempto.assertions.QueryAssert.isAnyValueEqual(QueryAssert.java:344)
	at io.prestodb.tempto.assertions.QueryAssert.rowsEqual(QueryAssert.java:334)
	at io.prestodb.tempto.assertions.QueryAssert.containsRow(QueryAssert.java:316)
	at io.prestodb.tempto.assertions.QueryAssert.contains(QueryAssert.java:176)
	at io.prestodb.tempto.assertions.QueryAssert.containsOnly(QueryAssert.java:207)
	at io.prestodb.tempto.assertions.QueryAssert.containsOnly(QueryAssert.java:219)
	at com.facebook.presto.tests.hive.TestHiveCoercion.doTestHiveCoercion(TestHiveCoercion.java:285)
	at com.facebook.presto.tests.hive.TestHiveCoercion.testHiveCoercionOrc(TestHiveCoercion.java:199)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
	at org.testng.internal.Invoker.invokeMethod(Invoker.java:645)
	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:851)
	at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1177)
	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:129)
	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:112)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With ImmutableMap.of(...), you don't need this helper (up to 5 key-value entries).

When you go over, we can't do this any more. But we wouldn't do it anyways because readability is pretty bad at that point anyways. Instead, we do:

ImmutableMap.<K, V>builder()
        .put(..., ...)
        .put(..., ...)
        .put(..., ...)
        .put(..., ...)
        .put(..., ...)
        .build()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially come across ImmutableMap but then I realized I couldn't use it because I want to test null value but ImmutableMap don't allow null.

@haozhun
Copy link
Contributor

haozhun commented Feb 3, 2018

Please address my comments, and rebase on top of latest master. Thank you, @Yaliang.

The code is in generally good shape. And I think we will be able to merge to this in the next week or the week after.

@haozhun
Copy link
Contributor

haozhun commented Feb 3, 2018

I made edits to a comment. Please use the latest version on github, and avoid reading the email notification from github.

@Yaliang Yaliang force-pushed the yaliangw/oss/structural-type-coerce branch 16 times, most recently from acccdf9 to b25cf97 Compare February 6, 2018 22:08
@Yaliang
Copy link
Contributor Author

Yaliang commented Feb 8, 2018

I have addressed most comments, rebased on top of master, tried best to make it easy to read and ready for another round of review (I wish it's the final one). Thanks for all the feedbacks through, @haozhun.

It seems to be impossible to avoid a dummy table. But I avoid putting meaningful values so it would be easy to understand now. QueryAssert in tempto can't compare JAVA_OBJECT, so I write some helper functions to extract/project the value from QueryResult and the list of expected Rows.

@Yaliang
Copy link
Contributor Author

Yaliang commented Feb 12, 2018

Rebased again

@haozhun
Copy link
Contributor

haozhun commented Feb 14, 2018

I made quite a few cleanups in TestHiveCoercion. I added comments to HiveCoercionPolicy to explain what we allow.

I will merge this with my changes after 0.195 is out. Schema evolution for row types in Hive connector will be available in 0.196.

@haozhun
Copy link
Contributor

haozhun commented Feb 22, 2018

Merged #9131.

Thank you for your contribution, @Yaliang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants