[HUDI-2682] Spark schema not updated with new columns on hive sync #4533

dongkelun · 2022-01-07T06:04:58Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

dongkelun · 2022-01-07T06:06:29Z

This PR is to solve this issue

dongkelun · 2022-01-07T07:08:18Z

@xushiyan @nsivabalan Hello,can you please take a review?

nsivabalan · 2022-01-07T13:00:20Z

@xiarixiaoyao : Can you please review this patch. thanks.

nsivabalan · 2022-01-07T13:06:30Z

@dongkelun : can you please check if HUDI-3192 and https://issues.apache.org/jira/browse/HUDI-2682 are duplicates. if yes, please mark one of them as duplicate and close it.

dongkelun · 2022-01-07T13:31:50Z

@dongkelun : can you please check if HUDI-3192 and https://issues.apache.org/jira/browse/HUDI-2682 are duplicates. if yes, please mark one of them as duplicate and close it.

I think it's a duplicate.HUDI-3192 has been closed

xiarixiaoyao · 2022-01-07T13:42:48Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.java

-          LOG.info("Sync table properties for " + tableName + ", table properties is: " + cfg.tableProperties);
-        }
+        hoodieHiveClient.updateTableProperties(tableName, tableProperties);
+        LOG.info("Sync table properties for " + tableName + ", table properties is: " + cfg.tableProperties);


cfg.tableProperties ？， i think it should be tableProperties

how about change this code to：
if (cfg.tableProperties != null || cfg.syncAsSparkDataSourceTable) {
hoodieHiveClient.updateTableProperties(tableName, tableProperties);
LOG.info("Sync table properties for " + tableName + ", table properties is: "
+ (cfg.tableProperties == null ? "" : cfg.tableProperties));
}

if add new columns,and cfg.tableProperties is null,then do not executeupdateTableProperties,then spark sql will not get the new columns.
I'm not sure if delete columns and update columns are the same.
If not, I think it can be judged by schemaDiff.getAddColumnTypes().isEmpty().

how about change this code to： if (cfg.tableProperties != null || cfg.syncAsSparkDataSourceTable) { hoodieHiveClient.updateTableProperties(tableName, tableProperties); LOG.info("Sync table properties for " + tableName + ", table properties is: " + (cfg.tableProperties == null ? "" : cfg.tableProperties)); }

Sorry to see this new news now. Let me think about it first

no need to use chemaDiff.getAddColumnTypes().isEmpty(). your modify is ok, just
pay attention to that: cfg.tableProperties maybe null and only if sync DataSourceTable we need these logical

OK, I see. Thank you for your reminder. Your idea is better

How about changing the log like this?

LOG.info("Sync table properties for " + tableName + ", table properties is: " + tableProperties);

I have submitted the newly modified code

parisni · 2022-01-07T14:21:03Z

hi @xiarixiaoyao . thx for looking at this. not sure we can solve this from hudi. the problem happens on spark vanilla to. see my explainations here https://lists.apache.org/thread/9mmrnc5o7w42z723s2yqgcrdpwwtts3x

dongkelun · 2022-01-07T14:36:06Z

hi @xiarixiaoyao . thx for looking at this. not sure we can solve this from hudi. the problem happens on spark vanilla to. see my explainations here https://lists.apache.org/thread/9mmrnc5o7w42z723s2yqgcrdpwwtts3x

Hello, I think this PR can explain why it is necessary

dongkelun · 2022-01-07T14:47:22Z

hi @xiarixiaoyao . thx for looking at this. not sure we can solve this from hudi. the problem happens on spark vanilla to. see my explainations here https://lists.apache.org/thread/9mmrnc5o7w42z723s2yqgcrdpwwtts3x

I packed and verified it today. It should solve this problem However, adding columns with Hive SQL is not supported

xiarixiaoyao · 2022-01-07T14:57:10Z

@parisni we want sparksql tread hudi as DataSource table to have a better performace.
when spark read dataSource table, spark will restore table metadata from table properties(include table schema )
you can see the original code in spark HiveExternalCatalog.restoreTableMetadata

It reads table schema, provider, partition column names and bucket specification from table
properties, and filter out these special entries from table properties.

xiarixiaoyao · 2022-01-07T15:00:53Z

@dongkelun we have no way to control the behavie of hive, so i think this pr is ok. thanks for your contribution.
LGTM， wait for CI pass

parisni · 2022-01-07T15:38:41Z

adding columns with Hive SQL is not supported

we have no way to control the behavie of hive

does this mean the hive_sync shall be equal to jdbc/hms and distinct from hiveql when syncing the metastore ?

hudi-bot · 2022-01-07T17:15:11Z

CI report:

fc2ebe0 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan added the priority:critical Production degraded; pipelines stalled label Jan 7, 2022

nsivabalan mentioned this pull request Jan 7, 2022

[SUPPORT] Spark metastore schema evolution broken #4525

Closed

dongkelun changed the title ~~[HUDI-3192] Spark metastore schema evolution broken~~ [HUDI-2682] Spark schema not updated with new columns on hive sync Jan 7, 2022

xiarixiaoyao reviewed Jan 7, 2022

View reviewed changes

[HUDI-3192] Spark metastore schema evolution broken

fc2ebe0

dongkelun force-pushed the HUDI-3192 branch from 63cee61 to fc2ebe0 Compare January 7, 2022 14:37

xiarixiaoyao self-assigned this Jan 7, 2022

xiarixiaoyao merged commit 4f6cdd7 into apache:master Jan 8, 2022

nsivabalan pushed a commit that referenced this pull request Jan 10, 2022

[HUDI-3192] Spark metastore schema evolution broken (#4533)

9bfc50e

vinishjail97 mentioned this pull request Jan 24, 2022

FixIgnoreKey nsivabalan/hudi#11

Closed

5 tasks

vingov pushed a commit to vingov/hudi that referenced this pull request Jan 26, 2022

[HUDI-3192] Spark metastore schema evolution broken (apache#4533)

f8d6e6c

liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022

[HUDI-3192] Spark metastore schema evolution broken (apache#4533)

4b408a1

vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022

[HUDI-3192] Spark metastore schema evolution broken (apache#4533)

e53fe8f

[HUDI-2682] Spark schema not updated with new columns on hive sync #4533

[HUDI-2682] Spark schema not updated with new columns on hive sync #4533

Uh oh!

Conversation

dongkelun commented Jan 7, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

dongkelun commented Jan 7, 2022

Uh oh!

dongkelun commented Jan 7, 2022

Uh oh!

nsivabalan commented Jan 7, 2022

Uh oh!

nsivabalan commented Jan 7, 2022

Uh oh!

dongkelun commented Jan 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parisni commented Jan 7, 2022

Uh oh!

dongkelun commented Jan 7, 2022

Uh oh!

dongkelun commented Jan 7, 2022

Uh oh!

xiarixiaoyao commented Jan 7, 2022

Uh oh!

xiarixiaoyao commented Jan 7, 2022

Uh oh!

parisni commented Jan 7, 2022

Uh oh!

hudi-bot commented Jan 7, 2022

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dongkelun commented Jan 7, 2022 •

edited

Loading