Skip to content

Conversation

@dongkelun
Copy link
Contributor

Tips

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@dongkelun
Copy link
Contributor Author

This PR is to solve this issue

@dongkelun
Copy link
Contributor Author

@xushiyan @nsivabalan Hello,can you please take a review?

@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Jan 7, 2022
@nsivabalan
Copy link
Contributor

@xiarixiaoyao : Can you please review this patch. thanks.

@nsivabalan
Copy link
Contributor

@dongkelun : can you please check if HUDI-3192 and https://issues.apache.org/jira/browse/HUDI-2682 are duplicates. if yes, please mark one of them as duplicate and close it.

@dongkelun dongkelun changed the title [HUDI-3192] Spark metastore schema evolution broken [HUDI-2682] Spark schema not updated with new columns on hive sync Jan 7, 2022
@dongkelun
Copy link
Contributor Author

dongkelun commented Jan 7, 2022

@dongkelun : can you please check if HUDI-3192 and https://issues.apache.org/jira/browse/HUDI-2682 are duplicates. if yes, please mark one of them as duplicate and close it.

I think it's a duplicate.HUDI-3192 has been closed

LOG.info("Sync table properties for " + tableName + ", table properties is: " + cfg.tableProperties);
}
hoodieHiveClient.updateTableProperties(tableName, tableProperties);
LOG.info("Sync table properties for " + tableName + ", table properties is: " + cfg.tableProperties);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cfg.tableProperties ?, i think it should be tableProperties

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about change this code to:
if (cfg.tableProperties != null || cfg.syncAsSparkDataSourceTable) {
hoodieHiveClient.updateTableProperties(tableName, tableProperties);
LOG.info("Sync table properties for " + tableName + ", table properties is: "
+ (cfg.tableProperties == null ? "" : cfg.tableProperties));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if add new columns,and cfg.tableProperties is null,then do not executeupdateTableProperties,then spark sql will not get the new columns.
I'm not sure if delete columns and update columns are the same.
If not, I think it can be judged by schemaDiff.getAddColumnTypes().isEmpty().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about change this code to: if (cfg.tableProperties != null || cfg.syncAsSparkDataSourceTable) { hoodieHiveClient.updateTableProperties(tableName, tableProperties); LOG.info("Sync table properties for " + tableName + ", table properties is: " + (cfg.tableProperties == null ? "" : cfg.tableProperties)); }

Sorry to see this new news now. Let me think about it first

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to use chemaDiff.getAddColumnTypes().isEmpty(). your modify is ok, just
pay attention to that: cfg.tableProperties maybe null and only if sync DataSourceTable we need these logical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. Thank you for your reminder. Your idea is better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing the log like this?

LOG.info("Sync table properties for " + tableName + ", table properties is: " + tableProperties);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have submitted the newly modified code

@parisni
Copy link
Contributor

parisni commented Jan 7, 2022

hi @xiarixiaoyao . thx for looking at this. not sure we can solve this from hudi. the problem happens on spark vanilla to. see my explainations here https://lists.apache.org/thread/9mmrnc5o7w42z723s2yqgcrdpwwtts3x

@dongkelun
Copy link
Contributor Author

hi @xiarixiaoyao . thx for looking at this. not sure we can solve this from hudi. the problem happens on spark vanilla to. see my explainations here https://lists.apache.org/thread/9mmrnc5o7w42z723s2yqgcrdpwwtts3x

Hello, I think this PR can explain why it is necessary

@dongkelun
Copy link
Contributor Author

hi @xiarixiaoyao . thx for looking at this. not sure we can solve this from hudi. the problem happens on spark vanilla to. see my explainations here https://lists.apache.org/thread/9mmrnc5o7w42z723s2yqgcrdpwwtts3x

I packed and verified it today. It should solve this problem However, adding columns with Hive SQL is not supported

@xiarixiaoyao
Copy link
Contributor

@parisni we want sparksql tread hudi as DataSource table to have a better performace.
when spark read dataSource table, spark will restore table metadata from table properties(include table schema )
you can see the original code in spark HiveExternalCatalog.restoreTableMetadata

  • It reads table schema, provider, partition column names and bucket specification from table
  • properties, and filter out these special entries from table properties.

@xiarixiaoyao
Copy link
Contributor

@dongkelun we have no way to control the behavie of hive, so i think this pr is ok. thanks for your contribution.
LGTM, wait for CI pass

@xiarixiaoyao xiarixiaoyao self-assigned this Jan 7, 2022
@parisni
Copy link
Contributor

parisni commented Jan 7, 2022

adding columns with Hive SQL is not supported

we have no way to control the behavie of hive

does this mean the hive_sync shall be equal to jdbc/hms and distinct from hiveql when syncing the metastore ?

@hudi-bot
Copy link
Collaborator

hudi-bot commented Jan 7, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@xiarixiaoyao xiarixiaoyao merged commit 4f6cdd7 into apache:master Jan 8, 2022
@vinishjail97 vinishjail97 mentioned this pull request Jan 24, 2022
5 tasks
vingov pushed a commit to vingov/hudi that referenced this pull request Jan 26, 2022
liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants