-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-2837] Add support for using database name in incremental query #4083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2ae82de to
3fc3118
Compare
|
@xushiyan @YannByron Hi,Can you please help review this PR? |
xushiyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YannByron can you take a look please?
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
Outdated
Show resolved
Hide resolved
|
@dongkelun @xushiyan
IMO, Hudi with a mountain of configs already has a high threshold of use. We should choose some solutions which balance the functionality and use experience as far as possible. |
|
For 2nd point above, we can consider to combine the four or five configs to two, just It's just my personal idea, and look forward to further discussion. |
@YannByron Hello |
|
@hudi-bot run azure |
1 similar comment
|
@hudi-bot run azure |
|
@dongkelun @xushiyan Query incrementally in hive need to set Also, @xushiyan what do you think. |
@xushiyan @YannByron I probably understand the solution. SQL will persist the database name to So, which parameter should DF use to persist the database name? |
@xushiyan Hello, do you think this idea is OK? If so, I'll submit a version according to this idea first |
@YannByron @dongkelun Sorry for the late reply. Instead of setting a switch to use database name, setting the config itself and checking its value is cleaner. The idea sounds good to me. Thanks. |
|
@hudi-bot run azure |
46053bb to
03e042a
Compare
|
@YannByron @xushiyan Hello, I have modified and submitted the code according to the new solution. Can you have a look? |
|
@xushiyan : Are we looking to get this in for 0.10.1? If yes, can you mark it with sev:critical. |
|
@nsivabalan no this won't go to 0.10.1 as it introduces new config. @dongkelun as this won't be included in 0.10.1, can we hold this off until next week to land? just try to avoid potential conflicts. |
OK, if you're free, can you review it first? I'll submit the code that needs to be modified first, and then land next week |
| // It is here so that both the client and deltastreamer use the same reference | ||
| public static final String DELTASTREAMER_CHECKPOINT_KEY = "deltastreamer.checkpoint.key"; | ||
|
|
||
| public static final ConfigProperty<String> DATABASE_NAME = ConfigProperty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to point to the definition of HoodieTableConfig.DATABASE_HOME directly, to avoid define repeatedly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just to keep consistent with other parameters before. If not, don't need to change other parameters for the time being? Is it better to revise it uniformly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just changing the configs related to this pr is ok.
|
@hudi-bot run azure |
4 similar comments
|
@hudi-bot run azure |
|
@hudi-bot run azure |
|
@hudi-bot run azure |
|
@hudi-bot run azure |
|
@hudi-bot run azure |
1 similar comment
|
@hudi-bot run azure |
...i-spark-common/src/main/scala/org/apache/spark/sql/catalyst/catalog/HoodieCatalogTable.scala
Outdated
Show resolved
Hide resolved
LGTM. maybe this can land after a very little change. |
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/InputPathHandler.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/InputPathHandler.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/InputPathHandler.java
Outdated
Show resolved
Hide resolved
|
@hudi-bot run azure |
xushiyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
When querying Hudi incrementally in hive, we set the start query time of the table. This setting works for all tables with the same name, not only for the tables in the current database. In actual business, it can not be guaranteed that the tables in different databases are different, so it can be realized by setting hoodie.table.name as database name + table name, However, at present, the original value of hoodie.table.name is not consistent in spark SQL, so I want to implement it in this pr
In addition, I think we can add configuration items when creating tables to support database name + table name
What is the purpose of the pull request
The original hoodie.table.name should be maintained in Spark SQL
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.