[HUDI-2837] Add support for using database name in incremental query #4083

dongkelun · 2021-11-23T09:28:01Z

When querying Hudi incrementally in hive, we set the start query time of the table. This setting works for all tables with the same name, not only for the tables in the current database. In actual business, it can not be guaranteed that the tables in different databases are different, so it can be realized by setting hoodie.table.name as database name + table name, However, at present, the original value of hoodie.table.name is not consistent in spark SQL, so I want to implement it in this pr

In addition, I think we can add configuration items when creating tables to support database name + table name

What is the purpose of the pull request

The original hoodie.table.name should be maintained in Spark SQL

Brief change log

(for example:)

The original hoodie.table.name should be maintained in Spark SQL

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

dongkelun · 2021-11-25T08:02:44Z

@xushiyan @YannByron Hi,Can you please help review this PR?

xushiyan

@YannByron can you take a look please?

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

YannByron · 2021-12-28T06:38:26Z

@dongkelun @xushiyan
I'm sorry of not supporting this pr to solve the problems that set the start query time of the table and query incrementally. Some points we should think about:

as this pr, just work for spark-sql. What about Spark DataFrame Write? We should support both.
after adding database config, no matter get the database value by using a individual config like hoodie.datasource.write.database.name or parsing from the existing hoodie.datasource.write.table.name/hoodie.table.name when enable hoodie.sql.uses.database.table.name, we'll have four related options: hoodie.datasource.hive_sync.table, hoodie.datasource.hive_sync.database and the two mentioned above. Then, user have to learn these. Can we combine and simplify these?

IMO, Hudi with a mountain of configs already has a high threshold of use. We should choose some solutions which balance the functionality and use experience as far as possible.

YannByron · 2021-12-28T06:56:20Z

For 2nd point above, we can consider to combine the four or five configs to two, just hoodie.database.name and hoodie.table.name. If enable to sync to hive, the two configs represent database and table in metastore.
Currently, hoodie.datasource.write.table.name/hoodie.table.name is required. In this idea, only need to provide one more config hoodie.database.name, or hoodie.datasource.write.database.name, so that we can sync and mark a certain hudi table for query incrementally in hive.

It's just my personal idea, and look forward to further discussion.

dongkelun · 2021-12-28T09:44:57Z

@dongkelun @xushiyan I'm sorry of not supporting this pr to solve the problems that set the start query time of the table and query incrementally. Some points we should think about:

as this pr, just work for spark-sql. What about Spark DataFrame Write? We should support both.

after adding database config, no matter get the database value by using a individual config like hoodie.datasource.write.database.name or parsing from the existing hoodie.datasource.write.table.name/hoodie.table.name when enable hoodie.sql.uses.database.table.name, we'll have four related options: hoodie.datasource.hive_sync.table, hoodie.datasource.hive_sync.database and the two mentioned above. Then, user have to learn these. Can we combine and simplify these?

IMO, Hudi with a mountain of configs already has a high threshold of use. We should choose some solutions which balance the functionality and use experience as far as possible.

@YannByron Hello
1、About Spark DataFrame Write, we can use hoodie.table.name to specify the table name
2、Because the database name can be specified when creating tables in Spark SQL, it is not through hoodie.database.name
and other configurations are specified. I think hoodie.sql.use.database.table.name is just a switch to judge whether SQL needs to be given hoodie.table.name specify the database name. It does not conflict with other configurations
As for combine other duplicate configuration items, I think we can solve them in other separate PR

…ark SQL

dongkelun · 2021-12-28T13:45:09Z

@hudi-bot run azure

dongkelun · 2021-12-28T14:41:17Z

@hudi-bot run azure

YannByron · 2021-12-29T03:30:16Z

@dongkelun @xushiyan
I offer another solution to discuss.

Query incrementally in hive need to set hoodie.%s.consume.start.timestamp which is used in HoodieHiveUtils.readStartCommitTime。Currently, we pass the hoodie.table.name named tableName to this function.
We can add configs hoodie.datasource.write.database.name in DataSourceWriteOptions and hoodie.database.name in HoodieTableConfig. And if database.name provided, we joint the database.name and table.name and pass it to readStartCommitTime. And then, use can set hoodie.dbName.tableName.consume.start.timestamp in hive and query.

Also, hoodie.datasource.write.database.name and hoodie.database.name can reuse in other scene.

@xushiyan what do you think.

dongkelun · 2021-12-29T07:50:44Z

@dongkelun @xushiyan I offer another solution to discuss.

Query incrementally in hive need to set hoodie.%s.consume.start.timestamp which is used in HoodieHiveUtils.readStartCommitTime。Currently, we pass the hoodie.table.name named tableName to this function. We can add configs hoodie.datasource.write.database.name in DataSourceWriteOptions and hoodie.database.name in HoodieTableConfig. And if database.name provided, we joint the database.name and table.name and pass it to readStartCommitTime. And then, use can set hoodie.dbName.tableName.consume.start.timestamp in hive and query.

Also, hoodie.datasource.write.database.name and hoodie.database.name can reuse in other scene.

@xushiyan what do you think.

@xushiyan @YannByron I probably understand the solution.

SQL will persist the database name to hoodie.properties by default, DF is selectively persisted through optional database parameters. Then, in incremental query, if set databaseName.tableName, we match databaseName.tableName. If it is inconsistent or there is no databaseName, incremental query will not be performed. If consistent, perform an incremental query.If the incremental query does not have a database name set, does not match the database name, only the table name

So, which parameter should DF use to persist the database name？

dongkelun · 2021-12-31T09:52:10Z

@dongkelun @xushiyan I offer another solution to discuss.
Query incrementally in hive need to set hoodie.%s.consume.start.timestamp which is used in HoodieHiveUtils.readStartCommitTime。Currently, we pass the hoodie.table.name named tableName to this function. We can add configs hoodie.datasource.write.database.name in DataSourceWriteOptions and hoodie.database.name in HoodieTableConfig. And if database.name provided, we joint the database.name and table.name and pass it to readStartCommitTime. And then, use can set hoodie.dbName.tableName.consume.start.timestamp in hive and query.
Also, hoodie.datasource.write.database.name and hoodie.database.name can reuse in other scene.
@xushiyan what do you think.

@xushiyan @YannByron I probably understand the solution.

SQL will persist the database name to hoodie.properties by default, DF is selectively persisted through optional database parameters. Then, in incremental query, if set databaseName.tableName, we match databaseName.tableName. If it is inconsistent or there is no databaseName, incremental query will not be performed. If consistent, perform an incremental query.If the incremental query does not have a database name set, does not match the database name, only the table name

So, which parameter should DF use to persist the database name？

@xushiyan Hello, do you think this idea is OK? If so, I'll submit a version according to this idea first

xushiyan · 2022-01-02T04:08:53Z

@dongkelun @xushiyan I offer another solution to discuss.

Query incrementally in hive need to set hoodie.%s.consume.start.timestamp which is used in HoodieHiveUtils.readStartCommitTime。Currently, we pass the hoodie.table.name named tableName to this function. We can add configs hoodie.datasource.write.database.name in DataSourceWriteOptions and hoodie.database.name in HoodieTableConfig. And if database.name provided, we joint the database.name and table.name and pass it to readStartCommitTime. And then, use can set hoodie.dbName.tableName.consume.start.timestamp in hive and query.

Also, hoodie.datasource.write.database.name and hoodie.database.name can reuse in other scene.

@xushiyan what do you think.

@YannByron @dongkelun Sorry for the late reply. Instead of setting a switch to use database name, setting the config itself and checking its value is cleaner. The idea sounds good to me. Thanks.

dongkelun · 2022-01-02T09:57:45Z

@hudi-bot run azure

dongkelun · 2022-01-03T01:01:12Z

@YannByron @xushiyan Hello, I have modified and submitted the code according to the new solution. Can you have a look?

nsivabalan · 2022-01-04T21:34:46Z

@xushiyan : Are we looking to get this in for 0.10.1? If yes, can you mark it with sev:critical.

xushiyan · 2022-01-05T05:29:30Z

@nsivabalan no this won't go to 0.10.1 as it introduces new config. @dongkelun as this won't be included in 0.10.1, can we hold this off until next week to land? just try to avoid potential conflicts.

dongkelun · 2022-01-05T05:53:50Z

@nsivabalan no this won't go to 0.10.1 as it introduces new config. @dongkelun as this won't be included in 0.10.1, can we hold this off until next week to land? just try to avoid potential conflicts.

OK, if you're free, can you review it first? I'll submit the code that needs to be modified first, and then land next week

YannByron · 2022-01-09T05:01:26Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java

  // It is here so that both the client and deltastreamer use the same reference
  public static final String DELTASTREAMER_CHECKPOINT_KEY = "deltastreamer.checkpoint.key";

+  public static final ConfigProperty<String> DATABASE_NAME = ConfigProperty


better to point to the definition of HoodieTableConfig.DATABASE_HOME directly, to avoid define repeatedly.

Yes, just to keep consistent with other parameters before. If not, don't need to change other parameters for the time being? Is it better to revise it uniformly?

just changing the configs related to this pr is ok.

dongkelun · 2022-01-12T11:44:25Z

@hudi-bot run azure

dongkelun · 2022-01-12T13:02:17Z

@hudi-bot run azure

dongkelun · 2022-01-13T00:56:09Z

@hudi-bot run azure

dongkelun · 2022-01-13T05:45:58Z

@hudi-bot run azure

dongkelun · 2022-01-13T06:13:42Z

@hudi-bot run azure

dongkelun · 2022-01-14T12:06:53Z

@hudi-bot run azure

dongkelun · 2022-01-16T02:59:39Z

@hudi-bot run azure

...i-spark-common/src/main/scala/org/apache/spark/sql/catalyst/catalog/HoodieCatalogTable.scala

YannByron · 2022-01-18T12:58:41Z

@YannByron Hello, I submitted the newly modified code and added the test case when the databaseName is empty or null

LGTM. maybe this can land after a very little change.

hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/InputPathHandler.java

dongkelun · 2022-01-20T11:19:42Z

@hudi-bot run azure

hudi-bot · 2022-01-20T12:59:26Z

CI report:

00221c8 UNKNOWN
46053bb UNKNOWN
2722bfc UNKNOWN
f6101d2 UNKNOWN
dd02784 UNKNOWN
b892126 UNKNOWN
b789fc0 Azure: FAILURE Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

xushiyan

LGTM

…pache#4083)

dongkelun force-pushed the HUDI-2837 branch 4 times, most recently from 2ae82de to 3fc3118 Compare November 25, 2021 06:59

xushiyan self-assigned this Nov 27, 2021

xushiyan reviewed Dec 28, 2021

View reviewed changes

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala Outdated Show resolved Hide resolved

[HUDI-2837] The original hoodie.table.name should be maintained in Sp…

0c1a86f

…ark SQL

dongkelun force-pushed the HUDI-2837 branch from 3fc3118 to 0c1a86f Compare December 28, 2021 11:42

Merge branch 'master' of https://github.com/apache/hudi into HUDI-2837

2e5ad08

dongkelun force-pushed the HUDI-2837 branch from d4b5915 to 00221c8 Compare January 2, 2022 09:32

dongkelun force-pushed the HUDI-2837 branch 2 times, most recently from 46053bb to 03e042a Compare January 2, 2022 10:13

[HUDI-2837] Add support for using database name in incremental query

152f64a

dongkelun force-pushed the HUDI-2837 branch from 03e042a to 152f64a Compare January 2, 2022 13:22

YannByron reviewed Jan 9, 2022

View reviewed changes

dongkelun force-pushed the HUDI-2837 branch from dd02784 to 41800fc Compare January 10, 2022 12:43

dongkelun added 2 commits January 10, 2022 20:47

[HUDI-2837] Add support for using database name in incremental query

41800fc

[HUDI-2837] Add support for using database name in incremental query

8b32ba0

[HUDI-2837] Add support for using database name in incremental query

6359d0f

dongkelun force-pushed the HUDI-2837 branch from b892126 to 6359d0f Compare January 14, 2022 07:44

YannByron reviewed Jan 18, 2022

View reviewed changes

...i-spark-common/src/main/scala/org/apache/spark/sql/catalyst/catalog/HoodieCatalogTable.scala Outdated Show resolved Hide resolved

[HUDI-2837] Add support for using database name in incremental query

b5c89d2

dongkelun force-pushed the HUDI-2837 branch from 2e925c9 to b5c89d2 Compare January 18, 2022 13:09

[HUDI-2837] Add support for using database name in incremental query

f3aba93

xushiyan reviewed Jan 20, 2022

View reviewed changes

[HUDI-2837] Add support for using database name in incremental query

b789fc0

xushiyan approved these changes Jan 23, 2022

View reviewed changes

xushiyan merged commit 56cd8ff into apache:master Jan 23, 2022

xushiyan added the priority:high Significant impact; potential bugs label Jan 23, 2022

vinishjail97 mentioned this pull request Jan 24, 2022

FixIgnoreKey nsivabalan/hudi#11

Closed

5 tasks

alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Jan 25, 2022

[HUDI-2837] Add support for using database name in incremental query (a…

8e4da6c

…pache#4083)

vingov pushed a commit to vingov/hudi that referenced this pull request Jan 26, 2022

[HUDI-2837] Add support for using database name in incremental query (a…

568e915

…pache#4083)

liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022

[HUDI-2837] Add support for using database name in incremental query (a…

26879e6

…pache#4083)

vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022

[HUDI-2837] Add support for using database name in incremental query (a…

a481398

…pache#4083)

dongkelun mentioned this pull request Nov 30, 2022

[HUDI-5301] Spark SQL queries support setting parameters through set #7339

Closed

4 tasks

[HUDI-2837] Add support for using database name in incremental query #4083

[HUDI-2837] Add support for using database name in incremental query #4083

Uh oh!

Conversation

dongkelun commented Nov 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

dongkelun commented Nov 25, 2021

Uh oh!

xushiyan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

YannByron commented Dec 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YannByron commented Dec 28, 2021

Uh oh!

dongkelun commented Dec 28, 2021 • edited by xushiyan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongkelun commented Dec 28, 2021

Uh oh!

dongkelun commented Dec 28, 2021

Uh oh!

YannByron commented Dec 29, 2021

Uh oh!

dongkelun commented Dec 29, 2021

Uh oh!

dongkelun commented Dec 31, 2021

Uh oh!

xushiyan commented Jan 2, 2022

Uh oh!

dongkelun commented Jan 2, 2022

Uh oh!

dongkelun commented Jan 3, 2022

Uh oh!

nsivabalan commented Jan 4, 2022

Uh oh!

xushiyan commented Jan 5, 2022

Uh oh!

dongkelun commented Jan 5, 2022

Uh oh!

YannByron Jan 9, 2022

Choose a reason for hiding this comment

Uh oh!

dongkelun Jan 9, 2022

Choose a reason for hiding this comment

Uh oh!

YannByron Jan 10, 2022

Choose a reason for hiding this comment

Uh oh!

dongkelun commented Jan 12, 2022

Uh oh!

dongkelun commented Jan 12, 2022

Uh oh!

dongkelun commented Jan 13, 2022

Uh oh!

dongkelun commented Jan 13, 2022

Uh oh!

dongkelun commented Jan 13, 2022

Uh oh!

dongkelun commented Jan 14, 2022

Uh oh!

dongkelun commented Jan 16, 2022

Uh oh!

Uh oh!

YannByron commented Jan 18, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dongkelun commented Jan 20, 2022

Uh oh!

hudi-bot commented Jan 20, 2022

dongkelun commented Nov 23, 2021 •

edited

Loading

YannByron commented Dec 28, 2021 •

edited

Loading

dongkelun commented Dec 28, 2021 •

edited by xushiyan

Loading