Skip to content

Conversation

@dongkelun
Copy link
Contributor

@dongkelun dongkelun commented Nov 23, 2021

When querying Hudi incrementally in hive, we set the start query time of the table. This setting works for all tables with the same name, not only for the tables in the current database. In actual business, it can not be guaranteed that the tables in different databases are different, so it can be realized by setting hoodie.table.name as database name + table name, However, at present, the original value of hoodie.table.name is not consistent in spark SQL, so I want to implement it in this pr

In addition, I think we can add configuration items when creating tables to support database name + table name

What is the purpose of the pull request

The original hoodie.table.name should be maintained in Spark SQL

Brief change log

(for example:)

  • The original hoodie.table.name should be maintained in Spark SQL

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@dongkelun dongkelun force-pushed the HUDI-2837 branch 4 times, most recently from 2ae82de to 3fc3118 Compare November 25, 2021 06:59
@dongkelun
Copy link
Contributor Author

@xushiyan @YannByron Hi,Can you please help review this PR?

@xushiyan xushiyan self-assigned this Nov 27, 2021
Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YannByron can you take a look please?

@YannByron
Copy link
Contributor

YannByron commented Dec 28, 2021

@dongkelun @xushiyan
I'm sorry of not supporting this pr to solve the problems that set the start query time of the table and query incrementally. Some points we should think about:

  1. as this pr, just work for spark-sql. What about Spark DataFrame Write? We should support both.
  2. after adding database config, no matter get the database value by using a individual config like hoodie.datasource.write.database.name or parsing from the existing hoodie.datasource.write.table.name/hoodie.table.name when enable hoodie.sql.uses.database.table.name, we'll have four related options: hoodie.datasource.hive_sync.table, hoodie.datasource.hive_sync.database and the two mentioned above. Then, user have to learn these. Can we combine and simplify these?

IMO, Hudi with a mountain of configs already has a high threshold of use. We should choose some solutions which balance the functionality and use experience as far as possible.

@YannByron
Copy link
Contributor

For 2nd point above, we can consider to combine the four or five configs to two, just hoodie.database.name and hoodie.table.name. If enable to sync to hive, the two configs represent database and table in metastore.
Currently, hoodie.datasource.write.table.name/hoodie.table.name is required. In this idea, only need to provide one more config hoodie.database.name, or hoodie.datasource.write.database.name, so that we can sync and mark a certain hudi table for query incrementally in hive.

It's just my personal idea, and look forward to further discussion.

@dongkelun
Copy link
Contributor Author

dongkelun commented Dec 28, 2021

@dongkelun @xushiyan I'm sorry of not supporting this pr to solve the problems that set the start query time of the table and query incrementally. Some points we should think about:

  1. as this pr, just work for spark-sql. What about Spark DataFrame Write? We should support both.
  2. after adding database config, no matter get the database value by using a individual config like hoodie.datasource.write.database.name or parsing from the existing hoodie.datasource.write.table.name/hoodie.table.name when enable hoodie.sql.uses.database.table.name, we'll have four related options: hoodie.datasource.hive_sync.table, hoodie.datasource.hive_sync.database and the two mentioned above. Then, user have to learn these. Can we combine and simplify these?

IMO, Hudi with a mountain of configs already has a high threshold of use. We should choose some solutions which balance the functionality and use experience as far as possible.

@YannByron Hello
1、About Spark DataFrame Write, we can use hoodie.table.name to specify the table name
2、Because the database name can be specified when creating tables in Spark SQL, it is not through hoodie.database.name
and other configurations are specified. I think hoodie.sql.use.database.table.name is just a switch to judge whether SQL needs to be given hoodie.table.name specify the database name. It does not conflict with other configurations
As for combine other duplicate configuration items, I think we can solve them in other separate PR

@dongkelun
Copy link
Contributor Author

@hudi-bot run azure

1 similar comment
@dongkelun
Copy link
Contributor Author

@hudi-bot run azure

@YannByron
Copy link
Contributor

@dongkelun @xushiyan
I offer another solution to discuss.

Query incrementally in hive need to set hoodie.%s.consume.start.timestamp which is used in HoodieHiveUtils.readStartCommitTime。Currently, we pass the hoodie.table.name named tableName to this function.
We can add configs hoodie.datasource.write.database.name in DataSourceWriteOptions and hoodie.database.name in HoodieTableConfig. And if database.name provided, we joint the database.name and table.name and pass it to readStartCommitTime. And then, use can set hoodie.dbName.tableName.consume.start.timestamp in hive and query.

Also, hoodie.datasource.write.database.name and hoodie.database.name can reuse in other scene.

@xushiyan what do you think.

@dongkelun
Copy link
Contributor Author

@dongkelun @xushiyan I offer another solution to discuss.

Query incrementally in hive need to set hoodie.%s.consume.start.timestamp which is used in HoodieHiveUtils.readStartCommitTime。Currently, we pass the hoodie.table.name named tableName to this function. We can add configs hoodie.datasource.write.database.name in DataSourceWriteOptions and hoodie.database.name in HoodieTableConfig. And if database.name provided, we joint the database.name and table.name and pass it to readStartCommitTime. And then, use can set hoodie.dbName.tableName.consume.start.timestamp in hive and query.

Also, hoodie.datasource.write.database.name and hoodie.database.name can reuse in other scene.

@xushiyan what do you think.

@xushiyan @YannByron I probably understand the solution.

SQL will persist the database name to hoodie.properties by default, DF is selectively persisted through optional database parameters. Then, in incremental query, if set databaseName.tableName, we match databaseName.tableName. If it is inconsistent or there is no databaseName, incremental query will not be performed. If consistent, perform an incremental query.If the incremental query does not have a database name set, does not match the database name, only the table name

So, which parameter should DF use to persist the database name?

@dongkelun
Copy link
Contributor Author

@dongkelun @xushiyan I offer another solution to discuss.
Query incrementally in hive need to set hoodie.%s.consume.start.timestamp which is used in HoodieHiveUtils.readStartCommitTime。Currently, we pass the hoodie.table.name named tableName to this function. We can add configs hoodie.datasource.write.database.name in DataSourceWriteOptions and hoodie.database.name in HoodieTableConfig. And if database.name provided, we joint the database.name and table.name and pass it to readStartCommitTime. And then, use can set hoodie.dbName.tableName.consume.start.timestamp in hive and query.
Also, hoodie.datasource.write.database.name and hoodie.database.name can reuse in other scene.
@xushiyan what do you think.

@xushiyan @YannByron I probably understand the solution.

SQL will persist the database name to hoodie.properties by default, DF is selectively persisted through optional database parameters. Then, in incremental query, if set databaseName.tableName, we match databaseName.tableName. If it is inconsistent or there is no databaseName, incremental query will not be performed. If consistent, perform an incremental query.If the incremental query does not have a database name set, does not match the database name, only the table name

So, which parameter should DF use to persist the database name?

@xushiyan Hello, do you think this idea is OK? If so, I'll submit a version according to this idea first

@xushiyan
Copy link
Member

xushiyan commented Jan 2, 2022

@dongkelun @xushiyan I offer another solution to discuss.

Query incrementally in hive need to set hoodie.%s.consume.start.timestamp which is used in HoodieHiveUtils.readStartCommitTime。Currently, we pass the hoodie.table.name named tableName to this function. We can add configs hoodie.datasource.write.database.name in DataSourceWriteOptions and hoodie.database.name in HoodieTableConfig. And if database.name provided, we joint the database.name and table.name and pass it to readStartCommitTime. And then, use can set hoodie.dbName.tableName.consume.start.timestamp in hive and query.

Also, hoodie.datasource.write.database.name and hoodie.database.name can reuse in other scene.

@xushiyan what do you think.

@YannByron @dongkelun Sorry for the late reply. Instead of setting a switch to use database name, setting the config itself and checking its value is cleaner. The idea sounds good to me. Thanks.

@dongkelun
Copy link
Contributor Author

@hudi-bot run azure

@dongkelun dongkelun force-pushed the HUDI-2837 branch 2 times, most recently from 46053bb to 03e042a Compare January 2, 2022 10:13
@dongkelun
Copy link
Contributor Author

@YannByron @xushiyan Hello, I have modified and submitted the code according to the new solution. Can you have a look?

@nsivabalan
Copy link
Contributor

@xushiyan : Are we looking to get this in for 0.10.1? If yes, can you mark it with sev:critical.

@xushiyan
Copy link
Member

xushiyan commented Jan 5, 2022

@nsivabalan no this won't go to 0.10.1 as it introduces new config. @dongkelun as this won't be included in 0.10.1, can we hold this off until next week to land? just try to avoid potential conflicts.

@dongkelun
Copy link
Contributor Author

@nsivabalan no this won't go to 0.10.1 as it introduces new config. @dongkelun as this won't be included in 0.10.1, can we hold this off until next week to land? just try to avoid potential conflicts.

OK, if you're free, can you review it first? I'll submit the code that needs to be modified first, and then land next week

// It is here so that both the client and deltastreamer use the same reference
public static final String DELTASTREAMER_CHECKPOINT_KEY = "deltastreamer.checkpoint.key";

public static final ConfigProperty<String> DATABASE_NAME = ConfigProperty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to point to the definition of HoodieTableConfig.DATABASE_HOME directly, to avoid define repeatedly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just to keep consistent with other parameters before. If not, don't need to change other parameters for the time being? Is it better to revise it uniformly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just changing the configs related to this pr is ok.

@dongkelun
Copy link
Contributor Author

@hudi-bot run azure

4 similar comments
@dongkelun
Copy link
Contributor Author

@hudi-bot run azure

@dongkelun
Copy link
Contributor Author

@hudi-bot run azure

@dongkelun
Copy link
Contributor Author

@hudi-bot run azure

@dongkelun
Copy link
Contributor Author

@hudi-bot run azure

@dongkelun
Copy link
Contributor Author

@hudi-bot run azure

1 similar comment
@dongkelun
Copy link
Contributor Author

@hudi-bot run azure

@YannByron
Copy link
Contributor

@YannByron Hello, I submitted the newly modified code and added the test case when the databaseName is empty or null

LGTM. maybe this can land after a very little change.

@dongkelun
Copy link
Contributor Author

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xushiyan xushiyan merged commit 56cd8ff into apache:master Jan 23, 2022
@xushiyan xushiyan added the priority:high Significant impact; potential bugs label Jan 23, 2022
@vinishjail97 vinishjail97 mentioned this pull request Jan 24, 2022
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:high Significant impact; potential bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants