-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-1527] Automatically infer the full partition path when user only specifies the table path #3353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@vinothchandar It works for CoW, the final test in |
|
@pengzhiwei2018 are you able to take one pass at this PR? |
|
Hi @rmahindra123 , the PR #2651 has already support query by the table path. So I am not very understand what is the PR do, can you explain it for me? |
|
That's pretty much the functionality, supporting reads of table without having to pass globs ("basePath///*") You can see the tests. |
Yes, I see. But the PR #2651 has already supported reading table without pass globs. is there any difference? |
@pengzhiwei2018 Its similar functionality but there were a few bugs and the tests were not actually testing the functionality. It required some refactoring to work for different cases, so I put it in a new PR. |
Thanks for your instructions @rmahindra123 , can you implement this based on the |
|
@pengzhiwei2018 . I am planning on taking this over. If you have cycles, please feel free give it a shot as well. |
Please feel free for this. I am currently busy with the sql integration. |
What is the purpose of the pull request
This is a PR for the changes that were done in #2475, with following fixes:
Original PR description: To read the hudi table, you need to specify the path, but the path is not only the tablePath corresponding to the table, but needs to be determined by the partition directory structure. Different keyGenerators correspond to different partition directory structures. The first-level partition directory uses path=.../table//, the secondary partition directory path=.../table///*,so it is troublesome to let the user specify the data path, the user only needs to specify the tablePath: .../table
At the same time, after reading the hudi table by configuring path=.../table, it is more convenient to use sparksql to query the hudi table. You only need to add tab properties to the hive table metadata: spark.sql.sources.provider= hudi, you can automatically convert the hive table to the hudi table.
Brief change log
Added logic in
createRelation()method inDefaultSourceto detect when a user only specifies the full table path and not a blob when reading the entire table. If detected correctly, the path is rewritten automatically as a blob that will ensure the rest of the logic works as before.Verify this pull request
Added unit and functional tests to cover all cases: no partition, single partition, date partition, and custom partition/ key generation.