[HUDI-1527] Automatically infer the full partition path when user only specifies the table path #3353

rmahindra123 · 2021-07-27T07:52:31Z

What is the purpose of the pull request

This is a PR for the changes that were done in #2475, with following fixes:

The actual logic is fixed to ensure it works
Code re-structure / cleanup and decouple the logic to detect full table reads from the current logic
Added new tests, and ensure tests actually verify the outputs

Original PR description: To read the hudi table, you need to specify the path, but the path is not only the tablePath corresponding to the table, but needs to be determined by the partition directory structure. Different keyGenerators correspond to different partition directory structures. The first-level partition directory uses path=.../table//, the secondary partition directory path=.../table///*，so it is troublesome to let the user specify the data path, the user only needs to specify the tablePath: .../table

At the same time, after reading the hudi table by configuring path=.../table, it is more convenient to use sparksql to query the hudi table. You only need to add tab properties to the hive table metadata: spark.sql.sources.provider= hudi, you can automatically convert the hive table to the hudi table.

Brief change log

Added logic in createRelation() method in DefaultSource to detect when a user only specifies the full table path and not a blob when reading the entire table. If detected correctly, the path is rewritten automatically as a blob that will ensure the rest of the logic works as before.

Verify this pull request

Added unit and functional tests to cover all cases: no partition, single partition, date partition, and custom partition/ key generation.

…f full path

rmahindra123 · 2021-07-27T23:34:41Z

@vinothchandar It works for CoW, the final test in TestMORDataSource.java fails, so I have special cased it for CoW for now.

vinothchandar · 2021-07-29T02:42:52Z

@pengzhiwei2018 are you able to take one pass at this PR?

pengzhiwei2018 · 2021-07-29T03:41:15Z

Hi @rmahindra123 , the PR #2651 has already support query by the table path. So I am not very understand what is the PR do, can you explain it for me?

vinothchandar · 2021-07-29T03:44:45Z

That's pretty much the functionality, supporting reads of table without having to pass globs ("basePath///*") You can see the tests.

pengzhiwei2018 · 2021-07-29T03:52:34Z

That's pretty much the functionality, supporting reads of table without having to pass globs ("basePath///*") You can see the tests.

Yes, I see. But the PR #2651 has already supported reading table without pass globs. is there any difference?

rmahindra123 · 2021-07-29T05:06:30Z

That's pretty much the functionality, supporting reads of table without having to pass globs ("basePath///*") You can see the tests.

Yes, I see. But the PR #2651 has already supported reading table without pass globs. is there any difference?

@pengzhiwei2018 Its similar functionality but there were a few bugs and the tests were not actually testing the functionality. It required some refactoring to work for different cases, so I put it in a new PR.

pengzhiwei2018 · 2021-07-29T07:49:13Z

That's pretty much the functionality, supporting reads of table without having to pass globs ("basePath///*") You can see the tests.

Yes, I see. But the PR #2651 has already supported reading table without pass globs. is there any difference?

@pengzhiwei2018 Its similar functionality but there were a few bugs and the tests were not actually testing the functionality. It required some refactoring to work for different cases, so I put it in a new PR.

Thanks for your instructions @rmahindra123 ， can you implement this based on the HoodieFileIndex? By a simple review, I found some duplicate file list operation with the HoodieFileIndex.

vinothchandar · 2021-08-02T21:11:56Z

@pengzhiwei2018 . I am planning on taking this over. If you have cycles, please feel free give it a shot as well.

pengzhiwei2018 · 2021-08-03T03:25:14Z

please feel free give it a shot as well

@vinothchandar It works for CoW, the final test in TestMORDataSource.java fails, so I have special cased it for CoW for now.

@pengzhiwei2018 . I am planning on taking this over. If you have cycles, please feel free give it a shot as well.

Please feel free for this. I am currently busy with the sql integration.

hudi-bot · 2021-11-05T02:53:35Z

CI report:

540a91e Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Rajesh Mahindra added 5 commits July 26, 2021 23:08

Auto infer partition path from table path

ce4ae24

Add unit tests / functional tests to cover all cases for auto infer o…

f5d17f5

…f full path

Fix tests

cfaf608

Fix checkstyle issues

fa24314

Special case for CoW

540a91e

vinothchandar self-assigned this Jul 29, 2021

vinothchandar added the priority:blocker Production down; release blocker label Jul 29, 2021

vinothchandar mentioned this pull request Jul 29, 2021

[HUDI-1527] automatically infer the data directory, users only need to specify the table directory #2475

Closed

5 tasks

nsivabalan removed the priority:blocker Production down; release blocker label Aug 11, 2021

rmahindra123 closed this Nov 24, 2021

hudi-bot mentioned this pull request Nov 30, 2025

Automatically infer the data directory, users only need to specify the table directory #14439

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-1527] Automatically infer the full partition path when user only specifies the table path #3353

[HUDI-1527] Automatically infer the full partition path when user only specifies the table path #3353

Uh oh!

rmahindra123 commented Jul 27, 2021

Uh oh!

rmahindra123 commented Jul 27, 2021 •

edited

Loading

Uh oh!

vinothchandar commented Jul 29, 2021

Uh oh!

pengzhiwei2018 commented Jul 29, 2021

Uh oh!

vinothchandar commented Jul 29, 2021

Uh oh!

pengzhiwei2018 commented Jul 29, 2021

Uh oh!

rmahindra123 commented Jul 29, 2021

Uh oh!

pengzhiwei2018 commented Jul 29, 2021

Uh oh!

vinothchandar commented Aug 2, 2021

Uh oh!

pengzhiwei2018 commented Aug 3, 2021

Uh oh!

hudi-bot commented Nov 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[HUDI-1527] Automatically infer the full partition path when user only specifies the table path #3353

[HUDI-1527] Automatically infer the full partition path when user only specifies the table path #3353

Uh oh!

Conversation

rmahindra123 commented Jul 27, 2021

What is the purpose of the pull request

Brief change log

Verify this pull request

Uh oh!

rmahindra123 commented Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vinothchandar commented Jul 29, 2021

Uh oh!

pengzhiwei2018 commented Jul 29, 2021

Uh oh!

vinothchandar commented Jul 29, 2021

Uh oh!

pengzhiwei2018 commented Jul 29, 2021

Uh oh!

rmahindra123 commented Jul 29, 2021

Uh oh!

pengzhiwei2018 commented Jul 29, 2021

Uh oh!

vinothchandar commented Aug 2, 2021

Uh oh!

pengzhiwei2018 commented Aug 3, 2021

Uh oh!

hudi-bot commented Nov 5, 2021

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rmahindra123 commented Jul 27, 2021 •

edited

Loading