Skip to content

Conversation

@rmahindra123
Copy link
Contributor

What is the purpose of the pull request

This is a PR for the changes that were done in #2475, with following fixes:

  1. The actual logic is fixed to ensure it works
  2. Code re-structure / cleanup and decouple the logic to detect full table reads from the current logic
  3. Added new tests, and ensure tests actually verify the outputs

Original PR description: To read the hudi table, you need to specify the path, but the path is not only the tablePath corresponding to the table, but needs to be determined by the partition directory structure. Different keyGenerators correspond to different partition directory structures. The first-level partition directory uses path=.../table//, the secondary partition directory path=.../table///*,so it is troublesome to let the user specify the data path, the user only needs to specify the tablePath: .../table

At the same time, after reading the hudi table by configuring path=.../table, it is more convenient to use sparksql to query the hudi table. You only need to add tab properties to the hive table metadata: spark.sql.sources.provider= hudi, you can automatically convert the hive table to the hudi table.

Brief change log

Added logic in createRelation() method in DefaultSource to detect when a user only specifies the full table path and not a blob when reading the entire table. If detected correctly, the path is rewritten automatically as a blob that will ensure the rest of the logic works as before.

Verify this pull request

Added unit and functional tests to cover all cases: no partition, single partition, date partition, and custom partition/ key generation.

@rmahindra123
Copy link
Contributor Author

rmahindra123 commented Jul 27, 2021

@vinothchandar It works for CoW, the final test in TestMORDataSource.java fails, so I have special cased it for CoW for now.

@vinothchandar
Copy link
Member

@pengzhiwei2018 are you able to take one pass at this PR?

@pengzhiwei2018
Copy link

Hi @rmahindra123 , the PR #2651 has already support query by the table path. So I am not very understand what is the PR do, can you explain it for me?

@vinothchandar
Copy link
Member

That's pretty much the functionality, supporting reads of table without having to pass globs ("basePath///*") You can see the tests.

@pengzhiwei2018
Copy link

That's pretty much the functionality, supporting reads of table without having to pass globs ("basePath///*") You can see the tests.

Yes, I see. But the PR #2651 has already supported reading table without pass globs. is there any difference?

@rmahindra123
Copy link
Contributor Author

That's pretty much the functionality, supporting reads of table without having to pass globs ("basePath///*") You can see the tests.

Yes, I see. But the PR #2651 has already supported reading table without pass globs. is there any difference?

@pengzhiwei2018 Its similar functionality but there were a few bugs and the tests were not actually testing the functionality. It required some refactoring to work for different cases, so I put it in a new PR.

@pengzhiwei2018
Copy link

That's pretty much the functionality, supporting reads of table without having to pass globs ("basePath///*") You can see the tests.

Yes, I see. But the PR #2651 has already supported reading table without pass globs. is there any difference?

@pengzhiwei2018 Its similar functionality but there were a few bugs and the tests were not actually testing the functionality. It required some refactoring to work for different cases, so I put it in a new PR.

Thanks for your instructions @rmahindra123 , can you implement this based on the HoodieFileIndex? By a simple review, I found some duplicate file list operation with the HoodieFileIndex.

@vinothchandar
Copy link
Member

@pengzhiwei2018 . I am planning on taking this over. If you have cycles, please feel free give it a shot as well.

@pengzhiwei2018
Copy link

please feel free give it a shot as well

@vinothchandar It works for CoW, the final test in TestMORDataSource.java fails, so I have special cased it for CoW for now.

@pengzhiwei2018 . I am planning on taking this over. If you have cycles, please feel free give it a shot as well.

Please feel free for this. I am currently busy with the sql integration.

@nsivabalan nsivabalan removed the priority:blocker Production down; release blocker label Aug 11, 2021
@hudi-bot
Copy link
Collaborator

hudi-bot commented Nov 5, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants