[HUDI-1527] automatically infer the data directory, users only need to specify the table directory #2475

teeyog · 2021-01-22T15:18:45Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

To read the hudi table, you need to specify the path, but the path is not only the tablePath corresponding to the table, but needs to be determined by the partition directory structure. Different keyGenerators correspond to different partition directory structures. The first-level partition directory uses path=.../table/*/*, the secondary partition directory path=.../table/*/*/*，so it is troublesome to let the user specify the data path, the user only needs to specify the tablePath: .../table

At the same time, after reading the hudi table by configuring path=.../table, it is more convenient to use sparksql to query the hudi table. You only need to add tabproperties to the hive table metadata: spark.sql.sources.provider= hudi, you can automatically convert the hive table to the hudi table.

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codecov-io · 2021-01-22T15:43:03Z

Codecov Report

Merging #2475 (62cff1e) into master (43a0776) will increase coverage by 0.03%.
The diff coverage is 93.33%.

@@             Coverage Diff              @@
##             master    #2475      +/-   ##
============================================
+ Coverage     51.14%   51.17%   +0.03%     
- Complexity     3215     3219       +4     
============================================
  Files           438      438              
  Lines         20041    20055      +14     
  Branches       2064     2067       +3     
============================================
+ Hits          10250    10264      +14     
  Misses         8946     8946              
  Partials        845      845

Flag	Coverage Δ	Complexity Δ
hudicli	`36.87% <ø> (ø)`	`0.00 <ø> (ø)`
hudiclient	`100.00% <ø> (ø)`	`0.00 <ø> (ø)`
hudicommon	`51.36% <ø> (ø)`	`0.00 <ø> (ø)`
hudiflink	`45.44% <ø> (ø)`	`0.00 <ø> (ø)`
hudihadoopmr	`33.16% <ø> (ø)`	`0.00 <ø> (ø)`
hudisparkdatasource	`69.96% <93.33%> (+0.20%)`	`0.00 <0.00> (ø)`
hudisync	`48.61% <ø> (ø)`	`0.00 <ø> (ø)`
huditimelineservice	`66.49% <ø> (ø)`	`0.00 <ø> (ø)`
hudiutilities	`69.51% <ø> (+0.05%)`	`0.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...src/main/scala/org/apache/hudi/DefaultSource.scala	`85.41% <93.33%> (+1.27%)`	`20.00 <0.00> (+3.00)`
...apache/hudi/utilities/deltastreamer/DeltaSync.java	`70.35% <0.00%> (+0.35%)`	`51.00% <0.00%> (+1.00%)`

rubenssoto · 2021-01-22T22:10:05Z

This is a great and important feature to make Hudi easier for no heavy users.

vinothchandar · 2021-01-29T01:47:06Z

@zhedoubushishi @umehrot2 could you please take a first pass

vinothchandar

Thanks for this! Seems very useful.

One thing I wanted to understand was - whether an user can still do basePath/2020/*/* and have only the parquet files for 2020 read out for e.g?

vinothchandar · 2021-02-03T21:21:30Z

hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java

this short circuits the recursive stack, once we get one partition path I guess

So, I am wondering if we can use the HoodieTableMetadata abstraction to read a partition path, instead of listing alone. We are trying to avoid any introduction of single point listings. There is a method to get all partition paths already FSUtils.getAllPartitionPaths(), lets just use that for now? I am thinking that it will be little bit of an overkill to list all partition paths, without metadata table

Yea I agree it would be better to use HoodieTableMetadata to avoid fs.listStatus. But what about the tables w/o metadata feature enable? Will it take super long time if it's a table with many partitions?

Also hoodie_partition_metadata saves a parameter called partitionDepth, could we take advantage of this?

Thank you for your review, this method of obtaining partitions is very fast. As long as one partition path is obtained, it will return directly. FSUtils.getAllPartitionPaths will obtain all partition paths, which is very time-consuming.

@teeyog we could even add a new overload/methods for this under HoodieTableMetadata interface, but really good to keep all of this under the interface. With the metadata table, its actually okay to call getAllPartitionPaths(), its pretty fast.

teeyog · 2021-02-04T01:53:37Z

Thanks for this! Seems very useful.

One thing I wanted to understand was - whether an user can still do basePath/2020/*/* and have only the parquet files for 2020 read out for e.g?

Of course it can, but it is specified by the parameter basePath=../2020/*/*

vinothchandar · 2021-02-06T20:21:57Z

@teeyog if you could call FSUtils.getAllPartitionPaths() or add a new method getNPartitionPaths() and return the first N such partition paths using your traversal in the class FileSystemBackedTableMetadata

vinothchandar · 2021-02-06T20:30:38Z

You only need to add tabproperties to the hive table metadata: spark.sql.sources.provider= hudi, you can automatically convert the hive table to the hudi table.

@teeyog can you please expand on this. is this related to this PR or a general comment?

vinothchandar · 2021-02-06T21:02:47Z

Of course it can, but it is specified by the parameter basePath=../2020//

I want to clarify this a bit. do you mean val READ_PATHS_OPT_KEY = "hoodie.datasource.read.paths" ?

if I do the following, I see that we reset the path in options to basePath + "/*/*/*/*. How does Spark parquet source know to only look for 2020 and 2021 for e,g?

val snapshotDF1 = spark.read.format("org.apache.hudi")
      .load(basePath + "/202*/*/*/*")

vinothchandar · 2021-02-06T21:03:01Z

other than these two I am good with this

teeyog · 2021-02-07T01:16:44Z

You only need to add tabproperties to the hive table metadata: spark.sql.sources.provider= hudi, you can automatically convert the hive table to the hudi table.

@teeyog can you please expand on this. is this related to this PR or a general comment?

If the hive metadata tabproperties contains spark.sql.sources.provider=hudi, the parsing process of sparksql reading the hive table is as follows：
First step
https://github.com/apache/spark/blob/62be2483d7d78e61fd2f77929cf41c76eff17869/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L302

Second step
https://github.com/apache/spark/blob/62be2483d7d78e61fd2f77929cf41c76eff17869/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L261

The resolveRelation in the second step will go directly to the DefaultSource of hudi, so reading the hive table is automatically converted to reading the hudi table

teeyog · 2021-02-07T02:30:31Z

Of course it can, but it is specified by the parameter basePath=../2020//

I want to clarify this a bit. do you mean val READ_PATHS_OPT_KEY = "hoodie.datasource.read.paths" ?

if I do the following, I see that we reset the path in options to basePath + "/*/*/*/*. How does Spark parquet source know to only look for 2020 and 2021 for e,g?
val snapshotDF1 = spark.read.format("org.apache.hudi")
      .load(basePath + "/202*/*/*/*")

I understand what you mean. The situation you said is really not supported, because the data path will be automatically inferred to cover the path configured by the user, but you only check the requirements of 2020 and 2021, you can use dadaframe when Filter again, or do I need to judge whether the path specified by the user contains *, if it does, the data path is not automatically inferred, what do you think?

teeyog · 2021-02-07T09:17:18Z

@teeyog if you could call FSUtils.getAllPartitionPaths() or add a new method getNPartitionPaths() and return the first N such partition paths using your traversal in the class FileSystemBackedTableMetadata

It has been modified to obtain the partition path by FSUtils.getAllPartitionPaths()

vinothchandar · 2021-02-17T16:26:24Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala

@teeyog Sorry it's not still clear to me. I supplied a globbed path 2015/*/*/* and even that overrides path -> tablePath/*/*/*/*

Won't this incur reading all partitions in the tablePath as opposed only 2015's?

The path specified by the user will be overwritten by the automatically inferred data directory, and your needs cannot be met

I will try to see if I can automatically infer this but also meet your needs

@vinothchandar Now it supports your needs. If the path specified by the user is a table path, it will be automatically inferred, otherwise it will not be inferred.

…o specify the table directory

…o specify the table directory [adapt no partition]

lw309637554 · 2021-02-28T14:37:21Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala

+        val sparkEngineContext = new HoodieSparkEngineContext(sqlContext.sparkContext)
+        val fsBackedTableMetadata =
+          new FileSystemBackedTableMetadata(sparkEngineContext, new SerializableConfiguration(fs.getConf), tablePath, false)
+        val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths


@teeyog hello, now infer the partition for getallpartition paths from metadata table.
The partition mode is set as hoodie.datasource.write.partitionpath.field when write the hudi table. Can we persist the hoodie.datasource.write.partitionpath.field to metatable? Then read just get the properties , not get all the partition path? cc @vinothchandar

@lw309637554 Thank you for your review, the previous path to get the hudi table can also be obtained through configuration instead of inference

hudi-bot · 2021-07-05T22:50:23Z

CI report:

62cff1e UNKNOWN

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run travis re-run the last Travis build
@hudi-bot run azure re-run the last Azure build

vinothchandar · 2021-07-29T02:42:28Z

Closing in favor of #3353

teeyog force-pushed the get_datapath branch from 8d073b6 to 9c38d02 Compare January 23, 2021 04:34

vinothchandar assigned umehrot2 and vinothchandar Jan 29, 2021

vinothchandar reviewed Feb 3, 2021

View reviewed changes

teeyog requested a review from vinothchandar February 4, 2021 03:06

teeyog force-pushed the get_datapath branch from d0ee06e to bca1656 Compare February 7, 2021 06:05

vinothchandar reviewed Feb 17, 2021

View reviewed changes

teeyog added 4 commits February 22, 2021 18:22

[HUDI-1527] automatically infer the data directory, users only need t…

ba9dd7a

…o specify the table directory

[HUDI-1527] automatically infer the data directory, users only need t…

6758f15

…o specify the table directory [adapt no partition]

[HUDI-1527] automatically infer the data directory, users only need t…

9689c7b

…o specify the table directory [adapt no partition]

[HUDI-1527] automatically infer the data directory, users only need t…

0e4f1ee

…o specify the table directory [adapt no partition]

teeyog force-pushed the get_datapath branch from a81ce9d to 0e4f1ee Compare February 22, 2021 10:32

[HUDI-1527] automatically infer the data directory, users only need t…

62cff1e

…o specify the table directory [adapt no partition]

teeyog requested a review from vinothchandar February 25, 2021 07:56

lw309637554 reviewed Feb 28, 2021

View reviewed changes

vinothchandar added the priority:high Significant impact; potential bugs label Mar 14, 2021

vinothchandar added the priority:blocker Production down; release blocker label Jul 5, 2021

rmahindra123 mentioned this pull request Jul 27, 2021

[HUDI-1527] Automatically infer the full partition path when user only specifies the table path #3353

Closed

vinothchandar closed this Jul 29, 2021

hudi-bot mentioned this pull request Dec 9, 2025

Automatically infer the data directory, users only need to specify the table directory #14439

Open

[HUDI-1527] automatically infer the data directory, users only need to specify the table directory #2475

[HUDI-1527] automatically infer the data directory, users only need to specify the table directory #2475

Uh oh!

Conversation

teeyog commented Jan 22, 2021

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

codecov-io commented Jan 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rubenssoto commented Jan 22, 2021

Uh oh!

vinothchandar commented Jan 29, 2021

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

teeyog commented Feb 4, 2021

Uh oh!

vinothchandar commented Feb 6, 2021

Uh oh!

vinothchandar commented Feb 6, 2021

Uh oh!

vinothchandar commented Feb 6, 2021

Uh oh!

vinothchandar commented Feb 6, 2021

Uh oh!

teeyog commented Feb 7, 2021

Uh oh!

teeyog commented Feb 7, 2021

Uh oh!

teeyog commented Feb 7, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jul 5, 2021

CI report:

Uh oh!

vinothchandar commented Jul 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

codecov-io commented Jan 22, 2021 •

edited

Loading