Skip to content

Conversation

@teeyog
Copy link
Contributor

@teeyog teeyog commented Jan 22, 2021

Tips

What is the purpose of the pull request

To read the hudi table, you need to specify the path, but the path is not only the tablePath corresponding to the table, but needs to be determined by the partition directory structure. Different keyGenerators correspond to different partition directory structures. The first-level partition directory uses path=.../table/*/*, the secondary partition directory path=.../table/*/*/*,so it is troublesome to let the user specify the data path, the user only needs to specify the tablePath: .../table

At the same time, after reading the hudi table by configuring path=.../table, it is more convenient to use sparksql to query the hudi table. You only need to add tabproperties to the hive table metadata: spark.sql.sources.provider= hudi, you can automatically convert the hive table to the hudi table.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@codecov-io
Copy link

codecov-io commented Jan 22, 2021

Codecov Report

Merging #2475 (62cff1e) into master (43a0776) will increase coverage by 0.03%.
The diff coverage is 93.33%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #2475      +/-   ##
============================================
+ Coverage     51.14%   51.17%   +0.03%     
- Complexity     3215     3219       +4     
============================================
  Files           438      438              
  Lines         20041    20055      +14     
  Branches       2064     2067       +3     
============================================
+ Hits          10250    10264      +14     
  Misses         8946     8946              
  Partials        845      845              
Flag Coverage Δ Complexity Δ
hudicli 36.87% <ø> (ø) 0.00 <ø> (ø)
hudiclient 100.00% <ø> (ø) 0.00 <ø> (ø)
hudicommon 51.36% <ø> (ø) 0.00 <ø> (ø)
hudiflink 45.44% <ø> (ø) 0.00 <ø> (ø)
hudihadoopmr 33.16% <ø> (ø) 0.00 <ø> (ø)
hudisparkdatasource 69.96% <93.33%> (+0.20%) 0.00 <0.00> (ø)
hudisync 48.61% <ø> (ø) 0.00 <ø> (ø)
huditimelineservice 66.49% <ø> (ø) 0.00 <ø> (ø)
hudiutilities 69.51% <ø> (+0.05%) 0.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...src/main/scala/org/apache/hudi/DefaultSource.scala 85.41% <93.33%> (+1.27%) 20.00 <0.00> (+3.00)
...apache/hudi/utilities/deltastreamer/DeltaSync.java 70.35% <0.00%> (+0.35%) 51.00% <0.00%> (+1.00%)

@rubenssoto
Copy link

This is a great and important feature to make Hudi easier for no heavy users.

@vinothchandar
Copy link
Member

@zhedoubushishi @umehrot2 could you please take a first pass

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Seems very useful.

One thing I wanted to understand was - whether an user can still do basePath/2020/*/* and have only the parquet files for 2020 read out for e.g?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this short circuits the recursive stack, once we get one partition path I guess

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I am wondering if we can use the HoodieTableMetadata abstraction to read a partition path, instead of listing alone. We are trying to avoid any introduction of single point listings. There is a method to get all partition paths already FSUtils.getAllPartitionPaths(), lets just use that for now? I am thinking that it will be little bit of an overkill to list all partition paths, without metadata table

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I agree it would be better to use HoodieTableMetadata to avoid fs.listStatus. But what about the tables w/o metadata feature enable? Will it take super long time if it's a table with many partitions?

Also hoodie_partition_metadata saves a parameter called partitionDepth, could we take advantage of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review, this method of obtaining partitions is very fast. As long as one partition path is obtained, it will return directly. FSUtils.getAllPartitionPaths will obtain all partition paths, which is very time-consuming.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@teeyog we could even add a new overload/methods for this under HoodieTableMetadata interface, but really good to keep all of this under the interface. With the metadata table, its actually okay to call getAllPartitionPaths(), its pretty fast.

@teeyog
Copy link
Contributor Author

teeyog commented Feb 4, 2021

Thanks for this! Seems very useful.

One thing I wanted to understand was - whether an user can still do basePath/2020/*/* and have only the parquet files for 2020 read out for e.g?

Of course it can, but it is specified by the parameter basePath=../2020/*/*

@teeyog teeyog requested a review from vinothchandar February 4, 2021 03:06
@vinothchandar
Copy link
Member

@teeyog if you could call FSUtils.getAllPartitionPaths() or add a new method getNPartitionPaths() and return the first N such partition paths using your traversal in the class FileSystemBackedTableMetadata

@vinothchandar
Copy link
Member

You only need to add tabproperties to the hive table metadata: spark.sql.sources.provider= hudi, you can automatically convert the hive table to the hudi table.

@teeyog can you please expand on this. is this related to this PR or a general comment?

@vinothchandar
Copy link
Member

Of course it can, but it is specified by the parameter basePath=../2020//

I want to clarify this a bit. do you mean val READ_PATHS_OPT_KEY = "hoodie.datasource.read.paths" ?

if I do the following, I see that we reset the path in options to basePath + "/*/*/*/*. How does Spark parquet source know to only look for 2020 and 2021 for e,g?

val snapshotDF1 = spark.read.format("org.apache.hudi")
      .load(basePath + "/202*/*/*/*")

@vinothchandar
Copy link
Member

other than these two I am good with this

@teeyog
Copy link
Contributor Author

teeyog commented Feb 7, 2021

You only need to add tabproperties to the hive table metadata: spark.sql.sources.provider= hudi, you can automatically convert the hive table to the hudi table.

@teeyog can you please expand on this. is this related to this PR or a general comment?

If the hive metadata tabproperties contains spark.sql.sources.provider=hudi, the parsing process of sparksql reading the hive table is as follows:
First step
https://github.com/apache/spark/blob/62be2483d7d78e61fd2f77929cf41c76eff17869/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L302

Second step
https://github.com/apache/spark/blob/62be2483d7d78e61fd2f77929cf41c76eff17869/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L261

The resolveRelation in the second step will go directly to the DefaultSource of hudi, so reading the hive table is automatically converted to reading the hudi table

@teeyog
Copy link
Contributor Author

teeyog commented Feb 7, 2021

Of course it can, but it is specified by the parameter basePath=../2020//

I want to clarify this a bit. do you mean val READ_PATHS_OPT_KEY = "hoodie.datasource.read.paths" ?

if I do the following, I see that we reset the path in options to basePath + "/*/*/*/*. How does Spark parquet source know to only look for 2020 and 2021 for e,g?

val snapshotDF1 = spark.read.format("org.apache.hudi")
      .load(basePath + "/202*/*/*/*")

I understand what you mean. The situation you said is really not supported, because the data path will be automatically inferred to cover the path configured by the user, but you only check the requirements of 2020 and 2021, you can use dadaframe when Filter again, or do I need to judge whether the path specified by the user contains *, if it does, the data path is not automatically inferred, what do you think?

@teeyog
Copy link
Contributor Author

teeyog commented Feb 7, 2021

@teeyog if you could call FSUtils.getAllPartitionPaths() or add a new method getNPartitionPaths() and return the first N such partition paths using your traversal in the class FileSystemBackedTableMetadata

It has been modified to obtain the partition path by FSUtils.getAllPartitionPaths()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@teeyog Sorry it's not still clear to me. I supplied a globbed path 2015/*/*/* and even that overrides path -> tablePath/*/*/*/*

Won't this incur reading all partitions in the tablePath as opposed only 2015's?

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path specified by the user will be overwritten by the automatically inferred data directory, and your needs cannot be met

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to see if I can automatically infer this but also meet your needs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar Now it supports your needs. If the path specified by the user is a table path, it will be automatically inferred, otherwise it will not be inferred.

…o specify the table directory [adapt no partition]
…o specify the table directory [adapt no partition]
…o specify the table directory [adapt no partition]
…o specify the table directory [adapt no partition]
@teeyog teeyog requested a review from vinothchandar February 25, 2021 07:56
val sparkEngineContext = new HoodieSparkEngineContext(sqlContext.sparkContext)
val fsBackedTableMetadata =
new FileSystemBackedTableMetadata(sparkEngineContext, new SerializableConfiguration(fs.getConf), tablePath, false)
val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@teeyog hello, now infer the partition for getallpartition paths from metadata table.
The partition mode is set as hoodie.datasource.write.partitionpath.field when write the hudi table. Can we persist the hoodie.datasource.write.partitionpath.field to metatable? Then read just get the properties , not get all the partition path? cc @vinothchandar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lw309637554 Thank you for your review, the previous path to get the hudi table can also be obtained through configuration instead of inference

@vinothchandar vinothchandar added the priority:high Significant impact; potential bugs label Mar 14, 2021
@vinothchandar vinothchandar added the priority:blocker Production down; release blocker label Jul 5, 2021
@hudi-bot
Copy link
Collaborator

hudi-bot commented Jul 5, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run travis re-run the last Travis build
  • @hudi-bot run azure re-run the last Azure build

@vinothchandar
Copy link
Member

Closing in favor of #3353

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker priority:high Significant impact; potential bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants