Skip to content

Conversation

@jonvex
Copy link
Contributor

@jonvex jonvex commented Mar 28, 2023

Change Logs

Reads from bootstrapped tables in spark are around twice as slow as from regular tables. Even if the bootstrap is a full bootstrap, which is just a bulk insert. This means that bootstrap relation is reading regular files much slower than HadoopFsRelation. To fix this, we only query the bootstrap base files and don't read and merge the skeleton files. This means that you cannot read hudi metadata columns when using the bootstrap fast path.

Introduces new config hoodie.bootstrap.data.queries.only that is disabled by default. To read the Hudi metadata fields, it needs to be set to false.

Impact

Spark query performance only 5-10% slower than regular hudi tables instead of 100% slower.

Risk level (write none, low medium or high below)

High

This heavily modifies a read path

Documentation Update

Need to put in release notes

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@jonvex jonvex changed the title use hadoopfsrelation for bootstrap [HUDI-5998] Speed up reads from bootstrapped tables in spark Mar 29, 2023
@jonvex jonvex marked this pull request as ready for review March 29, 2023 13:22
@jonvex
Copy link
Contributor Author

jonvex commented Mar 29, 2023

@hudi-bot run azure

@jonvex jonvex force-pushed the bootstrap_perf_hadoopfs branch from 859af38 to 78befd9 Compare March 30, 2023 15:18
parameters: Map[String, String]): BaseRelation = {
val enableFileIndex = HoodieSparkConfUtils.getConfigValue(parameters, sqlContext.sparkSession.sessionState.conf,
ENABLE_HOODIE_FILE_INDEX.key, ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean
if (!enableFileIndex || globPaths.nonEmpty || parameters.getOrElse(HoodieBootstrapConfig.DATA_QUERIES_ONLY.key(), "true") != "true") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do away with the config and rely on the condition here to decide whether or not to use the fast read path (which should be done by default). Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to read the metadata columns you need to disable it. I found a few tests that use the metadata columns and I would assume that some users must

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it. But, does it need to be inferred through a separate config? Can we not infer from the already available parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to know at the point of creating the relation, so I don't think this can be done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonvex : Wouldn't this change cause user queries which includes hoodie metadata columns to fail ? Can't we just userschema being passed here to determine if there are any hoodie metadata columns being queried to determine appropriate next steps ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, @jonvex : if you look at HoodieBootstrapRelation.composeRDD (the relation is being instantiated in below line), we segregate the skeleton schema and base file schema. Can we move the optimization logic inside that ? My main concern is this would break the existing functionality of bootstrap queries including hudi metafields failing unless user turn off the feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark applies special optimizations to HadoopFsRelation so unless we contribute PRs to spark, this is the only way to do it as far as I can tell

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate what optimization are being done to HadoopFsRelation that causes 100% speed up ? I don't seem to find this information from the PR description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://issues.apache.org/jira/browse/HUDI-3896 I am not sure if this is the only optimization, but it is one of them. The query plans for non bootstrapped and bootstrap tables look pretty much identical except non bootstrap says "FileScan parquet" when reading and bootstrap reading says "scan HoodieBootstrapRelation"

I started by comparing time to run tpcds queries on boostrapped tables vs non bootstrapped. For a full bootstrap, the runtime ratio was 1.997 and for a metadata only bootstrap it was 1.638.

I thought that was surprising that the full bootstrap was so slow, so I tried to replicate what was being done in BaseFileOnlyRelation in the first commit in this pr. We create a HoodieFileScanRDD instead of a HoodieBootstrapRDD. The ratio of tpcds runtime compared to reading from a non bootstrap table was 1.48 for a full bootstrap table, and 1.35 for a metadata only bootstrap.

With the changes in this pr to leverage HadoopFsRelation the ratio was 1.12 for metadata only bootstrap, and 1.09 for full bootstrap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonvex : Can we make HoodieBootstrapRelation/HoodieBaseRelation extend HadoopFsRelation to get the behavior ?

@jonvex jonvex requested a review from codope April 4, 2023 14:55

// Copy-On-Write Bootstrapped table
// Copy-On-Write Bootstrapped table
spark.sql("set hoodie.bootstrap.data.queries.only=false")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any integration test for bootstrap where we test with this feature on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it so now it will use the feature in this test on the queries that don't use the meta fields

} else {
Map()
}) ++ DataSourceOptionsHelper.parametersWithReadDefaults(optParams)
}) ++ DataSourceOptionsHelper.parametersWithReadDefaults(sqlContext.getAllConfs.filter(k => k._1.startsWith("hoodie.")) ++ optParams)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we can't set read configs in spark sql using the syntax like "set hoodie.bootstrap.data.queries.only=false". It only works for write configs. This was something we wanted to add anyways: https://issues.apache.org/jira/browse/HUDI-5361

parameters: Map[String, String]): BaseRelation = {
val enableFileIndex = HoodieSparkConfUtils.getConfigValue(parameters, sqlContext.sparkSession.sessionState.conf,
ENABLE_HOODIE_FILE_INDEX.key, ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean
if (!enableFileIndex || globPaths.nonEmpty || parameters.getOrElse(HoodieBootstrapConfig.DATA_QUERIES_ONLY.key(), "true") != "true") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonvex : Wouldn't this change cause user queries which includes hoodie metadata columns to fail ? Can't we just userschema being passed here to determine if there are any hoodie metadata columns being queried to determine appropriate next steps ?

parameters: Map[String, String]): BaseRelation = {
val enableFileIndex = HoodieSparkConfUtils.getConfigValue(parameters, sqlContext.sparkSession.sessionState.conf,
ENABLE_HOODIE_FILE_INDEX.key, ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean
if (!enableFileIndex || globPaths.nonEmpty || parameters.getOrElse(HoodieBootstrapConfig.DATA_QUERIES_ONLY.key(), "true") != "true") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, @jonvex : if you look at HoodieBootstrapRelation.composeRDD (the relation is being instantiated in below line), we segregate the skeleton schema and base file schema. Can we move the optimization logic inside that ? My main concern is this would break the existing functionality of bootstrap queries including hudi metafields failing unless user turn off the feature.

@jonvex jonvex force-pushed the bootstrap_perf_hadoopfs branch from 3da3f92 to 05334f4 Compare April 26, 2023 03:26
@jonvex jonvex force-pushed the bootstrap_perf_hadoopfs branch from 05334f4 to 732fbf0 Compare April 26, 2023 03:28
Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonvex : Few questions

sqlContext.sparkSession.sessionState.conf, DataSourceReadOptions.SCHEMA_EVOLUTION_ENABLED.key,
DataSourceReadOptions.SCHEMA_EVOLUTION_ENABLED.defaultValue.toString).toBoolean
if (!enableFileIndex || isSchemaEvolutionEnabledOnRead
|| globPaths.nonEmpty || !parameters.getOrElse(DATA_QUERIES_ONLY.key, DATA_QUERIES_ONLY.defaultValue).toBoolean) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why globPaths.nonEmpty is included here. Not following it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, How are we ensuring that for MOR, the behavior is unchanged ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer your first question: I got that condition from BaseFileOnlyRelation.toHadoopFsRelation.
For the second question, I need to go through today and update the existing bootstrap tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the existing testing for bootstrap, there are probably a lot of cases that we are not testing currently.
It doesn't seem like we support MOR with bootstrap very well https://issues.apache.org/jira/browse/HUDI-2071 .

@jonvex jonvex force-pushed the bootstrap_perf_hadoopfs branch from aeefd9b to b8772a7 Compare May 17, 2023 18:19
@apache apache deleted a comment from hudi-bot May 19, 2023
@bvaradar
Copy link
Contributor

@jonvex : Is this ready for review ?

@jonvex
Copy link
Contributor Author

jonvex commented May 19, 2023

@bvaradar Yes, it is ready for review. I wrote a a lot of tests to ensure that this matched the functionality of the regular bootstrap read. However, I discovered that there were some issues with bootstrap such as #8666 and https://issues.apache.org/jira/browse/HUDI-6201 (which is still unsolved).

Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question about config. Otherwise looks good to me.

@jonvex jonvex requested a review from bvaradar May 23, 2023 16:22
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config change looks good.

@codope
Copy link
Member

codope commented May 26, 2023

@bvaradar The changes looks good to me. Can you take another pass?

Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made one final pass. LGTM.

Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made one final pass. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants