-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5998] Speed up reads from bootstrapped tables in spark #8303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@hudi-bot run azure |
859af38 to
78befd9
Compare
| parameters: Map[String, String]): BaseRelation = { | ||
| val enableFileIndex = HoodieSparkConfUtils.getConfigValue(parameters, sqlContext.sparkSession.sessionState.conf, | ||
| ENABLE_HOODIE_FILE_INDEX.key, ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean | ||
| if (!enableFileIndex || globPaths.nonEmpty || parameters.getOrElse(HoodieBootstrapConfig.DATA_QUERIES_ONLY.key(), "true") != "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do away with the config and rely on the condition here to decide whether or not to use the fast read path (which should be done by default). Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to read the metadata columns you need to disable it. I found a few tests that use the metadata columns and I would assume that some users must
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get it. But, does it need to be inferred through a separate config? Can we not infer from the already available parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to know at the point of creating the relation, so I don't think this can be done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonvex : Wouldn't this change cause user queries which includes hoodie metadata columns to fail ? Can't we just userschema being passed here to determine if there are any hoodie metadata columns being queried to determine appropriate next steps ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, @jonvex : if you look at HoodieBootstrapRelation.composeRDD (the relation is being instantiated in below line), we segregate the skeleton schema and base file schema. Can we move the optimization logic inside that ? My main concern is this would break the existing functionality of bootstrap queries including hudi metafields failing unless user turn off the feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark applies special optimizations to HadoopFsRelation so unless we contribute PRs to spark, this is the only way to do it as far as I can tell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate what optimization are being done to HadoopFsRelation that causes 100% speed up ? I don't seem to find this information from the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://issues.apache.org/jira/browse/HUDI-3896 I am not sure if this is the only optimization, but it is one of them. The query plans for non bootstrapped and bootstrap tables look pretty much identical except non bootstrap says "FileScan parquet" when reading and bootstrap reading says "scan HoodieBootstrapRelation"
I started by comparing time to run tpcds queries on boostrapped tables vs non bootstrapped. For a full bootstrap, the runtime ratio was 1.997 and for a metadata only bootstrap it was 1.638.
I thought that was surprising that the full bootstrap was so slow, so I tried to replicate what was being done in BaseFileOnlyRelation in the first commit in this pr. We create a HoodieFileScanRDD instead of a HoodieBootstrapRDD. The ratio of tpcds runtime compared to reading from a non bootstrap table was 1.48 for a full bootstrap table, and 1.35 for a metadata only bootstrap.
With the changes in this pr to leverage HadoopFsRelation the ratio was 1.12 for metadata only bootstrap, and 1.09 for full bootstrap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonvex : Can we make HoodieBootstrapRelation/HoodieBaseRelation extend HadoopFsRelation to get the behavior ?
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
Show resolved
Hide resolved
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
...-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala
Outdated
Show resolved
Hide resolved
docker/demo/sparksql-batch2.commands
Outdated
|
|
||
| // Copy-On-Write Bootstrapped table | ||
| // Copy-On-Write Bootstrapped table | ||
| spark.sql("set hoodie.bootstrap.data.queries.only=false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any integration test for bootstrap where we test with this feature on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it so now it will use the feature in this test on the queries that don't use the meta fields
| } else { | ||
| Map() | ||
| }) ++ DataSourceOptionsHelper.parametersWithReadDefaults(optParams) | ||
| }) ++ DataSourceOptionsHelper.parametersWithReadDefaults(sqlContext.getAllConfs.filter(k => k._1.startsWith("hoodie.")) ++ optParams) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we can't set read configs in spark sql using the syntax like "set hoodie.bootstrap.data.queries.only=false". It only works for write configs. This was something we wanted to add anyways: https://issues.apache.org/jira/browse/HUDI-5361
| parameters: Map[String, String]): BaseRelation = { | ||
| val enableFileIndex = HoodieSparkConfUtils.getConfigValue(parameters, sqlContext.sparkSession.sessionState.conf, | ||
| ENABLE_HOODIE_FILE_INDEX.key, ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean | ||
| if (!enableFileIndex || globPaths.nonEmpty || parameters.getOrElse(HoodieBootstrapConfig.DATA_QUERIES_ONLY.key(), "true") != "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonvex : Wouldn't this change cause user queries which includes hoodie metadata columns to fail ? Can't we just userschema being passed here to determine if there are any hoodie metadata columns being queried to determine appropriate next steps ?
| parameters: Map[String, String]): BaseRelation = { | ||
| val enableFileIndex = HoodieSparkConfUtils.getConfigValue(parameters, sqlContext.sparkSession.sessionState.conf, | ||
| ENABLE_HOODIE_FILE_INDEX.key, ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean | ||
| if (!enableFileIndex || globPaths.nonEmpty || parameters.getOrElse(HoodieBootstrapConfig.DATA_QUERIES_ONLY.key(), "true") != "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, @jonvex : if you look at HoodieBootstrapRelation.composeRDD (the relation is being instantiated in below line), we segregate the skeleton schema and base file schema. Can we move the optimization logic inside that ? My main concern is this would break the existing functionality of bootstrap queries including hudi metafields failing unless user turn off the feature.
3da3f92 to
05334f4
Compare
05334f4 to
732fbf0
Compare
bvaradar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonvex : Few questions
| sqlContext.sparkSession.sessionState.conf, DataSourceReadOptions.SCHEMA_EVOLUTION_ENABLED.key, | ||
| DataSourceReadOptions.SCHEMA_EVOLUTION_ENABLED.defaultValue.toString).toBoolean | ||
| if (!enableFileIndex || isSchemaEvolutionEnabledOnRead | ||
| || globPaths.nonEmpty || !parameters.getOrElse(DATA_QUERIES_ONLY.key, DATA_QUERIES_ONLY.defaultValue).toBoolean) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why globPaths.nonEmpty is included here. Not following it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, How are we ensuring that for MOR, the behavior is unchanged ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To answer your first question: I got that condition from BaseFileOnlyRelation.toHadoopFsRelation.
For the second question, I need to go through today and update the existing bootstrap tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the existing testing for bootstrap, there are probably a lot of cases that we are not testing currently.
It doesn't seem like we support MOR with bootstrap very well https://issues.apache.org/jira/browse/HUDI-2071 .
…ition appending functionality
aeefd9b to
b8772a7
Compare
|
@jonvex : Is this ready for review ? |
|
@bvaradar Yes, it is ready for review. I wrote a a lot of tests to ensure that this matched the functionality of the regular bootstrap read. However, I discovered that there were some issues with bootstrap such as #8666 and https://issues.apache.org/jira/browse/HUDI-6201 (which is still unsolved). |
...rk-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBootstrapRelation.scala
Outdated
Show resolved
Hide resolved
bvaradar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question about config. Otherwise looks good to me.
codope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Config change looks good.
|
@bvaradar The changes looks good to me. Can you take another pass? |
bvaradar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made one final pass. LGTM.
bvaradar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made one final pass. LGTM.
Change Logs
Reads from bootstrapped tables in spark are around twice as slow as from regular tables. Even if the bootstrap is a full bootstrap, which is just a bulk insert. This means that bootstrap relation is reading regular files much slower than HadoopFsRelation. To fix this, we only query the bootstrap base files and don't read and merge the skeleton files. This means that you cannot read hudi metadata columns when using the bootstrap fast path.
Introduces new config
hoodie.bootstrap.data.queries.onlythat is disabled by default. To read the Hudi metadata fields, it needs to be set to false.Impact
Spark query performance only 5-10% slower than regular hudi tables instead of 100% slower.
Risk level (write none, low medium or high below)
High
This heavily modifies a read path
Documentation Update
Need to put in release notes
Contributor's checklist