[HUDI-5998] Speed up reads from bootstrapped tables in spark #8303

jonvex · 2023-03-28T02:36:07Z

Change Logs

Reads from bootstrapped tables in spark are around twice as slow as from regular tables. Even if the bootstrap is a full bootstrap, which is just a bulk insert. This means that bootstrap relation is reading regular files much slower than HadoopFsRelation. To fix this, we only query the bootstrap base files and don't read and merge the skeleton files. This means that you cannot read hudi metadata columns when using the bootstrap fast path.

Introduces new config hoodie.bootstrap.data.queries.only that is disabled by default. To read the Hudi metadata fields, it needs to be set to false.

Impact

Spark query performance only 5-10% slower than regular hudi tables instead of 100% slower.

Risk level (write none, low medium or high below)

High

This heavily modifies a read path

Documentation Update

Need to put in release notes

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

jonvex · 2023-03-29T17:19:31Z

@hudi-bot run azure

codope · 2023-04-04T05:24:44Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala

+                                             parameters: Map[String, String]): BaseRelation = {
+    val enableFileIndex = HoodieSparkConfUtils.getConfigValue(parameters, sqlContext.sparkSession.sessionState.conf,
+      ENABLE_HOODIE_FILE_INDEX.key, ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean
+    if (!enableFileIndex || globPaths.nonEmpty || parameters.getOrElse(HoodieBootstrapConfig.DATA_QUERIES_ONLY.key(), "true") != "true") {


I think we should do away with the config and rely on the condition here to decide whether or not to use the fast read path (which should be done by default). Wdyt?

If you want to read the metadata columns you need to disable it. I found a few tests that use the metadata columns and I would assume that some users must

I get it. But, does it need to be inferred through a separate config? Can we not infer from the already available parameters?

We need to know at the point of creating the relation, so I don't think this can be done

@jonvex : Wouldn't this change cause user queries which includes hoodie metadata columns to fail ? Can't we just userschema being passed here to determine if there are any hoodie metadata columns being queried to determine appropriate next steps ?

hmmm, @jonvex : if you look at HoodieBootstrapRelation.composeRDD (the relation is being instantiated in below line), we segregate the skeleton schema and base file schema. Can we move the optimization logic inside that ? My main concern is this would break the existing functionality of bootstrap queries including hudi metafields failing unless user turn off the feature.

Spark applies special optimizations to HadoopFsRelation so unless we contribute PRs to spark, this is the only way to do it as far as I can tell

Can you elaborate what optimization are being done to HadoopFsRelation that causes 100% speed up ? I don't seem to find this information from the PR description.

https://issues.apache.org/jira/browse/HUDI-3896 I am not sure if this is the only optimization, but it is one of them. The query plans for non bootstrapped and bootstrap tables look pretty much identical except non bootstrap says "FileScan parquet" when reading and bootstrap reading says "scan HoodieBootstrapRelation"

I started by comparing time to run tpcds queries on boostrapped tables vs non bootstrapped. For a full bootstrap, the runtime ratio was 1.997 and for a metadata only bootstrap it was 1.638.

I thought that was surprising that the full bootstrap was so slow, so I tried to replicate what was being done in BaseFileOnlyRelation in the first commit in this pr. We create a HoodieFileScanRDD instead of a HoodieBootstrapRDD. The ratio of tpcds runtime compared to reading from a non bootstrap table was 1.48 for a full bootstrap table, and 1.35 for a metadata only bootstrap.

With the changes in this pr to leverage HadoopFsRelation the ratio was 1.12 for metadata only bootstrap, and 1.09 for full bootstrap.

@jonvex : Can we make HoodieBootstrapRelation/HoodieBaseRelation extend HadoopFsRelation to get the behavior ?

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala

...-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala

bvaradar · 2023-04-11T03:03:21Z

docker/demo/sparksql-batch2.commands


- // Copy-On-Write Bootstrapped table
+// Copy-On-Write Bootstrapped table
+spark.sql("set hoodie.bootstrap.data.queries.only=false")


Are there any integration test for bootstrap where we test with this feature on?

I updated it so now it will use the feature in this test on the queries that don't use the meta fields

bvaradar · 2023-04-11T05:14:25Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala

    } else {
      Map()
-    }) ++ DataSourceOptionsHelper.parametersWithReadDefaults(optParams)
+    }) ++ DataSourceOptionsHelper.parametersWithReadDefaults(sqlContext.getAllConfs.filter(k => k._1.startsWith("hoodie.")) ++ optParams)


Why is this needed ?

Currently we can't set read configs in spark sql using the syntax like "set hoodie.bootstrap.data.queries.only=false". It only works for write configs. This was something we wanted to add anyways: https://issues.apache.org/jira/browse/HUDI-5361

bvaradar · 2023-04-11T05:26:58Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala

+                                             parameters: Map[String, String]): BaseRelation = {
+    val enableFileIndex = HoodieSparkConfUtils.getConfigValue(parameters, sqlContext.sparkSession.sessionState.conf,
+      ENABLE_HOODIE_FILE_INDEX.key, ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean
+    if (!enableFileIndex || globPaths.nonEmpty || parameters.getOrElse(HoodieBootstrapConfig.DATA_QUERIES_ONLY.key(), "true") != "true") {


@jonvex : Wouldn't this change cause user queries which includes hoodie metadata columns to fail ? Can't we just userschema being passed here to determine if there are any hoodie metadata columns being queried to determine appropriate next steps ?

bvaradar · 2023-04-12T18:54:22Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala

+                                             parameters: Map[String, String]): BaseRelation = {
+    val enableFileIndex = HoodieSparkConfUtils.getConfigValue(parameters, sqlContext.sparkSession.sessionState.conf,
+      ENABLE_HOODIE_FILE_INDEX.key, ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean
+    if (!enableFileIndex || globPaths.nonEmpty || parameters.getOrElse(HoodieBootstrapConfig.DATA_QUERIES_ONLY.key(), "true") != "true") {


hmmm, @jonvex : if you look at HoodieBootstrapRelation.composeRDD (the relation is being instantiated in below line), we segregate the skeleton schema and base file schema. Can we move the optimization logic inside that ? My main concern is this would break the existing functionality of bootstrap queries including hudi metafields failing unless user turn off the feature.

bvaradar

@jonvex : Few questions

bvaradar · 2023-04-28T08:22:08Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala

+      sqlContext.sparkSession.sessionState.conf, DataSourceReadOptions.SCHEMA_EVOLUTION_ENABLED.key,
+      DataSourceReadOptions.SCHEMA_EVOLUTION_ENABLED.defaultValue.toString).toBoolean
+    if (!enableFileIndex || isSchemaEvolutionEnabledOnRead
+      || globPaths.nonEmpty || !parameters.getOrElse(DATA_QUERIES_ONLY.key, DATA_QUERIES_ONLY.defaultValue).toBoolean) {


Can you explain why globPaths.nonEmpty is included here. Not following it.

Also, How are we ensuring that for MOR, the behavior is unchanged ?

To answer your first question: I got that condition from BaseFileOnlyRelation.toHadoopFsRelation.
For the second question, I need to go through today and update the existing bootstrap tests

Looking at the existing testing for bootstrap, there are probably a lot of cases that we are not testing currently.
It doesn't seem like we support MOR with bootstrap very well https://issues.apache.org/jira/browse/HUDI-2071 .

…ition appending functionality

bvaradar · 2023-05-19T05:08:30Z

@jonvex : Is this ready for review ?

jonvex · 2023-05-19T15:04:09Z

@bvaradar Yes, it is ready for review. I wrote a a lot of tests to ensure that this matched the functionality of the regular bootstrap read. However, I discovered that there were some issues with bootstrap such as #8666 and https://issues.apache.org/jira/browse/HUDI-6201 (which is still unsolved).

...rk-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBootstrapRelation.scala

bvaradar

One question about config. Otherwise looks good to me.

hudi-bot · 2023-05-25T04:22:08Z

CI report:

f361b40 UNKNOWN
551c52d Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codope

Config change looks good.

codope · 2023-05-26T05:27:58Z

@bvaradar The changes looks good to me. Can you take another pass?

bvaradar

Made one final pass. LGTM.

bvaradar

Made one final pass. LGTM.

Jonathan Vexler added 4 commits March 27, 2023 22:35

use hadoopfsrelation for bootstrap

27b8276

got rid of unused var

221af92

Get the partitionpath column, and clean up the code

4bce566

fix style

7f9a12f

jonvex changed the title ~~use hadoopfsrelation for bootstrap~~ [HUDI-5998] Speed up reads from bootstrapped tables in spark Mar 29, 2023

jonvex marked this pull request as ready for review March 29, 2023 13:22

fix some tests that bootstrap and query hudi metadata cols

4a20074

Jonathan Vexler added 2 commits March 29, 2023 17:40

disable bootstrap fast read in integraton tests

498f23e

get spark sqlconf to work

78befd9

jonvex force-pushed the bootstrap_perf_hadoopfs branch from 859af38 to 78befd9 Compare March 30, 2023 15:18

codope reviewed Apr 4, 2023

View reviewed changes

addressed some pr comments

5bca709

jonvex requested a review from codope April 4, 2023 14:55

codope approved these changes Apr 11, 2023

View reviewed changes

codope added the bootstrap label Apr 11, 2023

bvaradar reviewed Apr 11, 2023

View reviewed changes

updated integration tests to use fast bootrap read path when possible

6a7ae70

bvaradar requested changes Apr 12, 2023

View reviewed changes

refresh table after updating read configs

9cda89b

jonvex force-pushed the bootstrap_perf_hadoopfs branch from 3da3f92 to 05334f4 Compare April 26, 2023 03:26

make it toggle based on a read optimized query instead of a config

732fbf0

jonvex force-pushed the bootstrap_perf_hadoopfs branch from 05334f4 to 732fbf0 Compare April 26, 2023 03:28

jonvex and others added 3 commits April 25, 2023 23:48

Merge branch 'apache:master' into bootstrap_perf_hadoopfs

c6908a1

switched back to config

3cfef7f

change some tiny things

76394b7

bvaradar reviewed Apr 28, 2023

View reviewed changes

Jonathan Vexler added 3 commits April 28, 2023 11:50

fixed issue and added to a test

e779563

have tests passing. Need to manually verify that I didn't remove part…

3ad5ae5

…ition appending functionality

added to the tests

e4144fb

Jonathan Vexler and others added 3 commits May 17, 2023 13:26

revert uri patch changes

2e63f7a

Merge branch 'apache:master' into bootstrap_perf_hadoopfs

0ed2644

add testing for fast bootstrap

b8772a7

jonvex force-pushed the bootstrap_perf_hadoopfs branch from aeefd9b to b8772a7 Compare May 17, 2023 18:19

apache deleted a comment from hudi-bot May 19, 2023

bvaradar reviewed May 23, 2023

View reviewed changes

...rk-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBootstrapRelation.scala Outdated Show resolved Hide resolved

bvaradar reviewed May 23, 2023

View reviewed changes

Jonathan Vexler and others added 2 commits May 23, 2023 12:14

remove use of additional config

f361b40

Merge branch 'master' into bootstrap_perf_hadoopfs

27375ab

jonvex requested a review from bvaradar May 23, 2023 16:22

fix checkstyle

551c52d

codope approved these changes May 26, 2023

View reviewed changes

bvaradar approved these changes May 26, 2023

View reviewed changes

codope merged commit 5978611 into apache:master May 26, 2023

jonvex mentioned this pull request Aug 1, 2023

[HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading #9276

Merged

4 tasks

[HUDI-5998] Speed up reads from bootstrapped tables in spark #8303

[HUDI-5998] Speed up reads from bootstrapped tables in spark #8303

Uh oh!

Conversation

jonvex commented Mar 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

jonvex commented Mar 29, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bvaradar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bvaradar commented May 19, 2023

Uh oh!

jonvex commented May 19, 2023

Uh oh!

Uh oh!

bvaradar left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented May 25, 2023

CI report:

Uh oh!

codope left a comment

Choose a reason for hiding this comment

Uh oh!

codope commented May 26, 2023

Uh oh!

bvaradar left a comment

Choose a reason for hiding this comment

Uh oh!

bvaradar left a comment

jonvex commented Mar 28, 2023 •

edited

Loading