[SPARK-34346][CORE][SQL][3.1] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression by yaooqinn · Pull Request #31482 · apache/spark

yaooqinn · 2021-02-05T05:57:57Z

backport #31460 to branch 3.1

What changes were proposed in this pull request?

In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the hive-site.xml for their hive jobs and make a copy to SPARK_HOME/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use spark.buffer.size(65536) to reset io.file.buffer.size(4096). But when we load the hive-site.xml, we may ignore this behavior and reset io.file.buffer.size again according to hive-site.xml.

The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be spark > spark.hive > spark.hadoop > hive > hadoop
This breaks spark.buffer.size congfig's behavior for tuning the IO performance w/ HDFS if there is an existing io.file.buffer.size in hive-site.xml

Why are the changes needed?

bugfix for configuration behavior and fix performance regression by that behavior change

Does this PR introduce any user-facing change?

this pr restores silent user face change

How was this patch tested?

new tests

… will override by loading hive-site.xml accidentally may cause perf regression In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`. 1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop` 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml bugfix for configuration behavior and fix performance regression by that behavior change this pr restores silent user face change new tests Closes #31460 from yaooqinn/SPARK-34346. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

yaooqinn · 2021-02-05T06:02:19Z

cc @cloud-fan @maropu @HyukjinKwon here is the backport PR for 3.1, thanks

SparkQA · 2021-02-05T06:20:35Z

Test build #134913 has finished for PR 31482 at commit 905405f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-02-05T06:27:45Z

Thank you for making a backport, @yaooqinn .

SparkQA · 2021-02-05T06:48:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39496/

SparkQA · 2021-02-05T06:53:09Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39496/

SparkQA · 2021-02-05T07:52:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39500/

SparkQA · 2021-02-05T07:57:06Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39500/

SparkQA · 2021-02-05T09:17:37Z

Test build #134917 has finished for PR 31482 at commit c9d2248.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2021-02-05T09:28:46Z

retest this please

SparkQA · 2021-02-05T10:59:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39515/

SparkQA · 2021-02-05T11:28:36Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39515/

SparkQA · 2021-02-05T13:05:12Z

Test build #134932 has finished for PR 31482 at commit c9d2248.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

….size will override by loading hive-site.xml accidentally may cause perf regression backport #31460 to branch 3.1 ### What changes were proposed in this pull request? In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`. 1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop` 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml ### Why are the changes needed? bugfix for configuration behavior and fix performance regression by that behavior change ### Does this PR introduce _any_ user-facing change? this pr restores silent user face change ### How was this patch tested? new tests Closes #31482 from yaooqinn/SPARK-34346-31. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2021-02-05T14:16:16Z

thanks, merging to 3.1!

cloud-fan · 2021-02-05T14:16:44Z

@yaooqinn can you open a backport PR for 3.0? It conflicts. thanks!

yaooqinn · 2021-02-05T14:29:37Z

OK. I got it.

s

c9d2248

github-actions bot added CORE SQL labels Feb 5, 2021

cloud-fan approved these changes Feb 5, 2021

View reviewed changes

HyukjinKwon approved these changes Feb 5, 2021

View reviewed changes

cloud-fan closed this Feb 5, 2021

Conversation

yaooqinn commented Feb 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

yaooqinn commented Feb 5, 2021

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

dongjoon-hyun commented Feb 5, 2021

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

yaooqinn commented Feb 5, 2021

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

SparkQA commented Feb 5, 2021

Uh oh!

cloud-fan commented Feb 5, 2021

Uh oh!

cloud-fan commented Feb 5, 2021

Uh oh!

yaooqinn commented Feb 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yaooqinn commented Feb 5, 2021 •

edited

Loading