Skip to content

[SPARK-34346][CORE][SQL][3.1] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression#31482

Closed
yaooqinn wants to merge 2 commits intoapache:branch-3.1from
yaooqinn:SPARK-34346-31
Closed

[SPARK-34346][CORE][SQL][3.1] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression#31482
yaooqinn wants to merge 2 commits intoapache:branch-3.1from
yaooqinn:SPARK-34346-31

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Feb 5, 2021

backport #31460 to branch 3.1

What changes were proposed in this pull request?

In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the hive-site.xml for their hive jobs and make a copy to SPARK_HOME/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use spark.buffer.size(65536) to reset io.file.buffer.size(4096). But when we load the hive-site.xml, we may ignore this behavior and reset io.file.buffer.size again according to hive-site.xml.

  1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be spark > spark.hive > spark.hadoop > hive > hadoop

  2. This breaks spark.buffer.size congfig's behavior for tuning the IO performance w/ HDFS if there is an existing io.file.buffer.size in hive-site.xml

Why are the changes needed?

bugfix for configuration behavior and fix performance regression by that behavior change

Does this PR introduce any user-facing change?

this pr restores silent user face change

How was this patch tested?

new tests

… will override by loading hive-site.xml accidentally may cause perf regression

In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`.

1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop`

2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml

bugfix for configuration behavior and fix performance regression by that behavior change

this pr restores silent user face change

new tests

Closes #31460 from yaooqinn/SPARK-34346.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@yaooqinn
Copy link
Member Author

yaooqinn commented Feb 5, 2021

cc @cloud-fan @maropu @HyukjinKwon here is the backport PR for 3.1, thanks

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Test build #134913 has finished for PR 31482 at commit 905405f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-34346][CORE][SQL] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression [SPARK-34346][CORE][SQL][3.1] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression Feb 5, 2021
@dongjoon-hyun
Copy link
Member

Thank you for making a backport, @yaooqinn .

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39496/

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39496/

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39500/

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39500/

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Test build #134917 has finished for PR 31482 at commit c9d2248.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

yaooqinn commented Feb 5, 2021

retest this please

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39515/

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39515/

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Test build #134932 has finished for PR 31482 at commit c9d2248.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

cloud-fan pushed a commit that referenced this pull request Feb 5, 2021
….size will override by loading hive-site.xml accidentally may cause perf regression

backport  #31460  to branch 3.1

### What changes were proposed in this pull request?

In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`.

1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop`

2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml

### Why are the changes needed?

bugfix for configuration behavior and fix performance regression by that behavior change

### Does this PR introduce _any_ user-facing change?

this pr restores silent user face change

### How was this patch tested?

new tests

Closes #31482 from yaooqinn/SPARK-34346-31.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to 3.1!

@cloud-fan
Copy link
Contributor

@yaooqinn can you open a backport PR for 3.0? It conflicts. thanks!

@cloud-fan cloud-fan closed this Feb 5, 2021
@yaooqinn
Copy link
Member Author

yaooqinn commented Feb 5, 2021

OK. I got it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants