[SPARK-34346][CORE][SQL] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression by yaooqinn · Pull Request #31460 · apache/spark

yaooqinn · 2021-02-03T14:38:40Z

What changes were proposed in this pull request?

In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the hive-site.xml for their hive jobs and make a copy to SPARK_HOME/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use spark.buffer.size(65536) to reset io.file.buffer.size(4096). But when we load the hive-site.xml, we may ignore this behavior and reset io.file.buffer.size again according to hive-site.xml.

The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be spark > spark.hive > spark.hadoop > hive > hadoop
This breaks spark.buffer.size congfig's behavior for tuning the IO performance w/ HDFS if there is an existing io.file.buffer.size in hive-site.xml

Why are the changes needed?

bugfix for configuration behavior and fix performance regression by that behavior change

Does this PR introduce any user-facing change?

this pr restores silent user face change

How was this patch tested?

new tests

… will override by loading hive-site.xml may cause perf regression

yaooqinn · 2021-02-03T14:52:11Z

cc @cloud-fan @maropu @dongjoon-hyun @HyukjinKwon thanks

srowen · 2021-02-03T15:14:02Z

spark > spark.hive > spark.hadoop > hive > hadoop makes sense to me.

srowen · 2021-02-03T15:14:24Z

core/src/test/scala/org/apache/spark/SparkContextSuite.scala

    assert(sc.listJars().exists(_.contains("commons-lang_commons-lang-2.6.jar")))
  }
+
+  test("SPARK-34346: hadoop configuration priority for spark/hive/hadoop configs") {


For good measure, do we want to also test that the default io.file.buffer.size you get from a plain Spark configuration is in fact 65536?

I guess that will cause test flakiness if it gets a nondefault value somewhere due to the non-deterministic test order in CIs. I did a pre-check for loading hive-site.xml explicitly to ensure it gets loaded but not pollute the final result.

oh, I can just set it explicitly too

SparkQA · 2021-02-03T15:39:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39424/

...e/src/test/scala/org/apache/spark/deploy/k8s/features/HadoopConfDriverFeatureStepSuite.scala

cloud-fan · 2021-02-03T16:03:50Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala

+   * the very first created SparkSession instance.
   */
-  def loadHiveConfFile(
+  def determineWarehouse(


why the return type is still a Map?

We should document what this method returns, as it's not clear from the name.

make sense to me. the returned map is unrelated to this change，I will doc better

cloud-fan · 2021-02-03T16:07:02Z

core/src/test/scala/org/apache/spark/SparkContextSuite.scala

+    sc = new SparkContext(sparkConf)
+    assert(sc.hadoopConfiguration.get(testKey) === "/tmp/hive_two",
+      "spark.hadoop configs have higher priority than hive/hadoop ones")
+    assert(sc.hadoopConfiguration.get(bufferKey).toInt === 65536,


shouldn't this be 20181117?

here is the same sparkConf w/ two more configs，willnot change to respect spark.hadoop.xxx

OK, so for the buffer size, we ignore everything but only respect the BUFFER_SIZE config.

SparkQA · 2021-02-03T16:07:47Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39424/

SparkQA · 2021-02-03T16:53:55Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39427/

SparkQA · 2021-02-03T17:12:16Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39427/

SparkQA · 2021-02-03T17:17:27Z

Test build #134837 has finished for PR 31460 at commit 6f49965.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-02-03T17:39:53Z

sql/core/src/test/scala/org/apache/spark/sql/internal/SharedStateSuite.scala

    assert(conf.asInstanceOf[Configuration].get("fs.defaultFS") == "file:///")
  }
-
-  test("SPARK-33740: hadoop configs in hive-site.xml can overrides pre-existing hadoop ones") {


Is this removed because this is merged to the new test coverage?

yes, the newly added case will take over this one

dongjoon-hyun · 2021-02-03T17:44:20Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala

-      hadoopConfTemp.clear()
-      hadoopConfTemp.addResource(configFile)
-      for (entry <- hadoopConfTemp.asScala if !containsInSparkConf(entry.getKey)) {
-        hadoopConf.set(entry.getKey, entry.getValue)


This removal seems to have some side-effect. Is it okay?

This behavior exists before SPARK-33740. So, I'm curious about the side-effect.

According to the current usage restrictions of Hive in Spark, for documented behaviors, there is no side-effect that makes practical sense. But in some undocumented areas, there do have some kind of side effects, e.g. dynamically load the hive-site.xml which is unreachable at the start of a Spark app, but added later through some APIs, then those configurations will be added anymore.

Loading hive-site.xml dynamically at runtime is really hacky, I don't think anyone would rely on that...

dongjoon-hyun · 2021-02-03T17:58:54Z

core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala

      appendSparkHadoopConfigs(conf, hadoopConf)
      appendSparkHiveConfigs(conf, hadoopConf)
      val bufferSize = conf.get(BUFFER_SIZE).toString
      hadoopConf.set("io.file.buffer.size", bufferSize)


Hi, @MaxGekk . According to your email, was this the only property affected?

@dongjoon-hyun Yes.

SparkQA · 2021-02-03T18:46:28Z

Test build #134841 has finished for PR 31460 at commit 6ab49ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-02-04T02:31:22Z

spark > spark.hive > spark.hadoop > hive > hadoop makes sense and the new change is much easier to prove that it follows this order. Previously, we load the hive-site.xml at the end, and try to simulate that it has a lower priority, which is very hacky.

SparkQA · 2021-02-04T03:58:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39438/

SparkQA · 2021-02-04T04:32:59Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39438/

SparkQA · 2021-02-04T05:21:35Z

Test build #134851 has finished for PR 31460 at commit 58a532e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-02-04T05:27:57Z

retest this please

yaooqinn · 2021-02-04T05:28:36Z

retest this please

SparkQA · 2021-02-04T07:06:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39444/

SparkQA · 2021-02-04T07:11:53Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39444/

SparkQA · 2021-02-04T08:24:03Z

Test build #134856 has finished for PR 31460 at commit 58a532e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-02-04T08:28:56Z

Hi @dongjoon-hyun , do you have more concerns about this fix?

dongjoon-hyun

+1, LGTM. Thank you, @yaooqinn , @srowen , @cloud-fan , @MaxGekk .

cc @HyukjinKwon

HyukjinKwon · 2021-02-05T01:13:04Z

Merged to master.

HyukjinKwon · 2021-02-05T01:14:17Z

@yaooqinn it has a conflict in the test. Do you mind opening a backporting PR? I would also like to make sure the tests pass before merging it in since we're very close to the release.

yaooqinn · 2021-02-05T02:31:26Z

OK, it's my pleasure

….size will override by loading hive-site.xml accidentally may cause perf regression backport #31460 to branch 3.1 ### What changes were proposed in this pull request? In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`. 1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop` 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml ### Why are the changes needed? bugfix for configuration behavior and fix performance regression by that behavior change ### Does this PR introduce _any_ user-facing change? this pr restores silent user face change ### How was this patch tested? new tests Closes #31482 from yaooqinn/SPARK-34346-31. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

….size will override by loading hive-site.xml accidentally may cause perf regression Backport #31460 to 3.0 ### What changes were proposed in this pull request? In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`. 1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop` 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml ### Why are the changes needed? bugfix for configuration behavior and fix performance regression by that behavior change ### Does this PR introduce _any_ user-facing change? this pr restores silent user face change ### How was this patch tested? new tests Closes #31492 from yaooqinn/SPARK-34346-30. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2021-02-07T23:39:17Z

This seems to make the following UT flaky in both GitHub Action and Jenkins in master/branch-3.1/branch-3.0.

YarnClusterSuite.yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630)
org.scalatest.exceptions.TestFailedException: FAILED did not equal FINISHED (stdout/stderr was not captured)

dongjoon-hyun · 2021-02-07T23:43:26Z

Even in this PR, the last commit has the same failure in GitHub Action.

HyukjinKwon · 2021-02-07T23:53:36Z

@yaooqinn would you mind taking a look please?

dongjoon-hyun · 2021-02-07T23:54:21Z

It turns out core/src/test/resources/core-site.xml (of this PR) is loaded in YARN test. I'll make a fix for this.

dongjoon-hyun · 2021-02-08T00:39:53Z

I made a follow-up here. It switched the way of testing by replacing core-site.xml with an ad-hoc set command.

[SPARK-34346][CORE][TESTS][FOLLOWUP] Fix UT by removing core-site.xml #31515

yaooqinn added 2 commits February 3, 2021 22:18

[SPARK-34346][Core][SQL] io.file.buffer.size set by spark.buffer.size…

6ec0670

… will override by loading hive-site.xml may cause perf regression

[SPARK-34346][Core][SQL] io.file.buffer.size set by spark.buffer.size…

4f9b580

… will override by loading hive-site.xml may cause perf regression

nit

6f49965

srowen reviewed Feb 3, 2021

View reviewed changes

address comments

d0a8b6e

nit

6ab49ba

cloud-fan reviewed Feb 3, 2021

View reviewed changes

...e/src/test/scala/org/apache/spark/deploy/k8s/features/HadoopConfDriverFeatureStepSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 3, 2021

View reviewed changes

dongjoon-hyun reviewed Feb 3, 2021

View reviewed changes

github-actions bot added CORE KUBERNETES SQL labels Feb 3, 2021

doc

58a532e

cloud-fan approved these changes Feb 4, 2021

View reviewed changes

dongjoon-hyun approved these changes Feb 4, 2021

View reviewed changes

HyukjinKwon closed this in 961c851 Feb 5, 2021

yaooqinn mentioned this pull request Feb 5, 2021

[SPARK-34346][CORE][SQL][3.1] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression #31482

Closed

yaooqinn mentioned this pull request Feb 5, 2021

[SPARK-34346][CORE][SQL][3.0] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression #31492

Closed

Conversation

yaooqinn commented Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

yaooqinn commented Feb 3, 2021

Uh oh!

srowen commented Feb 3, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 3, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 3, 2021

Uh oh!

SparkQA commented Feb 3, 2021

Uh oh!

SparkQA commented Feb 3, 2021

Uh oh!

SparkQA commented Feb 3, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 3, 2021

Uh oh!

cloud-fan commented Feb 4, 2021

Uh oh!

SparkQA commented Feb 4, 2021

Uh oh!

SparkQA commented Feb 4, 2021

Uh oh!

SparkQA commented Feb 4, 2021

Uh oh!

cloud-fan commented Feb 4, 2021

Uh oh!

yaooqinn commented Feb 4, 2021

Uh oh!

SparkQA commented Feb 4, 2021

Uh oh!

SparkQA commented Feb 4, 2021

Uh oh!

SparkQA commented Feb 4, 2021

Uh oh!

cloud-fan commented Feb 4, 2021

yaooqinn commented Feb 3, 2021 •

edited

Loading

dongjoon-hyun Feb 3, 2021 •

edited

Loading

HyukjinKwon commented Feb 5, 2021 •

edited

Loading

dongjoon-hyun commented Feb 7, 2021 •

edited

Loading

dongjoon-hyun commented Feb 7, 2021 •

edited

Loading