Skip to content

[SPARK-34346][CORE][SQL] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression#31460

Closed
yaooqinn wants to merge 6 commits intoapache:masterfrom
yaooqinn:SPARK-34346
Closed

[SPARK-34346][CORE][SQL] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression#31460
yaooqinn wants to merge 6 commits intoapache:masterfrom
yaooqinn:SPARK-34346

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Feb 3, 2021

What changes were proposed in this pull request?

In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the hive-site.xml for their hive jobs and make a copy to SPARK_HOME/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use spark.buffer.size(65536) to reset io.file.buffer.size(4096). But when we load the hive-site.xml, we may ignore this behavior and reset io.file.buffer.size again according to hive-site.xml.

  1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be spark > spark.hive > spark.hadoop > hive > hadoop

  2. This breaks spark.buffer.size congfig's behavior for tuning the IO performance w/ HDFS if there is an existing io.file.buffer.size in hive-site.xml

Why are the changes needed?

bugfix for configuration behavior and fix performance regression by that behavior change

Does this PR introduce any user-facing change?

this pr restores silent user face change

How was this patch tested?

new tests

… will override by loading hive-site.xml may cause perf regression
… will override by loading hive-site.xml may cause perf regression
@yaooqinn
Copy link
Member Author

yaooqinn commented Feb 3, 2021

cc @cloud-fan @maropu @dongjoon-hyun @HyukjinKwon thanks

@srowen
Copy link
Member

srowen commented Feb 3, 2021

spark > spark.hive > spark.hadoop > hive > hadoop makes sense to me.

assert(sc.listJars().exists(_.contains("commons-lang_commons-lang-2.6.jar")))
}

test("SPARK-34346: hadoop configuration priority for spark/hive/hadoop configs") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For good measure, do we want to also test that the default io.file.buffer.size you get from a plain Spark configuration is in fact 65536?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that will cause test flakiness if it gets a nondefault value somewhere due to the non-deterministic test order in CIs. I did a pre-check for loading hive-site.xml explicitly to ensure it gets loaded but not pollute the final result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I can just set it explicitly too

@SparkQA
Copy link

SparkQA commented Feb 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39424/

* the very first created SparkSession instance.
*/
def loadHiveConfFile(
def determineWarehouse(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the return type is still a Map?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document what this method returns, as it's not clear from the name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense to me. the returned map is unrelated to this change,I will doc better

sc = new SparkContext(sparkConf)
assert(sc.hadoopConfiguration.get(testKey) === "/tmp/hive_two",
"spark.hadoop configs have higher priority than hive/hadoop ones")
assert(sc.hadoopConfiguration.get(bufferKey).toInt === 65536,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be 20181117?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the same sparkConf w/ two more configs,willnot change to respect spark.hadoop.xxx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so for the buffer size, we ignore everything but only respect the BUFFER_SIZE config.

@SparkQA
Copy link

SparkQA commented Feb 3, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39424/

@SparkQA
Copy link

SparkQA commented Feb 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39427/

@SparkQA
Copy link

SparkQA commented Feb 3, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39427/

@SparkQA
Copy link

SparkQA commented Feb 3, 2021

Test build #134837 has finished for PR 31460 at commit 6f49965.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

assert(conf.asInstanceOf[Configuration].get("fs.defaultFS") == "file:///")
}

test("SPARK-33740: hadoop configs in hive-site.xml can overrides pre-existing hadoop ones") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this removed because this is merged to the new test coverage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the newly added case will take over this one

hadoopConfTemp.clear()
hadoopConfTemp.addResource(configFile)
for (entry <- hadoopConfTemp.asScala if !containsInSparkConf(entry.getKey)) {
hadoopConf.set(entry.getKey, entry.getValue)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removal seems to have some side-effect. Is it okay?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior exists before SPARK-33740. So, I'm curious about the side-effect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the current usage restrictions of Hive in Spark, for documented behaviors, there is no side-effect that makes practical sense. But in some undocumented areas, there do have some kind of side effects, e.g. dynamically load the hive-site.xml which is unreachable at the start of a Spark app, but added later through some APIs, then those configurations will be added anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading hive-site.xml dynamically at runtime is really hacky, I don't think anyone would rely on that...

appendSparkHadoopConfigs(conf, hadoopConf)
appendSparkHiveConfigs(conf, hadoopConf)
val bufferSize = conf.get(BUFFER_SIZE).toString
hadoopConf.set("io.file.buffer.size", bufferSize)
Copy link
Member

@dongjoon-hyun dongjoon-hyun Feb 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @MaxGekk . According to your email, was this the only property affected?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Feb 3, 2021

Test build #134841 has finished for PR 31460 at commit 6ab49ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [SPARK-34346][Core][SQL] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression [SPARK-34346][CORE][SQL] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression Feb 4, 2021
@cloud-fan
Copy link
Contributor

spark > spark.hive > spark.hadoop > hive > hadoop makes sense and the new change is much easier to prove that it follows this order. Previously, we load the hive-site.xml at the end, and try to simulate that it has a lower priority, which is very hacky.

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39438/

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39438/

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Test build #134851 has finished for PR 31460 at commit 58a532e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

1 similar comment
@yaooqinn
Copy link
Member Author

yaooqinn commented Feb 4, 2021

retest this please

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39444/

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39444/

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Test build #134856 has finished for PR 31460 at commit 58a532e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Hi @dongjoon-hyun , do you have more concerns about this fix?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @yaooqinn , @srowen , @cloud-fan , @MaxGekk .

cc @HyukjinKwon

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Feb 5, 2021

Merged to master.

@HyukjinKwon
Copy link
Member

@yaooqinn it has a conflict in the test. Do you mind opening a backporting PR? I would also like to make sure the tests pass before merging it in since we're very close to the release.

@yaooqinn
Copy link
Member Author

yaooqinn commented Feb 5, 2021

OK, it's my pleasure

cloud-fan pushed a commit that referenced this pull request Feb 5, 2021
….size will override by loading hive-site.xml accidentally may cause perf regression

backport  #31460  to branch 3.1

### What changes were proposed in this pull request?

In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`.

1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop`

2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml

### Why are the changes needed?

bugfix for configuration behavior and fix performance regression by that behavior change

### Does this PR introduce _any_ user-facing change?

this pr restores silent user face change

### How was this patch tested?

new tests

Closes #31482 from yaooqinn/SPARK-34346-31.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
dongjoon-hyun pushed a commit that referenced this pull request Feb 5, 2021
….size will override by loading hive-site.xml accidentally may cause perf regression

Backport  #31460 to 3.0

### What changes were proposed in this pull request?
In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`.

1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop`

2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml

### Why are the changes needed?

bugfix for configuration behavior and fix performance regression by that behavior change
### Does this PR introduce _any_ user-facing change?

this pr restores silent user face change
### How was this patch tested?

new tests

Closes #31492 from yaooqinn/SPARK-34346-30.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 7, 2021

This seems to make the following UT flaky in both GitHub Action and Jenkins in master/branch-3.1/branch-3.0.

YarnClusterSuite.yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630)
org.scalatest.exceptions.TestFailedException: FAILED did not equal FINISHED (stdout/stderr was not captured)

@dongjoon-hyun
Copy link
Member

Even in this PR, the last commit has the same failure in GitHub Action.

@HyukjinKwon
Copy link
Member

@yaooqinn would you mind taking a look please?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 7, 2021

It turns out core/src/test/resources/core-site.xml (of this PR) is loaded in YARN test. I'll make a fix for this.

@dongjoon-hyun
Copy link
Member

I made a follow-up here. It switched the way of testing by replacing core-site.xml with an ad-hoc set command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants