[SPARK-34346][CORE][SQL] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression#31460
[SPARK-34346][CORE][SQL] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression#31460yaooqinn wants to merge 6 commits intoapache:masterfrom yaooqinn:SPARK-34346
Conversation
… will override by loading hive-site.xml may cause perf regression
… will override by loading hive-site.xml may cause perf regression
|
cc @cloud-fan @maropu @dongjoon-hyun @HyukjinKwon thanks |
|
spark > spark.hive > spark.hadoop > hive > hadoop makes sense to me. |
| assert(sc.listJars().exists(_.contains("commons-lang_commons-lang-2.6.jar"))) | ||
| } | ||
|
|
||
| test("SPARK-34346: hadoop configuration priority for spark/hive/hadoop configs") { |
There was a problem hiding this comment.
For good measure, do we want to also test that the default io.file.buffer.size you get from a plain Spark configuration is in fact 65536?
There was a problem hiding this comment.
I guess that will cause test flakiness if it gets a nondefault value somewhere due to the non-deterministic test order in CIs. I did a pre-check for loading hive-site.xml explicitly to ensure it gets loaded but not pollute the final result.
There was a problem hiding this comment.
oh, I can just set it explicitly too
|
Kubernetes integration test starting |
...e/src/test/scala/org/apache/spark/deploy/k8s/features/HadoopConfDriverFeatureStepSuite.scala
Outdated
Show resolved
Hide resolved
| * the very first created SparkSession instance. | ||
| */ | ||
| def loadHiveConfFile( | ||
| def determineWarehouse( |
There was a problem hiding this comment.
why the return type is still a Map?
There was a problem hiding this comment.
We should document what this method returns, as it's not clear from the name.
There was a problem hiding this comment.
make sense to me. the returned map is unrelated to this change,I will doc better
| sc = new SparkContext(sparkConf) | ||
| assert(sc.hadoopConfiguration.get(testKey) === "/tmp/hive_two", | ||
| "spark.hadoop configs have higher priority than hive/hadoop ones") | ||
| assert(sc.hadoopConfiguration.get(bufferKey).toInt === 65536, |
There was a problem hiding this comment.
shouldn't this be 20181117?
There was a problem hiding this comment.
here is the same sparkConf w/ two more configs,willnot change to respect spark.hadoop.xxx
There was a problem hiding this comment.
OK, so for the buffer size, we ignore everything but only respect the BUFFER_SIZE config.
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #134837 has finished for PR 31460 at commit
|
| assert(conf.asInstanceOf[Configuration].get("fs.defaultFS") == "file:///") | ||
| } | ||
|
|
||
| test("SPARK-33740: hadoop configs in hive-site.xml can overrides pre-existing hadoop ones") { |
There was a problem hiding this comment.
Is this removed because this is merged to the new test coverage?
There was a problem hiding this comment.
yes, the newly added case will take over this one
| hadoopConfTemp.clear() | ||
| hadoopConfTemp.addResource(configFile) | ||
| for (entry <- hadoopConfTemp.asScala if !containsInSparkConf(entry.getKey)) { | ||
| hadoopConf.set(entry.getKey, entry.getValue) |
There was a problem hiding this comment.
This removal seems to have some side-effect. Is it okay?
There was a problem hiding this comment.
This behavior exists before SPARK-33740. So, I'm curious about the side-effect.
There was a problem hiding this comment.
According to the current usage restrictions of Hive in Spark, for documented behaviors, there is no side-effect that makes practical sense. But in some undocumented areas, there do have some kind of side effects, e.g. dynamically load the hive-site.xml which is unreachable at the start of a Spark app, but added later through some APIs, then those configurations will be added anymore.
There was a problem hiding this comment.
Loading hive-site.xml dynamically at runtime is really hacky, I don't think anyone would rely on that...
| appendSparkHadoopConfigs(conf, hadoopConf) | ||
| appendSparkHiveConfigs(conf, hadoopConf) | ||
| val bufferSize = conf.get(BUFFER_SIZE).toString | ||
| hadoopConf.set("io.file.buffer.size", bufferSize) |
There was a problem hiding this comment.
Hi, @MaxGekk . According to your email, was this the only property affected?
|
Test build #134841 has finished for PR 31460 at commit
|
|
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #134851 has finished for PR 31460 at commit
|
|
retest this please |
1 similar comment
|
retest this please |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #134856 has finished for PR 31460 at commit
|
|
Hi @dongjoon-hyun , do you have more concerns about this fix? |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM. Thank you, @yaooqinn , @srowen , @cloud-fan , @MaxGekk .
cc @HyukjinKwon
|
Merged to master. |
|
@yaooqinn it has a conflict in the test. Do you mind opening a backporting PR? I would also like to make sure the tests pass before merging it in since we're very close to the release. |
|
OK, it's my pleasure |
….size will override by loading hive-site.xml accidentally may cause perf regression backport #31460 to branch 3.1 ### What changes were proposed in this pull request? In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`. 1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop` 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml ### Why are the changes needed? bugfix for configuration behavior and fix performance regression by that behavior change ### Does this PR introduce _any_ user-facing change? this pr restores silent user face change ### How was this patch tested? new tests Closes #31482 from yaooqinn/SPARK-34346-31. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
….size will override by loading hive-site.xml accidentally may cause perf regression Backport #31460 to 3.0 ### What changes were proposed in this pull request? In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`. 1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop` 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml ### Why are the changes needed? bugfix for configuration behavior and fix performance regression by that behavior change ### Does this PR introduce _any_ user-facing change? this pr restores silent user face change ### How was this patch tested? new tests Closes #31492 from yaooqinn/SPARK-34346-30. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
|
This seems to make the following UT flaky in both GitHub Action and Jenkins in |
|
Even in this PR, the last commit has the same failure in GitHub Action. |
|
@yaooqinn would you mind taking a look please? |
|
It turns out |
|
I made a follow-up here. It switched the way of testing by replacing |
What changes were proposed in this pull request?
In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the
hive-site.xmlfor their hive jobs and make a copy toSPARK_HOME/conf w/o modification. In Spark, when we generate Hadoop configurations, we will usespark.buffer.size(65536)to resetio.file.buffer.size(4096). But when we load the hive-site.xml, we may ignore this behavior and resetio.file.buffer.sizeagain according tohive-site.xml.The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be
spark > spark.hive > spark.hadoop > hive > hadoopThis breaks
spark.buffer.sizecongfig's behavior for tuning the IO performance w/ HDFS if there is an existingio.file.buffer.sizein hive-site.xmlWhy are the changes needed?
bugfix for configuration behavior and fix performance regression by that behavior change
Does this PR introduce any user-facing change?
this pr restores silent user face change
How was this patch tested?
new tests