[SPARK-34346][CORE][SQL][3.1] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression#31482
[SPARK-34346][CORE][SQL][3.1] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression#31482yaooqinn wants to merge 2 commits intoapache:branch-3.1from yaooqinn:SPARK-34346-31
Conversation
… will override by loading hive-site.xml accidentally may cause perf regression In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`. 1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop` 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml bugfix for configuration behavior and fix performance regression by that behavior change this pr restores silent user face change new tests Closes #31460 from yaooqinn/SPARK-34346. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
|
cc @cloud-fan @maropu @HyukjinKwon here is the backport PR for 3.1, thanks |
|
Test build #134913 has finished for PR 31482 at commit
|
|
Thank you for making a backport, @yaooqinn . |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #134917 has finished for PR 31482 at commit
|
|
retest this please |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #134932 has finished for PR 31482 at commit
|
….size will override by loading hive-site.xml accidentally may cause perf regression backport #31460 to branch 3.1 ### What changes were proposed in this pull request? In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`. 1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop` 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml ### Why are the changes needed? bugfix for configuration behavior and fix performance regression by that behavior change ### Does this PR introduce _any_ user-facing change? this pr restores silent user face change ### How was this patch tested? new tests Closes #31482 from yaooqinn/SPARK-34346-31. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
thanks, merging to 3.1! |
|
@yaooqinn can you open a backport PR for 3.0? It conflicts. thanks! |
|
OK. I got it. |
backport #31460 to branch 3.1
What changes were proposed in this pull request?
In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the
hive-site.xmlfor their hive jobs and make a copy toSPARK_HOME/conf w/o modification. In Spark, when we generate Hadoop configurations, we will usespark.buffer.size(65536)to resetio.file.buffer.size(4096). But when we load the hive-site.xml, we may ignore this behavior and resetio.file.buffer.sizeagain according tohive-site.xml.The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be
spark > spark.hive > spark.hadoop > hive > hadoopThis breaks
spark.buffer.sizecongfig's behavior for tuning the IO performance w/ HDFS if there is an existingio.file.buffer.sizein hive-site.xmlWhy are the changes needed?
bugfix for configuration behavior and fix performance regression by that behavior change
Does this PR introduce any user-facing change?
this pr restores silent user face change
How was this patch tested?
new tests