-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-33806][SQL] limit partition num to 1 when distributing by foldable expressions #30800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…able expressions It seems a very popular way that people use DISTRIBUTE BY clause with a literal to coalesce partition in the pure SQL data processing. For example ``` insert into table src select * from values (1), (2), (3) t(a) distribute by 1 ``` Users may want the final output to be one single data file, but if the reality is not always true. Spark will always create a file for partition 0 whether it contains data or not, so when the data all goes to a partition(IDX >0), there will be always 2 files there and the part-00000 is empty. On the other hand, a lot of empty tasks will be launched too, this is unnecessary. When users repeat the insert statement daily, hourly, or minutely, it causes small file issues. To avoid this, there are some options you can take. 1. user `distribute by null`, let the data go to the partition 0 2. set spark.sql.adaptive.enabled to true for Spark to automatically coalesce 3. using hints instead of `distribute by` 4. set spark.sql.shuffle.partitions to 1
|
cc @cloud-fan @HyukjinKwon @maropu thanks very much |
|
retest this please |
|
Test build #132875 has finished for PR 30800 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #132888 has finished for PR 30800 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thanks, @yaooqinn .
Merged to master for Apache Spark 3.2.0.
…able expressions It seems a very popular way that people use DISTRIBUTE BY clause with a literal to coalesce partition in the pure SQL data processing. For example ``` insert into table src select * from values (1), (2), (3) t(a) distribute by 1 ``` Users may want the final output to be one single data file, but if the reality is not always true. Spark will always create a file for partition 0 whether it contains data or not, so when the data all goes to a partition(IDX >0), there will be always 2 files there and the part-00000 is empty. On the other hand, a lot of empty tasks will be launched too, this is unnecessary. When users repeat the insert statement daily, hourly, or minutely, it causes small file issues. ``` spark-sql> set spark.sql.shuffle.partitions=3;drop table if exists test2;create table test2 using parquet as select * from values (1), (2), (3) t(a) distribute by 1; kentyaohulk ~/spark SPARK-33806 tree /Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20201202/spark-warehouse/test2/ -s /Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20201202/spark-warehouse/test2/ ├── [ 0] _SUCCESS ├── [ 298] part-00000-5dc19733-9405-414b-9681-d25c4d3e9ee6-c000.snappy.parquet └── [ 426] part-00001-5dc19733-9405-414b-9681-d25c4d3e9ee6-c000.snappy.parquet ``` To avoid this, there are some options you can take. 1. use `distribute by null`, let the data go to the partition 0 2. set spark.sql.adaptive.enabled to true for Spark to automatically coalesce 3. using hints instead of `distribute by` 4. set spark.sql.shuffle.partitions to 1 In this PR, we set the partition number to 1 in this particular case. 1. avoid small file issues 2. avoid unnecessary empty tasks when no adaptive execution no new test Closes apache#30800 from yaooqinn/SPARK-33806. Authored-by: Kent Yao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 728a129) Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
It seems a very popular way that people use DISTRIBUTE BY clause with a literal to coalesce partition in the pure SQL data processing.
For example
Users may want the final output to be one single data file, but if the reality is not always true. Spark will always create a file for partition 0 whether it contains data or not, so when the data all goes to a partition(IDX >0), there will be always 2 files there and the part-00000 is empty. On the other hand, a lot of empty tasks will be launched too, this is unnecessary.
When users repeat the insert statement daily, hourly, or minutely, it causes small file issues.
To avoid this, there are some options you can take.
distribute by null, let the data go to the partition 0distribute byIn this PR, we set the partition number to 1 in this particular case.
Why are the changes needed?
Does this PR introduce any user-facing change?
no
How was this patch tested?
new test