[SPARK-33806][SQL] limit partition num to 1 when distributing by foldable expressions #30800

yaooqinn · 2020-12-16T08:31:04Z

What changes were proposed in this pull request?

It seems a very popular way that people use DISTRIBUTE BY clause with a literal to coalesce partition in the pure SQL data processing.

For example

insert into table src select * from values (1), (2), (3) t(a) distribute by 1

Users may want the final output to be one single data file, but if the reality is not always true. Spark will always create a file for partition 0 whether it contains data or not, so when the data all goes to a partition(IDX >0), there will be always 2 files there and the part-00000 is empty. On the other hand, a lot of empty tasks will be launched too, this is unnecessary.

When users repeat the insert statement daily, hourly, or minutely, it causes small file issues.

spark-sql> set spark.sql.shuffle.partitions=3;drop table if exists test2;create table test2 using parquet as select * from values (1), (2), (3) t(a) distribute by 1;


 kentyao@hulk  ~/spark   SPARK-33806  tree /Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20201202/spark-warehouse/test2/ -s
/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20201202/spark-warehouse/test2/
├── [          0]  _SUCCESS
├── [        298]  part-00000-5dc19733-9405-414b-9681-d25c4d3e9ee6-c000.snappy.parquet
└── [        426]  part-00001-5dc19733-9405-414b-9681-d25c4d3e9ee6-c000.snappy.parquet

To avoid this, there are some options you can take.

use distribute by null, let the data go to the partition 0
set spark.sql.adaptive.enabled to true for Spark to automatically coalesce
using hints instead of distribute by
set spark.sql.shuffle.partitions to 1

In this PR, we set the partition number to 1 in this particular case.

Why are the changes needed?

avoid small file issues
avoid unnecessary empty tasks when no adaptive execution

Does this PR introduce any user-facing change?

no

How was this patch tested?

new test

…able expressions It seems a very popular way that people use DISTRIBUTE BY clause with a literal to coalesce partition in the pure SQL data processing. For example ``` insert into table src select * from values (1), (2), (3) t(a) distribute by 1 ``` Users may want the final output to be one single data file, but if the reality is not always true. Spark will always create a file for partition 0 whether it contains data or not, so when the data all goes to a partition(IDX >0), there will be always 2 files there and the part-00000 is empty. On the other hand, a lot of empty tasks will be launched too, this is unnecessary. When users repeat the insert statement daily, hourly, or minutely, it causes small file issues. To avoid this, there are some options you can take. 1. user `distribute by null`, let the data go to the partition 0 2. set spark.sql.adaptive.enabled to true for Spark to automatically coalesce 3. using hints instead of `distribute by` 4. set spark.sql.shuffle.partitions to 1

yaooqinn · 2020-12-16T12:51:56Z

cc @cloud-fan @HyukjinKwon @maropu thanks very much

cloud-fan · 2020-12-16T13:46:19Z

retest this please

SparkQA · 2020-12-16T14:04:44Z

Test build #132875 has finished for PR 30800 at commit 2952048.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-16T14:57:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37490/

SparkQA · 2020-12-16T15:02:16Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37490/

SparkQA · 2020-12-16T18:45:42Z

Test build #132888 has finished for PR 30800 at commit 2952048.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thanks, @yaooqinn .
Merged to master for Apache Spark 3.2.0.

…able expressions It seems a very popular way that people use DISTRIBUTE BY clause with a literal to coalesce partition in the pure SQL data processing. For example ``` insert into table src select * from values (1), (2), (3) t(a) distribute by 1 ``` Users may want the final output to be one single data file, but if the reality is not always true. Spark will always create a file for partition 0 whether it contains data or not, so when the data all goes to a partition(IDX >0), there will be always 2 files there and the part-00000 is empty. On the other hand, a lot of empty tasks will be launched too, this is unnecessary. When users repeat the insert statement daily, hourly, or minutely, it causes small file issues. ``` spark-sql> set spark.sql.shuffle.partitions=3;drop table if exists test2;create table test2 using parquet as select * from values (1), (2), (3) t(a) distribute by 1; kentyaohulk  ~/spark   SPARK-33806  tree /Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20201202/spark-warehouse/test2/ -s /Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20201202/spark-warehouse/test2/ ├── [ 0] _SUCCESS ├── [ 298] part-00000-5dc19733-9405-414b-9681-d25c4d3e9ee6-c000.snappy.parquet └── [ 426] part-00001-5dc19733-9405-414b-9681-d25c4d3e9ee6-c000.snappy.parquet ``` To avoid this, there are some options you can take. 1. use `distribute by null`, let the data go to the partition 0 2. set spark.sql.adaptive.enabled to true for Spark to automatically coalesce 3. using hints instead of `distribute by` 4. set spark.sql.shuffle.partitions to 1 In this PR, we set the partition number to 1 in this particular case. 1. avoid small file issues 2. avoid unnecessary empty tasks when no adaptive execution no new test Closes apache#30800 from yaooqinn/SPARK-33806. Authored-by: Kent Yao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 728a129) Signed-off-by: Dongjoon Hyun <[email protected]>

github-actions bot added the SQL label Dec 16, 2020

cloud-fan approved these changes Dec 16, 2020

View reviewed changes

dongjoon-hyun approved these changes Dec 16, 2020

View reviewed changes

dongjoon-hyun closed this in 728a129 Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-33806][SQL] limit partition num to 1 when distributing by foldable expressions #30800

[SPARK-33806][SQL] limit partition num to 1 when distributing by foldable expressions #30800

Uh oh!

yaooqinn commented Dec 16, 2020 •

edited

Loading

Uh oh!

yaooqinn commented Dec 16, 2020

Uh oh!

cloud-fan commented Dec 16, 2020

Uh oh!

SparkQA commented Dec 16, 2020

Uh oh!

SparkQA commented Dec 16, 2020

Uh oh!

SparkQA commented Dec 16, 2020

Uh oh!

SparkQA commented Dec 16, 2020

Uh oh!

dongjoon-hyun left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-33806][SQL] limit partition num to 1 when distributing by foldable expressions #30800

[SPARK-33806][SQL] limit partition num to 1 when distributing by foldable expressions #30800

Uh oh!

Conversation

yaooqinn commented Dec 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

yaooqinn commented Dec 16, 2020

Uh oh!

cloud-fan commented Dec 16, 2020

Uh oh!

SparkQA commented Dec 16, 2020

Uh oh!

SparkQA commented Dec 16, 2020

Uh oh!

SparkQA commented Dec 16, 2020

Uh oh!

SparkQA commented Dec 16, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yaooqinn commented Dec 16, 2020 •

edited

Loading