[WIP][SPARK-30538][SQL] Control spark sql output small file by merge small partition #27248

AngersZhuuuu · 2020-01-17T03:43:22Z

What changes were proposed in this pull request?

Add a method to control small file problem

Why are the changes needed?

Spark SQL always generate too many small files of insert type SQL result, here we add a control method to minimize the output file number automatic。

Does this PR introduce any user-facing change?

When set spark.sql.files.mergeSmallFile.enabled=false,
we will combine neighbouring small partition, then will reduce output small files

How was this patch tested?

manual tested

create table test_dy_data(id string, dt string);
create table test_dy_part(id string) partitioned by(dt string);

insert into table test_dy_data select 1, '20191201';
insert into table test_dy_data select 2, '20191201';
insert into table test_dy_data select 3, '20191201';
insert into table test_dy_data select 4, '20191201';
insert into table test_dy_data select 5, '20191201';
insert into table test_dy_data select 6, '20191201';
insert into table test_dy_data select 7, '20191201';
insert into table test_dy_data select 8, '20191201';



insert into table test_dy_data select 1, '20191202';
insert into table test_dy_data select 2, '20191202';
insert into table test_dy_data select 3, '20191202';
insert into table test_dy_data select 4, '20191202';
insert into table test_dy_data select 5, '20191202';
insert into table test_dy_data select 6, '20191202';
insert into table test_dy_data select 7, '20191202';
insert into table test_dy_data select 8, '20191202';

insert into table test_dy_data select 1, '20191203';
insert into table test_dy_data select 2, '20191203';
insert into table test_dy_data select 3, '20191203';
insert into table test_dy_data select 4, '20191203';
insert into table test_dy_data select 5, '20191203';
insert into table test_dy_data select 6, '20191203';
insert into table test_dy_data select 7, '20191203';
insert into table test_dy_data select 8, '20191203';


set hive.exec.stagingdir=/Users/yi.zhu/Documents/data/;
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table test_dy_part select * from test_dy_data;

spark.sql.files.mergeSmallFile.enabled=false

spark.sql.files.mergeSmallFile.enabled=true

wangshuo128

Maybe refer to the wrong JIRA number by "SPARK-20538", do you mean "SPARK-30538"?

wangshuo128 · 2020-01-19T04:15:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SizeBasedCoalescer.scala

+    val partitionIndexToSize = parent.mapPartitionsWithIndexInternal((index, part) => {
+      // TODO make it more accurate
+      Map(index -> rowSize * part.size).iterator
+    }).collectAsMap()


Is it too costly to trigger an action to compute all the RDDs?

Is it too costly to trigger an action to compute all the RDDs?

this is after compute rdds and coalese computed rdd.

Is it too costly to trigger an action to compute all the RDDs?

Sorry for I have make some mistake, it will re-compute last stage. but won't recompute all stage

AngersZhuuuu · 2020-01-19T04:31:28Z

Maybe refer to the wrong JIRA number by "SPARK-20538", do you mean "SPARK-30538"?

Yea, thanks

AmplabJenkins · 2020-03-23T04:17:13Z

Can one of the admins verify this patch?

github-actions · 2020-07-08T00:29:51Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

SubhamSinghal · 2024-05-10T12:39:09Z

Will someone be able to offer review here?

AngersZhuuuu added 5 commits December 12, 2019 16:20

update dev/lint-java

06dc33c

Merge remote-tracking branch 'upstream/master'

097bba9

Merge branch 'master' of https://github.com/AngersZhuuuu/spark

7b44277

control file output

2dabd51

sync

230904a

AngersZhuuuu requested review from cloud-fan and viirya January 17, 2020 03:52

wangshuo128 reviewed Jan 19, 2020

View reviewed changes

AngersZhuuuu changed the title ~~[WIP][SPARK-20538][SQL] Control spark sql output small file by merge small partition~~ [WIP][SPARK-30538][SQL] Control spark sql output small file by merge small partition Jan 19, 2020

dongjoon-hyun added the SQL label Feb 5, 2020

fix

14cd426

AngersZhuuuu mentioned this pull request Apr 2, 2020

[SPARK-31226][CORE][TESTS] SizeBasedCoalesce logic will lose partition #27988

Closed

github-actions bot added the Stale label Jul 8, 2020

github-actions bot closed this Jul 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][SPARK-30538][SQL] Control spark sql output small file by merge small partition #27248

[WIP][SPARK-30538][SQL] Control spark sql output small file by merge small partition #27248

Uh oh!

AngersZhuuuu commented Jan 17, 2020 •

edited

Loading

Uh oh!

wangshuo128 left a comment •

edited

Loading

Uh oh!

wangshuo128 Jan 19, 2020

Uh oh!

AngersZhuuuu Jan 19, 2020

Uh oh!

AngersZhuuuu Feb 13, 2020

Uh oh!

AngersZhuuuu commented Jan 19, 2020

Uh oh!

AmplabJenkins commented Mar 23, 2020

Uh oh!

github-actions bot commented Jul 8, 2020

Uh oh!

SubhamSinghal commented May 10, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[WIP][SPARK-30538][SQL] Control spark sql output small file by merge small partition #27248

[WIP][SPARK-30538][SQL] Control spark sql output small file by merge small partition #27248

Uh oh!

Conversation

AngersZhuuuu commented Jan 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

wangshuo128 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangshuo128 Jan 19, 2020

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Jan 19, 2020

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Feb 13, 2020

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu commented Jan 19, 2020

Uh oh!

AmplabJenkins commented Mar 23, 2020

Uh oh!

github-actions bot commented Jul 8, 2020

Uh oh!

SubhamSinghal commented May 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

AngersZhuuuu commented Jan 17, 2020 •

edited

Loading

wangshuo128 left a comment •

edited

Loading

SubhamSinghal commented May 10, 2024 •

edited

Loading