Skip to content

Conversation

@AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Jan 17, 2020

What changes were proposed in this pull request?

Add a method to control small file problem

Why are the changes needed?

Spark SQL always generate too many small files of insert type SQL result, here we add a control method to minimize the output file number automatic。

Does this PR introduce any user-facing change?

When set spark.sql.files.mergeSmallFile.enabled=false,
we will combine neighbouring small partition, then will reduce output small files

How was this patch tested?

manual tested

create table test_dy_data(id string, dt string);
create table test_dy_part(id string) partitioned by(dt string);

insert into table test_dy_data select 1, '20191201';
insert into table test_dy_data select 2, '20191201';
insert into table test_dy_data select 3, '20191201';
insert into table test_dy_data select 4, '20191201';
insert into table test_dy_data select 5, '20191201';
insert into table test_dy_data select 6, '20191201';
insert into table test_dy_data select 7, '20191201';
insert into table test_dy_data select 8, '20191201';



insert into table test_dy_data select 1, '20191202';
insert into table test_dy_data select 2, '20191202';
insert into table test_dy_data select 3, '20191202';
insert into table test_dy_data select 4, '20191202';
insert into table test_dy_data select 5, '20191202';
insert into table test_dy_data select 6, '20191202';
insert into table test_dy_data select 7, '20191202';
insert into table test_dy_data select 8, '20191202';

insert into table test_dy_data select 1, '20191203';
insert into table test_dy_data select 2, '20191203';
insert into table test_dy_data select 3, '20191203';
insert into table test_dy_data select 4, '20191203';
insert into table test_dy_data select 5, '20191203';
insert into table test_dy_data select 6, '20191203';
insert into table test_dy_data select 7, '20191203';
insert into table test_dy_data select 8, '20191203';


set hive.exec.stagingdir=/Users/yi.zhu/Documents/data/;
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table test_dy_part select * from test_dy_data;

spark.sql.files.mergeSmallFile.enabled=false
image

spark.sql.files.mergeSmallFile.enabled=true

image

Copy link
Contributor

@wangshuo128 wangshuo128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe refer to the wrong JIRA number by "SPARK-20538", do you mean "SPARK-30538"?

val partitionIndexToSize = parent.mapPartitionsWithIndexInternal((index, part) => {
// TODO make it more accurate
Map(index -> rowSize * part.size).iterator
}).collectAsMap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it too costly to trigger an action to compute all the RDDs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it too costly to trigger an action to compute all the RDDs?

this is after compute rdds and coalese computed rdd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it too costly to trigger an action to compute all the RDDs?

Sorry for I have make some mistake, it will re-compute last stage. but won't recompute all stage

@AngersZhuuuu AngersZhuuuu changed the title [WIP][SPARK-20538][SQL] Control spark sql output small file by merge small partition [WIP][SPARK-30538][SQL] Control spark sql output small file by merge small partition Jan 19, 2020
@AngersZhuuuu
Copy link
Contributor Author

Maybe refer to the wrong JIRA number by "SPARK-20538", do you mean "SPARK-30538"?

Yea, thanks

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

github-actions bot commented Jul 8, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jul 8, 2020
@github-actions github-actions bot closed this Jul 9, 2020
@SubhamSinghal
Copy link

SubhamSinghal commented May 10, 2024

Will someone be able to offer review here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants