Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes #6588

RussellSpitzer · 2023-01-14T12:47:03Z

An issue we've run into frequently is that several Spark actions perform deletes on the driver with a default parallelism of 1. This is quite slow for S3 and painfully slow for very large tables. To fix this we change the default behavior to always be multithreaded deletes.

The default for all Spark related actions can then be changed with a SQL Conf parameter as well as within each command with their own parallelism parameters.

RussellSpitzer · 2023-01-14T12:49:32Z

@anuragmantri + @aokolnychyi + @rdblue - This is a bit of a big default behavior change but it's been biting a lot of our users lately and the change is relatively safe.

RussellSpitzer · 2023-01-14T12:54:05Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java

+
+  // Controls how many physical file deletes to execute in parallel when not otherwise specified
+  public static final String DELETE_PARALLELISM = "driver-delete-default-parallelism";
+  public static final String DELETE_PARALLELISM_DEFAULT = "25";


With S3's request throttling around 4k requests a second this gives us a lot of overhead.
Assuming a 50ms response time
4000 max requests / Second / 20 requests per thread per second =~ 200 max concurrent requests.

Another option for this is to also incorporate the "bulk delete" apis but that would only help with S3 based filesystems.

RussellSpitzer · 2023-01-14T12:54:36Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

-              summary.deletedFile(path, type);
-            });
+
+    withDefaultDeleteService(executorService, (deleteService) -> {


This covers ExpireSnapshots and DropTable with Purge (via DeleteReachableFilesAction)

RussellSpitzer · 2023-01-14T12:54:52Z

.../v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

-        .suppressFailureWhenFinished()
-        .onFailure((file, exc) -> LOG.warn("Failed to delete file: {}", file, exc))
-        .run(deleteFunc::accept);
+    withDefaultDeleteService(deleteExecutorService, deleteService ->


This change covers RemoveOrphanFiles

anuragmantri

Thanks for this change @RussellSpitzer. It looks safe to me.

I left a minor comment and build failed with code format violations, you may want to run gradlew spotlessApply on the file.

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java

aokolnychyi · 2023-01-18T20:21:02Z

Let me take a look soon.

RussellSpitzer · 2023-01-18T23:02:03Z

Talking with @aokolnychyi we decided we are going to take a slightly broader approach here. Rather than allowing each action to define it's own method of deleting and using custom executor services, we will fall back to each FileIO's bulk delete support. We will then add a basic parallel default delete to HDFS for bulk delete.

After doing this we will deprecate all the "parallel delete" methods from the actions and procedures instead instructing users to use their IO specific parallelism controls.

amogh-jahagirdar · 2023-01-20T03:32:06Z

Thanks for clarifying @RussellSpitzer I think it makes a ton of sense to leave the specifics of bulk vs parallel to the FileIO abstraction. In this case, we leverage bulk delete wherever possible (S3) and then do parallel deletions for file systems which don't support bulk ops like HDFS to improve the throughput of deletions

aokolnychyi · 2023-01-24T17:16:15Z

cc #5373 #5375

RussellSpitzer added 2 commits January 10, 2023 17:49

Spark 3.3: Adds Parallelism to Drop Table Purge

7bdefa4

Spark 3.3: Use Default Delete Parallelism with All Spark Actions

1c48fe7

github-actions bot added the spark label Jan 14, 2023

RussellSpitzer commented Jan 14, 2023

View reviewed changes

Various Style Issues

21d3adc

anuragmantri approved these changes Jan 14, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java Outdated Show resolved Hide resolved

Format and prefix

4558d58

RussellSpitzer requested review from aokolnychyi and rdblue January 17, 2023 20:24

Spark 3.3: Use PropertyUtil

8b0786d

RussellSpitzer closed this Mar 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes #6588

Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes #6588

Uh oh!

RussellSpitzer commented Jan 14, 2023

Uh oh!

RussellSpitzer commented Jan 14, 2023

Uh oh!

RussellSpitzer Jan 14, 2023

Uh oh!

RussellSpitzer Jan 14, 2023 •

edited

Loading

Uh oh!

RussellSpitzer Jan 14, 2023

Uh oh!

anuragmantri left a comment

Uh oh!

Uh oh!

aokolnychyi commented Jan 18, 2023

Uh oh!

RussellSpitzer commented Jan 18, 2023

Uh oh!

amogh-jahagirdar commented Jan 20, 2023

Uh oh!

aokolnychyi commented Jan 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes #6588

Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes #6588

Uh oh!

Conversation

RussellSpitzer commented Jan 14, 2023

Uh oh!

RussellSpitzer commented Jan 14, 2023

Uh oh!

RussellSpitzer Jan 14, 2023

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Jan 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Jan 14, 2023

Choose a reason for hiding this comment

Uh oh!

anuragmantri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aokolnychyi commented Jan 18, 2023

Uh oh!

RussellSpitzer commented Jan 18, 2023

Uh oh!

amogh-jahagirdar commented Jan 20, 2023

Uh oh!

aokolnychyi commented Jan 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RussellSpitzer Jan 14, 2023 •

edited

Loading