Spark: Parallelize initializing readTasks when localityPreferred is true #2800

jshmchenxi · 2021-07-09T15:39:03Z

Follows #2577, use thread pool to initialize readTasks if Spark locality is preferred.
Before this, the Spark plan phase could be slow as it uses single thread to obtain block locations of all files for this scan.
More information could be found in this comment

rdblue · 2021-07-09T19:53:07Z

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java

+        .stopOnFailure()
+        .executeWith(readTasksInitExecutorService)
+        .run(task -> {
+          readTasks.add(new ReadTask<>(


This can't alter readTasks from multiple threads because array lists aren't thread-safe.

Oh, that was a mistake. Fixed it!

southernriver · 2021-07-10T15:49:41Z

Oops, I have also fix this a few days ago, and create another PR #2803 just a moment ago.

southernriver · 2021-07-10T15:59:04Z

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java

+              task, tableBroadcast, expectedSchemaString, caseSensitive,
+              localityPreferred, InternalRowReaderFactory.INSTANCE);
+          synchronized (readTasks) {
+            readTasks.add(readTask);


Maybe it‘s better to use array instead of list, and then we can avoid to add lock or temporary var ?

Yes, using array can help avoid lock, but the lock time should be insignificant compared to the get block location operation. The return value is a list and I didn't want to do all the transformation.

Yeap! But another reason is we can keep the style consistent with spark3 module, a little bit of optimization with array as follow :R153, how do you think?

rdblue · 2021-07-10T23:17:26Z

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java

+          InputPartition<ColumnarBatch> readTask = new ReadTask<>(
+              task, tableBroadcast, expectedSchemaString, caseSensitive,
+              localityPreferred, new BatchReaderFactory(batchSize));
+          synchronized (readTasks) {


Why not use a concurrent list?

At first time I wanted to use a concurrent list. But this would change the return value from a normal list to a concurrent list, and the subsequent operation on this list should only be read. Maybe it will affect performance.

I changed the code to use arrays like in spark3 to avoid synchronization.

rdblue · 2021-07-11T23:53:16Z

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java

+    Tasks.range(readTasks.length)
+        .stopOnFailure()
+        .executeWith(readTasksInitExecutorService)
+        .run(index -> {


The block is unnecessary because there is only one expression, can you remove it?

rdblue · 2021-07-11T23:56:10Z

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java

  private Filter[] pushedFilters = NO_FILTERS;
  private final boolean localityPreferred;
  private final int batchSize;
+  private ExecutorService readTasksInitExecutorService = DEFAULT_READTASKS_INIT_EXECUTOR_SERVICE;


I don't think that there is a need for this field and I'd prefer not to add mutable state. Can you refactor this to call executeWith(ThreadPools.getWorkerPool()) instead? You can pass null to that method so it can also check localityPreferred inline:

.executeWith(localityPreferred ? ThreadPools.getWorkerPool() : null)

I like this comment because it really simplify the code path without introducing any static or local variables.

That really simplifies the code. Thanks @rdblue

openinx · 2021-07-12T02:23:11Z

Thanks for @jshmchenxi for the optimize work for spark on HDFS ( I just notice this and #2577, it's impressive).

Do we also need to parallelize the SparkMicroBatchStream#planInputPartitions in this PR ?

jshmchenxi · 2021-07-12T07:21:59Z

@openinx Thanks for reminding, I've added parallelization to SparkMicroBatchStream#planInputPartitions in this PR.

rdblue · 2021-07-12T17:24:34Z

Thanks, @jshmchenxi! I merged this.

Spark: Parallelize initializing readTasks when localityPreferred is true

98fa40b

github-actions bot added the spark label Jul 9, 2021

rdblue reviewed Jul 9, 2021

View reviewed changes

Fix thread-safe problem for array lists

6887477

southernriver reviewed Jul 10, 2021

View reviewed changes

rdblue reviewed Jul 10, 2021

View reviewed changes

Use arrays for spark2 to avoid synchronization

1313fa9

jshmchenxi force-pushed the multi-readtask-init branch from 7000ecc to 1313fa9 Compare July 11, 2021 05:30

rdblue reviewed Jul 11, 2021

View reviewed changes

Xi Chen added 2 commits July 12, 2021 13:42

Remove unnecessary ExecutorService variables

d30bc53

Add parallelization to SparkMicroBatchStream#planInputPartitions

0507bb3

rdblue approved these changes Jul 12, 2021

View reviewed changes

rdblue merged commit 0bb89d0 into apache:master Jul 12, 2021

rzhang10 mentioned this pull request Apr 25, 2023

[Backport] Spark: Parallelize task init when fetching locality info (… linkedin/iceberg#144

Merged

Spark: Parallelize initializing readTasks when localityPreferred is true #2800

Spark: Parallelize initializing readTasks when localityPreferred is true #2800

Uh oh!

Conversation

jshmchenxi commented Jul 9, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

southernriver commented Jul 10, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openinx commented Jul 12, 2021

Uh oh!

jshmchenxi commented Jul 12, 2021

Uh oh!

rdblue commented Jul 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants