Skip to content

Conversation

@jshmchenxi
Copy link
Contributor

Follows #2577, use thread pool to initialize readTasks if Spark locality is preferred.
Before this, the Spark plan phase could be slow as it uses single thread to obtain block locations of all files for this scan.
More information could be found in this comment

@github-actions github-actions bot added the spark label Jul 9, 2021
.stopOnFailure()
.executeWith(readTasksInitExecutorService)
.run(task -> {
readTasks.add(new ReadTask<>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't alter readTasks from multiple threads because array lists aren't thread-safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that was a mistake. Fixed it!

@southernriver
Copy link
Contributor

Oops, I have also fix this a few days ago, and create another PR #2803 just a moment ago.

task, tableBroadcast, expectedSchemaString, caseSensitive,
localityPreferred, InternalRowReaderFactory.INSTANCE);
synchronized (readTasks) {
readTasks.add(readTask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it‘s better to use array instead of list, and then we can avoid to add lock or temporary var ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, using array can help avoid lock, but the lock time should be insignificant compared to the get block location operation. The return value is a list and I didn't want to do all the transformation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap! But another reason is we can keep the style consistent with spark3 module, a little bit of optimization with array as follow :R153, how do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay!

InputPartition<ColumnarBatch> readTask = new ReadTask<>(
task, tableBroadcast, expectedSchemaString, caseSensitive,
localityPreferred, new BatchReaderFactory(batchSize));
synchronized (readTasks) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a concurrent list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first time I wanted to use a concurrent list. But this would change the return value from a normal list to a concurrent list, and the subsequent operation on this list should only be read. Maybe it will affect performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the code to use arrays like in spark3 to avoid synchronization.

@jshmchenxi jshmchenxi force-pushed the multi-readtask-init branch from 7000ecc to 1313fa9 Compare July 11, 2021 05:30
Tasks.range(readTasks.length)
.stopOnFailure()
.executeWith(readTasksInitExecutorService)
.run(index -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block is unnecessary because there is only one expression, can you remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

private Filter[] pushedFilters = NO_FILTERS;
private final boolean localityPreferred;
private final int batchSize;
private ExecutorService readTasksInitExecutorService = DEFAULT_READTASKS_INIT_EXECUTOR_SERVICE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that there is a need for this field and I'd prefer not to add mutable state. Can you refactor this to call executeWith(ThreadPools.getWorkerPool()) instead? You can pass null to that method so it can also check localityPreferred inline:

    .executeWith(localityPreferred ? ThreadPools.getWorkerPool() : null)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this comment because it really simplify the code path without introducing any static or local variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That really simplifies the code. Thanks @rdblue

@openinx
Copy link
Member

openinx commented Jul 12, 2021

Thanks for @jshmchenxi for the optimize work for spark on HDFS ( I just notice this and #2577, it's impressive).

Do we also need to parallelize the SparkMicroBatchStream#planInputPartitions in this PR ?

@jshmchenxi
Copy link
Contributor Author

@openinx Thanks for reminding, I've added parallelization to SparkMicroBatchStream#planInputPartitions in this PR.

@rdblue rdblue merged commit 0bb89d0 into apache:master Jul 12, 2021
@rdblue
Copy link
Contributor

rdblue commented Jul 12, 2021

Thanks, @jshmchenxi! I merged this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants