Spark: Add Multi-thread to construct ReadTask #2803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

southernriver wants to merge 4 commits into apache:main from southernriver:spark-readtask-optimizor

Contributor

southernriver commented Jul 10, 2021 •

edited

Loading

For Spark2/Spark3, It always takes about 30 minutes to enter the Job Submitted state for over 100000 files, the more files, the longer waiting time for driver to scan. This piece of current code takes into account the locality strategy of the data, and will call getBlockLocations sequentially in a single thread.
Here is the thread log :

We can use multithreading to solve this problem.

Manual Test

Before this：

It took about 30min

After this：

21/07/09 18:41:12 INFO Reader: It took 164 s to construct 184082 readTasks with localityPreferred = true.


          Optimize construct ReadTask

9ceacbe

github-actions bot added core spark labels

southernriver mentioned this pull request

Spark: Parallelize initializing readTasks when localityPreferred is true #2800

Merged


          Fix code style.

8ee23b6

Contributor Author

southernriver commented Jul 10, 2021

cc @StefanXiepj


          fix checkstyle

f35ae0e

Contributor

jshmchenxi commented Jul 11, 2021

@southernriver Hi, you can refer to the comments in #2577 if you are interested, like using org.apache.iceberg.util.ThreadPools instead of creating a thread pool from scratch.

Contributor Author

southernriver commented Jul 11, 2021

@southernriver Hi, you can refer to the comments in #2577 if you are interested, like using org.apache.iceberg.util.ThreadPools instead of creating a thread pool from scratch.

Get it, I'm new to iceberg, thank you .

Contributor

kbendick commented Jul 12, 2021 •

edited

Loading

Thank you for your submission @southernriver!

This seems like a somewhat large behavior change. Can you open a GitHub issue to track this / for general discussion? Or possibly this has been discussed on the mailing list (in which case possibly you can open an issue and link to the dev list discussion)? Also, an example query / situation you encounter this in could be added there and that would help greatly. I might be thinking of something else, but I thought that at a certain point, planning is distributed once a certain heuristic is passed.

One thing I'd like to see discussed is the abilitu to opt-into this behavior. I'd be more comfortaenwitj this change if we could (at least initially) allow users to opt into or out of using the thread pool to do the planning. I imagine there are scenarios in which it's a detriment. An issue (and then either the dev list or that issue) would be a much better place to discuss that in my opinion.

I’m not necessarily adverse to this change, but I do think having an issue to track a relatively large change in functionality (single threaded to multithreaded) would be nice as this is a pretty large change in behavior with no way to really opt out (or any documentation on the new table property).

Also, if you could more completely describe what the query is that you're encountering these times and what your dataset looks like (ideally in the issue), that would be great. It's possible that with so many files, OPTIMIZE and other table maintenance tasks need to be fun.

kbendick reviewed

View reviewed changes

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java Outdated

    
                /**

                 * This is called in the Spark Driver when data is to be materialized into {@link ColumnarBatch}

                 */

                @SuppressWarnings({"checkstyle:LocalVariableName", "checkstyle:RegexpSinglelineJava"})

Contributor

kbendick Jul 12, 2021

What variable names are causing these checkstyle issues? Would it be possible to simply change them, instead of suppressing the warning for the whole function?

Contributor Author

southernriver Jul 12, 2021

"Long startTime = System.currentTimeMillis();" would cause "checkstyle:LocalVariableName".
And I will solve another checkstyle of "RegexpSinglelineJava"

Contributor

kbendick Jul 12, 2021

Ah. In that case, possibly changing startTime to another name to not shadow the external would be potentially preferable.

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java Outdated Show resolved Hide resolved

kbendick reviewed

View reviewed changes

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java

    
                  InputPartition<ColumnarBatch>[] readTasks = new InputPartition[taskSize];

                  Long startTime = System.currentTimeMillis();

                  try {

                    pool.submit(() -> IntStream.range(0, taskSize).parallel()

Contributor

kbendick Jul 12, 2021 •

edited

Loading

Should the construction of all read tasks be done in a single submit to the thread pool? I don’t see any way to slow the parallelism down here so as to not potentially overwhelm the name node.

For example, I would have expected that each of the ranges in the int stream were submitted to the pool invidually, so that tasks queue up waiting for their turn. Here, it looks like the parallelism is rather unbounded. Totally open to reading this wrong (it is Sunday for me after all!).

kbendick reviewed

View reviewed changes

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java Outdated Show resolved Hide resolved

kbendick reviewed

View reviewed changes

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java Outdated Show resolved Hide resolved

kbendick reviewed

View reviewed changes

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java

    
                            }).collect(Collectors.toList())).get();

                  } catch (Exception e) {

                    e.printStackTrace();

                    System.exit(-1);

Contributor

kbendick Jul 12, 2021 •

edited

Loading

I would avoid explicitly calling System.exit and instead let the exception bubble up (possibly catching it and then rethrowing it with additional information added or as another exception type). This would make it easier for end users to track down their exceptions, particularly when working in a notebook where there's limited space to display stack traces already.

Is there a specific reason you chose to call System.exit that possibly I'm not aware of?

Contributor Author

southernriver Jul 12, 2021

Thank you for reviewing this！ I wanted to throw exception out at first, and if I did this, I also need to change code of
org.apache.spark.sql.connector.read.Batch and so on which is outside of iceberg. Here is the err msg：

planInputPartitions()' in 'org.apache.iceberg.spark.source.SparkBatchScan' clashes with 'planInputPartitions()' in 'org.apache.spark.sql.connector.read.Batch'; overridden method does not throw 'java.lang.Exception'

kbendick reviewed

View reviewed changes

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java Outdated Show resolved Hide resolved


          Optimize code style

786e45e

Contributor

kbendick commented Jul 13, 2021 •

edited

Loading

@southernriver I was mistaken about the existence of distributed job planning (and hence my initial concern about how necessary this PR was).

I thought that there was distributed job planning already, but the PR is still open: #1421

I do know that this is being reprioritized again, but I'm not sure of any official timeline on that. Wanted to let you know. 🙂

github-actions bot commented Jul 17, 2024

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions bot added the stale label

github-actions bot commented Jul 24, 2024

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions bot closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core spark stale