[SPARK-38005][core] Support cleaning up merged shuffle files and state from external shuffle service #38560

yabola · 2022-11-08T12:12:26Z

No description provided.

mridulm · 2022-11-08T18:08:15Z

yabola · 2022-11-08T23:42:19Z

One things that I know need to be addressed are:
Some merge data infos are not saved on the driver because they are too small ( controlled by spark.shuffle.push.minShuffleSizeToWait)
please see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2295

yabola · 2022-11-09T09:07:23Z

I am wondering whether the driver needs to pass the merged reduceId to the external shuffle service (but now the driver cannot fully record merged info), or the shuffle service records the merged reduceIds, and driver only need to pass the shuffleId and other information

yabola · 2022-11-10T11:43:35Z

I am wondering whether the driver needs to pass the merged reduceId to the external shuffle service (but now the driver cannot fully record merged info), or the shuffle service records the merged reduceIds, and driver only need to pass the shuffleId and other information

I decided to change the driver to not send reduceIds ( only send shuffleId and appId), because only the shuffle service finally understands which shuffle data is stored, no matter how the driver processes or processes the message

mridulm · 2022-11-10T18:51:13Z

This is related quite a lot to #37922 by @wankunde
That PR seems to be having build issues, and so has not made progress.

AmplabJenkins · 2022-11-10T19:16:25Z

Can one of the admins verify this patch?

yabola · 2022-11-11T03:12:13Z

@mridulm Yes...These two issues are the similar. @wankunde Can I continue editing my PR in this Issue?

yabola · 2022-11-16T05:36:32Z

@mridulm as your comment #37922 (comment) , I want to Improve this part of the deletion logic

yabola

~~@mridulm @wankunde @otterc I'm not sure if I missed any logic, please help review my code , thanks~ I will improve my code style later.
Now I don't change my code in BlockManagerMasterEndpoint as #37922 do . Can it be split into two PRs, I implement the code of the shuffle service part first and @wankunde finish the rest part since he has done?~~

yabola · 2022-11-16T10:17:44Z

common/network-common/src/main/java/org/apache/spark/network/util/PushBasedShuffleUtils.java

Unified push based shuffle identification variables here, which will be used in yarn external shuffle service and spark core module.

yabola · 2022-11-16T10:34:08Z

.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java

I checked the appAttemptShuffleMergeId in the code before.
I think if we want to delete partitions merged data, then we should delete the corresponding ShuffleMergeId in DB (Otherwise, inconsistency will occur when restoring shuffle info from db)

And I think when deleting partitions, we shouldn't store shuffleMergeID in DB anymore

yabola · 2022-11-16T10:41:53Z

core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala

The previous code is difficult to understand because of the scala syntax. For example, mapOutputTracker.shuffleStatuses.get(shuffleId).**foreach** The foreach here is not actually an iterator.
externalBlockStoreClient.map the externalBlockStoreClient is not actually an iterator.

I didn't change the code logic, just changed the style

What if the shuffle statuses are not exists ?

What if the shuffle statuses are not exists ?

I think it will not, please see case match codes

mridulm · 2022-11-17T08:03:38Z

I will try to get to this later this week, do let me know if you are still working on it though.

mridulm · 2022-11-21T07:05:49Z

One things that I know need to be addressed are:
Some merge data infos are not saved on the driver because they are too small ( controlled by spark.shuffle.push.minShuffleSizeToWait)
please see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2295

In this case, we should fire a remove immediately - we are not going to use it for this app anyway ...

mridulm · 2022-11-21T07:09:11Z

@yabola, there is quite a lot of nontrivial overlap between this PR and @wankunde's PR is trying doing.
Would be great if you both can coordinate on this - I would love to get this functionality merged before we start getting closer to code freeze for 3.4

yabola · 2022-11-21T08:29:29Z

@mridulm
I will speed up to finish the unfinished parts of the previous PR together in this PR.
From your comments in the previous PR #37922 (comment) and #37922 (comment) .
I had addressed these in my PR , please help review if it is suitable: codes to remove shuffle merge
and codes to clean shuffle files

Yes , there are nontrivial overlap between #37922 , I can cherry pick some codes and fix the comments in this PR.

yabola · 2022-11-21T08:31:11Z

@mridulm @wankunde @otterc I'm not sure if I missed any logic, please help review my code , thanks~ I will improve my code style later. Now I don't change my code in BlockManagerMasterEndpoint as #37922 do . Can it be split into two PRs, I implement the code of the shuffle service part first and @wankunde finish the rest part since he has done?

Some thoughts on these changes I wrote here

mridulm · 2022-11-21T09:04:55Z

Thanks for taking over the PR @yabola !
I am heading out on a vacation soon, not sure if @otterc will have bandwidth to take a look in meantime.
I will definitely circle back to this once you are done, and I am back.

mridulm · 2022-11-21T09:07:30Z

To add, @wankunde's PR is very close to being done.
One approach would be:
a) We get the protocol changes and the immediate impl from that PR merged (once pending comments are addressed) once you take over the PR and complete that.
b) We can follow it up with other changes to make it more robust. For example, this issue you identified. There might be others as well

Thoughts ?

yabola · 2022-11-21T09:51:46Z

One things that I know need to be addressed are: Some merge data infos are not saved on the driver because they are too small ( controlled by spark.shuffle.push.minShuffleSizeToWait) please see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2295

@mridulm sorry, in my previous implementation, I needed to pass the reduceid to the external shuffle service, but I found a problem, the driver cannot record the complete merged reduceId (see my comment for the reason)...
But I had changed my implementation, so it may not be a problem (we can save merged reduceIds in shuffle service, please see codes.
But it is also better if we can clean up these useless merged data early.

…e from external shuffle service

wankunde · 2022-11-23T03:22:44Z

core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala

What if the shuffle statuses are not exists ?

wankunde · 2022-11-23T03:25:46Z

.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java

      mergeStatuses = new MergeStatuses(msg.shuffleId, msg.shuffleMergeId,
        bitmaps.toArray(new RoaringBitmap[bitmaps.size()]), Ints.toArray(reduceIds),
        Longs.toArray(sizes));
+      appShuffleInfo.shuffles.get(msg.shuffleId).setFinalizedPartitions(Ints.toArray(reduceIds));


The FinalizedPartitions will be empty after the Shuffle service restart which will cause the merged shuffle files leak.

Thanks for your review @wankunde !
Yes, if we need to solve this situation completely, we need to store finalized partitions in db.
But on the other hand, when Application removed, all the merged data will be cleaned up finally.
I'm not sure if we can just ignore this case to simplify the logic since it will finally be cleaned up.

What if the shuffle statuses are not exists ?
I think it will not, please see case match codes

wankunde · 2022-11-23T03:29:26Z

.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java

+      try {
+        File metaFile =
+            shuffleInfo.getMergedShuffleMetaFile(shuffleId, mergeId, partition);
+        File indexFile = new File(
+            shuffleInfo.getMergedShuffleIndexFilePath(shuffleId, mergeId, partition));
+        File dataFile =
+            shuffleInfo.getMergedShuffleDataFile(shuffleId, mergeId, partition);
+        metaFile.delete();
+        indexFile.delete();
+        dataFile.delete();
+      } catch (Exception e) {
+        logger.error("Error delete shuffle files for {}", shuffleMergeId, e);
+      }


Just like closeAllFilesAndDeleteIfNeeded method, can we continue delete the other files if one delete() failed?

yabola · 2022-11-24T10:29:03Z

core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala

+    val removeShuffleFromExecutorsFutures = blockManagerInfo.values.map { bm =>
+      bm.storageEndpoint.ask[Boolean](removeMsg).recover {
+        // use false as default value means no shuffle data were removed
+        handleBlockRemovalFailure("shuffle", shuffleId.toString, bm.blockManagerId, false)


I just move removeShuffleFromExecutorsFutures to the last.
It needs to be invoked at last to avoid cleaning up shuffleStatuses in mapOutputTracker too early. Otherwise mapOutputTracker.shuffleStatuses.get(shuffleId) may be none sometimes
Please refer to unregisterShuffle codes

yabola · 2022-11-24T10:32:47Z

...rce-managers/yarn/src/test/scala/org/apache/spark/network/yarn/YarnShuffleServiceSuite.scala

      mergeManager2, mergeManager2DB) == 1)
    assert(ShuffleTestAccessor.getOutdatedFinalizedShuffleCountDuringDBReload(
-      mergeManager2, mergeManager2DB) == 2)
+      mergeManager2, mergeManager2DB) == 1)


This is as expected, because we delete the current merge partitions and the current outdated merge status in db (not cleaned before this PR)
Please refer to codes

yabola · 2022-12-01T07:05:17Z

@mridulm If you are back and have time, please review my PR, I think the function is almost done. let me know if there is something inappropriate, I will modify it soon, thanks!

mridulm · 2022-12-03T07:28:55Z

Apologies for the delay in getting to this @yabola - will try to get to this next week.
Thanks for your patience.

My recommandation would be to keep the change as close to @wankunde's PR as possible, and fix the pending issues there to expedite the reviews (since that change was already reviewed quite a lot). If there are additional functional gaps in that PR, we can do it in follow up PR's.

yabola · 2022-12-13T08:23:08Z

Apologies for the delay in getting to this @yabola - will try to get to this next week. Thanks for your patience.

My recommandation would be to keep the change as close to @wankunde's PR as possible, and fix the pending issues there to expedite the reviews (since that change was already reviewed quite a lot). If there are additional functional gaps in that PR, we can do it in follow up PR's.

@wankunde according to comments, could you fix the remaining comments in your PR?

wankunde · 2022-12-17T04:30:16Z

Hi, @yabola @mridulm , I will update SPARK-40480 this weekend.

yabola · 2022-12-23T01:50:39Z

close on #37922

github-actions bot added the CORE label Nov 8, 2022

yabola commented Nov 16, 2022

View reviewed changes

yabola mentioned this pull request Nov 16, 2022

[SPARK-40480][SHUFFLE] Remove push-based shuffle data after query finished #37922

Closed

yabola added 2 commits November 22, 2022 15:00

[SPARK-38005][core] Support cleaning up merged shuffle files and stat…

76475d2

…e from external shuffle service

[SPARK-38005][core] Support cleaning up merged shuffle files and stat…

1c26c9c

…e from external shuffle service

wankunde reviewed Nov 23, 2022

View reviewed changes

github-actions bot added the YARN label Nov 23, 2022

improve code style and fix comments

ea86614

yabola commented Nov 24, 2022

View reviewed changes

yabola changed the title ~~[WIP][SPARK-38005][core] Support cleaning up merged shuffle files and state from external shuffle service~~ [SPARK-38005][core] Support cleaning up merged shuffle files and state from external shuffle service Nov 30, 2022

yabola closed this Dec 23, 2022

[SPARK-38005][core] Support cleaning up merged shuffle files and state from external shuffle service #38560

[SPARK-38005][core] Support cleaning up merged shuffle files and state from external shuffle service #38560

Uh oh!

Conversation

yabola commented Nov 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented Nov 8, 2022

Uh oh!

yabola commented Nov 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yabola commented Nov 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yabola commented Nov 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented Nov 10, 2022

Uh oh!

AmplabJenkins commented Nov 10, 2022

Uh oh!

yabola commented Nov 11, 2022

Uh oh!

yabola commented Nov 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yabola left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yabola Nov 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yabola Nov 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yabola Nov 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mridulm commented Nov 17, 2022

Uh oh!

mridulm commented Nov 21, 2022

Uh oh!

mridulm commented Nov 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yabola commented Nov 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yabola commented Nov 21, 2022

Uh oh!

mridulm commented Nov 21, 2022

Uh oh!

mridulm commented Nov 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yabola commented Nov 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yabola Nov 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

yabola commented Nov 8, 2022 •

edited

Loading

yabola commented Nov 8, 2022 •

edited

Loading

yabola commented Nov 9, 2022 •

edited

Loading

yabola commented Nov 10, 2022 •

edited

Loading

yabola commented Nov 16, 2022 •

edited

Loading

yabola left a comment •

edited

Loading

yabola Nov 16, 2022 •

edited

Loading

yabola Nov 16, 2022 •

edited

Loading

yabola Nov 24, 2022 •

edited

Loading

mridulm commented Nov 21, 2022 •

edited

Loading

yabola commented Nov 21, 2022 •

edited

Loading

mridulm commented Nov 21, 2022 •

edited

Loading

yabola commented Nov 21, 2022 •

edited

Loading

yabola Nov 23, 2022 •

edited

Loading

yabola Nov 23, 2022 •

edited

Loading

yabola Nov 24, 2022 •

edited

Loading

mridulm commented Dec 3, 2022 •

edited

Loading

yabola commented Dec 13, 2022 •

edited

Loading