[SPARK-34541][CORE] Fixed an issue where data could not be cleaned up when unregisterShuffle. #31664

yikf · 2021-02-26T10:44:03Z

What changes were proposed in this pull request?

Fixed an issue where data could not be cleaned up when unregisterShuffle.

Why are the changes needed?

While we use the old shuffle fetch protocol, we use partitionId as mapId in the ShuffleBlockId construction,but we use context.taskAttemptId() as mapId that it is cached in taskIdMapsForShuffle when we getWriter[K, V].

where data could not be cleaned up when unregisterShuffle ,because we remove a shuffle's metadata from the taskIdMapsForShuffle's mapIds, the mapId is context.taskAttemptId() instead of partitionId.

Does this PR introduce any user-facing change?

yes

How was this patch tested?

add new test.

… when unregisterShuffle.

yikf · 2021-02-26T12:09:38Z

gentle ping @srowen @otterc @HyukjinKwon , thanks for taking a look.

Ngone51 · 2021-02-26T13:21:55Z

core/src/test/scala/org/apache/spark/shuffle/sort/SortShuffleManagerSuite.scala

+import org.apache.spark.rdd.ShuffledRDD
 import org.apache.spark.serializer.{JavaSerializer, KryoSerializer, Serializer}
+import org.apache.spark.util.Utils
+


nit: redundant empty line.

Ngone51 · 2021-02-26T13:22:13Z

core/src/test/scala/org/apache/spark/shuffle/sort/SortShuffleManagerSuite.scala


+  test("SPARK-34541 Data could not be cleaned up when unregisterShuffle") {
+    val conf = new SparkConf(loadDefaults = false)
+    val tempDir: File = Utils.createTempDir()


use withTempDir?

Ngone51 · 2021-02-26T13:22:45Z

core/src/test/scala/org/apache/spark/shuffle/sort/SortShuffleManagerSuite.scala

+      assert (!file.exists(), s"Shuffle file $file was not cleaned up")
+    }
+  }
+


nit: unnecessary line

Ngone51 · 2021-02-26T13:24:34Z

core/src/test/scala/org/apache/spark/shuffle/sort/SortShuffleManagerSuite.scala

    )))
  }

+  test("SPARK-34541 Data could not be cleaned up when unregisterShuffle") {


nit: SPARK-34541:

… when unregisterShuffle.

Ngone51 · 2021-02-26T14:39:47Z

cc @xuanyuanking

… when unregisterShuffle.

srowen · 2021-02-27T15:28:16Z

Jenkins retest this please

SparkQA · 2021-02-27T18:23:08Z

Test build #135545 has finished for PR 31664 at commit 69467ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SortShuffleManagerSuite extends SparkFunSuite with Matchers with LocalSparkContext

Ngone51 · 2021-02-28T15:42:59Z

core/src/test/scala/org/apache/spark/shuffle/sort/SortShuffleManagerSuite.scala

    )))
  }

+  test("Data could not be cleaned up when unregisterShuffle") {


nit: SPARK-34541: Data ...

BTW, this test requires spark.shuffle.useOldFetchProtocol=true?

And it's better to test both true and false.

It seems that context.taskAttemptId and partitionId are the same, both increasing from 0. I don't understand why the protocol should be differentiated on the write side.

I met this problem before, but now the scene has not been recovered

We can run a simple job before our target job to make the taskAttemptId starts from 1. e.g.,

sc.parallelize(1 to 10, 1).count().

I tried this way and the issue can be reproduced.

OK, thank you. I'll add test later, And I don't understand why the protocol should be differentiated on the WriteSide, As follow:
ShuffleMapTask#runTask
// While we use the old shuffle fetch protocol, we use partitionId as mapId in the ShuffleBlockId construction. val mapId = if (SparkEnv.get.conf.get(config.SHUFFLE_USE_OLD_FETCH_PROTOCOL)) { partitionId } else context.taskAttemptId()

In readSide, we need to use protoco to distinguish messages, But in writeSide, register to ExternalShuffleService by RegisterExecutor , It paas the localDir to shuffleService, So shuffleService know the middle file by shuffle, But seems unrelated to mapId.

I roughly remember that's because we want to ensure the unique file name at write size. cc @xuanyuanking

at write size -> at write side? :)
Yes, you can check the description in #25620. TL;DR: we need a unique file name to resolve the indeterminate shuffle issue.

… when unregisterShuffle.

Ngone51 · 2021-03-02T06:38:07Z

core/src/test/scala/org/apache/spark/shuffle/sort/SortShuffleManagerSuite.scala

    )))
  }

+  test("Shuffle data can be cleaned up whether spark.shuffle.useOldFetchProtocol=true/false") {


We usually test different config values like this:

Seq(true, false).foreach { value => test(s"SPARK-34541: shuffle data can be cleaned up whether spark.shuffle.useOldFetchProtocol=$value") { ... conf.set(spark.shuffle.useOldFetchProtocol, value) ... }

Could you follow this way?

ok, Thanks for guiding codeStyle very much, xiexie~

… when unregisterShuffle.

xuanyuanking · 2021-03-02T15:36:54Z

core/src/test/scala/org/apache/spark/shuffle/sort/SortShuffleManagerSuite.scala

  }

+  Seq("true", "false").foreach { value =>
+    test(s"SPARK-34541: shuffle can be removed when spark.shuffle.useOldFetchProtocol=$value") {


Is this possible to add in ShuffleSuite. If so, we don't need to consider the spark.shuffle.useOldFetchProtocol since ShuffleOldFetchProtocolSuite would cover the old protocol case.
If it's hard to move it to ShuffleSuite. I'm also OK with the current code.

xuanyuanking

Thanks for the fix! LGTM

… when unregisterShuffle.

srowen · 2021-03-04T15:59:09Z

Jenkins retest this please

SparkQA · 2021-03-04T16:44:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40345/

SparkQA · 2021-03-04T16:53:41Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40345/

SparkQA · 2021-03-04T18:39:55Z

Test build #135762 has finished for PR 31664 at commit 65be2c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yikf · 2021-03-05T12:17:11Z

Jenkins retest this please

gentle ping @srowen .

srowen

@Ngone51 OK with you?

Ngone51

@srowen Thanks for the ping.

Overall, the change still looks good to me. I have left another two minor comments. Let's address them and trigger another round test to pass k8s tests.

Ngone51 · 2021-03-06T13:38:14Z

core/src/test/scala/org/apache/spark/ShuffleSuite.scala

    manager.unregisterShuffle(0)
  }
+
+  test(s"SPARK-34541: shuffle can be removed when spark.shuffle.useOldFetchProtocol=true") {


nit: SPARK-34541: shuffle files should be removed normally

(spark.shuffle.useOldFetchProtocol=true is not valid for ShuffleSuite.)

ShuffleSuite is abstract class, ShuffleOldFetchProtocolSuite would cover the spark.shuffle.useOldFetchProtocol=true case.

I mean, the test name is not correct for other extended shuffle suites. For other suites, spark.shuffle.useOldFetchProtocol should be false, right?

Ngone51 · 2021-03-06T13:39:15Z

core/src/test/scala/org/apache/spark/shuffle/sort/SortShuffleManagerSuite.scala

      mapSideCombine = true
    )))
  }
-


nit: revert the unrelated change.

line-133 is empty line, we add test in SortShuffleManagerSuite before, after #31664 (comment), move the test to ShuffleSuite, So i change the empty line together.

I know the reason, but it's still not a necessary change, especially when there's no other changes in the file.

ok，i will revert it.

Ngone51 · 2021-03-06T15:43:32Z

LGTM

srowen · 2021-03-08T00:46:07Z

Jenkins retest this please

SparkQA · 2021-03-08T02:21:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40430/

SparkQA · 2021-03-08T02:50:06Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40430/

SparkQA · 2021-03-08T03:26:22Z

Test build #135848 has finished for PR 31664 at commit cd907bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2021-03-08T13:08:53Z

Merged to master

yikf added 3 commits February 7, 2021 20:56

[SPARK-34395][SQL]Clean up unused code for code simplifications.

bff7afd

Merge remote-tracking branch 'origin/master'

da47187

[SPARK-34541][CORE] Fixed an issue where data could not be cleaned up…

da8facf

… when unregisterShuffle.

yikf mentioned this pull request Feb 26, 2021

[SPARK-34541][CORE] Fixed an issue where data could not be cleaned up when unregisterShuffle. #31648

Closed

github-actions bot added the CORE label Feb 26, 2021

Ngone51 approved these changes Feb 26, 2021

View reviewed changes

[SPARK-34541][CORE] Fixed an issue where data could not be cleaned up…

1e9c3b6

… when unregisterShuffle.

yikf added 2 commits February 27, 2021 00:54

[SPARK-34541][CORE] Fixed an issue where data could not be cleaned up…

79f9a70

… when unregisterShuffle.

[SPARK-34541][CORE] Fixed an issue where data could not be cleaned up…

69467ee

… when unregisterShuffle.

Ngone51 reviewed Feb 28, 2021

View reviewed changes

[SPARK-34541][CORE] Fixed an issue where data could not be cleaned up…

7319255

… when unregisterShuffle.

Ngone51 reviewed Mar 2, 2021

View reviewed changes

[SPARK-34541][CORE] Fixed an issue where data could not be cleaned up…

e53ee15

… when unregisterShuffle.

xuanyuanking reviewed Mar 2, 2021

View reviewed changes

xuanyuanking approved these changes Mar 2, 2021

View reviewed changes

yikf added 2 commits March 3, 2021 20:25

[SPARK-34541][CORE] Fixed an issue where data could not be cleaned up…

a86b5ab

… when unregisterShuffle.

[SPARK-34541][CORE] for test.

65be2c7

srowen reviewed Mar 6, 2021

View reviewed changes

Ngone51 reviewed Mar 6, 2021

View reviewed changes

[SPARK-34541][CORE] revert

cd907bb

srowen closed this in f340857 Mar 8, 2021

[SPARK-34541][CORE] Fixed an issue where data could not be cleaned up when unregisterShuffle. #31664

[SPARK-34541][CORE] Fixed an issue where data could not be cleaned up when unregisterShuffle. #31664

Uh oh!

Conversation

yikf commented Feb 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

yikf commented Feb 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ngone51 commented Feb 26, 2021

Uh oh!

srowen commented Feb 27, 2021

Uh oh!

SparkQA commented Feb 27, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yikf Mar 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yikf Mar 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ngone51 Mar 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yikf Mar 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuanyuanking left a comment

Choose a reason for hiding this comment

Uh oh!

srowen commented Mar 4, 2021

Uh oh!

SparkQA commented Mar 4, 2021

Uh oh!

SparkQA commented Mar 4, 2021

Uh oh!

SparkQA commented Mar 4, 2021

Uh oh!

yikf commented Mar 5, 2021

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Ngone51 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yikf commented Feb 26, 2021 •

edited

Loading

yikf Mar 1, 2021 •

edited

Loading

yikf Mar 1, 2021 •

edited

Loading

Ngone51 Mar 2, 2021 •

edited

Loading

yikf Mar 2, 2021 •

edited

Loading