[SPARK-31801][API][SHUFFLE] Register map output metadata #30763

attilapiros · 2020-12-14T13:26:24Z

This is a copy of #28618 but merged with the current master resolving all the merge conflicts.
All the credit goes to @mccheah I just would like to help out here and avoid his progress to be lost.

What changes were proposed in this pull request?

Adds a ShuffleOutputTracker API that can be used for managing shuffle metadata on the driver. Accepts map output metadata returned by the map output writers.

Requires #28616.

Why are the changes needed?

Part of the design as discussed in this document, and part of the wider effort of SPARK-25299.

Does this PR introduce any user-facing change?

Enables additional APIs for the shuffle storage plugin tree. Usage will become more apparent when the read side of the shuffle plugin tree is introduced.

How was this patch tested?

We've added a mock implementation of the shuffle plugin tree here, to prove that a Spark job using a different implementation of the plugin can use all of the plugin points for an alternative shuffle data storage solution. But we don't include it here, in order to minimize the diff and the code to review in this specific patch. See #28902.

…-metadata

Only invoke shuffle output tracker once per unregister shuffle attempt.

Cause we need to stop the SC

…-metadata

SparkQA · 2020-12-14T14:43:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37373/

SparkQA · 2020-12-14T15:15:17Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37373/

SparkQA · 2020-12-14T16:05:49Z

Test build #132771 has finished for PR 30763 at commit 5cb3e9d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-27T21:18:10Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39154/

SparkQA · 2021-01-27T21:45:53Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39154/

SparkQA · 2021-01-27T23:04:35Z

Test build #134568 has finished for PR 30763 at commit b350258.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-08T06:20:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40438/

SparkQA · 2021-03-08T06:55:44Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40438/

SparkQA · 2021-03-08T07:33:37Z

Test build #135856 has finished for PR 30763 at commit abcf8f3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2021-03-08T08:13:08Z

Failure is totally unrealted:

org.apache.spark.sql.kafka010.KafkaMicroBatchV2SourceWithAdminSuite.subscribing topic by pattern from latest offsets (failOnDataLoss: false)

attilapiros · 2021-03-08T08:31:52Z

Let me move the mima excludes from 3.1.x to 3.2.x.

attilapiros · 2021-03-08T08:50:51Z

@Ngone51

Regarding cutting this to smaller pieces I can identify two potential sub-PRs:

introduction of MapTaskResult
introduction of ShuffleOutputTracker

I can do this cut if you think it is really needed and if you agree with the content of the sub-PRs.

SparkQA · 2021-03-08T11:53:08Z

Test build #135864 has finished for PR 30763 at commit 8e54b41.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51

Hi, @attilapiros @hiboyang I'd like to discuss more about the way to register the metadata before we consider how to split this PR.

I actually have a different idea about this. As mentioned shortly in @hiboyang 's PR, I'd rather redesign the location of MapStatus:

spark/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

Lines 36 to 38 in f340857

    
           private[spark] sealed trait MapStatus { 
        
             /** Location where this task output is. */ 
        
             def location: BlockManagerId

If we want to introduce custom storage, I think the location should be abstracted to represent different storages, e.g., we can introduce class Location. And the Location would have some common attributes, e.g., type. And we can add metadata as an interface to Location to provide arbitrary infos. Then, BlockManagerId would be a native implementation for Spark and users could implement Location to support custom storage.

Also, this way wouldn't change the existing framework of shuffle read/write. It'd allow us to be able to reuse the features and code paths without extra effort.

WDYT?

attilapiros · 2021-03-08T15:13:44Z

. Then, BlockManagerId would be a native implementation for Spark and users could implement Location to support custom storage.

To test the idea I try to come up with hard situations but this does not mean I am against the idea.

So if I understand correctly BlockManagerId would extend the Location class, right?
And here MapStatus#location would be a generic Location?

In this case we should check the references of this MapStatus#location and based on that decide where we are safe to cast Location to BlockManagerId or where else we would pass the location further as a Location (or at least what else the generic location should contain to have the existing things working...).

As the current reader uses MapOutputTracker#getMapSizesByExecutorId you would like to keep that method and runtime throw an exception when it's called and location is not BlockManagerId? This is a central method to get blocksByAddress for fetching in the Spark shuffle.

For example as I see MapOutputTracker is tailored to satisfy the current shuffle solution. This should be checked for the idea.

On the other hand write side might be easier as there MapStatus is filled with the id of the current block manager. So a new writer implementation just uses its location.

But for the read side my worry is having runtime checks/assert/guards to enforce when allowed to use what.

attilapiros · 2021-03-08T15:31:23Z

I still think the location abstraction is good idea. I just have my doubts about the amount of the efforts we need to do:

Also, this way wouldn't change the existing framework of shuffle read/write. It'd allow us to be able to reuse the features and code paths without extra effort.

Ngone51 · 2021-03-08T15:55:08Z

So if I understand correctly BlockManagerId would extend the Location class, right?
And here MapStatus#location would be a generic Location?

Yes

As the current reader uses MapOutputTracker#getMapSizesByExecutorId you would like to keep that method and runtime throw an exception when it's called and location is not BlockManagerId?

We don't. Actually, MapOutputTracker should be refactored to work with the Location instead of the specific BlockManagerId if Location introduced. Accordingly, blocksByAddress would be refactored to store the unique "address" generated by Localtion. That also means we'd always keep the generic Location inside ShuffleBlockFetchIterator instead of a specific Location, so we don't need casting.

I think it also answers this question:

In this case we should check the references of this MapStatus#location and based on that decide where we are safe to cast Location to BlockManagerId or where else we would pass the location further as a Location (or at least what else the generic location should contain to have the existing things working...).

Acutally, I only find one reference that need cast:

spark/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

Line 1661 in f340857

val execId = status.location.executorId

And yes the custom reader should care more about casting. They should definitely cast the generic Location to their implemented one if they want to get the specific information. But the casting should always succeed because Spark would only use one type storage at a time.

Ngone51 · 2021-03-08T16:06:31Z

I just have my doubts about the amount of the efforts we need to do

As far as I see,

in @hiboyang 's PR, he added getAllMapOutputStatusMetadata in MapOutputTracker. IIUC, this must need corresponding change to handle the metadata at the reader side, which would be a new code path.
in this PR, I see we added ShuffleOutputTracker, which is very similar to MapOutputTracker. And, MapOutputTracker has add a new interface - updateMapOutput to support node decommission recently. But ShuffleOutputTracker doesn't have it. Do we want to support decommission for custom storages too or only specific to the BlockManager? In the way of ShuffleOutputTracker, I think we must need extra effort if we want to support it in custom storages. However, if we have the generic Location, we can reuse MapOutputTracker directly.

Ngone51 · 2021-03-08T16:12:01Z

BTW, I'm thinking we still need to think carefully about the Location solution. I worry we'd overengineering if the use cases don't require such flexibility. Because I can imagine how Location would widely refactor the code base, especially for those central parts. So I hope we'd discuss more use cases if you think the idea is generally good.

hiboyang · 2021-03-10T05:41:22Z

Just see the discussion here. The location abstraction is a good idea. For different shuffle solutions, they could have different location implementation, e.g. Spark's default sort shuffle has BlockManagerId as the location, remote shuffle service has shuffle servers as the location, disaggregated shuffle storage (e.g. S3) has S3 bucket/path as the location.

MapOutputTracker#getMapSizesByExecutorId may not need to throw exception? It could return a list of Locations and sizes.

hiboyang · 2021-03-16T19:11:27Z

@Ngone51 @attilapiros do we want to proceed with the location idea?

Ngone51 · 2021-03-17T01:46:12Z

I'm waiting for @attilapiros 's feedback.

attilapiros · 2021-03-17T04:48:11Z

We are all agree more abstraction here is really a good idea and reading #30763 (comment) gives me the impression we both worry about the impact of the change but as I see you have solutions for all the concerns: #30763 (comment).

@Ngone51 I am fine if you proceed and when it is ready we can see the real price of this change.

hiboyang · 2021-03-17T05:33:11Z

We are all agree more abstraction here is really a good idea and reading #30763 (comment) gives me the impression we both worry about the impact of the change but as I see you have solutions for all the concerns: #30763 (comment).

@Ngone51 I am fine if you proceed and when it is ready we can see the real price of this change.

+1

Ngone51 · 2021-03-17T10:15:14Z

Sure, I'll give a try these days.

github-actions · 2021-06-26T00:07:29Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

mccheah and others added 24 commits June 22, 2020 18:09

Register map output metadata upon committing shuffle map tasks.

f427ab5

Add a mock version for tests.

65d077d

Update manifests

c2882ab

Add a test for specifically checking map output registration and backup

dc6f853

Return map output commit message from single spill writer

529122f

Address comments and fix build

d20a0ee

Temporarily remove tests

dc8d15c

Address comments

e340bca

Merge remote-tracking branch 'origin/master' into register-map-output…

edd5c05

…-metadata

Fix Java linting errors

7baa3d2

Fix infinite loop.

cce67ab

Only invoke shuffle output tracker once per unregister shuffle attempt.

Fix checkstyle

409bebb

Fix test

a6fabd2

Invoke super.afterEach in ShuffleDriverComponentsSuite

3c66353

Cause we need to stop the SC

Fix build

b88d724

Merge remote-tracking branch 'origin/master' into register-map-output…

3505af8

…-metadata

Address comments.

89bb528

Address comments

0da9e22

Merge remote-tracking branch 'origin/master' into register-map-output…

2b5108f

…-metadata

Merge branch 'master' into pr/28618

1e10577

apply Attila's review comments

cac0e9e

remove unused import

861f089

fix mima

e210160

forgotten commas

1fc86f6

github-actions bot added BUILD CORE DSTREAM labels Dec 14, 2020

attilapiros mentioned this pull request Mar 7, 2021

[SPARK-33114][CORE] Add metadata in MapStatus to support custom shuffle manager #31763

Closed

Merge branch 'master' into updated_28618

abcf8f3

update mima excludes

8e54b41

Ngone51 reviewed Mar 8, 2021

View reviewed changes

Ngone51 mentioned this pull request Mar 18, 2021

[SPARK-34942][API][CORE] Abstract Location in MapStatus to enable support for custom storage #31876

Closed

github-actions bot added the Stale label Jun 26, 2021

github-actions bot closed this Jun 27, 2021

attilapiros mentioned this pull request Dec 1, 2021

[SPARK-37394][CORE] Skip registering with external shuffle server if a customized shuffle manager is configured #34672

Closed

	private[spark] sealed trait MapStatus {
	/** Location where this task output is. */
	def location: BlockManagerId

[SPARK-31801][API][SHUFFLE] Register map output metadata #30763

[SPARK-31801][API][SHUFFLE] Register map output metadata #30763

Uh oh!

Conversation

attilapiros commented Dec 14, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Dec 14, 2020

Uh oh!

SparkQA commented Dec 14, 2020

Uh oh!

SparkQA commented Dec 14, 2020

Uh oh!

SparkQA commented Jan 27, 2021

Uh oh!

SparkQA commented Jan 27, 2021

Uh oh!

SparkQA commented Jan 27, 2021

Uh oh!

SparkQA commented Mar 8, 2021

Uh oh!

SparkQA commented Mar 8, 2021

Uh oh!

SparkQA commented Mar 8, 2021

Uh oh!

attilapiros commented Mar 8, 2021

Uh oh!

attilapiros commented Mar 8, 2021

Uh oh!

attilapiros commented Mar 8, 2021

Uh oh!

SparkQA commented Mar 8, 2021

Uh oh!

Ngone51 left a comment

Choose a reason for hiding this comment

Uh oh!

attilapiros commented Mar 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

attilapiros commented Mar 8, 2021

Uh oh!

Ngone51 commented Mar 8, 2021

Uh oh!

Ngone51 commented Mar 8, 2021

Uh oh!

Ngone51 commented Mar 8, 2021

Uh oh!

hiboyang commented Mar 10, 2021

Uh oh!

hiboyang commented Mar 16, 2021

Uh oh!

Ngone51 commented Mar 17, 2021

Uh oh!

attilapiros commented Mar 17, 2021

Uh oh!

hiboyang commented Mar 17, 2021

Uh oh!

Ngone51 commented Mar 17, 2021

Uh oh!

github-actions bot commented Jun 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

attilapiros commented Mar 8, 2021 •

edited

Loading