-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-33114][CORE] Add metadata in MapStatus to support custom shuffle manager #31763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
36cf66c
Apply https://github.com/apache/spark/pull/30004
hiboyang 1309378
Update per comments
hiboyang 17bcff2
Rename getAllMapOutputStatuses to getAllMapOutputStatusMetadata
hiboyang 525973b
Fix test issues
hiboyang 1fbc7a3
Fix comment of return value for getAllMapOutputStatusMetadata
hiboyang 798addf
Fix import style issue
hiboyang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -62,17 +62,38 @@ class MapOutputTrackerSuite extends SparkFunSuite { | |
| assert(tracker.containsShuffle(10)) | ||
| val size1000 = MapStatus.decompressSize(MapStatus.compressSize(1000L)) | ||
| val size10000 = MapStatus.decompressSize(MapStatus.compressSize(10000L)) | ||
| tracker.registerMapOutput(10, 0, MapStatus(BlockManagerId("a", "hostA", 1000), | ||
| Array(1000L, 10000L), 5)) | ||
| tracker.registerMapOutput(10, 1, MapStatus(BlockManagerId("b", "hostB", 1000), | ||
| Array(10000L, 1000L), 6)) | ||
| val mapStatus1 = MapStatus(BlockManagerId("a", "hostA", 1000), Array(1000L, 10000L), 5) | ||
| val mapStatus2 = MapStatus(BlockManagerId("b", "hostB", 1000), Array(10000L, 1000L), 6) | ||
| tracker.registerMapOutput(10, 0, mapStatus1) | ||
| tracker.registerMapOutput(10, 1, mapStatus2) | ||
| val statuses = tracker.getMapSizesByExecutorId(10, 0) | ||
| assert(statuses.toSet === | ||
| Seq((BlockManagerId("a", "hostA", 1000), | ||
| ArrayBuffer((ShuffleBlockId(10, 5, 0), size1000, 0))), | ||
| (BlockManagerId("b", "hostB", 1000), | ||
| ArrayBuffer((ShuffleBlockId(10, 6, 0), size10000, 1)))).toSet) | ||
| assert(0 == tracker.getNumCachedSerializedBroadcast) | ||
| val allStatusMetadata = tracker.getAllMapOutputStatusMetadata(10) | ||
| assert(0 == allStatusMetadata.size) | ||
| tracker.stop() | ||
| rpcEnv.shutdown() | ||
| } | ||
|
|
||
| test("master register shuffle with map status metadata") { | ||
| val rpcEnv = createRpcEnv("test") | ||
| val tracker = newTrackerMaster() | ||
| tracker.trackerEndpoint = rpcEnv.setupEndpoint(MapOutputTracker.ENDPOINT_NAME, | ||
| new MapOutputTrackerMasterEndpoint(rpcEnv, tracker, conf)) | ||
| tracker.registerShuffle(10, 2) | ||
| val mapStatus1 = MapStatus(BlockManagerId("a", "hostA", 1000), | ||
| Array(1000L, 10000L), 5, Some("metadata1")) | ||
| val mapStatus2 = MapStatus(BlockManagerId("b", "hostB", 1000), | ||
| Array(10000L, 1000L), 6, Some(1001)) | ||
| tracker.registerMapOutput(10, 0, mapStatus1) | ||
| tracker.registerMapOutput(10, 1, mapStatus2) | ||
| assert(0 == tracker.getNumCachedSerializedBroadcast) | ||
| val allStatusMetadata = tracker.getAllMapOutputStatusMetadata(10) | ||
| assert(allStatusMetadata === Array(mapStatus1.metadata.get, mapStatus2.metadata.get)) | ||
| tracker.stop() | ||
| rpcEnv.shutdown() | ||
| } | ||
|
|
@@ -92,9 +113,11 @@ class MapOutputTrackerSuite extends SparkFunSuite { | |
| assert(tracker.containsShuffle(10)) | ||
| assert(tracker.getMapSizesByExecutorId(10, 0).nonEmpty) | ||
| assert(0 == tracker.getNumCachedSerializedBroadcast) | ||
| assert(0 == tracker.getAllMapOutputStatusMetadata(10).size) | ||
| tracker.unregisterShuffle(10) | ||
| assert(!tracker.containsShuffle(10)) | ||
| assert(tracker.getMapSizesByExecutorId(10, 0).isEmpty) | ||
| assert(0 == tracker.getAllMapOutputStatusMetadata(11).size) | ||
|
|
||
| tracker.stop() | ||
| rpcEnv.shutdown() | ||
|
|
@@ -122,6 +145,7 @@ class MapOutputTrackerSuite extends SparkFunSuite { | |
| // this should cause it to fail, and the scheduler will ignore the failure due to the | ||
| // stage already being aborted. | ||
| intercept[FetchFailedException] { tracker.getMapSizesByExecutorId(10, 1) } | ||
| intercept[FetchFailedException] { tracker.getAllMapOutputStatusMetadata(10) } | ||
|
|
||
| tracker.stop() | ||
| rpcEnv.shutdown() | ||
|
|
@@ -146,13 +170,15 @@ class MapOutputTrackerSuite extends SparkFunSuite { | |
| intercept[FetchFailedException] { mapWorkerTracker.getMapSizesByExecutorId(10, 0) } | ||
|
|
||
| val size1000 = MapStatus.decompressSize(MapStatus.compressSize(1000L)) | ||
| masterTracker.registerMapOutput(10, 0, MapStatus( | ||
| BlockManagerId("a", "hostA", 1000), Array(1000L), 5)) | ||
| val mapStatus = MapStatus(BlockManagerId("a", "hostA", 1000), Array(1000L), 5) | ||
| masterTracker.registerMapOutput(10, 0, mapStatus) | ||
| mapWorkerTracker.updateEpoch(masterTracker.getEpoch) | ||
| assert(mapWorkerTracker.getMapSizesByExecutorId(10, 0).toSeq === | ||
| Seq((BlockManagerId("a", "hostA", 1000), | ||
| ArrayBuffer((ShuffleBlockId(10, 5, 0), size1000, 0))))) | ||
| assert(0 == masterTracker.getNumCachedSerializedBroadcast) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please do not remove this assert:
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it seems caused by merge issue, will add it back |
||
| val allMapOutputStatusMetadata = mapWorkerTracker.getAllMapOutputStatusMetadata(10) | ||
| assert(0 == allMapOutputStatusMetadata.size) | ||
|
|
||
| val masterTrackerEpochBeforeLossOfMapOutput = masterTracker.getEpoch | ||
| masterTracker.unregisterMapOutput(10, 0, BlockManagerId("a", "hostA", 1000)) | ||
|
|
@@ -170,6 +196,33 @@ class MapOutputTrackerSuite extends SparkFunSuite { | |
| mapWorkerRpcEnv.shutdown() | ||
| } | ||
|
|
||
| test("remote get all map output statuses with metadata") { | ||
| val hostname = "localhost" | ||
| val rpcEnv = createRpcEnv("spark", hostname, 0, new SecurityManager(conf)) | ||
|
|
||
| val masterTracker = newTrackerMaster() | ||
| masterTracker.trackerEndpoint = rpcEnv.setupEndpoint(MapOutputTracker.ENDPOINT_NAME, | ||
| new MapOutputTrackerMasterEndpoint(rpcEnv, masterTracker, conf)) | ||
|
|
||
| val mapWorkerRpcEnv = createRpcEnv("spark-worker", hostname, 0, new SecurityManager(conf)) | ||
| val mapWorkerTracker = new MapOutputTrackerWorker(conf) | ||
| mapWorkerTracker.trackerEndpoint = | ||
| mapWorkerRpcEnv.setupEndpointRef(rpcEnv.address, MapOutputTracker.ENDPOINT_NAME) | ||
|
|
||
| masterTracker.registerShuffle(10, 1) | ||
| val mapStatus = MapStatus(BlockManagerId("a", "hostA", 1000), Array(1000L), 5, | ||
| Some("metadata1")) | ||
| masterTracker.registerMapOutput(10, 0, mapStatus) | ||
| val allMapOutputStatusMetadata = mapWorkerTracker.getAllMapOutputStatusMetadata(10) | ||
| assert(allMapOutputStatusMetadata.size === 1) | ||
| assert(allMapOutputStatusMetadata(0) === mapStatus.metadata.get) | ||
|
|
||
| masterTracker.stop() | ||
| mapWorkerTracker.stop() | ||
| rpcEnv.shutdown() | ||
| mapWorkerRpcEnv.shutdown() | ||
| } | ||
|
|
||
| test("remote fetch below max RPC message size") { | ||
| val newConf = new SparkConf | ||
| newConf.set(RPC_MESSAGE_MAX_SIZE, 1) | ||
|
|
@@ -311,10 +364,12 @@ class MapOutputTrackerSuite extends SparkFunSuite { | |
| val size0 = MapStatus.decompressSize(MapStatus.compressSize(0L)) | ||
| val size1000 = MapStatus.decompressSize(MapStatus.compressSize(1000L)) | ||
| val size10000 = MapStatus.decompressSize(MapStatus.compressSize(10000L)) | ||
| tracker.registerMapOutput(10, 0, MapStatus(BlockManagerId("a", "hostA", 1000), | ||
| Array(size0, size1000, size0, size10000), 5)) | ||
| tracker.registerMapOutput(10, 1, MapStatus(BlockManagerId("b", "hostB", 1000), | ||
| Array(size10000, size0, size1000, size0), 6)) | ||
| val mapStatus1 = MapStatus(BlockManagerId("a", "hostA", 1000), | ||
| Array(size0, size1000, size0, size10000), 5) | ||
| val mapStatus2 = MapStatus(BlockManagerId("b", "hostB", 1000), | ||
| Array(size10000, size0, size1000, size0), 6) | ||
| tracker.registerMapOutput(10, 0, mapStatus1) | ||
| tracker.registerMapOutput(10, 1, mapStatus2) | ||
| assert(tracker.containsShuffle(10)) | ||
| assert(tracker.getMapSizesByExecutorId(10, 0, 2, 0, 4).toSeq === | ||
| Seq( | ||
|
|
@@ -326,6 +381,7 @@ class MapOutputTrackerSuite extends SparkFunSuite { | |
| (ShuffleBlockId(10, 6, 2), size1000, 1))) | ||
| ) | ||
| ) | ||
| assert(0 == tracker.getAllMapOutputStatusMetadata(10).size) | ||
|
|
||
| tracker.unregisterShuffle(10) | ||
| tracker.stop() | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm...what's the relationship between SPARK-33114 and SPARK-25299? According to the JIRA description, SPARK-33114 seems to enhance the support for custom shuffle manager while SPARK-25299 only customize the storage with the default
SortShuffleManager.So if we are only talking about SPARK-33114, adding
metadatamay be a good choice according to its own scenario. But if we bring in SPARK-25299 together (IIUC, what this PR is doing would also benefit SPARK-25299), I personally think we need a more general design here. For example, I'd prefer to redesign thelocationofMapStatusto make it be able to support different scenarios (e.g., Spark BlockManager, Spark external shuffle service, custom remote storage, etc. ) mentioned in SPARK-25299. And in this way, different scenarios would be able to reuse the existing features, e.g., decommission(which may update mapstatus location during runtime) and reuse the existing code paths, e.g., we don't need the extragetAllMapOutputStatusesand everything should be the same as what we already did during shuffle reading.WDYT?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ngone51 I agree with you that finishing the design laid out in SPARK-25299 would be much better.
This is why I opened #30763 as a copy of Matthew Cheah's original PR for SPARK-31801 (because he is busy with other projects) and kept it up-to-date several times with the master.
But it haven't got enough reviews and I wouldn't want to block @hiboyang further, #30004 (comment).
I am sure with your help we can complete SPARK-31801 and be on the road of SPARK-25299.
So next week I will do the conflict resolution and ping you when the PR is ready for review. Is this okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks @attilapiros for keeping working on SPARK-25299 while unblocking this PR. @Ngone51 SPARK-33114 is a small change to support remote shuffle service/storage by adding a metadata object in MapStatus. It could be viewed as a subset of SPARK-25299 's work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, please. @attilapiros
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hiboyang Thanks for your explanation. I agree that #30763 is too big for review. But I think we can discuss there first to ensure we towards the same direction before we deep into details. And when we're on the same page, we can split the big PR into smaller pieces and start to co-work. Does it sound good to you?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SPARK-31801 is very big, and may take very long time to finish (already being there for 10 months). Could we merge this PR first?
If SPARK-31801 find a better way to support it and break
getAllMapOutputStatusMetadata, it is actually a good thing :) We could have multiple iterations. This PR is the first iteration with very small change. SPARK-31801 is the iteration after that. The latter does not need to block the former one.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should develop like this way...As you mentioned above, SPARK-33114 can be considered as a subtask of SPARK-25299. So how can we consider this PR as a first iteration when SPARK-25299 is still under discussion and development, especially when people haven't reached an agreement on the solution and has a possible alternative solution at the same time? Also, I think the custom shuffle manager isn't officially supported by Spark because the
ShuffleManagerinterface is private. So it doesn't make sense for Spark to add an internal API for un-official use cases if there's no strong reason.SPARK-31801 is surely big. But as I mentioned early, we can split it. When the solution is finalized, we can start with refactoring
MapStatusfirst. I think it would be a much smaller task and be enough for your case. And then, we'll start the remaining work(e.g. use the newMapStatuswhere it was referenced) but you don't care.I understand you have paid a lot of effort into this work, and sorry we can not get it in fast. And, unfortunately, I don't have the permission to merge. You could persuade committers to merge the PR if you insist on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is also good idea if we could split SPARK-31801 and start with refactoring
MapStatusfirst. Do you or the community get ideas about how to split SPARK-31801?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're still discussing the solution in #30763. So I can't tell you the concrete split plan. But, I think, we'd be able to start with refactoring
MapStatuseither way.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I missed the latest discussion in #30763, will check there as well, thanks!