-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31801][API][SHUFFLE] Register map output metadata #28618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #123022 has finished for PR 28618 at commit
|
| * number of bytes written by the partition writer for that partition id. | ||
| */ | ||
| long[] commitAllPartitions() throws IOException; | ||
| MapOutputCommitMessage commitAllPartitions() throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest change the relate comments of this code, the return is an object instead of array
|
Test build #123502 has finished for PR 28618 at commit
|
|
Test build #123504 has finished for PR 28618 at commit
|
|
Test build #123507 has finished for PR 28618 at commit
|
| mapStatus = MapStatus$.MODULE$.apply( | ||
| blockManager.shuffleServerId(), partitionLengths, mapId); | ||
| mapOutputCommitMessage = mapOutputWriter.commitAllPartitions(); | ||
| taskResult = new MapTaskResult( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As these lines are repeating you could extract them into a new def, like:
protected void setTaskResult(MapOutputCommitMessage mapOutputCommitMessage) {
taskResult = new MapTaskResult(
MapStatus$.MODULE$.apply(
blockManager.shuffleServerId(),
mapOutputCommitMessage.getPartitionLengths(),
mapId),
OptionConverters.toScala(mapOutputCommitMessage.getMapOutputMetadata()));
}With the help of this new def and Mockito's spy you can even get rid of the storing the mapOutputCommitMessage for testing purposes only but it has a price (this class cannot be final) for details you can check:
attilapiros@f4578a3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack - didn't address this in my latest patch but will get around to this
| mapOutputCommitMessage = maybeMetadata.map( | ||
| metadata -> MapOutputCommitMessage.of(spills[0].partitionLengths, metadata)) | ||
| .orElse(MapOutputCommitMessage.of(spills[0].partitionLengths)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot se why transferMapSpillFile cannot return a MapOutputCommitMessage that would simply this part:
attilapiros@289050e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I think this was originally designed this way because we didn't want the single spill writer to set a list of partition lengths that was different from what was passed into the writer's transfer function. But, maybe we can wrap this with a preconditions check to ensure that the state remains consistent, and that's good enough along with Javadoc.
|
I was thinking a lot on The problem I believe this class tries to fulfil two very separate roles: be a builder and the result of the building in the same time. This is why we need this kind of check: spark/core/src/main/java/org/apache/spark/shuffle/sort/io/LocalDiskShuffleDriverComponents.java Lines 47 to 49 in 289050e
If the building is cleanly separated from the result of the building then we can be sure the prerequisites are fulfilled before. I would change it by transforming it to be the result of the building in the following way:
One more idea / question:
@mccheah what do you think? |
|
Test build #124261 has finished for PR 28618 at commit
|
|
Retest this please. |
|
Test build #124307 has finished for PR 28618 at commit
|
You bring up a good point. I can adjust the PR accordingly. It does seem like the components does both an initialization and a runtime mode, and it would be more ideal to separate the two. Thanks for critically thinking about this! |
|
Also I think it makes sense for the executor side and the driver side to be mirrored. |
e7c9988 to
dc8d15c
Compare
|
I rebased on master in my latest patch. I also addressed your comments @attilapiros. Thanks for the feedback! The diff was growing extremely large (> 1000 lines), so I removed all the tests for now. I'm going to open a separate patch with tests for this. |
|
Test build #124385 has finished for PR 28618 at commit
|
|
In index 0976494b6d..5f6a0f164f 100644
--- a/streaming/src/test/scala/org/apache/spark/streaming/ReceivedBlockHandlerSuite.scala
+++ b/streaming/src/test/scala/org/apache/spark/streaming/ReceivedBlockHandlerSuite.scala
@@ -70,7 +70,7 @@ abstract class BaseReceivedBlockHandlerSuite(enableEncryption: Boolean)
val streamId = 1
val securityMgr = new SecurityManager(conf, encryptionKey)
val broadcastManager = new BroadcastManager(true, conf, securityMgr)
- val mapOutputTracker = new MapOutputTrackerMaster(conf, broadcastManager, true)
+ val mapOutputTracker = new MapOutputTrackerMaster(conf, null, broadcastManager, true)
val shuffleManager = new SortShuffleManager(conf)
val serializer = new KryoSerializer(conf)
var serializerManager = new SerializerManager(serializer, conf, encryptionKey) |
| private var _driver: ShuffleDriverComponents = _ | ||
| private var _executor: ShuffleExecutorComponents = _ | ||
|
|
||
| def getOrCreateDriverComponents(): ShuffleDriverComponents = synchronized { | ||
| if (_driver == null) { | ||
| _driver = delegate.initializeShuffleDriverComponents() | ||
| } | ||
| _driver | ||
| } | ||
|
|
||
| def getOrCreateExecutorComponents( | ||
| appId: String, | ||
| execId: String, | ||
| extraConfigs: Map[String, String]): ShuffleExecutorComponents = synchronized { | ||
| if (_executor == null) { | ||
| _executor = delegate.initializeShuffleExecutorComponents(appId, execId, extraConfigs.asJava) | ||
| } | ||
| _executor | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can improve the methods of this class. Especially the getOrCreateExecutorComponents as reading that I have a feeling it is easy to use it badly. For example by calling it twice with two different values (for any of its params) as always the object generated for the first call will be returned.
But first getOrCreateDriverComponents that is the easier here as it has no input params at all. So this could be replaced with a simple lazy val (it is thread-safe for a long time ago: scala/bug#3007).
And the same is true for getOrCreateExecutorComponents as all its parameters are basically coming from a SparkEnv instance.
So my idea is:
| private var _driver: ShuffleDriverComponents = _ | |
| private var _executor: ShuffleExecutorComponents = _ | |
| def getOrCreateDriverComponents(): ShuffleDriverComponents = synchronized { | |
| if (_driver == null) { | |
| _driver = delegate.initializeShuffleDriverComponents() | |
| } | |
| _driver | |
| } | |
| def getOrCreateExecutorComponents( | |
| appId: String, | |
| execId: String, | |
| extraConfigs: Map[String, String]): ShuffleExecutorComponents = synchronized { | |
| if (_executor == null) { | |
| _executor = delegate.initializeShuffleExecutorComponents(appId, execId, extraConfigs.asJava) | |
| } | |
| _executor | |
| } | |
| lazy val driverComponents = delegate.initializeShuffleDriverComponents() | |
| lazy val executorComponents = { | |
| val env = SparkEnv.get | |
| delegate.initializeShuffleExecutorComponents( | |
| env.conf.getAppId, | |
| env.executorId, | |
| env.conf.getAllWithPrefix(ShuffleDataIOUtils.SHUFFLE_SPARK_CONF_PREFIX).toMap.asJava) | |
| } |
I still have to test it. What is your opinion @mccheah ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I certainly gave lazy val a consideration. I'm not entirely familiar with the most modern Scala conventions - if lazy is preferred over explicit initialization methods in general, then I'm ok with the above recommendation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think transforming getOrCreateExecutorComponents into method without any argument is already big win.
I can let go the lazy val it is just implementation details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I'll make those changes. Thanks!
|
Test build #128474 has finished for PR 28618 at commit
|
|
Btw, noticed that +CC @holdenk |
|
I looked at the changes proposed here so that we can use the interfaces here for Push-based shuffle (SPIP and code). I think we will be able to use the current I still have to wrap my head around how we can model We may need to evolve them to fit the push-based shuffle use case. As long as we are open to potentially making some backward incompatible changes, these APIs look good to me for now. |
@mridulm Can we also do this as follow-up? The main thing is, this patch is already at ~900 lines changed total (+s and -s combined) and I really don't want to increase the scope of this. This patch has already stalled from merging for awhile and I'd rather get something completed and have follow-up tasks than try to make the entire feature perfect in a single patch. Can we add follow-up JIRA tasks that followed from the comments for purely additive changes to the API, and move forward with the scope of this patch as-is? |
|
Sure @mccheah we can do that in follow up work to keep things more bite sized. |
attilapiros
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (just a few nits)
| final ShuffleMapOutputWriter mapWriter = shuffleExecutorComponents | ||
| .createMapOutputWriter(shuffleId, mapId, partitioner.numPartitions()); | ||
| return mapWriter.commitAllPartitions().getPartitionLengths(); | ||
| return mapWriter.commitAllPartitions(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
| return mapWriter.commitAllPartitions(); | |
| mapOutputCommitMessage = mapWriter.commitAllPartitions(); |
| // output file would have already been counted as shuffle bytes written. | ||
| partitionLengths = spills[0].partitionLengths; | ||
| long[] partitionLengths = spills[0].partitionLengths; | ||
| logger.debug("Merge shuffle spills for mapId {} with length {}", mapId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is not this log a bit misleading? I mean here is no merge done as there is only one spill.
Even transferMapSpillFile says:
The map spill file already has the proper format, and it contains all of the partition data.
So just transfer it directly to the destination without any merging.
|
Thanks for working on this :) |
|
sorry for my delay on getting back to this, could you up merge to latest? |
Updating PR 28618 with master and applying my comments
|
Test build #129906 has finished for PR 28618 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129950 has finished for PR 28618 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Jenkins retest this please. |
|
Test build #130236 has finished for PR 28618 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Adds a
ShuffleOutputTrackerAPI that can be used for managing shuffle metadata on the driver. Accepts map output metadata returned by the map output writers.Requires #28616.
Why are the changes needed?
Part of the design as discussed in this document, and part of the wider effort of SPARK-25299.
Does this PR introduce any user-facing change?
Enables additional APIs for the shuffle storage plugin tree. Usage will become more apparent when the read side of the shuffle plugin tree is introduced.
How was this patch tested?
We've added a mock implementation of the shuffle plugin tree here, to prove that a Spark job using a different implementation of the plugin can use all of the plugin points for an alternative shuffle data storage solution. But we don't include it here, in order to minimize the diff and the code to review in this specific patch. See #28902.