[SPARK-30667][CORE] Add all gather method to BarrierTaskContext#27640
[SPARK-30667][CORE] Add all gather method to BarrierTaskContext#27640sarthfrey wants to merge 7 commits intoapache:masterfrom
Conversation
change method to allGather fix docstring fix test test2 test3 test4 test5 Change API to send and receive bytes rather than strings doc fix doc fix 2 fix test fix test 2 fix test 3 fix test 4 fix test 5 fix test 6 fix test 7 fix test final add python test fix test final 2 address review round 1 Change allGather API to accept string over bytes addressed review feedback round 2 comments rm trailing whitespace address review feedback round 2 address review round 3 address review feedback round 4 address review round 5 address review round 6 fix test retrigger build retrigger build add mima exclusion rule fix semicolon fix tests fix python unit test fix python unit test final temp
|
OK to test |
|
""" |
|
jenkins, add to whitelist |
|
jenkins, test this please |
|
Test build #118733 has finished for PR 27640 at commit
|
|
Test build #118737 has finished for PR 27640 at commit
|
|
Test build #118738 has finished for PR 27640 at commit
|
Fix for #27395 ### What changes were proposed in this pull request? The `allGather` method is added to the `BarrierTaskContext`. This method contains the same functionality as the `BarrierTaskContext.barrier` method; it blocks the task until all tasks make the call, at which time they may continue execution. In addition, the `allGather` method takes an input message. Upon returning from the `allGather` the task receives a list of all the messages sent by all the tasks that made the `allGather` call. ### Why are the changes needed? There are many situations where having the tasks communicate in a synchronized way is useful. One simple example is if each task needs to start a server to serve requests from one another; first the tasks must find a free port (the result of which is undetermined beforehand) and then start making requests, but to do so they each must know the port chosen by the other task. An `allGather` method would allow them to inform each other of the port they will run on. ### Does this PR introduce any user-facing change? Yes, an `BarrierTaskContext.allGather` method will be available through the Scala, Java, and Python APIs. ### How was this patch tested? Most of the code path is already covered by tests to the `barrier` method, since this PR includes a refactor so that much code is shared by the `barrier` and `allGather` methods. However, a test is added to assert that an all gather on each tasks partition ID will return a list of every partition ID. An example through the Python API: ```python >>> from pyspark import BarrierTaskContext >>> >>> def f(iterator): ... context = BarrierTaskContext.get() ... return [context.allGather('{}'.format(context.partitionId()))] ... >>> sc.parallelize(range(4), 4).barrier().mapPartitions(f).collect()[0] [u'3', u'1', u'0', u'2'] ``` Closes #27640 from sarthfrey/master. Lead-authored-by: sarthfrey-db <sarth.frey@databricks.com> Co-authored-by: sarthfrey <sarth.frey@gmail.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com> (cherry picked from commit 274b328) Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
|
Thanks, merged to master/3.0 ! |
| */ | ||
| @Experimental | ||
| @Since("3.0.0") | ||
| def allGather(message: String): ArrayBuffer[String] = { |
There was a problem hiding this comment.
Just out of curiosity, why return an ArrayBuffer[String] instead of an Array[String] here?
There was a problem hiding this comment.
There was a problem hiding this comment.
I didn't have a particular reason in mind for ArrayBuffer[String] over Array[String], @zhengruifeng do you think the latter is preferable here, and if so, why? The returned collection is indexed and sorted by partition ID so I preferred those over Seq which is vague about whether it is naturally indexed or linear.
There was a problem hiding this comment.
OK sure IndexedSeq. or Array is fine. Just something immutable
There was a problem hiding this comment.
Gotcha, will submit a PR.
… type This PR proposes that we change the return type of the `BarrierTaskContext.allGather` method to `Array[String]` instead of `ArrayBuffer[String]` since it is immutable. Based on discussion in #27640. cc zhengruifeng srowen Closes #27951 from sarthfrey/all-gather-api. Authored-by: sarthfrey-db <sarth.frey@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
… type This PR proposes that we change the return type of the `BarrierTaskContext.allGather` method to `Array[String]` instead of `ArrayBuffer[String]` since it is immutable. Based on discussion in #27640. cc zhengruifeng srowen Closes #27951 from sarthfrey/all-gather-api. Authored-by: sarthfrey-db <sarth.frey@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 6fd3138) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Fix for apache#27395 ### What changes were proposed in this pull request? The `allGather` method is added to the `BarrierTaskContext`. This method contains the same functionality as the `BarrierTaskContext.barrier` method; it blocks the task until all tasks make the call, at which time they may continue execution. In addition, the `allGather` method takes an input message. Upon returning from the `allGather` the task receives a list of all the messages sent by all the tasks that made the `allGather` call. ### Why are the changes needed? There are many situations where having the tasks communicate in a synchronized way is useful. One simple example is if each task needs to start a server to serve requests from one another; first the tasks must find a free port (the result of which is undetermined beforehand) and then start making requests, but to do so they each must know the port chosen by the other task. An `allGather` method would allow them to inform each other of the port they will run on. ### Does this PR introduce any user-facing change? Yes, an `BarrierTaskContext.allGather` method will be available through the Scala, Java, and Python APIs. ### How was this patch tested? Most of the code path is already covered by tests to the `barrier` method, since this PR includes a refactor so that much code is shared by the `barrier` and `allGather` methods. However, a test is added to assert that an all gather on each tasks partition ID will return a list of every partition ID. An example through the Python API: ```python >>> from pyspark import BarrierTaskContext >>> >>> def f(iterator): ... context = BarrierTaskContext.get() ... return [context.allGather('{}'.format(context.partitionId()))] ... >>> sc.parallelize(range(4), 4).barrier().mapPartitions(f).collect()[0] [u'3', u'1', u'0', u'2'] ``` Closes apache#27640 from sarthfrey/master. Lead-authored-by: sarthfrey-db <sarth.frey@databricks.com> Co-authored-by: sarthfrey <sarth.frey@gmail.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
… type This PR proposes that we change the return type of the `BarrierTaskContext.allGather` method to `Array[String]` instead of `ArrayBuffer[String]` since it is immutable. Based on discussion in apache#27640. cc zhengruifeng srowen Closes apache#27951 from sarthfrey/all-gather-api. Authored-by: sarthfrey-db <sarth.frey@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Fix for #27395
What changes were proposed in this pull request?
The
allGathermethod is added to theBarrierTaskContext. This method contains the same functionality as theBarrierTaskContext.barriermethod; it blocks the task until all tasks make the call, at which time they may continue execution. In addition, theallGathermethod takes an input message. Upon returning from theallGatherthe task receives a list of all the messages sent by all the tasks that made theallGathercall.Why are the changes needed?
There are many situations where having the tasks communicate in a synchronized way is useful. One simple example is if each task needs to start a server to serve requests from one another; first the tasks must find a free port (the result of which is undetermined beforehand) and then start making requests, but to do so they each must know the port chosen by the other task. An
allGathermethod would allow them to inform each other of the port they will run on.Does this PR introduce any user-facing change?
Yes, an
BarrierTaskContext.allGathermethod will be available through the Scala, Java, and Python APIs.How was this patch tested?
Most of the code path is already covered by tests to the
barriermethod, since this PR includes a refactor so that much code is shared by thebarrierandallGathermethods. However, a test is added to assert that an all gather on each tasks partition ID will return a list of every partition ID.An example through the Python API: