[SPARK-30667][CORE] Add all gather method to BarrierTaskContext by sarthfrey · Pull Request #27640 · apache/spark

sarthfrey · 2020-02-20T01:06:57Z

What changes were proposed in this pull request?

The allGather method is added to the BarrierTaskContext. This method contains the same functionality as the BarrierTaskContext.barrier method; it blocks the task until all tasks make the call, at which time they may continue execution. In addition, the allGather method takes an input message. Upon returning from the allGather the task receives a list of all the messages sent by all the tasks that made the allGather call.

Why are the changes needed?

There are many situations where having the tasks communicate in a synchronized way is useful. One simple example is if each task needs to start a server to serve requests from one another; first the tasks must find a free port (the result of which is undetermined beforehand) and then start making requests, but to do so they each must know the port chosen by the other task. An allGather method would allow them to inform each other of the port they will run on.

Does this PR introduce any user-facing change?

Yes, an BarrierTaskContext.allGather method will be available through the Scala, Java, and Python APIs.

How was this patch tested?

Most of the code path is already covered by tests to the barrier method, since this PR includes a refactor so that much code is shared by the barrier and allGather methods. However, a test is added to assert that an all gather on each tasks partition ID will return a list of every partition ID.

An example through the Python API:

>>> from pyspark import BarrierTaskContext
>>>
>>> def f(iterator):
...     context = BarrierTaskContext.get()
...     return [context.allGather('{}'.format(context.partitionId()))]
...
>>> sc.parallelize(range(4), 4).barrier().mapPartitions(f).collect()[0]
[u'3', u'1', u'0', u'2']

change method to allGather fix docstring fix test test2 test3 test4 test5 Change API to send and receive bytes rather than strings doc fix doc fix 2 fix test fix test 2 fix test 3 fix test 4 fix test 5 fix test 6 fix test 7 fix test final add python test fix test final 2 address review round 1 Change allGather API to accept string over bytes addressed review feedback round 2 comments rm trailing whitespace address review feedback round 2 address review round 3 address review feedback round 4 address review round 5 address review round 6 fix test retrigger build retrigger build add mima exclusion rule fix semicolon fix tests fix python unit test fix python unit test final temp

jiangxb1987 · 2020-02-20T01:30:34Z

OK to test

mengxr · 2020-02-20T06:24:25Z

"""
/home/runner/work/spark/spark/python/pyspark/taskcontext.py:docstring of pyspark.BarrierTaskContext.getTaskInfos:2:Explicit markup ends without a blank line; unexpected unindent.
"""

mengxr · 2020-02-20T20:15:08Z

jenkins, add to whitelist

mengxr · 2020-02-20T20:15:47Z

jenkins, test this please

SparkQA · 2020-02-20T22:08:25Z

Test build #118733 has finished for PR 27640 at commit c1d1b0e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-21T02:55:00Z

Test build #118737 has finished for PR 27640 at commit 3f1f709.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-21T03:49:46Z

Test build #118738 has finished for PR 27640 at commit 7c259ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987

LGTM

Fix for #27395 ### What changes were proposed in this pull request? The `allGather` method is added to the `BarrierTaskContext`. This method contains the same functionality as the `BarrierTaskContext.barrier` method; it blocks the task until all tasks make the call, at which time they may continue execution. In addition, the `allGather` method takes an input message. Upon returning from the `allGather` the task receives a list of all the messages sent by all the tasks that made the `allGather` call. ### Why are the changes needed? There are many situations where having the tasks communicate in a synchronized way is useful. One simple example is if each task needs to start a server to serve requests from one another; first the tasks must find a free port (the result of which is undetermined beforehand) and then start making requests, but to do so they each must know the port chosen by the other task. An `allGather` method would allow them to inform each other of the port they will run on. ### Does this PR introduce any user-facing change? Yes, an `BarrierTaskContext.allGather` method will be available through the Scala, Java, and Python APIs. ### How was this patch tested? Most of the code path is already covered by tests to the `barrier` method, since this PR includes a refactor so that much code is shared by the `barrier` and `allGather` methods. However, a test is added to assert that an all gather on each tasks partition ID will return a list of every partition ID. An example through the Python API: ```python >>> from pyspark import BarrierTaskContext >>> >>> def f(iterator): ... context = BarrierTaskContext.get() ... return [context.allGather('{}'.format(context.partitionId()))] ... >>> sc.parallelize(range(4), 4).barrier().mapPartitions(f).collect()[0] [u'3', u'1', u'0', u'2'] ``` Closes #27640 from sarthfrey/master. Lead-authored-by: sarthfrey-db <sarth.frey@databricks.com> Co-authored-by: sarthfrey <sarth.frey@gmail.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com> (cherry picked from commit 274b328) Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>

jiangxb1987 · 2020-02-21T19:41:34Z

Thanks, merged to master/3.0 !

zhengruifeng · 2020-03-04T07:39:53Z

core/src/main/scala/org/apache/spark/BarrierTaskContext.scala

+   */
+  @Experimental
+  @Since("3.0.0")
+  def allGather(message: String): ArrayBuffer[String] = {


Just out of curiosity, why return an ArrayBuffer[String] instead of an Array[String] here?

friendly ping @jiangxb1987 @sarthfrey

cc @gatorsmile @srowen

Fair point; why not just Seq?

I didn't have a particular reason in mind for ArrayBuffer[String] over Array[String], @zhengruifeng do you think the latter is preferable here, and if so, why? The returned collection is indexed and sorted by partition ID so I preferred those over Seq which is vague about whether it is naturally indexed or linear.

OK sure IndexedSeq. or Array is fine. Just something immutable

Gotcha, will submit a PR.

… type This PR proposes that we change the return type of the `BarrierTaskContext.allGather` method to `Array[String]` instead of `ArrayBuffer[String]` since it is immutable. Based on discussion in #27640. cc zhengruifeng srowen Closes #27951 from sarthfrey/all-gather-api. Authored-by: sarthfrey-db <sarth.frey@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

… type This PR proposes that we change the return type of the `BarrierTaskContext.allGather` method to `Array[String]` instead of `ArrayBuffer[String]` since it is immutable. Based on discussion in #27640. cc zhengruifeng srowen Closes #27951 from sarthfrey/all-gather-api. Authored-by: sarthfrey-db <sarth.frey@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 6fd3138) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

Fix for apache#27395 ### What changes were proposed in this pull request? The `allGather` method is added to the `BarrierTaskContext`. This method contains the same functionality as the `BarrierTaskContext.barrier` method; it blocks the task until all tasks make the call, at which time they may continue execution. In addition, the `allGather` method takes an input message. Upon returning from the `allGather` the task receives a list of all the messages sent by all the tasks that made the `allGather` call. ### Why are the changes needed? There are many situations where having the tasks communicate in a synchronized way is useful. One simple example is if each task needs to start a server to serve requests from one another; first the tasks must find a free port (the result of which is undetermined beforehand) and then start making requests, but to do so they each must know the port chosen by the other task. An `allGather` method would allow them to inform each other of the port they will run on. ### Does this PR introduce any user-facing change? Yes, an `BarrierTaskContext.allGather` method will be available through the Scala, Java, and Python APIs. ### How was this patch tested? Most of the code path is already covered by tests to the `barrier` method, since this PR includes a refactor so that much code is shared by the `barrier` and `allGather` methods. However, a test is added to assert that an all gather on each tasks partition ID will return a list of every partition ID. An example through the Python API: ```python >>> from pyspark import BarrierTaskContext >>> >>> def f(iterator): ... context = BarrierTaskContext.get() ... return [context.allGather('{}'.format(context.partitionId()))] ... >>> sc.parallelize(range(4), 4).barrier().mapPartitions(f).collect()[0] [u'3', u'1', u'0', u'2'] ``` Closes apache#27640 from sarthfrey/master. Lead-authored-by: sarthfrey-db <sarth.frey@databricks.com> Co-authored-by: sarthfrey <sarth.frey@gmail.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>

… type This PR proposes that we change the return type of the `BarrierTaskContext.allGather` method to `Array[String]` instead of `ArrayBuffer[String]` since it is immutable. Based on discussion in apache#27640. cc zhengruifeng srowen Closes apache#27951 from sarthfrey/all-gather-api. Authored-by: sarthfrey-db <sarth.frey@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

sarthfrey changed the title ~~Add all gather method to BarrierTaskContext~~ [SPARK-30667][CORE] Add all gather method to BarrierTaskContext Feb 20, 2020

sarthfrey requested review from jiangxb1987 and mengxr February 20, 2020 01:08

sarthfrey force-pushed the master branch from 4b8e533 to 9c6c3ce Compare February 20, 2020 01:20

sarthfrey force-pushed the master branch from 9c6c3ce to 692b8ab Compare February 20, 2020 01:26

fix merge

3f8e9bf

sarthfrey added 2 commits February 19, 2020 17:32

fix spaces

711803a

fix spaces

6817643

fix indent

c1d1b0e

sarthfrey added 2 commits February 20, 2020 15:45

fix flaky test

3f1f709

revert and relax test assertion

7c259ac

jiangxb1987 approved these changes Feb 21, 2020

View reviewed changes

jiangxb1987 closed this in 274b328 Feb 21, 2020

zhengruifeng reviewed Mar 4, 2020

View reviewed changes

sarthfrey mentioned this pull request Mar 18, 2020

[SPARK-30667][FOLLOW-UP][CORE] Change BarrierTaskContext allGather method return type #27951

Closed

Comments

Conversation

sarthfrey commented Feb 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

jiangxb1987 commented Feb 20, 2020

Uh oh!

mengxr commented Feb 20, 2020

Uh oh!

mengxr commented Feb 20, 2020

Uh oh!

mengxr commented Feb 20, 2020

Uh oh!

SparkQA commented Feb 20, 2020

Uh oh!

SparkQA commented Feb 21, 2020

Uh oh!

SparkQA commented Feb 21, 2020

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Feb 21, 2020

Uh oh!

zhengruifeng Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Mar 12, 2020

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Mar 16, 2020

Choose a reason for hiding this comment

Uh oh!

srowen Mar 16, 2020

Choose a reason for hiding this comment

Uh oh!

sarthfrey Mar 16, 2020

Choose a reason for hiding this comment

Uh oh!

srowen Mar 16, 2020

Choose a reason for hiding this comment

Uh oh!

sarthfrey Mar 16, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sarthfrey commented Feb 20, 2020 •

edited

Loading