[SPARK-30667][CORE] Add allGather method to BarrierTaskContext by sarthfrey · Pull Request #27395 · apache/spark

sarthfrey · 2020-01-30T06:39:42Z

What changes were proposed in this pull request?

The allGather method is added to the BarrierTaskContext. This method contains the same functionality as the BarrierTaskContext.barrier method; it blocks the task until all tasks make the call, at which time they may continue execution. In addition, the allGather method takes an input message. Upon returning from the allGather the task receives a list of all the messages sent by all the tasks that made the allGather call.

Why are the changes needed?

There are many situations where having the tasks communicate in a synchronized way is useful. One simple example is if each task needs to start a server to serve requests from one another; first the tasks must find a free port (the result of which is undetermined beforehand) and then start making requests, but to do so they each must know the port chosen by the other task. An allGather method would allow them to inform each other of the port they will run on.

Does this PR introduce any user-facing change?

Yes, an BarrierTaskContext.allGather method will be available through the Scala, Java, and Python APIs.

How was this patch tested?

Most of the code path is already covered by tests to the barrier method, since this PR includes a refactor so that much code is shared by the barrier and allGather methods. However, a test is added to assert that an all gather on each tasks partition ID will return a list of every partition ID.

An example through the Python API:

>>> from pyspark import BarrierTaskContext
>>>
>>> def f(iterator):
...     context = BarrierTaskContext.get()
...     return [context.allGather('{}'.format(context.partitionId()))]
...
>>> sc.parallelize(range(4), 4).barrier().mapPartitions(f).collect()[0]
[u'3', u'1', u'0', u'2']

HyukjinKwon · 2020-01-30T09:12:56Z

@sarthfrey, please link JIRA id in the PR title. See also https://spark.apache.org/contributing.html

core/src/main/scala/org/apache/spark/BarrierCoordinator.scala

core/src/main/scala/org/apache/spark/BarrierTaskContext.scala

SparkQA · 2020-02-14T20:06:33Z

Test build #118446 has finished for PR 27395 at commit 6398066.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-14T23:23:56Z

Test build #118452 has finished for PR 27395 at commit 377d8d2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-19T00:14:15Z

Test build #118655 has finished for PR 27395 at commit ff7f3dd.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-19T03:22:20Z

Test build #118659 has finished for PR 27395 at commit d2fffe1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-19T03:34:46Z

Test build #118658 has finished for PR 27395 at commit 24adef3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2020-02-19T17:36:51Z

retest this please

mengxr · 2020-02-19T17:38:08Z

Failed test seems irrelevant: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.(It is not a test it is a sbt.testing.SuiteSelector)

SparkQA · 2020-02-19T20:06:11Z

Test build #118681 has finished for PR 27395 at commit d2fffe1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

### What changes were proposed in this pull request? The `allGather` method is added to the `BarrierTaskContext`. This method contains the same functionality as the `BarrierTaskContext.barrier` method; it blocks the task until all tasks make the call, at which time they may continue execution. In addition, the `allGather` method takes an input message. Upon returning from the `allGather` the task receives a list of all the messages sent by all the tasks that made the `allGather` call. ### Why are the changes needed? There are many situations where having the tasks communicate in a synchronized way is useful. One simple example is if each task needs to start a server to serve requests from one another; first the tasks must find a free port (the result of which is undetermined beforehand) and then start making requests, but to do so they each must know the port chosen by the other task. An `allGather` method would allow them to inform each other of the port they will run on. ### Does this PR introduce any user-facing change? Yes, an `BarrierTaskContext.allGather` method will be available through the Scala, Java, and Python APIs. ### How was this patch tested? Most of the code path is already covered by tests to the `barrier` method, since this PR includes a refactor so that much code is shared by the `barrier` and `allGather` methods. However, a test is added to assert that an all gather on each tasks partition ID will return a list of every partition ID. An example through the Python API: ```python >>> from pyspark import BarrierTaskContext >>> >>> def f(iterator): ... context = BarrierTaskContext.get() ... return [context.allGather('{}'.format(context.partitionId()))] ... >>> sc.parallelize(range(4), 4).barrier().mapPartitions(f).collect()[0] [u'3', u'1', u'0', u'2'] ``` Closes #27395 from sarthfrey/master. Lead-authored-by: sarthfrey-db <sarth.frey@databricks.com> Co-authored-by: sarthfrey <sarth.frey@gmail.com> Signed-off-by: Xiangrui Meng <meng@databricks.com> (cherry picked from commit 57254c9) Signed-off-by: Xiangrui Meng <meng@databricks.com>

mengxr · 2020-02-19T20:13:44Z

LGTM. Merged into both master and branch-3.0. Thanks!

gengliangwang · 2020-02-19T22:26:29Z

It seems that this PR breaks the Mima test in the Jenkins PR builder job s(https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118684/console)

dev/mima -Phadoop-2.7 -Phive-2.3 -Pkinesis-asl -Phive -Phive-thriftserver -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pyarn

sarthfrey · 2020-02-19T23:59:26Z

It seems that this PR breaks the Mima test in the Jenkins PR builder job s(https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118684/console)
dev/mima -Phadoop-2.7 -Phive-2.3 -Pkinesis-asl -Phive -Phive-thriftserver -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pyarn

hmm odd, this PR adds ProblemFilters.exclude[IncompatibleTemplateDefProblem]("org.apache.spark.RequestToSync")

jiangxb1987 · 2020-02-20T01:09:06Z

Reverted from both master and 3.0

It seems the merge script provides a weird behavior: when you tried to merge this PR, it automatically cherry-picked the latest commit (which was reverted before).
The following output from my local environment:

Enter a branch name [branch-3.0]:       
git fetch apache branch-3.0:PR_TOOL_PICK_PR_27395_BRANCH-3.0
remote: Enumerating objects: 307, done.
remote: Counting objects: 100% (307/307), done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 692 (delta 288), reused 305 (delta 288), pack-reused 385
Receiving objects: 100% (692/692), 98.68 KiB | 5.48 MiB/s, done.
Resolving deltas: 100% (336/336), completed with 79 local objects.
From https://github.com/apache/spark
 * [new branch]            branch-3.0 -> PR_TOOL_PICK_PR_27395_BRANCH-3.0
 * [new branch]            branch-3.0 -> apache/branch-3.0
git checkout PR_TOOL_PICK_PR_27395_BRANCH-3.0
Switched to branch 'PR_TOOL_PICK_PR_27395_BRANCH-3.0'
git cherry-pick -sx 57254c9719f9af9ad985596ed7fbbaafa4052002
The previous cherry-pick is now empty, possibly due to conflict resolution.

jiangxb1987 · 2020-02-20T01:10:07Z

@sarthfrey Please open a new PR instead and then let's try merge it again.

Fix for #27395 ### What changes were proposed in this pull request? The `allGather` method is added to the `BarrierTaskContext`. This method contains the same functionality as the `BarrierTaskContext.barrier` method; it blocks the task until all tasks make the call, at which time they may continue execution. In addition, the `allGather` method takes an input message. Upon returning from the `allGather` the task receives a list of all the messages sent by all the tasks that made the `allGather` call. ### Why are the changes needed? There are many situations where having the tasks communicate in a synchronized way is useful. One simple example is if each task needs to start a server to serve requests from one another; first the tasks must find a free port (the result of which is undetermined beforehand) and then start making requests, but to do so they each must know the port chosen by the other task. An `allGather` method would allow them to inform each other of the port they will run on. ### Does this PR introduce any user-facing change? Yes, an `BarrierTaskContext.allGather` method will be available through the Scala, Java, and Python APIs. ### How was this patch tested? Most of the code path is already covered by tests to the `barrier` method, since this PR includes a refactor so that much code is shared by the `barrier` and `allGather` methods. However, a test is added to assert that an all gather on each tasks partition ID will return a list of every partition ID. An example through the Python API: ```python >>> from pyspark import BarrierTaskContext >>> >>> def f(iterator): ... context = BarrierTaskContext.get() ... return [context.allGather('{}'.format(context.partitionId()))] ... >>> sc.parallelize(range(4), 4).barrier().mapPartitions(f).collect()[0] [u'3', u'1', u'0', u'2'] ``` Closes #27640 from sarthfrey/master. Lead-authored-by: sarthfrey-db <sarth.frey@databricks.com> Co-authored-by: sarthfrey <sarth.frey@gmail.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>

Fix for #27395 ### What changes were proposed in this pull request? The `allGather` method is added to the `BarrierTaskContext`. This method contains the same functionality as the `BarrierTaskContext.barrier` method; it blocks the task until all tasks make the call, at which time they may continue execution. In addition, the `allGather` method takes an input message. Upon returning from the `allGather` the task receives a list of all the messages sent by all the tasks that made the `allGather` call. ### Why are the changes needed? There are many situations where having the tasks communicate in a synchronized way is useful. One simple example is if each task needs to start a server to serve requests from one another; first the tasks must find a free port (the result of which is undetermined beforehand) and then start making requests, but to do so they each must know the port chosen by the other task. An `allGather` method would allow them to inform each other of the port they will run on. ### Does this PR introduce any user-facing change? Yes, an `BarrierTaskContext.allGather` method will be available through the Scala, Java, and Python APIs. ### How was this patch tested? Most of the code path is already covered by tests to the `barrier` method, since this PR includes a refactor so that much code is shared by the `barrier` and `allGather` methods. However, a test is added to assert that an all gather on each tasks partition ID will return a list of every partition ID. An example through the Python API: ```python >>> from pyspark import BarrierTaskContext >>> >>> def f(iterator): ... context = BarrierTaskContext.get() ... return [context.allGather('{}'.format(context.partitionId()))] ... >>> sc.parallelize(range(4), 4).barrier().mapPartitions(f).collect()[0] [u'3', u'1', u'0', u'2'] ``` Closes #27640 from sarthfrey/master. Lead-authored-by: sarthfrey-db <sarth.frey@databricks.com> Co-authored-by: sarthfrey <sarth.frey@gmail.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com> (cherry picked from commit 274b328) Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>

### What changes were proposed in this pull request? The `allGather` method is added to the `BarrierTaskContext`. This method contains the same functionality as the `BarrierTaskContext.barrier` method; it blocks the task until all tasks make the call, at which time they may continue execution. In addition, the `allGather` method takes an input message. Upon returning from the `allGather` the task receives a list of all the messages sent by all the tasks that made the `allGather` call. ### Why are the changes needed? There are many situations where having the tasks communicate in a synchronized way is useful. One simple example is if each task needs to start a server to serve requests from one another; first the tasks must find a free port (the result of which is undetermined beforehand) and then start making requests, but to do so they each must know the port chosen by the other task. An `allGather` method would allow them to inform each other of the port they will run on. ### Does this PR introduce any user-facing change? Yes, an `BarrierTaskContext.allGather` method will be available through the Scala, Java, and Python APIs. ### How was this patch tested? Most of the code path is already covered by tests to the `barrier` method, since this PR includes a refactor so that much code is shared by the `barrier` and `allGather` methods. However, a test is added to assert that an all gather on each tasks partition ID will return a list of every partition ID. An example through the Python API: ```python >>> from pyspark import BarrierTaskContext >>> >>> def f(iterator): ... context = BarrierTaskContext.get() ... return [context.allGather('{}'.format(context.partitionId()))] ... >>> sc.parallelize(range(4), 4).barrier().mapPartitions(f).collect()[0] [u'3', u'1', u'0', u'2'] ``` Closes apache#27395 from sarthfrey/master. Lead-authored-by: sarthfrey-db <sarth.frey@databricks.com> Co-authored-by: sarthfrey <sarth.frey@gmail.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>

### What changes were proposed in this pull request? The `allGather` method is added to the `BarrierTaskContext`. This method contains the same functionality as the `BarrierTaskContext.barrier` method; it blocks the task until all tasks make the call, at which time they may continue execution. In addition, the `allGather` method takes an input message. Upon returning from the `allGather` the task receives a list of all the messages sent by all the tasks that made the `allGather` call. ### Why are the changes needed? There are many situations where having the tasks communicate in a synchronized way is useful. One simple example is if each task needs to start a server to serve requests from one another; first the tasks must find a free port (the result of which is undetermined beforehand) and then start making requests, but to do so they each must know the port chosen by the other task. An `allGather` method would allow them to inform each other of the port they will run on. ### Does this PR introduce any user-facing change? Yes, an `BarrierTaskContext.allGather` method will be available through the Scala, Java, and Python APIs. ### How was this patch tested? Most of the code path is already covered by tests to the `barrier` method, since this PR includes a refactor so that much code is shared by the `barrier` and `allGather` methods. However, a test is added to assert that an all gather on each tasks partition ID will return a list of every partition ID. An example through the Python API: ```python >>> from pyspark import BarrierTaskContext >>> >>> def f(iterator): ... context = BarrierTaskContext.get() ... return [context.allGather('{}'.format(context.partitionId()))] ... >>> sc.parallelize(range(4), 4).barrier().mapPartitions(f).collect()[0] [u'3', u'1', u'0', u'2'] ``` Closes apache#27395 from sarthfrey/master. Lead-authored-by: sarthfrey-db <sarth.frey@databricks.com> Co-authored-by: sarthfrey <sarth.frey@gmail.com> Signed-off-by: Xiangrui Meng <meng@databricks.com> (cherry picked from commit 57254c9) Signed-off-by: Xiangrui Meng <meng@databricks.com>

Fix for apache#27395 ### What changes were proposed in this pull request? The `allGather` method is added to the `BarrierTaskContext`. This method contains the same functionality as the `BarrierTaskContext.barrier` method; it blocks the task until all tasks make the call, at which time they may continue execution. In addition, the `allGather` method takes an input message. Upon returning from the `allGather` the task receives a list of all the messages sent by all the tasks that made the `allGather` call. ### Why are the changes needed? There are many situations where having the tasks communicate in a synchronized way is useful. One simple example is if each task needs to start a server to serve requests from one another; first the tasks must find a free port (the result of which is undetermined beforehand) and then start making requests, but to do so they each must know the port chosen by the other task. An `allGather` method would allow them to inform each other of the port they will run on. ### Does this PR introduce any user-facing change? Yes, an `BarrierTaskContext.allGather` method will be available through the Scala, Java, and Python APIs. ### How was this patch tested? Most of the code path is already covered by tests to the `barrier` method, since this PR includes a refactor so that much code is shared by the `barrier` and `allGather` methods. However, a test is added to assert that an all gather on each tasks partition ID will return a list of every partition ID. An example through the Python API: ```python >>> from pyspark import BarrierTaskContext >>> >>> def f(iterator): ... context = BarrierTaskContext.get() ... return [context.allGather('{}'.format(context.partitionId()))] ... >>> sc.parallelize(range(4), 4).barrier().mapPartitions(f).collect()[0] [u'3', u'1', u'0', u'2'] ``` Closes apache#27640 from sarthfrey/master. Lead-authored-by: sarthfrey-db <sarth.frey@databricks.com> Co-authored-by: sarthfrey <sarth.frey@gmail.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>

sarthfrey added 2 commits January 29, 2020 22:18

Add all gather method to BarrierTaskContext

7370857

change method to allGather

fec40fe

sarthfrey changed the title ~~Add all Gather method to BarrierTaskContext~~ Add allGather method to BarrierTaskContext Jan 30, 2020

sarthfrey requested review from jiangxb1987 and mengxr January 30, 2020 06:41

sarthfrey added 2 commits January 29, 2020 22:44

fix docstring

390fb1f

fix test

a1229c9

sarthfrey changed the title ~~Add allGather method to BarrierTaskContext~~ [SPARK-30667] Add allGather method to BarrierTaskContext Jan 30, 2020

sarthfrey and others added 17 commits January 30, 2020 02:03

test2

2b8a199

test3

d63eab3

test4

ebd102c

test5

7fad912

Change API to send and receive bytes rather than strings

62b8a30

doc fix

f17cdd5

doc fix 2

d52f0ba

fix test

f62a1d5

fix test 2

ec198f1

fix test 3

f7bdd8a

fix test 4

2079548

fix test 5

aca81bc

fix test 6

76ea287

fix test 7

8a4c450

fix test final

adfab5d

add python test

149e1f3

fix test final 2

47c514a

dongjoon-hyun changed the title ~~[SPARK-30667] Add allGather method to BarrierTaskContext~~ [SPARK-30667][CORE] Add allGather method to BarrierTaskContext Jan 31, 2020

jiangxb1987 reviewed Jan 31, 2020

View reviewed changes

address review round 1

c047af8

sarthfrey requested a review from jiangxb1987 February 1, 2020 02:53

fix semicolon

377d8d2

fix tests

ff7f3dd

sarthfrey added 2 commits February 18, 2020 17:21

fix python unit test

24adef3

fix python unit test final

d2fffe1

asfgit closed this in af63971 Feb 19, 2020

gengliangwang mentioned this pull request Feb 19, 2020

[SPARK-30881][SQL][DOCS]Revise the doc of spark.sql.sources.parallelPartitionDiscovery.threshold #27639

Closed

sarthfrey mentioned this pull request Feb 20, 2020

[SPARK-30667][CORE] Add all gather method to BarrierTaskContext #27640

Closed

Ngone51 mentioned this pull request Apr 4, 2020

[SPARK-31344][CORE] Polish implementation of barrier() and allGather() #28117

Closed

Comments

Conversation

sarthfrey commented Jan 30, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Jan 30, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Feb 14, 2020

Uh oh!

SparkQA commented Feb 14, 2020

Uh oh!

SparkQA commented Feb 19, 2020

Uh oh!

SparkQA commented Feb 19, 2020

Uh oh!

SparkQA commented Feb 19, 2020

Uh oh!

mengxr commented Feb 19, 2020

Uh oh!

mengxr commented Feb 19, 2020

Uh oh!

SparkQA commented Feb 19, 2020

Uh oh!

mengxr commented Feb 19, 2020

Uh oh!

gengliangwang commented Feb 19, 2020

Uh oh!

sarthfrey commented Feb 19, 2020

Uh oh!

jiangxb1987 commented Feb 20, 2020

Uh oh!

jiangxb1987 commented Feb 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants