Skip to content

Conversation

@wbo4958
Copy link
Contributor

@wbo4958 wbo4958 commented Oct 27, 2022

What changes were proposed in this pull request?

The messages returned by allGather may be overridden by the following barrier APIs, eg,

      val messages: Array[String] = context.allGather("ABC")
      context.barrier()

the messages may be like Array("", ""), but we're expecting Array("ABC", "ABC")

The root cause of this issue is the messages got by allGather pointing to the original message in the local mode. So when the following barrier APIs changed the messages, then the allGather message will be changed accordingly.
Finally, users can't get the correct result.

This PR fixed this issue by sending back the cloned messages.

Why are the changes needed?

The bug mentioned in this description may block some external SPARK ML libraries which heavily depend on the spark barrier API to do some synchronization. If the barrier mechanism can't guarantee the correctness of the barrier APIs, it will be a disaster for external SPARK ML libraries.

Does this PR introduce any user-facing change?

No

How was this patch tested?

I added a unit test, with this PR, the unit test can pass

@wbo4958
Copy link
Contributor Author

wbo4958 commented Oct 27, 2022

@jiangxb1987 could you help to review it?

@github-actions github-actions bot added the CORE label Oct 27, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@wbo4958
Copy link
Contributor Author

wbo4958 commented Oct 28, 2022

@cloud-fan could you help to review this PR?

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.3!

@cloud-fan cloud-fan closed this in 0b892a5 Oct 28, 2022
cloud-fan pushed a commit that referenced this pull request Oct 28, 2022
### What changes were proposed in this pull request?

The messages returned by allGather may be overridden by the following barrier APIs, eg,

``` scala
      val messages: Array[String] = context.allGather("ABC")
      context.barrier()
```

the  `messages` may be like Array("", ""), but we're expecting Array("ABC", "ABC")

The root cause of this issue is the [messages got by allGather](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/BarrierTaskContext.scala#L102) pointing to the [original message](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/BarrierCoordinator.scala#L107) in the local mode. So when the following barrier APIs changed the messages, then the allGather message will be changed accordingly.
Finally, users can't get the correct result.

This PR fixed this issue by sending back the cloned messages.

### Why are the changes needed?

The bug mentioned in this description may block some external SPARK ML libraries which heavily depend on the spark barrier API to do some synchronization. If the barrier mechanism can't guarantee the correctness of the barrier APIs, it will be a disaster for external SPARK ML libraries.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

I added a unit test, with this PR, the unit test can pass

Closes #38410 from wbo4958/allgather-issue.

Authored-by: Bobby Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 0b892a5)
Signed-off-by: Wenchen Fan <[email protected]>
@wbo4958 wbo4958 deleted the allgather-issue branch November 1, 2022 10:35
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?

The messages returned by allGather may be overridden by the following barrier APIs, eg,

``` scala
      val messages: Array[String] = context.allGather("ABC")
      context.barrier()
```

the  `messages` may be like Array("", ""), but we're expecting Array("ABC", "ABC")

The root cause of this issue is the [messages got by allGather](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/BarrierTaskContext.scala#L102) pointing to the [original message](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/BarrierCoordinator.scala#L107) in the local mode. So when the following barrier APIs changed the messages, then the allGather message will be changed accordingly.
Finally, users can't get the correct result.

This PR fixed this issue by sending back the cloned messages.

### Why are the changes needed?

The bug mentioned in this description may block some external SPARK ML libraries which heavily depend on the spark barrier API to do some synchronization. If the barrier mechanism can't guarantee the correctness of the barrier APIs, it will be a disaster for external SPARK ML libraries.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

I added a unit test, with this PR, the unit test can pass

Closes apache#38410 from wbo4958/allgather-issue.

Authored-by: Bobby Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants