Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Oct 15, 2024

What changes were proposed in this pull request?

This PR aims to fix BarrierTaskContextSuite.successively sync with allGather and barrier test case to be robust.

Why are the changes needed?

The test case asserts the duration of partitions. However, this is flaky because we don't know when a partition is triggered before barrier sync.

// All the tasks shall finish the first round of global sync within a short time slot.
val times1 = times.map(_._1)
assert(times1.max - times1.min <= 1000)

Although we added TestUtils.waitUntilExecutorsUp at Apache Spark 3.0.0 like the following,

let's say a partition starts slowly than 38ms and all partitions sleep 1s exactly. Then, the test case fails like the following.

BarrierTaskContextSuite:
...
- successively sync with allGather and barrier *** FAILED ***
  1038 was not less than or equal to 1000 (BarrierTaskContextSuite.scala:118)

According to the failure history here (SPARK-49983) and SPARK-31730, the slowness seems to be less than 200ms when it happens. So, this PR aims to reduce the flakiness by capping the sleep up to 500ms while keeping the 1s validation. There is no test coverage change because this test case focuses on the successively sync with allGather and battier.

Does this PR introduce any user-facing change?

No, this is a test-only test case.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.

…sync with allGather and barrier` test case to be robust
@github-actions github-actions bot added the CORE label Oct 15, 2024
@xinrong-meng
Copy link
Member

LGTM, the tests in Build (pull_request_target) all passed. Thank you!

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @xinrong-meng .
Merged to master/3.5/3.4.

dongjoon-hyun added a commit that referenced this pull request Oct 16, 2024
…sync with allGather and barrier` test case to be robust

### What changes were proposed in this pull request?

This PR aims to fix `BarrierTaskContextSuite.successively sync with allGather and barrier` test case to be robust.

### Why are the changes needed?

The test case asserts the duration of partitions. However, this is flaky because we don't know when a partition is triggered before `barrier` sync.

https://github.com/apache/spark/blob/0e75d19a736aa18fe77414991ebb7e3577a43af8/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala#L116-L118

Although we added `TestUtils.waitUntilExecutorsUp` at Apache Spark 3.0.0 like the following,

- #28658

let's say a partition starts slowly than `38ms` and all partitions sleep `1s` exactly. Then, the test case fails like the following.
- https://github.com/apache/spark/actions/runs/11298639789/job/31428018075
```
BarrierTaskContextSuite:
...
- successively sync with allGather and barrier *** FAILED ***
  1038 was not less than or equal to 1000 (BarrierTaskContextSuite.scala:118)
```

According to the failure history here (SPARK-49983) and SPARK-31730, the slowness seems to be less than `200ms` when it happens. So, this PR aims to reduce the flakiness by capping the sleep up to 500ms while keeping the `1s` validation. There is no test coverage change because this test case focuses on the `successively sync with allGather and battier`.

### Does this PR introduce _any_ user-facing change?

No, this is a test-only test case.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #48487 from dongjoon-hyun/SPARK-49983.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit bcfe62b)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Oct 16, 2024
…sync with allGather and barrier` test case to be robust

### What changes were proposed in this pull request?

This PR aims to fix `BarrierTaskContextSuite.successively sync with allGather and barrier` test case to be robust.

### Why are the changes needed?

The test case asserts the duration of partitions. However, this is flaky because we don't know when a partition is triggered before `barrier` sync.

https://github.com/apache/spark/blob/0e75d19a736aa18fe77414991ebb7e3577a43af8/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala#L116-L118

Although we added `TestUtils.waitUntilExecutorsUp` at Apache Spark 3.0.0 like the following,

- #28658

let's say a partition starts slowly than `38ms` and all partitions sleep `1s` exactly. Then, the test case fails like the following.
- https://github.com/apache/spark/actions/runs/11298639789/job/31428018075
```
BarrierTaskContextSuite:
...
- successively sync with allGather and barrier *** FAILED ***
  1038 was not less than or equal to 1000 (BarrierTaskContextSuite.scala:118)
```

According to the failure history here (SPARK-49983) and SPARK-31730, the slowness seems to be less than `200ms` when it happens. So, this PR aims to reduce the flakiness by capping the sleep up to 500ms while keeping the `1s` validation. There is no test coverage change because this test case focuses on the `successively sync with allGather and battier`.

### Does this PR introduce _any_ user-facing change?

No, this is a test-only test case.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #48487 from dongjoon-hyun/SPARK-49983.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit bcfe62b)
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit d37a8b9)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun dongjoon-hyun deleted the SPARK-49983 branch October 16, 2024 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants