[SPARK-49983][CORE][TESTS] Fix `BarrierTaskContextSuite.successively sync with allGather and barrier` test case to be robust #48487

dongjoon-hyun · 2024-10-15T23:52:14Z

What changes were proposed in this pull request?

This PR aims to fix BarrierTaskContextSuite.successively sync with allGather and barrier test case to be robust.

Why are the changes needed?

The test case asserts the duration of partitions. However, this is flaky because we don't know when a partition is triggered before barrier sync.

spark/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala

Lines 116 to 118 in 0e75d19

    
           // All the tasks shall finish the first round of global sync within a short time slot. 
        
           val times1 = times.map(_._1) 
        
           assert(times1.max - times1.min <= 1000)

Although we added TestUtils.waitUntilExecutorsUp at Apache Spark 3.0.0 like the following,

[SPARK-31730][CORE][TEST][3.0] Fix flaky tests in BarrierTaskContextSuite #28658

let's say a partition starts slowly than 38ms and all partitions sleep 1s exactly. Then, the test case fails like the following.

https://github.com/apache/spark/actions/runs/11298639789/job/31428018075

BarrierTaskContextSuite:
...
- successively sync with allGather and barrier *** FAILED ***
  1038 was not less than or equal to 1000 (BarrierTaskContextSuite.scala:118)

According to the failure history here (SPARK-49983) and SPARK-31730, the slowness seems to be less than 200ms when it happens. So, this PR aims to reduce the flakiness by capping the sleep up to 500ms while keeping the 1s validation. There is no test coverage change because this test case focuses on the successively sync with allGather and battier.

Does this PR introduce any user-facing change?

No, this is a test-only test case.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.

…sync with allGather and barrier` test case to be robust

xinrong-meng · 2024-10-16T07:17:57Z

LGTM, the tests in Build (pull_request_target) all passed. Thank you!

dongjoon-hyun · 2024-10-16T14:24:18Z

Thank you so much, @xinrong-meng .
Merged to master/3.5/3.4.

…sync with allGather and barrier` test case to be robust ### What changes were proposed in this pull request? This PR aims to fix `BarrierTaskContextSuite.successively sync with allGather and barrier` test case to be robust. ### Why are the changes needed? The test case asserts the duration of partitions. However, this is flaky because we don't know when a partition is triggered before `barrier` sync. https://github.com/apache/spark/blob/0e75d19a736aa18fe77414991ebb7e3577a43af8/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala#L116-L118 Although we added `TestUtils.waitUntilExecutorsUp` at Apache Spark 3.0.0 like the following, - #28658 let's say a partition starts slowly than `38ms` and all partitions sleep `1s` exactly. Then, the test case fails like the following. - https://github.com/apache/spark/actions/runs/11298639789/job/31428018075 ``` BarrierTaskContextSuite: ... - successively sync with allGather and barrier *** FAILED *** 1038 was not less than or equal to 1000 (BarrierTaskContextSuite.scala:118) ``` According to the failure history here (SPARK-49983) and SPARK-31730, the slowness seems to be less than `200ms` when it happens. So, this PR aims to reduce the flakiness by capping the sleep up to 500ms while keeping the `1s` validation. There is no test coverage change because this test case focuses on the `successively sync with allGather and battier`. ### Does this PR introduce _any_ user-facing change? No, this is a test-only test case. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48487 from dongjoon-hyun/SPARK-49983. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit bcfe62b) Signed-off-by: Dongjoon Hyun <[email protected]>

…sync with allGather and barrier` test case to be robust ### What changes were proposed in this pull request? This PR aims to fix `BarrierTaskContextSuite.successively sync with allGather and barrier` test case to be robust. ### Why are the changes needed? The test case asserts the duration of partitions. However, this is flaky because we don't know when a partition is triggered before `barrier` sync. https://github.com/apache/spark/blob/0e75d19a736aa18fe77414991ebb7e3577a43af8/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala#L116-L118 Although we added `TestUtils.waitUntilExecutorsUp` at Apache Spark 3.0.0 like the following, - #28658 let's say a partition starts slowly than `38ms` and all partitions sleep `1s` exactly. Then, the test case fails like the following. - https://github.com/apache/spark/actions/runs/11298639789/job/31428018075 ``` BarrierTaskContextSuite: ... - successively sync with allGather and barrier *** FAILED *** 1038 was not less than or equal to 1000 (BarrierTaskContextSuite.scala:118) ``` According to the failure history here (SPARK-49983) and SPARK-31730, the slowness seems to be less than `200ms` when it happens. So, this PR aims to reduce the flakiness by capping the sleep up to 500ms while keeping the `1s` validation. There is no test coverage change because this test case focuses on the `successively sync with allGather and battier`. ### Does this PR introduce _any_ user-facing change? No, this is a test-only test case. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48487 from dongjoon-hyun/SPARK-49983. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit bcfe62b) Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d37a8b9) Signed-off-by: Dongjoon Hyun <[email protected]>

[SPARK-49983][CORE][TESTS] Fix `BarrierTaskContextSuite.successively …

f065fe3

…sync with allGather and barrier` test case to be robust

github-actions bot added the CORE label Oct 15, 2024

xinrong-meng approved these changes Oct 16, 2024

View reviewed changes

dongjoon-hyun closed this in bcfe62b Oct 16, 2024

dongjoon-hyun deleted the SPARK-49983 branch October 16, 2024 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-49983][CORE][TESTS] Fix `BarrierTaskContextSuite.successively sync with allGather and barrier` test case to be robust #48487

[SPARK-49983][CORE][TESTS] Fix `BarrierTaskContextSuite.successively sync with allGather and barrier` test case to be robust #48487

Uh oh!

dongjoon-hyun commented Oct 15, 2024 •

edited

Loading

Uh oh!

xinrong-meng commented Oct 16, 2024

Uh oh!

dongjoon-hyun commented Oct 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// All the tasks shall finish the first round of global sync within a short time slot.
	val times1 = times.map(_._1)
	assert(times1.max - times1.min <= 1000)

[SPARK-49983][CORE][TESTS] Fix BarrierTaskContextSuite.successively sync with allGather and barrier test case to be robust #48487

[SPARK-49983][CORE][TESTS] Fix BarrierTaskContextSuite.successively sync with allGather and barrier test case to be robust #48487

Uh oh!

Conversation

dongjoon-hyun commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

xinrong-meng commented Oct 16, 2024

Uh oh!

dongjoon-hyun commented Oct 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-49983][CORE][TESTS] Fix `BarrierTaskContextSuite.successively sync with allGather and barrier` test case to be robust #48487

[SPARK-49983][CORE][TESTS] Fix `BarrierTaskContextSuite.successively sync with allGather and barrier` test case to be robust #48487

dongjoon-hyun commented Oct 15, 2024 •

edited

Loading