[SPARK-31730][CORE][TEST][3.0] Fix flaky tests in BarrierTaskContextSuite #28658

jiangxb1987 · 2020-05-28T00:31:18Z

What changes were proposed in this pull request?

To wait until all the executors have started before submitting any job. This could avoid the flakiness caused by waiting for executors coming up.

How was this patch tested?

Existing tests.

### What changes were proposed in this pull request? To wait until all the executors have started before submitting any job. This could avoid the flakiness caused by waiting for executors coming up. ### How was this patch tested? Existing tests. Closes apache#28584 from jiangxb1987/barrierTest. Authored-by: Xingbo Jiang <[email protected]> Signed-off-by: Xingbo Jiang <[email protected]>

SparkQA · 2020-05-28T03:35:07Z

Test build #123203 has finished for PR 28658 at commit 4359923.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2020-05-28T16:41:35Z

Retest this please

SparkQA · 2020-05-28T19:26:44Z

Test build #123240 has finished for PR 28658 at commit 4359923.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…uite ### What changes were proposed in this pull request? To wait until all the executors have started before submitting any job. This could avoid the flakiness caused by waiting for executors coming up. ### How was this patch tested? Existing tests. Closes #28658 from jiangxb1987/barrierTest. Authored-by: Xingbo Jiang <[email protected]> Signed-off-by: Xingbo Jiang <[email protected]>

jiangxb1987 · 2020-05-28T23:30:03Z

Merged to 3.0！

…sync with allGather and barrier` test case to be robust ### What changes were proposed in this pull request? This PR aims to fix `BarrierTaskContextSuite.successively sync with allGather and barrier` test case to be robust. ### Why are the changes needed? The test case asserts the duration of partitions. However, this is flaky because we don't know when a partition is triggered before `barrier` sync. https://github.com/apache/spark/blob/0e75d19a736aa18fe77414991ebb7e3577a43af8/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala#L116-L118 Although we added `TestUtils.waitUntilExecutorsUp` at Apache Spark 3.0.0 like the following, - #28658 let's say a partition starts slowly than `38ms` and all partitions sleep `1s` exactly. Then, the test case fails like the following. - https://github.com/apache/spark/actions/runs/11298639789/job/31428018075 ``` BarrierTaskContextSuite: ... - successively sync with allGather and barrier *** FAILED *** 1038 was not less than or equal to 1000 (BarrierTaskContextSuite.scala:118) ``` According to the failure history here (SPARK-49983) and SPARK-31730, the slowness seems to be less than `200ms` when it happens. So, this PR aims to reduce the flakiness by capping the sleep up to 500ms while keeping the `1s` validation. There is no test coverage change because this test case focuses on the `successively sync with allGather and battier`. ### Does this PR introduce _any_ user-facing change? No, this is a test-only test case. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48487 from dongjoon-hyun/SPARK-49983. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…sync with allGather and barrier` test case to be robust ### What changes were proposed in this pull request? This PR aims to fix `BarrierTaskContextSuite.successively sync with allGather and barrier` test case to be robust. ### Why are the changes needed? The test case asserts the duration of partitions. However, this is flaky because we don't know when a partition is triggered before `barrier` sync. https://github.com/apache/spark/blob/0e75d19a736aa18fe77414991ebb7e3577a43af8/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala#L116-L118 Although we added `TestUtils.waitUntilExecutorsUp` at Apache Spark 3.0.0 like the following, - #28658 let's say a partition starts slowly than `38ms` and all partitions sleep `1s` exactly. Then, the test case fails like the following. - https://github.com/apache/spark/actions/runs/11298639789/job/31428018075 ``` BarrierTaskContextSuite: ... - successively sync with allGather and barrier *** FAILED *** 1038 was not less than or equal to 1000 (BarrierTaskContextSuite.scala:118) ``` According to the failure history here (SPARK-49983) and SPARK-31730, the slowness seems to be less than `200ms` when it happens. So, this PR aims to reduce the flakiness by capping the sleep up to 500ms while keeping the `1s` validation. There is no test coverage change because this test case focuses on the `successively sync with allGather and battier`. ### Does this PR introduce _any_ user-facing change? No, this is a test-only test case. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48487 from dongjoon-hyun/SPARK-49983. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit bcfe62b) Signed-off-by: Dongjoon Hyun <[email protected]>

…sync with allGather and barrier` test case to be robust ### What changes were proposed in this pull request? This PR aims to fix `BarrierTaskContextSuite.successively sync with allGather and barrier` test case to be robust. ### Why are the changes needed? The test case asserts the duration of partitions. However, this is flaky because we don't know when a partition is triggered before `barrier` sync. https://github.com/apache/spark/blob/0e75d19a736aa18fe77414991ebb7e3577a43af8/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala#L116-L118 Although we added `TestUtils.waitUntilExecutorsUp` at Apache Spark 3.0.0 like the following, - #28658 let's say a partition starts slowly than `38ms` and all partitions sleep `1s` exactly. Then, the test case fails like the following. - https://github.com/apache/spark/actions/runs/11298639789/job/31428018075 ``` BarrierTaskContextSuite: ... - successively sync with allGather and barrier *** FAILED *** 1038 was not less than or equal to 1000 (BarrierTaskContextSuite.scala:118) ``` According to the failure history here (SPARK-49983) and SPARK-31730, the slowness seems to be less than `200ms` when it happens. So, this PR aims to reduce the flakiness by capping the sleep up to 500ms while keeping the `1s` validation. There is no test coverage change because this test case focuses on the `successively sync with allGather and battier`. ### Does this PR introduce _any_ user-facing change? No, this is a test-only test case. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48487 from dongjoon-hyun/SPARK-49983. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit bcfe62b) Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d37a8b9) Signed-off-by: Dongjoon Hyun <[email protected]>

jiangxb1987 added 2 commits May 27, 2020 17:27

fixBuild

4359923

jiangxb1987 mentioned this pull request May 28, 2020

[SPARK-31730][CORE][TEST] Fix flaky tests in BarrierTaskContextSuite #28584

Closed

jiangxb1987 closed this May 28, 2020

dongjoon-hyun mentioned this pull request Oct 15, 2024

[SPARK-49983][CORE][TESTS] Fix BarrierTaskContextSuite.successively sync with allGather and barrier test case to be robust #48487

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-31730][CORE][TEST][3.0] Fix flaky tests in BarrierTaskContextSuite #28658

[SPARK-31730][CORE][TEST][3.0] Fix flaky tests in BarrierTaskContextSuite #28658

Uh oh!

jiangxb1987 commented May 28, 2020

Uh oh!

SparkQA commented May 28, 2020

Uh oh!

jiangxb1987 commented May 28, 2020

Uh oh!

SparkQA commented May 28, 2020

Uh oh!

jiangxb1987 commented May 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-31730][CORE][TEST][3.0] Fix flaky tests in BarrierTaskContextSuite #28658

[SPARK-31730][CORE][TEST][3.0] Fix flaky tests in BarrierTaskContextSuite #28658

Uh oh!

Conversation

jiangxb1987 commented May 28, 2020

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 28, 2020

Uh oh!

jiangxb1987 commented May 28, 2020

Uh oh!

SparkQA commented May 28, 2020

Uh oh!

jiangxb1987 commented May 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants