Skip to content

Conversation

@jiangxb1987
Copy link
Contributor

What changes were proposed in this pull request?

To wait until all the executors have started before submitting any job. This could avoid the flakiness caused by waiting for executors coming up.

How was this patch tested?

Existing tests.

### What changes were proposed in this pull request?

To wait until all the executors have started before submitting any job. This could avoid the flakiness caused by waiting for executors coming up.

### How was this patch tested?

Existing tests.

Closes apache#28584 from jiangxb1987/barrierTest.

Authored-by: Xingbo Jiang <[email protected]>
Signed-off-by: Xingbo Jiang <[email protected]>
@SparkQA
Copy link

SparkQA commented May 28, 2020

Test build #123203 has finished for PR 28658 at commit 4359923.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

Retest this please

@SparkQA
Copy link

SparkQA commented May 28, 2020

Test build #123240 has finished for PR 28658 at commit 4359923.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

jiangxb1987 added a commit that referenced this pull request May 28, 2020
…uite

### What changes were proposed in this pull request?

To wait until all the executors have started before submitting any job. This could avoid the flakiness caused by waiting for executors coming up.

### How was this patch tested?

Existing tests.

Closes #28658 from jiangxb1987/barrierTest.

Authored-by: Xingbo Jiang <[email protected]>
Signed-off-by: Xingbo Jiang <[email protected]>
@jiangxb1987
Copy link
Contributor Author

Merged to 3.0!

dongjoon-hyun added a commit that referenced this pull request Oct 16, 2024
…sync with allGather and barrier` test case to be robust

### What changes were proposed in this pull request?

This PR aims to fix `BarrierTaskContextSuite.successively sync with allGather and barrier` test case to be robust.

### Why are the changes needed?

The test case asserts the duration of partitions. However, this is flaky because we don't know when a partition is triggered before `barrier` sync.

https://github.com/apache/spark/blob/0e75d19a736aa18fe77414991ebb7e3577a43af8/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala#L116-L118

Although we added `TestUtils.waitUntilExecutorsUp` at Apache Spark 3.0.0 like the following,

- #28658

let's say a partition starts slowly than `38ms` and all partitions sleep `1s` exactly. Then, the test case fails like the following.
- https://github.com/apache/spark/actions/runs/11298639789/job/31428018075
```
BarrierTaskContextSuite:
...
- successively sync with allGather and barrier *** FAILED ***
  1038 was not less than or equal to 1000 (BarrierTaskContextSuite.scala:118)
```

According to the failure history here (SPARK-49983) and SPARK-31730, the slowness seems to be less than `200ms` when it happens. So, this PR aims to reduce the flakiness by capping the sleep up to 500ms while keeping the `1s` validation. There is no test coverage change because this test case focuses on the `successively sync with allGather and battier`.

### Does this PR introduce _any_ user-facing change?

No, this is a test-only test case.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #48487 from dongjoon-hyun/SPARK-49983.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Oct 16, 2024
…sync with allGather and barrier` test case to be robust

### What changes were proposed in this pull request?

This PR aims to fix `BarrierTaskContextSuite.successively sync with allGather and barrier` test case to be robust.

### Why are the changes needed?

The test case asserts the duration of partitions. However, this is flaky because we don't know when a partition is triggered before `barrier` sync.

https://github.com/apache/spark/blob/0e75d19a736aa18fe77414991ebb7e3577a43af8/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala#L116-L118

Although we added `TestUtils.waitUntilExecutorsUp` at Apache Spark 3.0.0 like the following,

- #28658

let's say a partition starts slowly than `38ms` and all partitions sleep `1s` exactly. Then, the test case fails like the following.
- https://github.com/apache/spark/actions/runs/11298639789/job/31428018075
```
BarrierTaskContextSuite:
...
- successively sync with allGather and barrier *** FAILED ***
  1038 was not less than or equal to 1000 (BarrierTaskContextSuite.scala:118)
```

According to the failure history here (SPARK-49983) and SPARK-31730, the slowness seems to be less than `200ms` when it happens. So, this PR aims to reduce the flakiness by capping the sleep up to 500ms while keeping the `1s` validation. There is no test coverage change because this test case focuses on the `successively sync with allGather and battier`.

### Does this PR introduce _any_ user-facing change?

No, this is a test-only test case.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #48487 from dongjoon-hyun/SPARK-49983.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit bcfe62b)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Oct 16, 2024
…sync with allGather and barrier` test case to be robust

### What changes were proposed in this pull request?

This PR aims to fix `BarrierTaskContextSuite.successively sync with allGather and barrier` test case to be robust.

### Why are the changes needed?

The test case asserts the duration of partitions. However, this is flaky because we don't know when a partition is triggered before `barrier` sync.

https://github.com/apache/spark/blob/0e75d19a736aa18fe77414991ebb7e3577a43af8/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala#L116-L118

Although we added `TestUtils.waitUntilExecutorsUp` at Apache Spark 3.0.0 like the following,

- #28658

let's say a partition starts slowly than `38ms` and all partitions sleep `1s` exactly. Then, the test case fails like the following.
- https://github.com/apache/spark/actions/runs/11298639789/job/31428018075
```
BarrierTaskContextSuite:
...
- successively sync with allGather and barrier *** FAILED ***
  1038 was not less than or equal to 1000 (BarrierTaskContextSuite.scala:118)
```

According to the failure history here (SPARK-49983) and SPARK-31730, the slowness seems to be less than `200ms` when it happens. So, this PR aims to reduce the flakiness by capping the sleep up to 500ms while keeping the `1s` validation. There is no test coverage change because this test case focuses on the `successively sync with allGather and battier`.

### Does this PR introduce _any_ user-facing change?

No, this is a test-only test case.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #48487 from dongjoon-hyun/SPARK-49983.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit bcfe62b)
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit d37a8b9)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants