Commit 3106324
committed
[SPARK-25184][SS] Fixed race condition in StreamExecution that caused flaky test in FlatMapGroupsWithState
## What changes were proposed in this pull request?
The race condition that caused test failure is between 2 threads.
- The MicrobatchExecution thread that processes inputs to produce answers and then generates progress events.
- The test thread that generates some input data, checked the answer and then verified the query generated progress event.
The synchronization structure between these threads is as follows
1. MicrobatchExecution thread, in every batch, does the following in order.
a. Processes batch input to generate answer.
b. Signals `awaitProgressLockCondition` to wake up threads waiting for progress using `awaitOffset`
c. Generates progress event
2. Test execution thread
a. Calls `awaitOffset` to wait for progress, which waits on `awaitProgressLockCondition`.
b. As soon as `awaitProgressLockCondition` is signaled, it would move on the in the test to check answer.
c. Finally, it would verify the last generated progress event.
What can happen is the following sequence of events: 2a -> 1a -> 1b -> 2b -> 2c -> 1c.
In other words, the progress event may be generated after the test tries to verify it.
The solution has two steps.
1. Signal the waiting thread after the progress event has been generated, that is, after `finishTrigger()`.
2. Increase the timeout of `awaitProgressLockCondition.await(100 ms)` to a large value.
This latter is to ensure that test thread for keeps waiting on `awaitProgressLockCondition`until the MicroBatchExecution thread explicitly signals it. With the existing small timeout of 100ms the following sequence can occur.
- MicroBatchExecution thread updates committed offsets
- Test thread waiting on `awaitProgressLockCondition` accidentally times out after 100 ms, finds that the committed offsets have been updated, therefore returns from `awaitOffset` and moves on to the progress event tests.
- MicroBatchExecution thread then generates progress event and signals. But the test thread has already attempted to verify the event and failed.
By increasing the timeout to large (e.g., `streamingTimeoutMs = 60 seconds`, similar to `awaitInitialization`), this above type of race condition is also avoided.
## How was this patch tested?
Ran locally many times.
Closes #22182 from tdas/SPARK-25184.
Authored-by: Tathagata Das <[email protected]>
Signed-off-by: Tathagata Das <[email protected]>1 parent 68ec4d6 commit 3106324
File tree
5 files changed
+33
-25
lines changed- external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010
- sql/core/src
- main/scala/org/apache/spark/sql/execution/streaming
- test/scala/org/apache/spark/sql/streaming
5 files changed
+33
-25
lines changedLines changed: 2 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
970 | 970 | | |
971 | 971 | | |
972 | 972 | | |
973 | | - | |
| 973 | + | |
| 974 | + | |
974 | 975 | | |
975 | 976 | | |
976 | 977 | | |
| |||
Lines changed: 4 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
200 | 200 | | |
201 | 201 | | |
202 | 202 | | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
203 | 207 | | |
204 | 208 | | |
205 | 209 | | |
| |||
538 | 542 | | |
539 | 543 | | |
540 | 544 | | |
541 | | - | |
542 | 545 | | |
543 | 546 | | |
544 | 547 | | |
| |||
Lines changed: 2 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
382 | 382 | | |
383 | 383 | | |
384 | 384 | | |
385 | | - | |
| 385 | + | |
386 | 386 | | |
387 | 387 | | |
388 | 388 | | |
| |||
398 | 398 | | |
399 | 399 | | |
400 | 400 | | |
401 | | - | |
| 401 | + | |
402 | 402 | | |
403 | 403 | | |
404 | 404 | | |
| |||
Lines changed: 24 additions & 20 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
34 | | - | |
35 | | - | |
36 | | - | |
37 | | - | |
38 | | - | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
39 | 42 | | |
40 | | - | |
41 | | - | |
| 43 | + | |
| 44 | + | |
42 | 45 | | |
43 | | - | |
44 | | - | |
45 | | - | |
46 | | - | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
47 | 50 | | |
48 | | - | |
49 | | - | |
| 51 | + | |
| 52 | + | |
50 | 53 | | |
51 | | - | |
52 | | - | |
| 54 | + | |
| 55 | + | |
53 | 56 | | |
54 | | - | |
55 | | - | |
| 57 | + | |
| 58 | + | |
56 | 59 | | |
57 | | - | |
58 | | - | |
| 60 | + | |
| 61 | + | |
59 | 62 | | |
60 | | - | |
| 63 | + | |
| 64 | + | |
61 | 65 | | |
62 | 66 | | |
63 | 67 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
467 | 467 | | |
468 | 468 | | |
469 | 469 | | |
470 | | - | |
| 470 | + | |
471 | 471 | | |
472 | 472 | | |
473 | 473 | | |
| |||
0 commit comments