[SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite #15519

tdas · 2016-10-17T21:01:58Z

This work has largely been done by @lw-lin in his PR #15497. This is a slight refactoring of it.

What changes were proposed in this pull request?

There were two sources of flakiness in StreamingQueryListener test.

When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock.

+-----------------------------------+--------------------------------+
|      StreamExecution thread       |         testing thread         |
+-----------------------------------+--------------------------------+
|  ManualClock.waitTillTime(100) {  |                                |
|        _isWaiting = true          |                                |
|            wait(10)               |                                |
|        still in wait(10)          |  if (_isWaiting) advance(100)  |
|        still in wait(10)          |  if (_isWaiting) advance(200)  | <- this should be disallowed !
|        still in wait(10)          |  if (_isWaiting) advance(300)  | <- this should be disallowed !
|      wake up from wait(10)        |                                |
|       current time is 600         |                                |
|       _isWaiting = false          |                                |
|  }                                |                                |
+-----------------------------------+--------------------------------+

Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger.

My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, advance(200) (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous advance(100)).

In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest.

How was this patch tested?

Ran existing unit test MANY TIME in Jenkins

This reverts commit 5bc47b6.

tdas · 2016-10-17T21:40:13Z

@lw-lin please take a look.

SparkQA · 2016-10-17T22:18:15Z

Test build #3359 has started for PR 15519 at commit 6fdbae3.

SparkQA · 2016-10-17T23:04:02Z

Test build #3358 has finished for PR 15519 at commit 6fdbae3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-18T01:12:34Z

Test build #67087 has finished for PR 15519 at commit 6fdbae3.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-10-18T01:35:23Z

I have tested this enough in Jenkins. There was a single failure in a different flaky test, not in StreamingQueryListenerSuite.

lw-lin · 2016-10-18T02:12:20Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala

      /* Stop then restart the Stream  */
      StopStream,
-      StartStream(ProcessingTime("10 seconds"), new ManualClock),
+      StartStream(ProcessingTime("10 seconds"), new ManualClock(60 * 1000)),


should also be StreamManualClock? but this is trivial

Huh! I wonder how the test passed without this change.

Oh I never ran the StreamSuite in jenkins till now. I was running StreamingQuery* repeatedly.

lw-lin · 2016-10-18T02:13:49Z

This looks good to me, thanks!

tdas · 2016-10-18T02:38:25Z

@lw-lin Fixed the bug.

SparkQA · 2016-10-18T03:22:55Z

Test build #67101 has finished for PR 15519 at commit 4ce3093.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-18T04:53:51Z

Test build #67104 has finished for PR 15519 at commit 3229095.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class StreamManualClock(time: Long = 0L) extends ManualClock(time)

tdas · 2016-10-18T07:45:27Z

Merging this to master and branch 2.0

This work has largely been done by lw-lin in his PR #15497. This is a slight refactoring of it. ## What changes were proposed in this pull request? There were two sources of flakiness in StreamingQueryListener test. - When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock. ``` +-----------------------------------+--------------------------------+ | StreamExecution thread | testing thread | +-----------------------------------+--------------------------------+ | ManualClock.waitTillTime(100) { | | | _isWaiting = true | | | wait(10) | | | still in wait(10) | if (_isWaiting) advance(100) | | still in wait(10) | if (_isWaiting) advance(200) | <- this should be disallowed ! | still in wait(10) | if (_isWaiting) advance(300) | <- this should be disallowed ! | wake up from wait(10) | | | current time is 600 | | | _isWaiting = false | | | } | | +-----------------------------------+--------------------------------+ ``` - Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger. My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, `advance(200)` (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous `advance(100)`). In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest. ## How was this patch tested? Ran existing unit test MANY TIME in Jenkins Author: Tathagata Das <[email protected]> Author: Liwei Lin <[email protected]> Closes #15519 from tdas/metrics-flaky-test-fix. (cherry picked from commit 7d878cf) Signed-off-by: Tathagata Das <[email protected]>

This work has largely been done by lw-lin in his PR apache#15497. This is a slight refactoring of it. ## What changes were proposed in this pull request? There were two sources of flakiness in StreamingQueryListener test. - When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock. ``` +-----------------------------------+--------------------------------+ | StreamExecution thread | testing thread | +-----------------------------------+--------------------------------+ | ManualClock.waitTillTime(100) { | | | _isWaiting = true | | | wait(10) | | | still in wait(10) | if (_isWaiting) advance(100) | | still in wait(10) | if (_isWaiting) advance(200) | <- this should be disallowed ! | still in wait(10) | if (_isWaiting) advance(300) | <- this should be disallowed ! | wake up from wait(10) | | | current time is 600 | | | _isWaiting = false | | | } | | +-----------------------------------+--------------------------------+ ``` - Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger. My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, `advance(200)` (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous `advance(100)`). In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest. ## How was this patch tested? Ran existing unit test MANY TIME in Jenkins Author: Tathagata Das <[email protected]> Author: Liwei Lin <[email protected]> Closes apache#15519 from tdas/metrics-flaky-test-fix.

lw-lin and others added 4 commits October 15, 2016 10:36

Fix flaky test

5bc47b6

Revert "Fix flaky test"

eb59a98

This reverts commit 5bc47b6.

Fix flaky test again

7ae7782

Refactored Manual clock

6fdbae3

tdas changed the title ~~[SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite~~ [WIP][SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite Oct 17, 2016

tdas mentioned this pull request Oct 17, 2016

[Test][SPARK-16002][Follow-up] Fix flaky test in StreamingQueryListenerSuite #15497

Closed

tdas mentioned this pull request Oct 17, 2016

[SPARK-17731][SQL][STREAMING] Metrics for structured streaming for branch-2.0 #15472

Closed

Reverted run tests

4ce3093

lw-lin reviewed Oct 18, 2016

View reviewed changes

Fixed StreamSuite

3229095

tdas changed the title ~~[WIP][SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite~~ [SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite Oct 18, 2016

asfgit closed this in 7d878cf Oct 18, 2016

[SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite #15519

[SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite #15519

Uh oh!

Conversation

tdas commented Oct 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

tdas commented Oct 17, 2016

Uh oh!

SparkQA commented Oct 17, 2016

Uh oh!

SparkQA commented Oct 17, 2016

Uh oh!

SparkQA commented Oct 18, 2016

Uh oh!

tdas commented Oct 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lw-lin Oct 18, 2016

Choose a reason for hiding this comment

Uh oh!

tdas Oct 18, 2016

Choose a reason for hiding this comment

Uh oh!

tdas Oct 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lw-lin commented Oct 18, 2016

Uh oh!

tdas commented Oct 18, 2016

Uh oh!

SparkQA commented Oct 18, 2016

Uh oh!

SparkQA commented Oct 18, 2016

Uh oh!

tdas commented Oct 18, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tdas commented Oct 17, 2016 •

edited

Loading

tdas commented Oct 18, 2016 •

edited

Loading

tdas Oct 18, 2016 •

edited

Loading