Skip to content

Conversation

@tdas
Copy link
Contributor

@tdas tdas commented Oct 17, 2016

This work has largely been done by @lw-lin in his PR #15497. This is a slight refactoring of it.

What changes were proposed in this pull request?

There were two sources of flakiness in StreamingQueryListener test.

  • When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock.
+-----------------------------------+--------------------------------+
|      StreamExecution thread       |         testing thread         |
+-----------------------------------+--------------------------------+
|  ManualClock.waitTillTime(100) {  |                                |
|        _isWaiting = true          |                                |
|            wait(10)               |                                |
|        still in wait(10)          |  if (_isWaiting) advance(100)  |
|        still in wait(10)          |  if (_isWaiting) advance(200)  | <- this should be disallowed !
|        still in wait(10)          |  if (_isWaiting) advance(300)  | <- this should be disallowed !
|      wake up from wait(10)        |                                |
|       current time is 600         |                                |
|       _isWaiting = false          |                                |
|  }                                |                                |
+-----------------------------------+--------------------------------+
  • Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger.

My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, advance(200) (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous advance(100)).

In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest.

How was this patch tested?

Ran existing unit test MANY TIME in Jenkins

@tdas tdas changed the title [SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite [WIP][SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite Oct 17, 2016
@tdas
Copy link
Contributor Author

tdas commented Oct 17, 2016

@lw-lin please take a look.

@SparkQA
Copy link

SparkQA commented Oct 17, 2016

Test build #3359 has started for PR 15519 at commit 6fdbae3.

@SparkQA
Copy link

SparkQA commented Oct 17, 2016

Test build #3358 has finished for PR 15519 at commit 6fdbae3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 18, 2016

Test build #67087 has finished for PR 15519 at commit 6fdbae3.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor Author

tdas commented Oct 18, 2016

I have tested this enough in Jenkins. There was a single failure in a different flaky test, not in StreamingQueryListenerSuite.

/* Stop then restart the Stream */
StopStream,
StartStream(ProcessingTime("10 seconds"), new ManualClock),
StartStream(ProcessingTime("10 seconds"), new ManualClock(60 * 1000)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should also be StreamManualClock? but this is trivial

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh! I wonder how the test passed without this change.

Copy link
Contributor Author

@tdas tdas Oct 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I never ran the StreamSuite in jenkins till now. I was running StreamingQuery* repeatedly.

@lw-lin
Copy link
Contributor

lw-lin commented Oct 18, 2016

This looks good to me, thanks!

@tdas
Copy link
Contributor Author

tdas commented Oct 18, 2016

@lw-lin Fixed the bug.

@SparkQA
Copy link

SparkQA commented Oct 18, 2016

Test build #67101 has finished for PR 15519 at commit 4ce3093.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 18, 2016

Test build #67104 has finished for PR 15519 at commit 3229095.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StreamManualClock(time: Long = 0L) extends ManualClock(time)

@tdas
Copy link
Contributor Author

tdas commented Oct 18, 2016

Merging this to master and branch 2.0

@tdas tdas changed the title [WIP][SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite [SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite Oct 18, 2016
asfgit pushed a commit that referenced this pull request Oct 18, 2016
This work has largely been done by lw-lin in his PR #15497. This is a slight refactoring of it.

## What changes were proposed in this pull request?
There were two sources of flakiness in StreamingQueryListener test.

- When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock.
```
+-----------------------------------+--------------------------------+
|      StreamExecution thread       |         testing thread         |
+-----------------------------------+--------------------------------+
|  ManualClock.waitTillTime(100) {  |                                |
|        _isWaiting = true          |                                |
|            wait(10)               |                                |
|        still in wait(10)          |  if (_isWaiting) advance(100)  |
|        still in wait(10)          |  if (_isWaiting) advance(200)  | <- this should be disallowed !
|        still in wait(10)          |  if (_isWaiting) advance(300)  | <- this should be disallowed !
|      wake up from wait(10)        |                                |
|       current time is 600         |                                |
|       _isWaiting = false          |                                |
|  }                                |                                |
+-----------------------------------+--------------------------------+
```

- Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger.

My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, `advance(200)` (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous `advance(100)`).

In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest.

## How was this patch tested?
Ran existing unit test MANY TIME in Jenkins

Author: Tathagata Das <[email protected]>
Author: Liwei Lin <[email protected]>

Closes #15519 from tdas/metrics-flaky-test-fix.

(cherry picked from commit 7d878cf)
Signed-off-by: Tathagata Das <[email protected]>
@asfgit asfgit closed this in 7d878cf Oct 18, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
This work has largely been done by lw-lin in his PR apache#15497. This is a slight refactoring of it.

## What changes were proposed in this pull request?
There were two sources of flakiness in StreamingQueryListener test.

- When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock.
```
+-----------------------------------+--------------------------------+
|      StreamExecution thread       |         testing thread         |
+-----------------------------------+--------------------------------+
|  ManualClock.waitTillTime(100) {  |                                |
|        _isWaiting = true          |                                |
|            wait(10)               |                                |
|        still in wait(10)          |  if (_isWaiting) advance(100)  |
|        still in wait(10)          |  if (_isWaiting) advance(200)  | <- this should be disallowed !
|        still in wait(10)          |  if (_isWaiting) advance(300)  | <- this should be disallowed !
|      wake up from wait(10)        |                                |
|       current time is 600         |                                |
|       _isWaiting = false          |                                |
|  }                                |                                |
+-----------------------------------+--------------------------------+
```

- Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger.

My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, `advance(200)` (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous `advance(100)`).

In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest.

## How was this patch tested?
Ran existing unit test MANY TIME in Jenkins

Author: Tathagata Das <[email protected]>
Author: Liwei Lin <[email protected]>

Closes apache#15519 from tdas/metrics-flaky-test-fix.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
This work has largely been done by lw-lin in his PR apache#15497. This is a slight refactoring of it.

## What changes were proposed in this pull request?
There were two sources of flakiness in StreamingQueryListener test.

- When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock.
```
+-----------------------------------+--------------------------------+
|      StreamExecution thread       |         testing thread         |
+-----------------------------------+--------------------------------+
|  ManualClock.waitTillTime(100) {  |                                |
|        _isWaiting = true          |                                |
|            wait(10)               |                                |
|        still in wait(10)          |  if (_isWaiting) advance(100)  |
|        still in wait(10)          |  if (_isWaiting) advance(200)  | <- this should be disallowed !
|        still in wait(10)          |  if (_isWaiting) advance(300)  | <- this should be disallowed !
|      wake up from wait(10)        |                                |
|       current time is 600         |                                |
|       _isWaiting = false          |                                |
|  }                                |                                |
+-----------------------------------+--------------------------------+
```

- Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger.

My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, `advance(200)` (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous `advance(100)`).

In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest.

## How was this patch tested?
Ran existing unit test MANY TIME in Jenkins

Author: Tathagata Das <[email protected]>
Author: Liwei Lin <[email protected]>

Closes apache#15519 from tdas/metrics-flaky-test-fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants