Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

What changes were proposed in this pull request?

This patch fixes the flaky test "query without test harness" on ContinuousSuite, via adding some more gaps on waiting query to commit the epoch which writes output rows.

The observation of this issue is below (injected some debug logs to get them):

reader creation time                                   1562225320210
epoch 1 launched                                       1562225320593 (+380ms from reader creation time)
epoch 13 launched                                      1562225321702 (+1.5s from reader creation time)
partition reader creation time                         1562225321715 (+1.5s from reader creation time)

next read time for first next call                     1562225321210 (+1s from reader creation time)
first next called in partition reader                  1562225321746 (immediately after creation of partition reader)
wait finished in next called in partition reader       1562225321746 (no wait)

second next called in partition reader                 1562225321747 (immediately after first next())

epoch 0 commit started                                 1562225321861

writing rows (0, 1) (belong to epoch 13)               1562225321866 (+100ms after first next())

wait start in waitForRateSourceTriggers(2)             1562225322059

next read time for second next call                    1562225322210 (+1s from previous "next read time")
wait finished in next called in partition reader       1562225322211 (+450ms wait)

writing rows (2, 3) (belong to epoch 13)               1562225322211 (immediately after next())

epoch 14 launched                                      1562225322246

desired wait time in waitForRateSourceTriggers(2)      1562225322510 (+2.3s from reader creation time)

epoch 12 committed                                     1562225323034

These rows were written within desired wait time, but the epoch 13 couldn't be committed within it. Interestingly, epoch 12 was lucky to be committed within a gap between finished waiting in waitForRateSourceTriggers and query.stop() - but even suppose the rows were written in epoch 12, it would be just in luck and epoch should be committed within desired wait time.

This patch modifies Rate continuous stream to track the highest committed value, so that test can wait until desired value is reported to the stream as committed.

This patch also modifies Rate continuous stream to track the timestamp at stream gets the first committed offset, and let waitForRateSourceTriggers use the timestamp. This also relies on waiting for specific period, but safer approach compared to current based on the observation above. Based on the change, this patch saves couple of seconds in test time.

How was this patch tested?

3 sequential test runs succeeded locally.

…inuousSuite

This patch fixes the flaky test "query without test harness" on ContinuousSuite, via adding some more gaps on waiting query to commit the epoch which writes output rows.

The observation of this issue is below (injected some debug logs to get them):

```
reader creation time                                   1562225320210
epoch 1 launched                                       1562225320593 (+380ms from reader creation time)
epoch 13 launched                                      1562225321702 (+1.5s from reader creation time)
partition reader creation time                         1562225321715 (+1.5s from reader creation time)

next read time for first next call                     1562225321210 (+1s from reader creation time)
first next called in partition reader                  1562225321746 (immediately after creation of partition reader)
wait finished in next called in partition reader       1562225321746 (no wait)

second next called in partition reader                 1562225321747 (immediately after first next())

epoch 0 commit started                                 1562225321861

writing rows (0, 1) (belong to epoch 13)               1562225321866 (+100ms after first next())

wait start in waitForRateSourceTriggers(2)             1562225322059

next read time for second next call                    1562225322210 (+1s from previous "next read time")
wait finished in next called in partition reader       1562225322211 (+450ms wait)

writing rows (2, 3) (belong to epoch 13)               1562225322211 (immediately after next())

epoch 14 launched                                      1562225322246

desired wait time in waitForRateSourceTriggers(2)      1562225322510 (+2.3s from reader creation time)

epoch 12 committed                                     1562225323034
```

These rows were written within desired wait time, but the epoch 13 couldn't be committed within it. Interestingly, epoch 12 was lucky to be committed within a gap between finished waiting in waitForRateSourceTriggers and query.stop() - but even suppose the rows were written in epoch 12, it would be just in luck and epoch should be committed within desired wait time.

This patch modifies Rate continuous stream to track the highest committed value, so that test can wait until desired value is reported to the stream as committed.

This patch also modifies Rate continuous stream to track the timestamp at stream gets the first committed offset, and let `waitForRateSourceTriggers` use the timestamp. This also relies on waiting for specific period, but safer approach compared to current based on the observation above. Based on the change, this patch saves couple of seconds in test time.

10 sequential test runs succeeded locally.

Closes apache#25048 from HeartSaVioR/SPARK-28247.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
@HeartSaVioR
Copy link
Contributor Author

cc. @srowen This is the backport patch of #25048 - given #25048 provided more stable approach to verify the test as well as reducing wait times in tests, I guess this is worth to port back.

StartStream(longContinuousTrigger),
AwaitEpoch(0),
Execute(waitForRateSourceTriggers(_, 201)),
Execute { exec =>
Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Jul 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the approach of calculating wait time is changed, this can reduce around 150 seconds without breaking test.

AwaitEpoch(0),
Execute(waitForRateSourceTriggers(_, 201)),
Execute { exec =>
waitForRateSourceTriggers(exec.asInstanceOf[ContinuousExecution], 50)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here: reduced around 150 seconds

@SparkQA
Copy link

SparkQA commented Jul 15, 2019

Test build #107653 has finished for PR 25154 at commit c3b870e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor

We've reviewed the change which has been merged to the master. What's the change here (resolved conflict)?

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this is just a rebase.

@HeartSaVioR
Copy link
Contributor Author

Yeah #25048 wasn't clearly ported back to branch-2.4 so this patch addressed it.

srowen pushed a commit that referenced this pull request Jul 16, 2019
…ess" on ContinuousSuite

## What changes were proposed in this pull request?

This patch fixes the flaky test "query without test harness" on ContinuousSuite, via adding some more gaps on waiting query to commit the epoch which writes output rows.

The observation of this issue is below (injected some debug logs to get them):

```
reader creation time                                   1562225320210
epoch 1 launched                                       1562225320593 (+380ms from reader creation time)
epoch 13 launched                                      1562225321702 (+1.5s from reader creation time)
partition reader creation time                         1562225321715 (+1.5s from reader creation time)

next read time for first next call                     1562225321210 (+1s from reader creation time)
first next called in partition reader                  1562225321746 (immediately after creation of partition reader)
wait finished in next called in partition reader       1562225321746 (no wait)

second next called in partition reader                 1562225321747 (immediately after first next())

epoch 0 commit started                                 1562225321861

writing rows (0, 1) (belong to epoch 13)               1562225321866 (+100ms after first next())

wait start in waitForRateSourceTriggers(2)             1562225322059

next read time for second next call                    1562225322210 (+1s from previous "next read time")
wait finished in next called in partition reader       1562225322211 (+450ms wait)

writing rows (2, 3) (belong to epoch 13)               1562225322211 (immediately after next())

epoch 14 launched                                      1562225322246

desired wait time in waitForRateSourceTriggers(2)      1562225322510 (+2.3s from reader creation time)

epoch 12 committed                                     1562225323034
```

These rows were written within desired wait time, but the epoch 13 couldn't be committed within it. Interestingly, epoch 12 was lucky to be committed within a gap between finished waiting in waitForRateSourceTriggers and query.stop() - but even suppose the rows were written in epoch 12, it would be just in luck and epoch should be committed within desired wait time.

This patch modifies Rate continuous stream to track the highest committed value, so that test can wait until desired value is reported to the stream as committed.

This patch also modifies Rate continuous stream to track the timestamp at stream gets the first committed offset, and let `waitForRateSourceTriggers` use the timestamp. This also relies on waiting for specific period, but safer approach compared to current based on the observation above. Based on the change, this patch saves couple of seconds in test time.

## How was this patch tested?

3 sequential test runs succeeded locally.

Closes #25154 from HeartSaVioR/SPARK-28247-branch-2.4.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member

srowen commented Jul 16, 2019

Merged to 2.4

@srowen srowen closed this Jul 16, 2019
@HeartSaVioR HeartSaVioR deleted the SPARK-28247-branch-2.4 branch July 26, 2019 03:12
rluta pushed a commit to rluta/spark that referenced this pull request Sep 17, 2019
…ess" on ContinuousSuite

## What changes were proposed in this pull request?

This patch fixes the flaky test "query without test harness" on ContinuousSuite, via adding some more gaps on waiting query to commit the epoch which writes output rows.

The observation of this issue is below (injected some debug logs to get them):

```
reader creation time                                   1562225320210
epoch 1 launched                                       1562225320593 (+380ms from reader creation time)
epoch 13 launched                                      1562225321702 (+1.5s from reader creation time)
partition reader creation time                         1562225321715 (+1.5s from reader creation time)

next read time for first next call                     1562225321210 (+1s from reader creation time)
first next called in partition reader                  1562225321746 (immediately after creation of partition reader)
wait finished in next called in partition reader       1562225321746 (no wait)

second next called in partition reader                 1562225321747 (immediately after first next())

epoch 0 commit started                                 1562225321861

writing rows (0, 1) (belong to epoch 13)               1562225321866 (+100ms after first next())

wait start in waitForRateSourceTriggers(2)             1562225322059

next read time for second next call                    1562225322210 (+1s from previous "next read time")
wait finished in next called in partition reader       1562225322211 (+450ms wait)

writing rows (2, 3) (belong to epoch 13)               1562225322211 (immediately after next())

epoch 14 launched                                      1562225322246

desired wait time in waitForRateSourceTriggers(2)      1562225322510 (+2.3s from reader creation time)

epoch 12 committed                                     1562225323034
```

These rows were written within desired wait time, but the epoch 13 couldn't be committed within it. Interestingly, epoch 12 was lucky to be committed within a gap between finished waiting in waitForRateSourceTriggers and query.stop() - but even suppose the rows were written in epoch 12, it would be just in luck and epoch should be committed within desired wait time.

This patch modifies Rate continuous stream to track the highest committed value, so that test can wait until desired value is reported to the stream as committed.

This patch also modifies Rate continuous stream to track the timestamp at stream gets the first committed offset, and let `waitForRateSourceTriggers` use the timestamp. This also relies on waiting for specific period, but safer approach compared to current based on the observation above. Based on the change, this patch saves couple of seconds in test time.

## How was this patch tested?

3 sequential test runs succeeded locally.

Closes apache#25154 from HeartSaVioR/SPARK-28247-branch-2.4.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Sep 26, 2019
…ess" on ContinuousSuite

## What changes were proposed in this pull request?

This patch fixes the flaky test "query without test harness" on ContinuousSuite, via adding some more gaps on waiting query to commit the epoch which writes output rows.

The observation of this issue is below (injected some debug logs to get them):

```
reader creation time                                   1562225320210
epoch 1 launched                                       1562225320593 (+380ms from reader creation time)
epoch 13 launched                                      1562225321702 (+1.5s from reader creation time)
partition reader creation time                         1562225321715 (+1.5s from reader creation time)

next read time for first next call                     1562225321210 (+1s from reader creation time)
first next called in partition reader                  1562225321746 (immediately after creation of partition reader)
wait finished in next called in partition reader       1562225321746 (no wait)

second next called in partition reader                 1562225321747 (immediately after first next())

epoch 0 commit started                                 1562225321861

writing rows (0, 1) (belong to epoch 13)               1562225321866 (+100ms after first next())

wait start in waitForRateSourceTriggers(2)             1562225322059

next read time for second next call                    1562225322210 (+1s from previous "next read time")
wait finished in next called in partition reader       1562225322211 (+450ms wait)

writing rows (2, 3) (belong to epoch 13)               1562225322211 (immediately after next())

epoch 14 launched                                      1562225322246

desired wait time in waitForRateSourceTriggers(2)      1562225322510 (+2.3s from reader creation time)

epoch 12 committed                                     1562225323034
```

These rows were written within desired wait time, but the epoch 13 couldn't be committed within it. Interestingly, epoch 12 was lucky to be committed within a gap between finished waiting in waitForRateSourceTriggers and query.stop() - but even suppose the rows were written in epoch 12, it would be just in luck and epoch should be committed within desired wait time.

This patch modifies Rate continuous stream to track the highest committed value, so that test can wait until desired value is reported to the stream as committed.

This patch also modifies Rate continuous stream to track the timestamp at stream gets the first committed offset, and let `waitForRateSourceTriggers` use the timestamp. This also relies on waiting for specific period, but safer approach compared to current based on the observation above. Based on the change, this patch saves couple of seconds in test time.

## How was this patch tested?

3 sequential test runs succeeded locally.

Closes apache#25154 from HeartSaVioR/SPARK-28247-branch-2.4.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants