-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14942][SQL][Streaming] Reduce delay between batch construction and execution #12725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #57084 has finished for PR 12725 at commit
|
|
This makes sense. Thanks for writing a very clear description! Perhaps a better title would be "Reduce delay between batch construction and execution" Is there any way we can test this? What if we injected a manual timed executor? |
|
Sure, I'll add a manual timed executor and some dedicated tests as well. |
|
To make things easier to review, I've added the manual timed executor for testing general cases in a separate PR. |
| populateStartOffsets() | ||
| logDebug(s"Stream running from $committedOffsets to $availableOffsets") | ||
| } | ||
| else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: merge this line and the previous one.
|
Looks pretty good. |
…ainst the `ProcessingTime(intervalMS > 0)` trigger and `ManualClock` ## What changes were proposed in this pull request? Currently in `StreamTest`, we have a `StartStream` which will start a streaming query against trigger `ProcessTime(intervalMS = 0)` and `SystemClock`. We also need to test cases against `ProcessTime(intervalMS > 0)`, which often requires `ManualClock`. This patch: - fixes an issue of `ProcessingTimeExecutor`, where for a batch it should run `batchRunner` only once but might run multiple times under certain conditions; - adds support for testing against the `ProcessingTime(intervalMS > 0)` trigger and `AdvanceManualClock`, by specifying them as fields for `StartStream`, and by adding an `AdvanceClock` action; - adds a test, which takes advantage of the new `StartStream` and `AdvanceManualClock`, to test against [PR#[SPARK-14942] Reduce delay between batch construction and execution ](#12725). ## How was this patch tested? N/A Author: Liwei Lin <[email protected]> Closes #12797 from lw-lin/add-trigger-test-support.
…ainst the `ProcessingTime(intervalMS > 0)` trigger and `ManualClock` ## What changes were proposed in this pull request? Currently in `StreamTest`, we have a `StartStream` which will start a streaming query against trigger `ProcessTime(intervalMS = 0)` and `SystemClock`. We also need to test cases against `ProcessTime(intervalMS > 0)`, which often requires `ManualClock`. This patch: - fixes an issue of `ProcessingTimeExecutor`, where for a batch it should run `batchRunner` only once but might run multiple times under certain conditions; - adds support for testing against the `ProcessingTime(intervalMS > 0)` trigger and `AdvanceManualClock`, by specifying them as fields for `StartStream`, and by adding an `AdvanceClock` action; - adds a test, which takes advantage of the new `StartStream` and `AdvanceManualClock`, to test against [PR#[SPARK-14942] Reduce delay between batch construction and execution ](#12725). ## How was this patch tested? N/A Author: Liwei Lin <[email protected]> Closes #12797 from lw-lin/add-trigger-test-support. (cherry picked from commit e597ec6) Signed-off-by: Shixiong Zhu <[email protected]>
| outputMode: OutputMode, | ||
| checkpointLocation: String, | ||
| currentBatchId: Long) | ||
| val currentBatchId: Long) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's expose this to test suits
|
Test build #58051 has finished for PR 12725 at commit
|
|
Test build #58054 has finished for PR 12725 at commit
|
|
Now that the manual timed executor patch has been merged, and I've addressed comments and expanded tests for this patch -- @zsxwing would you mind taking another look? Thanks! |
|
@zsxwing would you take another look? Thanks! |
| StopStream, | ||
| StartStream(ProcessingTime("10 seconds"), new ManualClock), | ||
|
|
||
| /* -- batch 1 rerun ----------------- */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can avoid to rerun a batch that has already finished before stopping. How about storing the offsets after finishing a batch instead of storing it before running a batch? @marmbrus what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failure is the rare case, so I don't think its that bad to rerun if it reduces the complexity of the implementation.
|
Test build #2986 has finished for PR 12725 at commit
|
|
LGTM. Merging to master / 2.0. Thanks, @lw-lin |
… and execution ## Problem Currently in `StreamExecution`, [we first run the batch, then construct the next](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L165): ```scala if (dataAvailable) runBatch() constructNextBatch() ``` This is good when we run batches ASAP, where data would get processed in the **very next batch**:  However, when we run batches at trigger like `ProcessTime("1 minute")`, data - such as _y_ below - may not get processed in the very next batch i.e. _batch 1_, but in _batch 2_:  ## What changes were proposed in this pull request? This patch reverses the order of `constructNextBatch()` and `runBatch()`. After this patch, data would get processed in the **very next batch**, i.e. _batch 1_:  In addition, this patch alters when we do `currentBatchId += 1`: let's do that when the processing of the current batch's data is completed, so we won't bother passing `currentBatchId + 1` or `currentBatchId - 1` to states or sinks. ## How was this patch tested? New added test case. Also this should be covered by existing test suits, e.g. stress tests and others. Author: Liwei Lin <[email protected]> Closes #12725 from lw-lin/construct-before-run-3. (cherry picked from commit 95f4fba) Signed-off-by: Shixiong Zhu <[email protected]>
Problem
Currently in
StreamExecution, we first run the batch, then construct the next:if (dataAvailable) runBatch() constructNextBatch()This is good when we run batches ASAP, where data would get processed in the very next batch:
However, when we run batches at trigger like
ProcessTime("1 minute"), data - such as y below - may not get processed in the very next batch i.e. batch 1, but in batch 2:What changes were proposed in this pull request?
This patch reverses the order of
constructNextBatch()andrunBatch(). After this patch, data would get processed in the very next batch, i.e. batch 1:In addition, this patch alters when we do
currentBatchId += 1: let's do that when the processing of the current batch's data is completed, so we won't bother passingcurrentBatchId + 1orcurrentBatchId - 1to states or sinks.How was this patch tested?
New added test case. Also this should be covered by existing test suits, e.g. stress tests and others.