-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31593][SS] Remove unnecessary streaming query progress update #28391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if (hasExecuted) { | ||
| // Reset noDataEventTimestamp if we processed any data | ||
| lastNoExecutionProgressEventTime = Long.MinValue | ||
| lastNoExecutionProgressEventTime = triggerClock.getTimeMillis() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When reset "lastNoExecutionProgressEventTime =Long.MinValue", it will make "now - noDataProgressEventInterval >= lastNoExecutionProgressEventTime" always true when there is no new data. Then progress reporter will report an empty progress.
|
Test build #121989 has finished for PR 28391 at commit
|
|
This was actually proposed in #28040 but indicated UT failure hence abandoned. Happy to see it addressed, but looks like you'll also need to deal with UT failure. (UT failure is relevant.) |
|
I think the change here is reasonable since the condition changed from |
|
@HeartSaVioR @xuanyuanking Thanks for reviewing, let me add ut. |
|
Test build #122460 has finished for PR 28391 at commit
|
|
Test build #122461 has finished for PR 28391 at commit
|
|
retest this please |
|
Test build #122462 has finished for PR 28391 at commit
|
|
Test build #122464 has finished for PR 28391 at commit
|
xuanyuanking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just some nits of comments.
| noDataBatchEnableKey -> flag.toString, | ||
| // set `STREAMING_NO_DATA_PROGRESS_EVENT_INTERVAL` a small value to | ||
| // report an `empty` progress when no data come. | ||
| noDataProgressIntervalKey -> "1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way for fixing the failed UTs seems a little flaky. As your approach deletes empty progress generally, so here you set STREAMING_NO_DATA_PROGRESS_EVENT_INTERVAL to a small enough value to generate an empty one as expected, I think we can do better in either way:
- Add comments to explain how small
STREAMING_NO_DATA_PROGRESS_EVENT_INTERVALneeds to be set here is enough, seems less than 10 seconds is OK? - Delete the empty progress checking with explanation in comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's either use the manual clock or fix the UT like the way it doesn't depend on STREAMING_NO_DATA_PROGRESS_EVENT_INTERVAL.
For latter like we can set noDataProgressIntervalKey to 1000000 and remove the last assertion, which isn't actually testing the behavior of deduplication. Even better, you can still keep the last assertion for only when flag == true, which has a meaning of verification that state cleanup in empty input batch doesn't produce new outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SPARK-31928 is filed to address the flakiness of this test on the master branch - that said, if we can make sure this test become no longer flaky, that should be nice.
|
retest this please |
|
Test build #123227 has finished for PR 28391 at commit
|
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenerSuite.scala
Outdated
Show resolved
Hide resolved
| StartStream( | ||
| Trigger.ProcessingTime(100), | ||
| triggerClock = clock, | ||
| Map(noDataProgressIntervalKey -> "100")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set "noDataProgressIntervalKey" to 100. After advance manual clock at batch 3 (with no data), it will report an empty progress event.
| spark.streams.addListener(listener) | ||
|
|
||
| def checkProgressEvent(count: Int): StreamAction = { | ||
| AssertOnQuery { _ => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add an inner function to simplify test.
|
Test build #123471 has finished for PR 28391 at commit
|
HeartSaVioR
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@HeartSaVioR @xuanyuanking Thanks for reviewing. |
gaborgsomogyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
I am going to merge this in few days if there's no more comments given that multiple LGTMs from the community. Let me know if there's any concern @tdas @zsxwing @jose-torres. |
|
retest this please |
|
retest this please. |
|
Test build #123888 has finished for PR 28391 at commit
|
|
Test build #123893 has finished for PR 28391 at commit
|
|
retest this please |
|
Test build #123901 has finished for PR 28391 at commit
|
|
Merged to master and branch-3.0. |
### What changes were proposed in this pull request? Structured Streaming progress reporter will always report an `empty` progress when there is no new data. As design, we should provide progress updates every 10s (default) when there is no new data. Before PR:    After PR:  ### Why are the changes needed? Fixes a bug around incorrect progress report ### Does this PR introduce any user-facing change? Fixes a bug around incorrect progress report ### How was this patch tested? existing ut and manual test Closes #28391 from uncleGen/SPARK-31593. Authored-by: uncleGen <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 1e40bcc) Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request? Structured Streaming progress reporter will always report an `empty` progress when there is no new data. As design, we should provide progress updates every 10s (default) when there is no new data. Before PR:    After PR:  ### Why are the changes needed? Fixes a bug around incorrect progress report ### Does this PR introduce any user-facing change? Fixes a bug around incorrect progress report ### How was this patch tested? existing ut and manual test Closes apache#28391 from uncleGen/SPARK-31593. Authored-by: uncleGen <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 1e40bcc) Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
Structured Streaming progress reporter will always report an
emptyprogress when there is no new data. As design, we should provide progress updates every 10s (default) when there is no new data.Before PR:
After PR:
Why are the changes needed?
Fixes a bug around incorrect progress report
Does this PR introduce any user-facing change?
Fixes a bug around incorrect progress report
How was this patch tested?
existing ut and manual test