-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25674][SQL] If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated #22594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #96806 has finished for PR 22594 at commit
|
|
This needs a JIRA as it's a non-trivial bug fix. This also doesn't explain the problem at all. The problem is that records may be incremented by more than 1 at a time, right? That should be in a comment here as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original goal here is to avoid updating it every record, because it is too expensive. I am not sure what is the goal of your changes. Try to write a test case in SQLMetricsSuite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue is that in line 108, this value can be incremented by more than 1. It might skip over the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. If that code path is common, it might rarely ever get updated. This now just checks whether the increment causes the value to exceed a higher multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS, which sounds more correct. But yeah needs a description and ideally a little test.
|
@srowen Yes,I will update,thanks |
c332716 to
8134249
Compare
|
Test build #97096 has finished for PR 22594 at commit
|
|
Test build #97095 has finished for PR 22594 at commit
|
|
retest this please |
|
Test build #97104 has finished for PR 22594 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't really testing the code you changed. It's replicating something similar and testing that. I don't think this test helps. Ideally you would write a test for any path that uses FileScanRDD and check its metrics. Are there tests around here that you could 'piggyback' onto? maybe an existing test of the metrics involving ColumnarBatch than can be changed to trigger this case.
It may be hard, I don't know. Worth looking to see if there's an easy way to test this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is too hard,the test needs involve ColumnarBatch,
in addition, we must capture the bytesRead in the process of execution, not the task end.
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a reasonable test.
|
Test build #97154 has finished for PR 22594 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you know, we have spark.testing for this purpose too.
I'm actually on the fence here about whether it's worth this extra complexity for the test. It's a simple change and the overall effect is tested by other tests of input metrics.
Hm, what do you think, just drop this? is it necessary for the test to work correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this place is controlled by spark.testing, other unit tests may fail.
Yeah, I agree with you ,this is a simple change, it is better to drop this.
thanks @srowen
95376b3 to
04eba30
Compare
|
Test build #97230 has finished for PR 22594 at commit
|
… time,the number of bytes might rarely ever get updated ## What changes were proposed in this pull request? If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated,because it might skip over the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. This PR just checks whether the increment causes the value to exceed a higher multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. ## How was this patch tested? existed unit tests Closes #22594 from 10110346/inputMetrics. Authored-by: liuxian <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 69f5e9c) Signed-off-by: Sean Owen <[email protected]>
|
Merged to master/2.4/2.3 as a clean simple bug fix |
… time,the number of bytes might rarely ever get updated ## What changes were proposed in this pull request? If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated,because it might skip over the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. This PR just checks whether the increment causes the value to exceed a higher multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. ## How was this patch tested? existed unit tests Closes #22594 from 10110346/inputMetrics. Authored-by: liuxian <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 69f5e9c) Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? This PR is a follow-up of #22594 . This alternative can avoid the unneeded computation in the hot code path. - For row-based scan, we keep the original way. - For the columnar scan, we just need to update the stats after each batch. ## How was this patch tested? N/A Closes #22731 from gatorsmile/udpateStatsFileScanRDD. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 4cee191) Signed-off-by: Sean Owen <[email protected]>
This PR is a follow-up of #22594 . This alternative can avoid the unneeded computation in the hot code path. - For row-based scan, we keep the original way. - For the columnar scan, we just need to update the stats after each batch. N/A Closes #22731 from gatorsmile/udpateStatsFileScanRDD. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 4cee191) Signed-off-by: Sean Owen <[email protected]>
… time,the number of bytes might rarely ever get updated ## What changes were proposed in this pull request? If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated,because it might skip over the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. This PR just checks whether the increment causes the value to exceed a higher multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. ## How was this patch tested? existed unit tests Closes apache#22594 from 10110346/inputMetrics. Authored-by: liuxian <[email protected]> Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? This PR is a follow-up of apache#22594 . This alternative can avoid the unneeded computation in the hot code path. - For row-based scan, we keep the original way. - For the columnar scan, we just need to update the stats after each batch. ## How was this patch tested? N/A Closes apache#22731 from gatorsmile/udpateStatsFileScanRDD. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated,because it might skip over the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS.
This PR just checks whether the increment causes the value to exceed a higher multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS.
How was this patch tested?
existed unit tests