[SPARK-25674][SQL] If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated #22594

10110346 · 2018-09-30T09:44:58Z

What changes were proposed in this pull request?

If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated，because it might skip over the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS.

This PR just checks whether the increment causes the value to exceed a higher multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS.

How was this patch tested?

existed unit tests

SparkQA · 2018-09-30T13:32:50Z

Test build #96806 has finished for PR 22594 at commit e589e1e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-10-04T10:36:22Z

This needs a JIRA as it's a non-trivial bug fix. This also doesn't explain the problem at all. The problem is that records may be incremented by more than 1 at a time, right?

That should be in a comment here as well.

gatorsmile · 2018-10-05T02:04:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

The original goal here is to avoid updating it every record, because it is too expensive. I am not sure what is the goal of your changes. Try to write a test case in SQLMetricsSuite?

I think the issue is that in line 108, this value can be incremented by more than 1. It might skip over the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. If that code path is common, it might rarely ever get updated. This now just checks whether the increment causes the value to exceed a higher multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS, which sounds more correct. But yeah needs a description and ideally a little test.

10110346 · 2018-10-05T05:05:06Z

@srowen Yes,I will update,thanks

SparkQA · 2018-10-08T06:59:09Z

Test build #97096 has finished for PR 22594 at commit 8134249.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-08T07:05:02Z

Test build #97095 has finished for PR 22594 at commit c332716.

This patch fails due to an unknown error code, -9.
This patch does not merge cleanly.
This patch adds no public classes.

dilipbiswal · 2018-10-08T07:13:29Z

retest this please

SparkQA · 2018-10-08T11:04:48Z

Test build #97104 has finished for PR 22594 at commit 8134249.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-10-08T12:58:07Z

sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala

This isn't really testing the code you changed. It's replicating something similar and testing that. I don't think this test helps. Ideally you would write a test for any path that uses FileScanRDD and check its metrics. Are there tests around here that you could 'piggyback' onto? maybe an existing test of the metrics involving ColumnarBatch than can be changed to trigger this case.

It may be hard, I don't know. Worth looking to see if there's an easy way to test this.

It is too hard，the test needs involve ColumnarBatch,
in addition, we must capture the bytesRead in the process of execution, not the task end.

srowen

Looks like a reasonable test.

SparkQA · 2018-10-09T15:56:41Z

Test build #97154 has finished for PR 22594 at commit 95376b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-10-10T16:15:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

Ah you know, we have spark.testing for this purpose too.

I'm actually on the fence here about whether it's worth this extra complexity for the test. It's a simple change and the overall effect is tested by other tests of input metrics.

Hm, what do you think, just drop this? is it necessary for the test to work correctly?

If this place is controlled by spark.testing, other unit tests may fail.
Yeah, I agree with you ,this is a simple change, it is better to drop this.
thanks @srowen

SparkQA · 2018-10-11T05:19:43Z

Test build #97230 has finished for PR 22594 at commit 04eba30.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… time,the number of bytes might rarely ever get updated ## What changes were proposed in this pull request? If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated，because it might skip over the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. This PR just checks whether the increment causes the value to exceed a higher multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. ## How was this patch tested? existed unit tests Closes #22594 from 10110346/inputMetrics. Authored-by: liuxian <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 69f5e9c) Signed-off-by: Sean Owen <[email protected]>

srowen · 2018-10-11T21:24:57Z

Merged to master/2.4/2.3 as a clean simple bug fix

… time,the number of bytes might rarely ever get updated ## What changes were proposed in this pull request? If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated，because it might skip over the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. This PR just checks whether the increment causes the value to exceed a higher multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. ## How was this patch tested? existed unit tests Closes #22594 from 10110346/inputMetrics. Authored-by: liuxian <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 69f5e9c) Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? This PR is a follow-up of #22594 . This alternative can avoid the unneeded computation in the hot code path. - For row-based scan, we keep the original way. - For the columnar scan, we just need to update the stats after each batch. ## How was this patch tested? N/A Closes #22731 from gatorsmile/udpateStatsFileScanRDD. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 4cee191) Signed-off-by: Sean Owen <[email protected]>

This PR is a follow-up of #22594 . This alternative can avoid the unneeded computation in the hot code path. - For row-based scan, we keep the original way. - For the columnar scan, we just need to update the stats after each batch. N/A Closes #22731 from gatorsmile/udpateStatsFileScanRDD. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 4cee191) Signed-off-by: Sean Owen <[email protected]>

… time,the number of bytes might rarely ever get updated ## What changes were proposed in this pull request? If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated，because it might skip over the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. This PR just checks whether the increment causes the value to exceed a higher multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. ## How was this patch tested? existed unit tests Closes apache#22594 from 10110346/inputMetrics. Authored-by: liuxian <[email protected]> Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? This PR is a follow-up of apache#22594 . This alternative can avoid the unneeded computation in the hot code path. - For row-based scan, we keep the original way. - For the columnar scan, we just need to update the stats after each batch. ## How was this patch tested? N/A Closes apache#22731 from gatorsmile/udpateStatsFileScanRDD. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

gatorsmile reviewed Oct 5, 2018

View reviewed changes

10110346 changed the title ~~[MINOR][SQL] When batch reading, the number of bytes can not be updated as expected.~~ [SPARK-25674][SQL] If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated Oct 8, 2018

10110346 force-pushed the inputMetrics branch 2 times, most recently from c332716 to 8134249 Compare October 8, 2018 03:20

srowen reviewed Oct 8, 2018

View reviewed changes

10110346 force-pushed the inputMetrics branch from 8134249 to 95376b3 Compare October 9, 2018 12:15

srowen approved these changes Oct 9, 2018

View reviewed changes

srowen reviewed Oct 10, 2018

View reviewed changes

fix

04eba30

10110346 force-pushed the inputMetrics branch from 95376b3 to 04eba30 Compare October 11, 2018 01:59

srowen approved these changes Oct 11, 2018

View reviewed changes

asfgit closed this in 69f5e9c Oct 11, 2018

gatorsmile mentioned this pull request Oct 15, 2018

[SPARK-25674][FOLLOW-UP] Update the stats for each ColumnarBatch #22731

Closed

[SPARK-25674][SQL] If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated #22594

[SPARK-25674][SQL] If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated #22594

Uh oh!

Conversation

10110346 commented Sep 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 30, 2018

Uh oh!

srowen commented Oct 4, 2018

Uh oh!

gatorsmile Oct 5, 2018

Choose a reason for hiding this comment

Uh oh!

srowen Oct 5, 2018

Choose a reason for hiding this comment

Uh oh!

10110346 commented Oct 5, 2018

Uh oh!

SparkQA commented Oct 8, 2018

Uh oh!

SparkQA commented Oct 8, 2018

Uh oh!

dilipbiswal commented Oct 8, 2018

Uh oh!

SparkQA commented Oct 8, 2018

Uh oh!

srowen Oct 8, 2018

Choose a reason for hiding this comment

Uh oh!

10110346 Oct 9, 2018

Choose a reason for hiding this comment

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 9, 2018

Uh oh!

srowen Oct 10, 2018

Choose a reason for hiding this comment

Uh oh!

10110346 Oct 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 11, 2018

Uh oh!

srowen commented Oct 11, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

10110346 commented Sep 30, 2018 •

edited

Loading

10110346 Oct 11, 2018 •

edited

Loading