Skip to content

Conversation

@10110346
Copy link
Contributor

@10110346 10110346 commented Sep 30, 2018

What changes were proposed in this pull request?

If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated,because it might skip over the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS.

This PR just checks whether the increment causes the value to exceed a higher multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS.

How was this patch tested?

existed unit tests

@SparkQA
Copy link

SparkQA commented Sep 30, 2018

Test build #96806 has finished for PR 22594 at commit e589e1e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Oct 4, 2018

This needs a JIRA as it's a non-trivial bug fix. This also doesn't explain the problem at all. The problem is that records may be incremented by more than 1 at a time, right?

That should be in a comment here as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original goal here is to avoid updating it every record, because it is too expensive. I am not sure what is the goal of your changes. Try to write a test case in SQLMetricsSuite?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is that in line 108, this value can be incremented by more than 1. It might skip over the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. If that code path is common, it might rarely ever get updated. This now just checks whether the increment causes the value to exceed a higher multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS, which sounds more correct. But yeah needs a description and ideally a little test.

@10110346
Copy link
Contributor Author

10110346 commented Oct 5, 2018

@srowen Yes,I will update,thanks

@10110346 10110346 changed the title [MINOR][SQL] When batch reading, the number of bytes can not be updated as expected. [SPARK-25674][SQL] If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated Oct 8, 2018
@10110346 10110346 force-pushed the inputMetrics branch 2 times, most recently from c332716 to 8134249 Compare October 8, 2018 03:20
@SparkQA
Copy link

SparkQA commented Oct 8, 2018

Test build #97096 has finished for PR 22594 at commit 8134249.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 8, 2018

Test build #97095 has finished for PR 22594 at commit c332716.

  • This patch fails due to an unknown error code, -9.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Oct 8, 2018

Test build #97104 has finished for PR 22594 at commit 8134249.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really testing the code you changed. It's replicating something similar and testing that. I don't think this test helps. Ideally you would write a test for any path that uses FileScanRDD and check its metrics. Are there tests around here that you could 'piggyback' onto? maybe an existing test of the metrics involving ColumnarBatch than can be changed to trigger this case.

It may be hard, I don't know. Worth looking to see if there's an easy way to test this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is too hard,the test needs involve ColumnarBatch,
in addition, we must capture the bytesRead in the process of execution, not the task end.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a reasonable test.

@SparkQA
Copy link

SparkQA commented Oct 9, 2018

Test build #97154 has finished for PR 22594 at commit 95376b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you know, we have spark.testing for this purpose too.

I'm actually on the fence here about whether it's worth this extra complexity for the test. It's a simple change and the overall effect is tested by other tests of input metrics.

Hm, what do you think, just drop this? is it necessary for the test to work correctly?

Copy link
Contributor Author

@10110346 10110346 Oct 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this place is controlled by spark.testing, other unit tests may fail.
Yeah, I agree with you ,this is a simple change, it is better to drop this.
thanks @srowen

@SparkQA
Copy link

SparkQA commented Oct 11, 2018

Test build #97230 has finished for PR 22594 at commit 04eba30.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Oct 11, 2018
… time,the number of bytes might rarely ever get updated

## What changes were proposed in this pull request?
If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated,because it might skip over the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS.

This PR just checks whether the increment causes the value to exceed a higher multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS.

## How was this patch tested?
existed unit tests

Closes #22594 from 10110346/inputMetrics.

Authored-by: liuxian <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 69f5e9c)
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member

srowen commented Oct 11, 2018

Merged to master/2.4/2.3 as a clean simple bug fix

asfgit pushed a commit that referenced this pull request Oct 11, 2018
… time,the number of bytes might rarely ever get updated

## What changes were proposed in this pull request?
If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated,because it might skip over the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS.

This PR just checks whether the increment causes the value to exceed a higher multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS.

## How was this patch tested?
existed unit tests

Closes #22594 from 10110346/inputMetrics.

Authored-by: liuxian <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 69f5e9c)
Signed-off-by: Sean Owen <[email protected]>
@asfgit asfgit closed this in 69f5e9c Oct 11, 2018
asfgit pushed a commit that referenced this pull request Oct 16, 2018
## What changes were proposed in this pull request?
This PR is a follow-up of #22594 . This alternative can avoid the unneeded computation in the hot code path.

- For row-based scan, we keep the original way.
- For the columnar scan, we just need to update the stats after each batch.

## How was this patch tested?
N/A

Closes #22731 from gatorsmile/udpateStatsFileScanRDD.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 4cee191)
Signed-off-by: Sean Owen <[email protected]>
asfgit pushed a commit that referenced this pull request Oct 16, 2018
This PR is a follow-up of #22594 . This alternative can avoid the unneeded computation in the hot code path.

- For row-based scan, we keep the original way.
- For the columnar scan, we just need to update the stats after each batch.

N/A

Closes #22731 from gatorsmile/udpateStatsFileScanRDD.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 4cee191)
Signed-off-by: Sean Owen <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
… time,the number of bytes might rarely ever get updated

## What changes were proposed in this pull request?
If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated,because it might skip over the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS.

This PR just checks whether the increment causes the value to exceed a higher multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS.

## How was this patch tested?
existed unit tests

Closes apache#22594 from 10110346/inputMetrics.

Authored-by: liuxian <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?
This PR is a follow-up of apache#22594 . This alternative can avoid the unneeded computation in the hot code path.

- For row-based scan, we keep the original way.
- For the columnar scan, we just need to update the stats after each batch.

## How was this patch tested?
N/A

Closes apache#22731 from gatorsmile/udpateStatsFileScanRDD.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants