[SPARK-43240][SQL][3.3] Fix the wrong result issue when calling df.describe() method. #40914

JkSelf · 2023-04-23T03:26:10Z

What changes were proposed in this pull request?

The df.describe() method will cached the RDD. And if the cached RDD is RDD[Unsaferow], which may be released after the row is used, then the result will be wong. Here we need to copy the RDD before caching as the TakeOrderedAndProjectExec operator does.

Why are the changes needed?

bug fix

Does this PR introduce any user-facing change?

no

How was this patch tested?

JkSelf · 2023-04-23T03:26:35Z

@cloud-fan Please help to review. Thanks.

HyukjinKwon · 2023-04-24T01:02:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

Hm, we're taking the head row anyway. Do you have any e2e example that produces a wrong result?

+1, I think we'd better add a test for it

We encountered this issue when passing gluten ut. Because in Gluten we rewritten ColumnarToRow. which will release the row after the row is used. It seems that this issue is hard to reproduce on apache spark. Because the ColumnarToRow of apache spark is in on heap memory, which is recycled by GC. Do you have any suggestion to reproduce this issue in apache spark? @HyukjinKwon @cloud-fan @zhengruifeng

This is only a bug if Spark installs third-party physical operators that release memory eagerly.

According to the doc of QueryExecution.toRDD, I think adding copy is the right thing to do

/** * Internal version of the RDD. Avoids copies and has no schema. * Note for callers: Spark may apply various optimization including reusing object: this means * the row is valid only for the iteration it is retrieved. You should avoid storing row and * accessing after iteration. (Calling `collect()` is one of known bad usage.) * If you want to store these rows into collection, please apply some converter or copy row * which produces new object per iteration. * Given QueryExecution is not a public class, end users are discouraged to use this: please * use `Dataset.rdd` instead where conversion will be applied. */ lazy val toRdd: RDD[InternalRow] = new SQLExecutionRDD( executedPlan.execute(), sparkSession.sessionState.conf)

It happens to work here because the result only has one row (it's a global aggregate). I'm fine without testing it as this follows the guidance of QueryExecution.toRdd doc

Thanks @cloud-fan for the valuable information. @HyukjinKwon @zhengruifeng can we follow the suggestion from wenchen to merge this PR without testing?

@HyukjinKwon Do you have any further comment? Thanks.

zhengruifeng · 2023-04-24T02:36:01Z

Do 3.3/3.4/master have the same issue?

JkSelf · 2023-04-24T03:18:21Z

Do 3.3/3.4/master have the same issue?

Spark 3.3 have this issue. And spark 3.4 and main doesn't seem to have this issue. Because the StatFunctions.scala is reimplemented and doesn't call the rdd.collect() method.

HyukjinKwon · 2023-04-24T03:58:27Z

Oh, you're fixing branch-3.2. It reached EOL, and there won't be more releases in 3.2.x.

HyukjinKwon · 2023-04-24T04:00:22Z

I am fine if we can land this to branch-3.3 alone but would need to fix the JIRA's affected version.

JkSelf · 2023-04-24T05:04:07Z

I am fine if we can land this to branch-3.3 alone but would need to fix the JIRA's affected version.

Sure. I have change the version to branch-3.3. Please help to review again. Thanks.

JkSelf · 2023-04-26T06:34:57Z

Thanks for your review. @cloud-fan Can you help to merge?

cloud-fan · 2023-04-26T09:23:46Z

thanks, merging to 3.3!

…scribe() method ### What changes were proposed in this pull request? The df.describe() method will cached the RDD. And if the cached RDD is RDD[Unsaferow], which may be released after the row is used, then the result will be wong. Here we need to copy the RDD before caching as the [TakeOrderedAndProjectExec ](https://github.com/apache/spark/blob/d68d46c9e2cec04541e2457f4778117b570d8cdb/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L204)operator does. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Closes #40914 from JkSelf/describe. Authored-by: Jia Ke <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added the SQL label Apr 23, 2023

HyukjinKwon reviewed Apr 24, 2023

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-43240] [SQL] Fix the wrong result issue when calling df.describe() method.~~ [SPARK-43240][SQL][3.2] Fix the wrong result issue when calling df.describe() method. Apr 24, 2023

JkSelf changed the base branch from branch-3.2 to branch-3.3 April 24, 2023 04:51

JkSelf force-pushed the describe branch from 4cd69b2 to 45773a3 Compare April 24, 2023 05:01

JkSelf changed the title ~~[SPARK-43240][SQL][3.2] Fix the wrong result issue when calling df.describe() method.~~ [SPARK-43240][SQL][3.3] Fix the wrong result issue when calling df.describe() method. Apr 24, 2023

zhengruifeng approved these changes Apr 24, 2023

View reviewed changes

copy RDD before calling the RDD.collect() method

45773a3

HyukjinKwon approved these changes Apr 26, 2023

View reviewed changes

JkSelf mentioned this pull request Apr 26, 2023

[SPARK-43240][SQL][3.3] Fix the wrong result issue when calling df.describe() method. #40914 #40958

Closed

cloud-fan closed this Apr 26, 2023

This was referenced Apr 27, 2023

[GLUTEN-1354][CORE] Fix the wrong result when calling the df.describe apache/incubator-gluten#1355

Closed

Gluten will return wrong result when calling the df.describe() method apache/incubator-gluten#1354

Closed

JkSelf mentioned this pull request Jun 6, 2023

[DNM] Reproduce some bug apache/incubator-gluten#1656

Closed

[SPARK-43240][SQL][3.3] Fix the wrong result issue when calling df.describe() method. #40914

[SPARK-43240][SQL][3.3] Fix the wrong result issue when calling df.describe() method. #40914

Uh oh!

Conversation

JkSelf commented Apr 23, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

JkSelf commented Apr 23, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Apr 24, 2023

Uh oh!

JkSelf commented Apr 24, 2023

Uh oh!

HyukjinKwon commented Apr 24, 2023

Uh oh!

HyukjinKwon commented Apr 24, 2023

Uh oh!

JkSelf commented Apr 24, 2023

Uh oh!

JkSelf commented Apr 26, 2023

Uh oh!

cloud-fan commented Apr 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants