[GLUTEN-1354][CORE] Fix the wrong result when calling the df.describe #1355

JkSelf · 2023-04-14T06:22:17Z

What changes were proposed in this pull request?

When calling the df.describe() method, it will call the rdd.collect() method to cache the RDD[UnsafeRow]. If the last RDD is GlutenColumnarToRowRDD, the UnsafeRow will be released when the batch is accessed. So df.describe() method will access the released RDD[UnsafeRow] , which cause the wrong result. This PR will fallback ColumnarToRow if the ColumnarToRow is the last operator.

Fix ISSUE#1354

How was this patch tested?

added unit test

github-actions · 2023-04-14T06:22:32Z

#1354

FelixYBW · 2023-04-14T09:16:07Z

The root cause is that native columnar2row will release the memory once the row is used. It's OK for most non-cached operations. But if the next operator needs to cache the row directly, it will cause issues because the buffer containing the row is already released. Fortunately in Spark most of the operator will copy the row when it needs to cache it. But it's not true if last operation is df.describe() which cache the row produced by previous operator directly. Here is a work around to use JVM columnar2row which allocate the row from heap memory so it can be safely cached.

In theory it's a bug for Vanilla Spark as well, because if the previous operator produce the rows allocated on offheap, it will also be released once it's used.

FelixYBW · 2023-04-14T09:17:07Z

What changes were proposed in this pull request?

When calling the df.describe() method, it will call the rdd.collect() method to cache the RDD[UnsafeRow]. If the last RDD is GlutenColumnarToRowRDD, the UnsafeRow will be released when the batch is accessed. So df.describe() method will access the released RDD[UnsafeRow] , which cause the wrong result. This PR will fallback ColumnarToRow if the ColumnarToRow is the last operator.

Fix ISSUE#1354

How was this patch tested?

added unit test

if we fallback to JVM columnarToRow, will it has a Velox2Arrow conversion before it?
@zhztheplayer

JkSelf · 2023-04-27T05:05:06Z

This is a bug for spark 3.2 and 3.3. We have fixed it in PR#40914 for spark 3.3. We will close this PR and this issue will be fixed when upgrading spark version into spark 3.4.

JkSelf force-pushed the describe branch from 8cb6523 to 36c3c6a Compare April 21, 2023 08:54

Jia Ke added 2 commits April 21, 2023 15:38

tmp

2327d01

fallback the pre operator of ColumnarToRow

36c3c6a

zhztheplayer self-requested a review April 25, 2023 05:23

JkSelf closed this Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-1354][CORE] Fix the wrong result when calling the df.describe #1355

[GLUTEN-1354][CORE] Fix the wrong result when calling the df.describe #1355

Uh oh!

JkSelf commented Apr 14, 2023 •

edited

Loading

Uh oh!

github-actions bot commented Apr 14, 2023

Uh oh!

FelixYBW commented Apr 14, 2023

Uh oh!

FelixYBW commented Apr 14, 2023

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

JkSelf commented Apr 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[GLUTEN-1354][CORE] Fix the wrong result when calling the df.describe #1355

[GLUTEN-1354][CORE] Fix the wrong result when calling the df.describe #1355

Uh oh!

Conversation

JkSelf commented Apr 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

github-actions bot commented Apr 14, 2023

Uh oh!

FelixYBW commented Apr 14, 2023

Uh oh!

FelixYBW commented Apr 14, 2023

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

JkSelf commented Apr 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JkSelf commented Apr 14, 2023 •

edited

Loading