[SPARK-40880][SQL] Reimplement `summary` with dataframe operations #38346

zhengruifeng · 2022-10-22T06:41:58Z

What changes were proposed in this pull request?

Reimplement summary with dataframe operations

Why are the changes needed?

1, do not truncate the sql plan any more;
2, enable sql optimization like column pruning:

scala> val df = spark.range(0, 3, 1, 10).withColumn("value", lit("str"))
df: org.apache.spark.sql.DataFrame = [id: bigint, value: string]

scala> df.summary("max", "50%").show
+-------+---+-----+
|summary| id|value|
+-------+---+-----+
|    max|  2|  str|
|    50%|  1| null|
+-------+---+-----+


scala> df.summary("max", "50%").select("id").show
+---+
| id|
+---+
|  2|
|  1|
+---+


scala> df.summary("max", "50%").select("id").queryExecution.optimizedPlan
res4: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [element_at(id#367, summary#376, None, false) AS id#371]
+- Generate explode([max,50%]), false, [summary#376]
   +- Aggregate [map(max, cast(max(id#153L) as string), 50%, cast(percentile_approx(id#153L, [0.5], 10000, 0, 0)[0] as string)) AS id#367]
      +- Range (0, 3, step=1, splits=Some(10))

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing UTs and manually check

zhengruifeng · 2022-10-23T00:57:25Z

cc @HyukjinKwon

HyukjinKwon · 2022-10-24T01:57:55Z

Merged to master.

zhengruifeng · 2022-10-24T01:59:14Z

thank you @HyukjinKwon !

zhengruifeng · 2022-10-24T02:21:18Z

CI is green in this PR, but it may have conflict with the previous one #38340

then the build in master is broken, I fix it in #38362

### What changes were proposed in this pull request? Reimplement `summary` with dataframe operations ### Why are the changes needed? 1, do not truncate the sql plan any more; 2, enable sql optimization like column pruning: ``` scala> val df = spark.range(0, 3, 1, 10).withColumn("value", lit("str")) df: org.apache.spark.sql.DataFrame = [id: bigint, value: string] scala> df.summary("max", "50%").show +-------+---+-----+ |summary| id|value| +-------+---+-----+ | max| 2| str| | 50%| 1| null| +-------+---+-----+ scala> df.summary("max", "50%").select("id").show +---+ | id| +---+ | 2| | 1| +---+ scala> df.summary("max", "50%").select("id").queryExecution.optimizedPlan res4: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [element_at(id#367, summary#376, None, false) AS id#371] +- Generate explode([max,50%]), false, [summary#376] +- Aggregate [map(max, cast(max(id#153L) as string), 50%, cast(percentile_approx(id#153L, [0.5], 10000, 0, 0)[0] as string)) AS id#367] +- Range (0, 3, step=1, splits=Some(10)) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing UTs and manually check Closes apache#38346 from zhengruifeng/sql_stat_summary. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

zhengruifeng added 2 commits October 22, 2022 14:25

init

cd1adb1

init

be70f18

github-actions bot added the SQL label Oct 22, 2022

fix scala 2.13

1068973

HyukjinKwon approved these changes Oct 24, 2022

View reviewed changes

HyukjinKwon closed this in 6a0713a Oct 24, 2022

zhengruifeng deleted the sql_stat_summary branch October 24, 2022 01:59

zhengruifeng mentioned this pull request Oct 24, 2022

[SPARK-40880][SQL][FOLLOW-UP] Remove unused imports #38362

Closed

JkSelf mentioned this pull request Apr 24, 2023

[SPARK-43240][SQL][3.3] Fix the wrong result issue when calling df.describe() method. #40914

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-40880][SQL] Reimplement `summary` with dataframe operations #38346

[SPARK-40880][SQL] Reimplement `summary` with dataframe operations #38346

Uh oh!

zhengruifeng commented Oct 22, 2022 •

edited

Loading

Uh oh!

zhengruifeng commented Oct 23, 2022

Uh oh!

HyukjinKwon commented Oct 24, 2022

Uh oh!

zhengruifeng commented Oct 24, 2022

Uh oh!

zhengruifeng commented Oct 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-40880][SQL] Reimplement summary with dataframe operations #38346

[SPARK-40880][SQL] Reimplement summary with dataframe operations #38346

Uh oh!

Conversation

zhengruifeng commented Oct 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng commented Oct 23, 2022

Uh oh!

HyukjinKwon commented Oct 24, 2022

Uh oh!

zhengruifeng commented Oct 24, 2022

Uh oh!

zhengruifeng commented Oct 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-40880][SQL] Reimplement `summary` with dataframe operations #38346

[SPARK-40880][SQL] Reimplement `summary` with dataframe operations #38346

zhengruifeng commented Oct 22, 2022 •

edited

Loading