Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Oct 22, 2022

What changes were proposed in this pull request?

Reimplement summary with dataframe operations

Why are the changes needed?

1, do not truncate the sql plan any more;
2, enable sql optimization like column pruning:

scala> val df = spark.range(0, 3, 1, 10).withColumn("value", lit("str"))
df: org.apache.spark.sql.DataFrame = [id: bigint, value: string]

scala> df.summary("max", "50%").show
+-------+---+-----+
|summary| id|value|
+-------+---+-----+
|    max|  2|  str|
|    50%|  1| null|
+-------+---+-----+


scala> df.summary("max", "50%").select("id").show
+---+
| id|
+---+
|  2|
|  1|
+---+


scala> df.summary("max", "50%").select("id").queryExecution.optimizedPlan
res4: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [element_at(id#367, summary#376, None, false) AS id#371]
+- Generate explode([max,50%]), false, [summary#376]
   +- Aggregate [map(max, cast(max(id#153L) as string), 50%, cast(percentile_approx(id#153L, [0.5], 10000, 0, 0)[0] as string)) AS id#367]
      +- Range (0, 3, step=1, splits=Some(10))


Does this PR introduce any user-facing change?

No

How was this patch tested?

existing UTs and manually check

@github-actions github-actions bot added the SQL label Oct 22, 2022
@zhengruifeng
Copy link
Contributor Author

cc @HyukjinKwon

@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng zhengruifeng deleted the sql_stat_summary branch October 24, 2022 01:59
@zhengruifeng
Copy link
Contributor Author

thank you @HyukjinKwon !

@zhengruifeng
Copy link
Contributor Author

CI is green in this PR, but it may have conflict with the previous one #38340

then the build in master is broken, I fix it in #38362

SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?
Reimplement `summary` with dataframe operations

### Why are the changes needed?
1, do not truncate the sql plan any more;
2, enable sql optimization like column pruning:

```
scala> val df = spark.range(0, 3, 1, 10).withColumn("value", lit("str"))
df: org.apache.spark.sql.DataFrame = [id: bigint, value: string]

scala> df.summary("max", "50%").show
+-------+---+-----+
|summary| id|value|
+-------+---+-----+
|    max|  2|  str|
|    50%|  1| null|
+-------+---+-----+

scala> df.summary("max", "50%").select("id").show
+---+
| id|
+---+
|  2|
|  1|
+---+

scala> df.summary("max", "50%").select("id").queryExecution.optimizedPlan
res4: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [element_at(id#367, summary#376, None, false) AS id#371]
+- Generate explode([max,50%]), false, [summary#376]
   +- Aggregate [map(max, cast(max(id#153L) as string), 50%, cast(percentile_approx(id#153L, [0.5], 10000, 0, 0)[0] as string)) AS id#367]
      +- Range (0, 3, step=1, splits=Some(10))

```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing UTs and manually check

Closes apache#38346 from zhengruifeng/sql_stat_summary.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants