[SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution #44689

zhengruifeng · 2024-01-11T10:36:45Z

What changes were proposed in this pull request?

On Spark Connect, df.col("*") should be resolved against the target plan

Why are the changes needed?

In [6]: df1 = spark.createDataFrame([{"id": 1}])

In [7]: df2 = spark.createDataFrame([{"id": 1, "val": "v"}])

In [8]: df1.join(df2)
Out[8]: DataFrame[id: bigint, id: bigint, val: string]

In [9]: df1.join(df2).select(df1["*"])
Out[9]: DataFrame[id: bigint, id: bigint, val: string]

it should be

In [3]: df1.join(df2).select(df1["*"])
Out[3]: DataFrame[id: bigint]

Does this PR introduce any user-facing change?

yes

How was this patch tested?

added ut

Was this patch authored or co-authored using generative AI tooling?

no

zhengruifeng · 2024-01-11T10:39:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

don't add an ambiguous detection for now, since in vanilla Spark,

In [7]: df1 = spark.createDataFrame([{"id": 1}]) In [8]: df1.join(df1) Out[8]: DataFrame[id: bigint, id: bigint] In [9]: df1.join(df1).select(df1["id"]) ... AnalysisException: Column id#0L are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. In [10]: df1.join(df1).select(df1["*"]) Out[10]: DataFrame[id: bigint]

It's probably a bug in vanilla spark...

yeah, let me fail it in spark connect anyway

zhengruifeng · 2024-01-11T10:42:24Z

python/pyspark/sql/connect/functions/builtin.py

delete this helper function due to the behavior difference between Dataset#col and functions#col

spark/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

Lines 1452 to 1461 in d2f5724

def col(colName: String): Column = colName match {

case "*" =>

Column(ResolvedStar(queryExecution.analyzed.output))

case _ =>

if (sparkSession.sessionState.conf.supportQuotedRegexColumnName) {

colRegex(colName)

} else {

Column(addDataFrameIdToCol(resolve(colName)))

}

}

spark/sql/core/src/main/scala/org/apache/spark/sql/Column.scala

Lines 154 to 162 in 0a79199

def this(name: String) = this(withOrigin {

name match {

case "*" => UnresolvedStar(None)

case _ if name.endsWith(".*") =>

val parts = UnresolvedAttribute.parseAttributeName(name.substring(0, name.length - 2))

UnresolvedStar(Some(parts))

case _ => UnresolvedAttribute.quotedString(name)

}

})

zhengruifeng · 2024-01-11T10:46:00Z

python/pyspark/sql/connect/dataframe.py

TODO for myself, should revisit the implementation of colRegex

We can probably skip it in spark connect. It's really a weird feature and non-standard.

It's off by default anyway, so we can throw a proper error if it's enabled in spark connect.

cloud-fan · 2024-01-12T02:16:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

This does not need to be a star. It's just a placeholder and will be replaced by ResolvedStar after finding the matching plan.

Then we can handle it in ColumnResolutionHelper and reuse code.

zhengruifeng · 2024-01-13T01:52:33Z

python/pyspark/sql/tests/test_dataframe.py

CI run this test in both connect and vanilla

zhengruifeng · 2024-01-13T01:52:46Z

python/pyspark/sql/tests/connect/test_connect_basic.py

CI run this test only in Connect

what's the difference between Connect and Classic for this test?

In [4]: cdf1 = spark.createDataFrame([Row(a=1, b=2, c=3)]) In [5]: cdf2 = spark.createDataFrame([Row(a=2, b=0)]) In [6]: cdf3 = cdf1.select(cdf1.a) In [7]: cdf3.select(cdf1["*"]).schema ... AnalysisException: [MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_MISSING_FROM_INPUT] Resolved attribute(s) "b", "c" missing from "a" in operator !Project [a#0L, b#1L, c#2L]. SQLSTATE: XX000; !Project [a#0L, b#1L, c#2L] +- Project [a#0L] +- LogicalRDD [a#0L, b#1L, c#2L], false In [8]: cdf1.select(cdf2["*"]).schema ... AnalysisException: [MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION] Resolved attribute(s) "a", "b" missing from "a", "b", "c" in operator !Project [a#6L, b#7L]. Attribute(s) with the same name appear in the operation: "a", "b". Please check if the right attribute(s) are used. SQLSTATE: XX000; !Project [a#6L, b#7L] +- LogicalRDD [a#0L, b#1L, c#2L], false In [9]: cdf1.join(cdf1).select(cdf1["*"]).schema Out[9]: StructType([StructField('a', LongType(), True), StructField('b', LongType(), True), StructField('c', LongType(), True)])

cdf1.join(cdf1).select(cdf1["*"]) won't fail due to AMBIGUOUS_COLUMN_REFERENCE

init init fix lint fix lint fix lint

HyukjinKwon · 2024-01-16T00:25:01Z

Merged to master.

…)` on client side ### What changes were proposed in this pull request? before #44689, `df["*"]` and `sf.col("*")` are both convert to `UnresolvedStar`, and then `Count(UnresolvedStar)` is converted to `Count(1)` in Analyzer: https://github.com/apache/spark/blob/381f3691bd481abc8f621ca3f282e06db32bea31/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1893-L1897 in that fix, we introduced a new node `UnresolvedDataFrameStar` for `df["*"]` which will be replaced to `ResolvedStar` later. Unfortunately, it doesn't match `Count(UnresolvedStar)` any more. So it causes: ``` In [1]: from pyspark.sql import functions as sf In [2]: df1 = spark.createDataFrame([{"id": 1, "val": "v"}]) In [3]: df1.select(sf.count(df1["*"])) Out[3]: DataFrame[count(id, val): bigint] ``` which should be ``` In [3]: df1.select(sf.count(df1["*"])) Out[3]: DataFrame[count(1): bigint] ``` In vanilla Spark, it is up to the `count` function to make such conversion `sf.count(df1["*"])` -> `sf.count(sf.lit(1))`, see https://github.com/apache/spark/blob/e8dfcd3081abe16b2115bb2944a2b1cb547eca8e/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L422-L436 So it is a natural way to fix this behavior on the client side. ### Why are the changes needed? to keep the behavior ### Does this PR introduce _any_ user-facing change? it fix a behavior change introduced in #44689 ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? no Closes #44752 from zhengruifeng/connect_fix_count_df_star. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

### What changes were proposed in this pull request? this PR is a followup of #44689, to fix `dataset.col("*")` in Scala Client ### Why are the changes needed? fix `dataset.col("*")` resolution ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? no Closes #44748 from zhengruifeng/connect_scala_df_star. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

github-actions bot added SQL DOCS PYTHON CONNECT labels Jan 11, 2024

zhengruifeng commented Jan 11, 2024

View reviewed changes

zhengruifeng force-pushed the py_df_star branch from 6adf690 to 5429239 Compare January 11, 2024 10:40

zhengruifeng commented Jan 11, 2024

View reviewed changes

zhengruifeng changed the title ~~[SPARK-46677][SQL][CONNECT] Fix df.col("*") resolution~~ [SPARK-46677][SQL][CONNECT] Fix dataframe["*"] resolution Jan 11, 2024

zhengruifeng commented Jan 11, 2024

View reviewed changes

zhengruifeng force-pushed the py_df_star branch 4 times, most recently from 15487aa to 4ba519f Compare January 11, 2024 12:57

zhengruifeng requested review from HyukjinKwon and cloud-fan January 12, 2024 00:42

cloud-fan reviewed Jan 12, 2024

View reviewed changes

zhengruifeng force-pushed the py_df_star branch from 4ba519f to 78de239 Compare January 12, 2024 08:55

github-actions bot removed the DOCS label Jan 12, 2024

zhengruifeng force-pushed the py_df_star branch from 469becc to f213726 Compare January 12, 2024 12:32

zhengruifeng commented Jan 13, 2024

View reviewed changes

python/pyspark/sql/tests/test_dataframe.py Outdated

Copy link

Contributor Author

zhengruifeng Jan 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI run this test in both connect and vanilla

zhengruifeng commented Jan 13, 2024

View reviewed changes

cloud-fan approved these changes Jan 15, 2024

View reviewed changes

zhengruifeng added 5 commits January 15, 2024 17:37

init

d74152c

init init fix lint fix lint fix lint

move

8380635

test

8088ab6

always use metadata output

c6fba2e

add more tests

c9fd7bf

zhengruifeng force-pushed the py_df_star branch from f213726 to c9fd7bf Compare January 15, 2024 09:56

HyukjinKwon approved these changes Jan 16, 2024

View reviewed changes

HyukjinKwon closed this in cd3fa2f Jan 16, 2024

zhengruifeng deleted the py_df_star branch January 16, 2024 00:35

This was referenced Jan 16, 2024

[SPARK-46677][CONNECT][FOLLOWUP] Fix dataset.col("*") in Scala Client #44748

Closed

[SPARK-46677][CONNECT][FOLLOWUP] Convert count(df["*"]) to count(1) on client side #44752

Closed

	def col(colName: String): Column = colName match {
	case "*" =>
	Column(ResolvedStar(queryExecution.analyzed.output))
	case _ =>
	if (sparkSession.sessionState.conf.supportQuotedRegexColumnName) {
	colRegex(colName)
	} else {
	Column(addDataFrameIdToCol(resolve(colName)))
	}
	}

	def this(name: String) = this(withOrigin {
	name match {
	case "*" => UnresolvedStar(None)
	case _ if name.endsWith(".*") =>
	val parts = UnresolvedAttribute.parseAttributeName(name.substring(0, name.length - 2))
	UnresolvedStar(Some(parts))
	case _ => UnresolvedAttribute.quotedString(name)
	}
	})

[SPARK-46677][SQL][CONNECT] Fix dataframe["*"] resolution #44689

[SPARK-46677][SQL][CONNECT] Fix dataframe["*"] resolution #44689

Uh oh!

Conversation

zhengruifeng commented Jan 11, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Jan 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jan 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution #44689

[SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution #44689

zhengruifeng Jan 11, 2024 •

edited

Loading