[SPARK-46538][ML] Fix the ambiguous column reference issue in `ALSModel.transform` #44526

zhengruifeng · 2023-12-28T11:58:58Z

What changes were proposed in this pull request?

the column references in ALSModel.transform maybe ambiguous in some case

Why are the changes needed?

to fix a bug

before this fix, the test fails with:

JVM stacktrace:
org.apache.spark.sql.AnalysisException: [MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION] Resolved attribute(s) "features", "features" missing from "user", "item", "id", "features", "id", "features" in operator !Project [user#60, item#63, UDF(features#50, features#54) AS prediction#94]. Attribute(s) with the same name appear in the operation: "features", "features".
Please check if the right attribute(s) are used. SQLSTATE: XX000;

and


pyspark.errors.exceptions.captured.AnalysisException: Column features#50, features#46 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before joining them, and specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.

JVM stacktrace:
org.apache.spark.sql.AnalysisException: Column features#50, features#46 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before joining them, and specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.

Does this PR introduce any user-facing change?

yes, bug fix

How was this patch tested?

added ut

Was this patch authored or co-authored using generative AI tooling?

no

fix fix fix

zhengruifeng · 2023-12-28T12:25:17Z

python/pyspark/ml/tests/test_als.py

+            model.write().overwrite().save(d)
+            loaded_model = ALSModel().load(d)
+
+            with self.sql_conf({"spark.sql.analyzer.failAmbiguousSelfJoin": False}):


before this PR, fails with [MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION]

zhengruifeng · 2023-12-28T12:26:19Z

python/pyspark/ml/tests/test_als.py

+                predictions = loaded_model.transform(users.crossJoin(items))
+                self.assertTrue(predictions.count() > 0)
+
+            with self.sql_conf({"spark.sql.analyzer.failAmbiguousSelfJoin": True}):


before this PR, fails with org.apache.spark.sql.AnalysisException: Column features#50, features#46 are ambiguous

zhengruifeng · 2023-12-28T12:40:52Z

cc @cloud-fan and @WeichenXu123

cloud-fan · 2023-12-28T12:46:30Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

-        validatedItems === itemFactors("id"), "left")
-      .select(dataset("*"),
-        predict(userFactors("features"), itemFactors("features")).as($(predictionCol)))
+      .withColumns(Map($(userCol) -> validatedUsers, $(itemCol) -> validatedItems))


shall we use the Seq version of withColumns? So that the column order is deterministic.

why do we need a withColumns now?

It maybe not needed, I want to use withColumns to validate the columns first (while keep the column name) and then reference the validated column by s"${validatedInputAlias}.${$(itemCol)}"

cloud-fan · 2023-12-28T13:02:35Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

-        validatedItems === itemFactors("id"), "left")
-      .select(dataset("*"),
-        predict(userFactors("features"), itemFactors("features")).as($(predictionCol)))
+      .withColumns(Seq($(userCol), $(itemCol)), Seq(validatedUsers, validatedItems))


oh, previously validatedUsers was directly used in the join condition, now we materialize it first and reference only columns in the join condition.

zhengruifeng · 2023-12-29T01:27:59Z

merged to master

fix

e88be5c

fix fix fix

github-actions bot added ML BUILD PYTHON labels Dec 28, 2023

zhengruifeng added 3 commits December 28, 2023 19:59

nit

d0a6459

add config

f5109ae

add config

6d3019c

zhengruifeng commented Dec 28, 2023

View reviewed changes

zhengruifeng changed the title ~~[SPARK-46538][ML] Fix an ambiguous column reference issue in ALSModel.transform~~ [SPARK-46538][ML] Fix the ambiguous column reference issue in ALSModel.transform Dec 28, 2023

nit

26c6305

cloud-fan reviewed Dec 28, 2023

View reviewed changes

seq with columns

c28b8ef

cloud-fan reviewed Dec 28, 2023

View reviewed changes

cloud-fan approved these changes Dec 28, 2023

View reviewed changes

zhengruifeng closed this in b249cb8 Dec 29, 2023

zhengruifeng deleted the ml_als_reference branch December 29, 2023 01:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-46538][ML] Fix the ambiguous column reference issue in `ALSModel.transform` #44526

[SPARK-46538][ML] Fix the ambiguous column reference issue in `ALSModel.transform` #44526

Uh oh!

zhengruifeng commented Dec 28, 2023 •

edited

Loading

Uh oh!

zhengruifeng Dec 28, 2023

Uh oh!

zhengruifeng Dec 28, 2023

Uh oh!

zhengruifeng commented Dec 28, 2023

Uh oh!

cloud-fan Dec 28, 2023

Uh oh!

cloud-fan Dec 28, 2023

Uh oh!

zhengruifeng Dec 28, 2023

Uh oh!

cloud-fan Dec 28, 2023

Uh oh!

zhengruifeng commented Dec 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-46538][ML] Fix the ambiguous column reference issue in ALSModel.transform #44526

[SPARK-46538][ML] Fix the ambiguous column reference issue in ALSModel.transform #44526

Uh oh!

Conversation

zhengruifeng commented Dec 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng Dec 28, 2023

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Dec 28, 2023

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Dec 28, 2023

Uh oh!

cloud-fan Dec 28, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 28, 2023

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Dec 28, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 28, 2023

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Dec 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-46538][ML] Fix the ambiguous column reference issue in `ALSModel.transform` #44526

[SPARK-46538][ML] Fix the ambiguous column reference issue in `ALSModel.transform` #44526

zhengruifeng commented Dec 28, 2023 •

edited

Loading