[SPARK-43922][SQL] Add named parameter support in parser for function calls #41796

learningchess2003 · 2023-06-29T20:48:20Z

What changes were proposed in this pull request?

We plan on adding two new tokens called namedArgumentExpression and functionArgument which would enable this feature. When parsing this logic, we also make changes to ASTBuilder such that it can detect if the argument passed is a named argument or a positional one.

Why are the changes needed?

This is part of a larger project to implement named parameter support for user defined functions, built-in functions, and table valued functions.

Does this PR introduce any user-facing change?

Yes, the user would be able to call functions with argument lists that contain named arguments.

How was this patch tested?

We add tests in the PlanParserSuite that will verify that the plan parsed is as intended.

learningchess2003 · 2023-06-29T21:08:28Z

@MaxGekk Please let me know if the end-to-end tests were what you had in mind.

dtenedor · 2023-06-29T21:14:20Z

@MaxGekk Please let me know if the end-to-end tests were what you had in mind.

These seem reasonable to me for this PR that only covers parser changes so far.

dtenedor

LGTM since this was ported from #41429 which I already approved.

MaxGekk

Waiting for Ci.

MaxGekk · 2023-06-30T10:08:29Z

+1, LGTM. Merging to master.
Thank you, @learningchess2003 and @dtenedor for review.

### What changes were proposed in this pull request? Supports named arguments in Python UDTF. For example: ```py >>> udtf(returnType="a: int") ... class TestUDTF: ... def eval(self, a, b): ... yield a, ... >>> spark.udtf.register("test_udtf", TestUDTF) >>> TestUDTF(a=lit(10), b=lit("x")).show() +---+ | a| +---+ | 10| +---+ >>> TestUDTF(b=lit("x"), a=lit(10)).show() +---+ | a| +---+ | 10| +---+ >>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x')").show() +---+ | a| +---+ | 10| +---+ >>> spark.sql("SELECT * FROM test_udtf(b=>'x', a=>10)").show() +---+ | a| +---+ | 10| +---+ ``` or: ```py >>> udtf ... class TestUDTF: ... staticmethod ... def analyze(**kwargs: AnalyzeArgument) -> AnalyzeResult: ... return AnalyzeResult( ... StructType( ... [StructField(key, arg.data_type) for key, arg in sorted(kwargs.items())] ... ) ... ) ... def eval(self, **kwargs): ... yield tuple(value for _, value in sorted(kwargs.items())) ... >>> spark.udtf.register("test_udtf", TestUDTF) >>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x', x=>100.0)").show() +---+---+-----+ | a| b| x| +---+---+-----+ | 10| x|100.0| +---+---+-----+ >>> spark.sql("SELECT * FROM test_udtf(x=>10, a=>'x', z=>100.0)").show() +---+---+-----+ | a| x| z| +---+---+-----+ | x| 10|100.0| +---+---+-----+ ``` ### Why are the changes needed? Now that named arguments are supported (#41796, #42020). It should be supported in Python UDTF. ### Does this PR introduce _any_ user-facing change? Yes, named arguments will be available for Python UDTF. ### How was this patch tested? Added related tests. Closes #42422 from ueshin/issues/SPARK-44749/kwargs. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Takuya UESHIN <[email protected]>

### What changes were proposed in this pull request? Supports named arguments in Python UDTF. For example: ```py >>> udtf(returnType="a: int") ... class TestUDTF: ... def eval(self, a, b): ... yield a, ... >>> spark.udtf.register("test_udtf", TestUDTF) >>> TestUDTF(a=lit(10), b=lit("x")).show() +---+ | a| +---+ | 10| +---+ >>> TestUDTF(b=lit("x"), a=lit(10)).show() +---+ | a| +---+ | 10| +---+ >>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x')").show() +---+ | a| +---+ | 10| +---+ >>> spark.sql("SELECT * FROM test_udtf(b=>'x', a=>10)").show() +---+ | a| +---+ | 10| +---+ ``` or: ```py >>> udtf ... class TestUDTF: ... staticmethod ... def analyze(**kwargs: AnalyzeArgument) -> AnalyzeResult: ... return AnalyzeResult( ... StructType( ... [StructField(key, arg.data_type) for key, arg in sorted(kwargs.items())] ... ) ... ) ... def eval(self, **kwargs): ... yield tuple(value for _, value in sorted(kwargs.items())) ... >>> spark.udtf.register("test_udtf", TestUDTF) >>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x', x=>100.0)").show() +---+---+-----+ | a| b| x| +---+---+-----+ | 10| x|100.0| +---+---+-----+ >>> spark.sql("SELECT * FROM test_udtf(x=>10, a=>'x', z=>100.0)").show() +---+---+-----+ | a| x| z| +---+---+-----+ | x| 10|100.0| +---+---+-----+ ``` ### Why are the changes needed? Now that named arguments are supported (apache#41796, apache#42020). It should be supported in Python UDTF. ### Does this PR introduce _any_ user-facing change? Yes, named arguments will be available for Python UDTF. ### How was this patch tested? Added related tests. Closes apache#42422 from ueshin/issues/SPARK-44749/kwargs. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Takuya UESHIN <[email protected]>

…andas UDFs ### What changes were proposed in this pull request? Supports named arguments in scalar Python/Pandas UDF. For example: ```py >>> udf("int") ... def test_udf(a, b): ... return a + 10 * b ... >>> spark.udf.register("test_udf", test_udf) >>> spark.range(2).select(test_udf(b=col("id") * 10, a=col("id"))).show() +---------------------------------+ |test_udf(b => (id * 10), a => id)| +---------------------------------+ | 0| | 101| +---------------------------------+ >>> spark.sql("SELECT test_udf(b => id * 10, a => id) FROM range(2)").show() +---------------------------------+ |test_udf(b => (id * 10), a => id)| +---------------------------------+ | 0| | 101| +---------------------------------+ ``` or: ```py >>> pandas_udf("int") ... def test_udf(a, b): ... return a + 10 * b ... >>> spark.udf.register("test_udf", test_udf) >>> spark.range(2).select(test_udf(b=col("id") * 10, a=col("id"))).show() +---------------------------------+ |test_udf(b => (id * 10), a => id)| +---------------------------------+ | 0| | 101| +---------------------------------+ >>> spark.sql("SELECT test_udf(b => id * 10, a => id) FROM range(2)").show() +---------------------------------+ |test_udf(b => (id * 10), a => id)| +---------------------------------+ | 0| | 101| +---------------------------------+ ``` ### Why are the changes needed? Now that named arguments support was added (#41796, #42020). Scalar Python/Pandas UDFs can support it. ### Does this PR introduce _any_ user-facing change? Yes, named arguments will be available for scalar Python/Pandas UDFs. ### How was this patch tested? Added related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42617 from ueshin/issues/SPARK-44918/kwargs. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Takuya UESHIN <[email protected]>

…s UDFs ### What changes were proposed in this pull request? Supports named arguments in aggregate Pandas UDFs. For example: ```py >>> pandas_udf("double") ... def weighted_mean(v: pd.Series, w: pd.Series) -> float: ... import numpy as np ... return np.average(v, weights=w) ... >>> df = spark.createDataFrame( ... [(1, 1.0, 1.0), (1, 2.0, 2.0), (2, 3.0, 1.0), (2, 5.0, 2.0), (2, 10.0, 3.0)], ... ("id", "v", "w")) >>> df.groupby("id").agg(weighted_mean(v=df["v"], w=df["w"])).show() +---+-----------------------------+ | id|weighted_mean(v => v, w => w)| +---+-----------------------------+ | 1| 1.6666666666666667| | 2| 7.166666666666667| +---+-----------------------------+ >>> df.groupby("id").agg(weighted_mean(w=df["w"], v=df["v"])).show() +---+-----------------------------+ | id|weighted_mean(w => w, v => v)| +---+-----------------------------+ | 1| 1.6666666666666667| | 2| 7.166666666666667| +---+-----------------------------+ ``` or with window: ```py >>> w = Window.partitionBy("id").orderBy("v").rowsBetween(-2, 1) >>> df.withColumn("wm", weighted_mean(v=df.v, w=df.w).over(w)).show() +---+----+---+------------------+ | id| v| w| wm| +---+----+---+------------------+ | 1| 1.0|1.0|1.6666666666666667| | 1| 2.0|2.0|1.6666666666666667| | 2| 3.0|1.0| 4.333333333333333| | 2| 5.0|2.0| 7.166666666666667| | 2|10.0|3.0| 7.166666666666667| +---+----+---+------------------+ >>> df.withColumn("wm", weighted_mean_udf(w=df.w, v=df.v).over(w)).show() +---+----+---+------------------+ | id| v| w| wm| +---+----+---+------------------+ | 1| 1.0|1.0|1.6666666666666667| | 1| 2.0|2.0|1.6666666666666667| | 2| 3.0|1.0| 4.333333333333333| | 2| 5.0|2.0| 7.166666666666667| | 2|10.0|3.0| 7.166666666666667| +---+----+---+------------------+ ``` ### Why are the changes needed? Now that named arguments support was added (#41796, #42020). Aggregate Pandas UDFs can support it. ### Does this PR introduce _any_ user-facing change? Yes, named arguments will be available for aggregate Pandas UDFs. ### How was this patch tested? Added related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42663 from ueshin/issues/SPARK-44952/kwargs. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Takuya UESHIN <[email protected]>

…s UDFs ### What changes were proposed in this pull request? Supports named arguments in aggregate Pandas UDFs. For example: ```py >>> pandas_udf("double") ... def weighted_mean(v: pd.Series, w: pd.Series) -> float: ... import numpy as np ... return np.average(v, weights=w) ... >>> df = spark.createDataFrame( ... [(1, 1.0, 1.0), (1, 2.0, 2.0), (2, 3.0, 1.0), (2, 5.0, 2.0), (2, 10.0, 3.0)], ... ("id", "v", "w")) >>> df.groupby("id").agg(weighted_mean(v=df["v"], w=df["w"])).show() +---+-----------------------------+ | id|weighted_mean(v => v, w => w)| +---+-----------------------------+ | 1| 1.6666666666666667| | 2| 7.166666666666667| +---+-----------------------------+ >>> df.groupby("id").agg(weighted_mean(w=df["w"], v=df["v"])).show() +---+-----------------------------+ | id|weighted_mean(w => w, v => v)| +---+-----------------------------+ | 1| 1.6666666666666667| | 2| 7.166666666666667| +---+-----------------------------+ ``` or with window: ```py >>> w = Window.partitionBy("id").orderBy("v").rowsBetween(-2, 1) >>> df.withColumn("wm", weighted_mean(v=df.v, w=df.w).over(w)).show() +---+----+---+------------------+ | id| v| w| wm| +---+----+---+------------------+ | 1| 1.0|1.0|1.6666666666666667| | 1| 2.0|2.0|1.6666666666666667| | 2| 3.0|1.0| 4.333333333333333| | 2| 5.0|2.0| 7.166666666666667| | 2|10.0|3.0| 7.166666666666667| +---+----+---+------------------+ >>> df.withColumn("wm", weighted_mean_udf(w=df.w, v=df.v).over(w)).show() +---+----+---+------------------+ | id| v| w| wm| +---+----+---+------------------+ | 1| 1.0|1.0|1.6666666666666667| | 1| 2.0|2.0|1.6666666666666667| | 2| 3.0|1.0| 4.333333333333333| | 2| 5.0|2.0| 7.166666666666667| | 2|10.0|3.0| 7.166666666666667| +---+----+---+------------------+ ``` ### Why are the changes needed? Now that named arguments support was added (apache/spark#41796, apache/spark#42020). Aggregate Pandas UDFs can support it. ### Does this PR introduce _any_ user-facing change? Yes, named arguments will be available for aggregate Pandas UDFs. ### How was this patch tested? Added related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42663 from ueshin/issues/SPARK-44952/kwargs. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Takuya UESHIN <[email protected]>

learningchess2003 added 2 commits June 29, 2023 13:46

Initial commit

5611950

Removing weird file

a1b93f3

github-actions bot added the SQL label Jun 29, 2023

Addinh back some stuff

9f51b04

learningchess2003 mentioned this pull request Jun 29, 2023

[SPARK-43922] Add named parameter support in parser for function calls #41429

Closed

dtenedor approved these changes Jun 29, 2023

View reviewed changes

MaxGekk approved these changes Jun 30, 2023

View reviewed changes

MaxGekk changed the title ~~[SPARK-43922] Add named parameter support in parser for function calls~~ [SPARK-43922][SQL] Add named parameter support in parser for function calls Jun 30, 2023

MaxGekk closed this in 91c4581 Jun 30, 2023

ueshin mentioned this pull request Aug 11, 2023

[SPARK-44749][SQL][PYTHON] Support named arguments in Python UDTF #42422

Closed

ueshin mentioned this pull request Aug 22, 2023

[SPARK-44918][SQL][PYTHON] Support named arguments in scalar Python/Pandas UDFs #42617

Closed

ueshin mentioned this pull request Aug 24, 2023

[SPARK-44952][SQL][PYTHON] Support named arguments in aggregate Pandas UDFs #42663

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-43922][SQL] Add named parameter support in parser for function calls #41796

[SPARK-43922][SQL] Add named parameter support in parser for function calls #41796

Uh oh!

learningchess2003 commented Jun 29, 2023 •

edited by gatorsmile

Loading

Uh oh!

learningchess2003 commented Jun 29, 2023

Uh oh!

dtenedor commented Jun 29, 2023

Uh oh!

dtenedor left a comment

Uh oh!

MaxGekk left a comment

Uh oh!

MaxGekk commented Jun 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-43922][SQL] Add named parameter support in parser for function calls #41796

[SPARK-43922][SQL] Add named parameter support in parser for function calls #41796

Uh oh!

Conversation

learningchess2003 commented Jun 29, 2023 • edited by gatorsmile Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

learningchess2003 commented Jun 29, 2023

Uh oh!

dtenedor commented Jun 29, 2023

Uh oh!

dtenedor left a comment

Choose a reason for hiding this comment

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Jun 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

learningchess2003 commented Jun 29, 2023 •

edited by gatorsmile

Loading