Skip to content

Conversation

@learningchess2003
Copy link
Contributor

@learningchess2003 learningchess2003 commented Jun 29, 2023

What changes were proposed in this pull request?

We plan on adding two new tokens called namedArgumentExpression and functionArgument which would enable this feature. When parsing this logic, we also make changes to ASTBuilder such that it can detect if the argument passed is a named argument or a positional one.

Why are the changes needed?

This is part of a larger project to implement named parameter support for user defined functions, built-in functions, and table valued functions.

Does this PR introduce any user-facing change?

Yes, the user would be able to call functions with argument lists that contain named arguments.

How was this patch tested?

We add tests in the PlanParserSuite that will verify that the plan parsed is as intended.

@github-actions github-actions bot added the SQL label Jun 29, 2023
@learningchess2003
Copy link
Contributor Author

@MaxGekk Please let me know if the end-to-end tests were what you had in mind.

@dtenedor
Copy link
Contributor

@MaxGekk Please let me know if the end-to-end tests were what you had in mind.

These seem reasonable to me for this PR that only covers parser changes so far.

Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM since this was ported from #41429 which I already approved.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for Ci.

@MaxGekk MaxGekk changed the title [SPARK-43922] Add named parameter support in parser for function calls [SPARK-43922][SQL] Add named parameter support in parser for function calls Jun 30, 2023
@MaxGekk
Copy link
Member

MaxGekk commented Jun 30, 2023

+1, LGTM. Merging to master.
Thank you, @learningchess2003 and @dtenedor for review.

@MaxGekk MaxGekk closed this in 91c4581 Jun 30, 2023
ueshin added a commit that referenced this pull request Aug 14, 2023
### What changes were proposed in this pull request?

Supports named arguments in Python UDTF.

For example:

```py
>>> udtf(returnType="a: int")
... class TestUDTF:
...     def eval(self, a, b):
...         yield a,
...
>>> spark.udtf.register("test_udtf", TestUDTF)

>>> TestUDTF(a=lit(10), b=lit("x")).show()
+---+
|  a|
+---+
| 10|
+---+

>>> TestUDTF(b=lit("x"), a=lit(10)).show()
+---+
|  a|
+---+
| 10|
+---+

>>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x')").show()
+---+
|  a|
+---+
| 10|
+---+

>>> spark.sql("SELECT * FROM test_udtf(b=>'x', a=>10)").show()
+---+
|  a|
+---+
| 10|
+---+
```

or:

```py
>>> udtf
... class TestUDTF:
...     staticmethod
...     def analyze(**kwargs: AnalyzeArgument) -> AnalyzeResult:
...         return AnalyzeResult(
...             StructType(
...                 [StructField(key, arg.data_type) for key, arg in sorted(kwargs.items())]
...             )
...         )
...     def eval(self, **kwargs):
...         yield tuple(value for _, value in sorted(kwargs.items()))
...
>>> spark.udtf.register("test_udtf", TestUDTF)

>>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x', x=>100.0)").show()
+---+---+-----+
|  a|  b|    x|
+---+---+-----+
| 10|  x|100.0|
+---+---+-----+

>>> spark.sql("SELECT * FROM test_udtf(x=>10, a=>'x', z=>100.0)").show()
+---+---+-----+
|  a|  x|    z|
+---+---+-----+
|  x| 10|100.0|
+---+---+-----+
```

### Why are the changes needed?

Now that named arguments are supported (#41796, #42020).

It should be supported in Python UDTF.

### Does this PR introduce _any_ user-facing change?

Yes, named arguments will be available for Python UDTF.

### How was this patch tested?

Added related tests.

Closes #42422 from ueshin/issues/SPARK-44749/kwargs.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Takuya UESHIN <[email protected]>
valentinp17 pushed a commit to valentinp17/spark that referenced this pull request Aug 24, 2023
### What changes were proposed in this pull request?

Supports named arguments in Python UDTF.

For example:

```py
>>> udtf(returnType="a: int")
... class TestUDTF:
...     def eval(self, a, b):
...         yield a,
...
>>> spark.udtf.register("test_udtf", TestUDTF)

>>> TestUDTF(a=lit(10), b=lit("x")).show()
+---+
|  a|
+---+
| 10|
+---+

>>> TestUDTF(b=lit("x"), a=lit(10)).show()
+---+
|  a|
+---+
| 10|
+---+

>>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x')").show()
+---+
|  a|
+---+
| 10|
+---+

>>> spark.sql("SELECT * FROM test_udtf(b=>'x', a=>10)").show()
+---+
|  a|
+---+
| 10|
+---+
```

or:

```py
>>> udtf
... class TestUDTF:
...     staticmethod
...     def analyze(**kwargs: AnalyzeArgument) -> AnalyzeResult:
...         return AnalyzeResult(
...             StructType(
...                 [StructField(key, arg.data_type) for key, arg in sorted(kwargs.items())]
...             )
...         )
...     def eval(self, **kwargs):
...         yield tuple(value for _, value in sorted(kwargs.items()))
...
>>> spark.udtf.register("test_udtf", TestUDTF)

>>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x', x=>100.0)").show()
+---+---+-----+
|  a|  b|    x|
+---+---+-----+
| 10|  x|100.0|
+---+---+-----+

>>> spark.sql("SELECT * FROM test_udtf(x=>10, a=>'x', z=>100.0)").show()
+---+---+-----+
|  a|  x|    z|
+---+---+-----+
|  x| 10|100.0|
+---+---+-----+
```

### Why are the changes needed?

Now that named arguments are supported (apache#41796, apache#42020).

It should be supported in Python UDTF.

### Does this PR introduce _any_ user-facing change?

Yes, named arguments will be available for Python UDTF.

### How was this patch tested?

Added related tests.

Closes apache#42422 from ueshin/issues/SPARK-44749/kwargs.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Takuya UESHIN <[email protected]>
ueshin added a commit that referenced this pull request Aug 24, 2023
…andas UDFs

### What changes were proposed in this pull request?

Supports named arguments in scalar Python/Pandas UDF.

For example:

```py
>>> udf("int")
... def test_udf(a, b):
...     return a + 10 * b
...
>>> spark.udf.register("test_udf", test_udf)

>>> spark.range(2).select(test_udf(b=col("id") * 10, a=col("id"))).show()
+---------------------------------+
|test_udf(b => (id * 10), a => id)|
+---------------------------------+
|                                0|
|                              101|
+---------------------------------+

>>> spark.sql("SELECT test_udf(b => id * 10, a => id) FROM range(2)").show()
+---------------------------------+
|test_udf(b => (id * 10), a => id)|
+---------------------------------+
|                                0|
|                              101|
+---------------------------------+
```

or:

```py
>>> pandas_udf("int")
... def test_udf(a, b):
...     return a + 10 * b
...
>>> spark.udf.register("test_udf", test_udf)

>>> spark.range(2).select(test_udf(b=col("id") * 10, a=col("id"))).show()
+---------------------------------+
|test_udf(b => (id * 10), a => id)|
+---------------------------------+
|                                0|
|                              101|
+---------------------------------+

>>> spark.sql("SELECT test_udf(b => id * 10, a => id) FROM range(2)").show()
+---------------------------------+
|test_udf(b => (id * 10), a => id)|
+---------------------------------+
|                                0|
|                              101|
+---------------------------------+
```

### Why are the changes needed?

Now that named arguments support was added (#41796, #42020).

Scalar Python/Pandas UDFs can support it.

### Does this PR introduce _any_ user-facing change?

Yes, named arguments will be available for scalar Python/Pandas UDFs.

### How was this patch tested?

Added related tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42617 from ueshin/issues/SPARK-44918/kwargs.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Takuya UESHIN <[email protected]>
ueshin added a commit that referenced this pull request Sep 1, 2023
…s UDFs

### What changes were proposed in this pull request?

Supports named arguments in aggregate Pandas UDFs.

For example:

```py
>>> pandas_udf("double")
... def weighted_mean(v: pd.Series, w: pd.Series) -> float:
...     import numpy as np
...     return np.average(v, weights=w)
...
>>> df = spark.createDataFrame(
...     [(1, 1.0, 1.0), (1, 2.0, 2.0), (2, 3.0, 1.0), (2, 5.0, 2.0), (2, 10.0, 3.0)],
...     ("id", "v", "w"))

>>> df.groupby("id").agg(weighted_mean(v=df["v"], w=df["w"])).show()
+---+-----------------------------+
| id|weighted_mean(v => v, w => w)|
+---+-----------------------------+
|  1|           1.6666666666666667|
|  2|            7.166666666666667|
+---+-----------------------------+

>>> df.groupby("id").agg(weighted_mean(w=df["w"], v=df["v"])).show()
+---+-----------------------------+
| id|weighted_mean(w => w, v => v)|
+---+-----------------------------+
|  1|           1.6666666666666667|
|  2|            7.166666666666667|
+---+-----------------------------+
```

or with window:

```py
>>> w = Window.partitionBy("id").orderBy("v").rowsBetween(-2, 1)

>>> df.withColumn("wm", weighted_mean(v=df.v, w=df.w).over(w)).show()
+---+----+---+------------------+
| id|   v|  w|                wm|
+---+----+---+------------------+
|  1| 1.0|1.0|1.6666666666666667|
|  1| 2.0|2.0|1.6666666666666667|
|  2| 3.0|1.0| 4.333333333333333|
|  2| 5.0|2.0| 7.166666666666667|
|  2|10.0|3.0| 7.166666666666667|
+---+----+---+------------------+

>>> df.withColumn("wm", weighted_mean_udf(w=df.w, v=df.v).over(w)).show()
+---+----+---+------------------+
| id|   v|  w|                wm|
+---+----+---+------------------+
|  1| 1.0|1.0|1.6666666666666667|
|  1| 2.0|2.0|1.6666666666666667|
|  2| 3.0|1.0| 4.333333333333333|
|  2| 5.0|2.0| 7.166666666666667|
|  2|10.0|3.0| 7.166666666666667|
+---+----+---+------------------+
```

### Why are the changes needed?

Now that named arguments support was added (#41796, #42020).

Aggregate Pandas UDFs can support it.

### Does this PR introduce _any_ user-facing change?

Yes, named arguments will be available for aggregate Pandas UDFs.

### How was this patch tested?

Added related tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42663 from ueshin/issues/SPARK-44952/kwargs.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Takuya UESHIN <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Sep 1, 2023
…s UDFs

### What changes were proposed in this pull request?

Supports named arguments in aggregate Pandas UDFs.

For example:

```py
>>> pandas_udf("double")
... def weighted_mean(v: pd.Series, w: pd.Series) -> float:
...     import numpy as np
...     return np.average(v, weights=w)
...
>>> df = spark.createDataFrame(
...     [(1, 1.0, 1.0), (1, 2.0, 2.0), (2, 3.0, 1.0), (2, 5.0, 2.0), (2, 10.0, 3.0)],
...     ("id", "v", "w"))

>>> df.groupby("id").agg(weighted_mean(v=df["v"], w=df["w"])).show()
+---+-----------------------------+
| id|weighted_mean(v => v, w => w)|
+---+-----------------------------+
|  1|           1.6666666666666667|
|  2|            7.166666666666667|
+---+-----------------------------+

>>> df.groupby("id").agg(weighted_mean(w=df["w"], v=df["v"])).show()
+---+-----------------------------+
| id|weighted_mean(w => w, v => v)|
+---+-----------------------------+
|  1|           1.6666666666666667|
|  2|            7.166666666666667|
+---+-----------------------------+
```

or with window:

```py
>>> w = Window.partitionBy("id").orderBy("v").rowsBetween(-2, 1)

>>> df.withColumn("wm", weighted_mean(v=df.v, w=df.w).over(w)).show()
+---+----+---+------------------+
| id|   v|  w|                wm|
+---+----+---+------------------+
|  1| 1.0|1.0|1.6666666666666667|
|  1| 2.0|2.0|1.6666666666666667|
|  2| 3.0|1.0| 4.333333333333333|
|  2| 5.0|2.0| 7.166666666666667|
|  2|10.0|3.0| 7.166666666666667|
+---+----+---+------------------+

>>> df.withColumn("wm", weighted_mean_udf(w=df.w, v=df.v).over(w)).show()
+---+----+---+------------------+
| id|   v|  w|                wm|
+---+----+---+------------------+
|  1| 1.0|1.0|1.6666666666666667|
|  1| 2.0|2.0|1.6666666666666667|
|  2| 3.0|1.0| 4.333333333333333|
|  2| 5.0|2.0| 7.166666666666667|
|  2|10.0|3.0| 7.166666666666667|
+---+----+---+------------------+
```

### Why are the changes needed?

Now that named arguments support was added (apache/spark#41796, apache/spark#42020).

Aggregate Pandas UDFs can support it.

### Does this PR introduce _any_ user-facing change?

Yes, named arguments will be available for aggregate Pandas UDFs.

### How was this patch tested?

Added related tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42663 from ueshin/issues/SPARK-44952/kwargs.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Takuya UESHIN <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants