Upgrade to log4j 2.17 by dcirne · Pull Request #14 · viirya/spark-1

dcirne · 2021-12-17T17:51:19Z

After another issue has been found on log4j 2.16, this upgrades its
version to 2.17.

After anotherr issue has been found on log4j 2.16, this upgrades its version to 2.17.

viirya · 2021-12-20T04:02:50Z

Thanks for this PR. I was upgrading it in Spark.

…onicalized expressions ### What changes were proposed in this pull request? Make PullOutNonDeterministic use canonicalized expressions to dedup group and aggregate expressions. This affects pyspark udfs in particular. Example: ``` from pyspark.sql.functions import col, avg, udf pythonUDF = udf(lambda x: x).asNondeterministic() spark.range(10)\ .selectExpr("id", "id % 3 as value")\ .groupBy(pythonUDF(col("value")))\ .agg(avg("id"), pythonUDF(col("value")))\ .explain(extended=True) ``` Currently results in a plan like this: ``` Aggregate [_nondeterministic#15](apache#15), [_nondeterministic#15 AS dummyNondeterministicUDF(value)#12, avg(id#0L) AS avg(id)#13, dummyNondeterministicUDF(value#6L)#8 AS dummyNondeterministicUDF(value)#14](apache#15%20AS%20dummyNondeterministicUDF(value)#12,%20avg(id#0L)%20AS%20avg(id)#13,%20dummyNondeterministicUDF(value#6L)#8%20AS%20dummyNondeterministicUDF(value)#14) +- Project [id#0L, value#6L, dummyNondeterministicUDF(value#6L)#7 AS _nondeterministic#15](#0L,%20value#6L,%20dummyNondeterministicUDF(value#6L)#7%20AS%20_nondeterministic#15) +- Project [id#0L, (id#0L % cast(3 as bigint)) AS value#6L](#0L,%20(id#0L%20%%20cast(3%20as%20bigint))%20AS%20value#6L) +- Range (0, 10, step=1, splits=Some(2)) ``` and then it throws: ``` [[MISSING_AGGREGATION] The non-aggregating expression "value" is based on columns which are not participating in the GROUP BY clause. Add the columns or the expression to the GROUP BY, aggregate the expression, or use "any_value(value)" if you do not care which of the values within a group is returned. SQLSTATE: 42803 ``` - how canonicalized fixes this: - nondeterministic PythonUDF expressions always have distinct resultIds per udf - The fix is to canonicalize the expressions when matching. Canonicalized means that we're setting the resultIds to -1, allowing us to dedup the PythonUDF expressions. - for deterministic UDFs, this rule does not apply and "Post Analysis" batch extracts and deduplicates the expressions, as expected ### Why are the changes needed? - the output of the query with the fix applied still makes sense - the nondeterministic UDF is invoked only once, in the project. ### Does this PR introduce _any_ user-facing change? Yes, it's additive, it enables queries to run that previously threw errors. ### How was this patch tested? - added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52061 from benrobby/adhoc-fix-pull-out-nondeterministic. Authored-by: Ben Hurdelhey <ben.hurdelhey@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Upgrade to log4j 2.17

750e147

After anotherr issue has been found on log4j 2.16, this upgrades its version to 2.17.

github-actions bot added the BUILD label Dec 17, 2021

viirya closed this Dec 20, 2021

dcirne deleted the log4j2.17 branch December 20, 2021 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to log4j 2.17#14

Upgrade to log4j 2.17#14
dcirne wants to merge 1 commit intoviirya:log4j2from
dcirne:log4j2.17

dcirne commented Dec 17, 2021 •

edited

Loading

Uh oh!

viirya commented Dec 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dcirne commented Dec 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Dec 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dcirne commented Dec 17, 2021 •

edited

Loading