forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 1
Combine two configure jobs #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…cal operators ### What changes were proposed in this pull request? 1, override `maxRowsPerPartition` in `Sort`,`Expand`,`Sample`,`CollectMetrics`; 2, override `maxRows` in `Except`,`Expand`,`CollectMetrics`; ### Why are the changes needed? to provide an accurate value if possible ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added testsuites Closes apache#35250 from zhengruifeng/add_some_maxRows_maxRowsPerPartition. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This PR aims to update ORC to version 1.7.5. ### Why are the changes needed? ORC 1.7.5 is the latest version with the following bug fixes: -https://orc.apache.org/news/2022/06/16/ORC-1.7.5/ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes apache#36892 from williamhyun/orc175. Authored-by: William Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…or `HiveClientVersions`
### What changes were proposed in this pull request?
This PR aims to introduce a test environment variable `SPARK_TEST_HIVE_CLIENT_VERSIONS` to control the test target HiveClient Versions in `HiveClientVersions` trait.
### Why are the changes needed?
Currently, `HiveClientVersions` is used in three test suites.
```
$ git grep 'with HiveClientVersions'
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuites.scala:class HiveClientSuites extends SparkFunSuite with HiveClientVersions {
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientUserNameSuites.scala:class HiveClientUserNameSuites extends Suite with HiveClientVersions {
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuites.scala:class HivePartitionFilteringSuites extends Suite with HiveClientVersions {
```
### Does this PR introduce _any_ user-facing change?
No. This is a test only change.
### How was this patch tested?
Pass the CIs and manually test like the following.
```
SPARK_TEST_HIVE_CLIENT_VERSIONS='' build/sbt "hive/testOnly *.HiveClientSuites" -Phive
SPARK_TEST_HIVE_CLIENT_VERSIONS=3.1 build/sbt "hive/testOnly *.HiveClientSuites" -Phive
SPARK_TEST_HIVE_CLIENT_VERSIONS=3.0,3.1 build/sbt "hive/testOnly *.HiveClientSuites" -Phive
```
Closes apache#36894 from dongjoon-hyun/SPARK-39495.
Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… SBT tests ### What changes were proposed in this pull request? This PR aims to propagate `java.net.preferIPv6Addresses=true` in SBT tests. - This also fixes several `hive` module (which downloads Hive or Hadoop dependencies) failures. - `Maven` will be handled separately. ### Why are the changes needed? To test IPv6 SBT tests, we need to set consistently with the `SBT_OPTS`. - https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/doc-files/net-properties.html **java.net.preferIPv6Addresses (default: false)** > When dealing with a host which has both IPv4 and IPv6 addresses, and if IPv6 is available on the operating system, the default behavior is to prefer using IPv4 addresses over IPv6 ones. This is to ensure backward compatibility, for example applications that depend on the representation of an IPv4 address (e.g. 192.168.1.1). This property can be set to true to change that preference and use IPv6 addresses over IPv4 ones where possible, or system to preserve the order of the addresses as returned by the operating system. ### Does this PR introduce _any_ user-facing change? This is a test-only change. ### How was this patch tested? Pass the GitHub Action with new test cases. 1. New test case will be disabled by default or when `java.net.preferIPv6Addresses` is not `true`. 2. With `java.net.preferIPv6Addresses=true` in IPv6 environment, new test case will pass. ``` [info] SparkSubmitUtilsSuite: :: loading settings :: url = jar:file:/Users/dongjoon/Library/Caches/Coursier/v1/https/maven-central.storage-download.googleapis.com/maven2/org/apache/ivy/ivy/2.5.0/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml [info] - SPARK-39501: Resolve maven dependenicy in IPv6 (243 milliseconds) [info] Run completed in 1 second, 141 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` Closes apache#36898 from dongjoon-hyun/SPARK-39501. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? SPARK-39409 upgrade `scala-maven-plugin` to 4.6.1, but I found that there is a compilation issue when run **`mvn test` with Java 8.** The reproduction steps are as follows: ``` mvn clean install -DskipTests -pl core -am mvn test -pl core ``` ``` [ERROR] ## Exception when compiling 669 sources to /basedir/spark-mine/core/target/scala-2.12/classes java.lang.RuntimeException: rt.jar (class sbt.internal.inc.DummyVirtualFile) is not supported scala.sys.package$.error(package.scala:27) sbt.internal.inc.Locate$.definesClass(Locate.scala:92) sbt.internal.inc.Locate.definesClass(Locate.scala) sbt_inc.SbtIncrementalCompiler$1.definesClass(SbtIncrementalCompiler.java:119) sbt.internal.inc.Locate$.$anonfun$entry$1(Locate.scala:60) scala.collection.Iterator$$anon$9.next(Iterator.scala:575) scala.collection.IterableOnceOps.collectFirst(IterableOnce.scala:1079) scala.collection.IterableOnceOps.collectFirst$(IterableOnce.scala:1071) scala.collection.AbstractIterator.collectFirst(Iterator.scala:1288) sbt.internal.inc.Locate$.$anonfun$entry$2(Locate.scala:67) sbt.internal.inc.LookupImpl.lookupOnClasspath(LookupImpl.scala:51) sbt.internal.inc.IncrementalCommon$.$anonfun$isLibraryModified$3(IncrementalCommon.scala:764) sbt.internal.inc.IncrementalCommon$.$anonfun$isLibraryModified$3$adapted(IncrementalCommon.scala:754) scala.collection.IterableOnceOps.exists(IterableOnce.scala:591) scala.collection.IterableOnceOps.exists$(IterableOnce.scala:588) scala.collection.AbstractIterable.exists(Iterable.scala:919) sbt.internal.inc.IncrementalCommon$.isLibraryChanged$1(IncrementalCommon.scala:754) sbt.internal.inc.IncrementalCommon$.$anonfun$isLibraryModified$1(IncrementalCommon.scala:774) sbt.internal.inc.IncrementalCommon$.$anonfun$isLibraryModified$1$adapted(IncrementalCommon.scala:732) scala.collection.parallel.AugmentedIterableIterator.filter2combiner(RemainsIterator.scala:136) scala.collection.parallel.AugmentedIterableIterator.filter2combiner$(RemainsIterator.scala:133) scala.collection.parallel.immutable.ParVector$ParVectorIterator.filter2combiner(ParVector.scala:72) scala.collection.parallel.ParIterableLike$Filter.leaf(ParIterableLike.scala:1083) scala.collection.parallel.Task.$anonfun$tryLeaf$1(Tasks.scala:52) scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) scala.util.control.Breaks$$anon$1.catchBreak(Breaks.scala:97) scala.collection.parallel.Task.tryLeaf(Tasks.scala:55) scala.collection.parallel.Task.tryLeaf$(Tasks.scala:49) scala.collection.parallel.ParIterableLike$Filter.tryLeaf(ParIterableLike.scala:1079) scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask.internal(Tasks.scala:159) scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask.internal$(Tasks.scala:156) scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:303) scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask.compute(Tasks.scala:149) scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask.compute$(Tasks.scala:148) scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:303) java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189) java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) java.util.concurrent.ForkJoinTask.doJoin(ForkJoinTask.java:389) java.util.concurrent.ForkJoinTask.join(ForkJoinTask.java:719) scala.collection.parallel.ForkJoinTasks$WrappedTask.sync(Tasks.scala:242) scala.collection.parallel.ForkJoinTasks$WrappedTask.sync$(Tasks.scala:242) scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.sync(Tasks.scala:303) scala.collection.parallel.ForkJoinTasks.executeAndWaitResult(Tasks.scala:286) scala.collection.parallel.ForkJoinTasks.executeAndWaitResult$(Tasks.scala:279) scala.collection.parallel.ForkJoinTaskSupport.executeAndWaitResult(TaskSupport.scala:59) scala.collection.parallel.ExecutionContextTasks.executeAndWaitResult(Tasks.scala:409) scala.collection.parallel.ExecutionContextTasks.executeAndWaitResult$(Tasks.scala:409) scala.collection.parallel.ExecutionContextTaskSupport.executeAndWaitResult(TaskSupport.scala:75) scala.collection.parallel.ParIterableLike$ResultMapping.leaf(ParIterableLike.scala:932) scala.collection.parallel.Task.$anonfun$tryLeaf$1(Tasks.scala:52) scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) scala.util.control.Breaks$$anon$1.catchBreak(Breaks.scala:97) scala.collection.parallel.Task.tryLeaf(Tasks.scala:55) scala.collection.parallel.Task.tryLeaf$(Tasks.scala:49) scala.collection.parallel.ParIterableLike$ResultMapping.tryLeaf(ParIterableLike.scala:927) scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask.compute(Tasks.scala:152) scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask.compute$(Tasks.scala:148) scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:303) java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189) java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) ``` Use Java 11 and 17 no this issue and downgrade to scala-maven-plugin to 4.6.1 can avoid this issue. So this pr downgrade scala-maven-plugin to 4.6.1. ### Why are the changes needed? Fix compilation issue when run `mvn test` with Java 8. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GAs Closes apache#36899 from LuciferYang/SPARK-39502. Authored-by: yangjie01 <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? Escape log content rendered to the UI. ### Why are the changes needed? Log content may contain reserved characters or other code in the log and be misinterpreted in the UI as HTML. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes apache#36902 from srowen/LogViewEscape. Authored-by: Sean Owen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
Change `Inline.eval` to return a row of null values rather than a null row in the case of a null input struct.
### Why are the changes needed?
Consider the following query:
```
set spark.sql.codegen.wholeStage=false;
select inline(array(named_struct('a', 1, 'b', 2), null));
```
This query fails with a `NullPointerException`:
```
22/06/16 15:10:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$11(GenerateExec.scala:122)
```
(In Spark 3.1.3, you don't need to set `spark.sql.codegen.wholeStage` to false to reproduce the error, since Spark 3.1.3 has no codegen path for `Inline`).
This query fails regardless of the setting of `spark.sql.codegen.wholeStage`:
```
val dfWide = (Seq((1))
.toDF("col0")
.selectExpr(Seq.tabulate(99)(x => s"$x as col${x + 1}"): _*))
val df = (dfWide
.selectExpr("*", "array(named_struct('a', 1, 'b', 2), null) as struct_array"))
df.selectExpr("*", "inline(struct_array)").collect
```
It fails with
```
22/06/16 15:18:55 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:80)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_8$(Unknown Source)
```
When `Inline.eval` returns a null row in the collection, GenerateExec gets a NullPointerException either when joining the null row with required child output, or projecting the null row.
This PR avoids producing the null row and produces a row of null values instead:
```
spark-sql> set spark.sql.codegen.wholeStage=false;
spark.sql.codegen.wholeStage false
Time taken: 3.095 seconds, Fetched 1 row(s)
spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
1 2
NULL NULL
Time taken: 1.214 seconds, Fetched 2 row(s)
spark-sql>
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New unit test.
Closes apache#36903 from bersprockets/inline_eval_null_struct_issue.
Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…erated pandas API support list documentation ### What changes were proposed in this pull request? In the auto-generated documentation on pandas API support list, there are cases where the link of the function property provided in the document is not connected, so it needs to be corrected. The current 'supported API generation' function dynamically compares the modules of `PySpark.pandas` and `pandas` to find the difference. At this time, the inherited class is also aggregated, and the link is not generated correctly (such as `CategoricalIndex.all()` is used internally by inheriting `Index.all()`.) because it does not match the pattern of each API document. So, I modified it in such a way that it is created by excluding methods that exist in the parent class. ### Why are the changes needed? To link to the correct API document. ### Does this PR introduce _any_ user-facing change? Yes, the "Supported pandas APIs" page has changed as below. <img width="992" alt="Screen Shot 2022-06-16 at 7 54 05 PM" src="https://user-images.githubusercontent.com/7010554/174196507-ea8a2577-1e2c-44cd-9564-e7dd81f5c799.png"> ### How was this patch tested? Manually check the links in the documents & the existing doc build should be passed. Closes apache#36895 from beobest2/SPARK-39456. Authored-by: beobest2 <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? Replace the error class MISSING_COLUMN with UNRESOLVED_COLUMN wit the following new text: A column or function parameter with name <objectName> cannot be resolved. Did you mean one of the following? [<objectList>] ### Why are the changes needed? The existing class name and text does not reflect what is really happening. The column may well exist, it may just not be within scope. ### Does this PR introduce _any_ user-facing change? Yes, we we replace an error class name. ### How was this patch tested? test all affected test suites with MISSING_COLUMN Closes apache#36891 from srielau/SPARK-39492-Rework-MISSING_COLUMN. Lead-authored-by: Serge Rielau <[email protected]> Co-authored-by: Serge Rielau <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
…license` ### What changes were proposed in this pull request? This PR aims to make `check-license` script to support IPv6 environment via `DEFAULT_ARTIFACT_REPOSITORY` ### Why are the changes needed? Apache Maven Central repository has two separate URLs. - https://repo.maven.apache.org/maven2/ (IPv4) - https://ipv6.repo1.maven.org/maven2/ (IPv6) `DEFAULT_ARTIFACT_REPOSITORY` allows IPv6 users to use `ipv6.repo1.maven.org` or Google Maven Central Mirror according to their needs. ### Does this PR introduce _any_ user-facing change? No. This is a dev-only change. ### How was this patch tested? Pass the CIs. Closes apache#36907 from dongjoon-hyun/SPARK-39509. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to fix `SocketAuthServer` to respect Java IPv6 option, `java.net.preferIPv6Addresses=true`. ### Why are the changes needed? This can be tested easily on all systems. **BEFORE** ``` $ SPARK_LOCAL_IP=::1 SBT_OPTS=-Djava.net.preferIPv6Addresses=true build/sbt "core/testOnly *.PythonRDDSuite -- -z server" [info] PythonRDDSuite: [info] - python server error handling *** FAILED *** (63 milliseconds) [info] java.net.ConnectException: Connection refused (Connection refused) [info] at java.base/java.net.PlainSocketImpl.socketConnect(Native Method) ... [info] at java.base/java.net.Socket.<init>(Socket.java:264) [info] at org.apache.spark.api.python.PythonRDDSuite.$anonfun$new$3(PythonRDDSuite.scala:81) ... [info] Run completed in 1 second, 434 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0 [info] *** 1 TEST FAILED *** ``` **AFTER** ``` $ SPARK_LOCAL_IP=::1 SBT_OPTS=-Djava.net.preferIPv6Addresses=true build/sbt "core/testOnly *.PythonRDDSuite -- -z server" [info] PythonRDDSuite: [info] - python server error handling (75 milliseconds) [info] Run completed in 1 second, 35 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Does this PR introduce _any_ user-facing change? In general, there is no side effect because this only affects the users who already have `java.net.preferIPv6Addresses=true`. For those users, the new behavior is correct. ### How was this patch tested? Pass the CIs and manually run this command. ``` SPARK_LOCAL_IP=::1 SBT_OPTS=-Djava.net.preferIPv6Addresses=true build/sbt "core/testOnly *.PythonRDDSuite -- -z server" ``` Closes apache#36905 from dongjoon-hyun/SPARK-39507. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? 1, remove unused `extractInstances` and `extractLabeledPoints` in `Predictor`; 2, remove unused `checkNonNegativeWeight` in `function`; 3, move `getNumClasses` from `Clasifier` to `DatasetUtils`; 4, move `getNumFeatures` from `MetadataUtils` to `DatasetUtils`; ### Why are the changes needed? to unify to methods ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes apache#36049 from zhengruifeng/validate_cleanup. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…` to `SQLQueryTestSuite` ### What changes were proposed in this pull request? Move the test from `SQLQuerySuite` to `SQLQueryTestSuite`. ### Why are the changes needed? Make the code easier to maintain. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes apache#36910 from wangyum/SPARK-36979. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…t.preferIPv6Addresses` conf ### What changes were proposed in this pull request? This PR aims to make `LauncherBackendSuite` should use `java.net.preferIPv6Addresses` conf during launching. ### Why are the changes needed? **BEFORE** ``` $ SPARK_LOCAL_IP=::1 SBT_OPTS=-Djava.net.preferIPv6Addresses=true build/sbt "core/testOnly *.LauncherBackendSuite" ... [info] LauncherBackendSuite: [info] - local: launcher handle *** FAILED *** (30 seconds, 139 milliseconds) [info] The code passed to eventually never returned normally. Attempted 296 times over 30.078724458999996 seconds. Last failure message: The reference was null. (LauncherBackendSuite.scala:60) ... ``` **AFTER** ``` SPARK_LOCAL_IP=::1 SBT_OPTS=-Djava.net.preferIPv6Addresses=true build/sbt "core/testOnly *.LauncherBackendSuite" ... [info] LauncherBackendSuite: [info] - local: launcher handle (1 second, 432 milliseconds) [info] - standalone/client: launcher handle (1 second, 600 milliseconds) [info] Run completed in 3 seconds, 904 milliseconds. [info] Total number of tests run: 2 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 2, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Does this PR introduce _any_ user-facing change? No, this is a test-only change. ### How was this patch tested? Pass the CIs and do the manual test. Closes apache#36911 from dongjoon-hyun/SPARK-39514. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Yuming Wang <[email protected]>
…n in PySpark ### What changes were proposed in this pull request? This PR aims to use `IPv6` between Spark and Python Daemon in IPv6-only system. Unlike `spark-shell`, `pyspark` starts Python shell and `java-gateway` first. We need a new environment variable, `SPARK_PREFER_IPV6=True` in `pyspark` shell, like the following. ``` SPARK_PREFER_IPV6=True bin/pyspark --driver-java-options=-Djava.net.preferIPv6Addresses=true ``` ### Why are the changes needed? Currently, PySpark uses `127.0.0.1` for inter-communication between Python Daemon and JVM. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes apache#36906 from dongjoon-hyun/SPARK-39508. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
… script missing ### What changes were proposed in this pull request? Currently scheduled jobs fail as below (https://github.com/apache/spark/runs/6913767185?check_suite_focus=true): ``` Run build=`./dev/is-changed.py -m avro,build,catalyst,core,docker-integration-tests,examples,graphx,hadoop-cloud,hive,hive-thriftserver,kubernetes,kvstore,launcher,mesos,mllib,mllib-local,network-common,network-shuffle,pyspark-core,pyspark-ml,pyspark-mllib,pyspark-pandas,pyspark-pandas-slow,pyspark-resource,pyspark-sql,pyspark-streaming,repl,sketch,spark-ganglia-lgpl,sparkr,sql,sql-kafka-0-10,streaming,streaming-kafka-0-10,streaming-kinesis-asl,tags,unsafe,yarn` /home/runner/work/_temp/f86503c6-6b49-448e-b4b7-ac31411a87db.sh: line 1: ./dev/is-changed.py: No such file or directory Error: Process completed with exit code 127. ``` because `is-changed.py` script is missing. The scheduled jobs are created by `master` branch. This PR fixes it by explicitly setting all modules to test if the script is missing. ### Why are the changes needed? To recover the scheduled builds in branch-3.2. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? I manually tested the fixed shell commands locally. Closes apache#36915 from HyukjinKwon/SPARK-39517. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? This PR creates a scheduled job for branch-3.3. ### Why are the changes needed? To make sure branch-3.3 build fine. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? This is a copy of branch-3.2. Should work. Also, scheduled jobs are already broken now. I will fix them in parallel to recover. Closes apache#36914 from HyukjinKwon/SPARK-39516. Lead-authored-by: Hyukjin Kwon <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…2` profile ### What changes were proposed in this pull request? Build `yarn` module with `-Phadoop-2` profile failed now as follows: ``` [ERROR] [Error] /basedir/spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala:454: value DECOMMISSIONING is not a member of object org.apache.hadoop.yarn.api.records.NodeState ``` The above compilation error due to `hadoop-2.7` not support `NodeState.DECOMMISSIONING`, so this pr change to use string comparison instead for compilation, and the test suite `Test YARN container decommissioning` in `YarnAllocatorSuite` should only run when `VersionUtils.isHadoop3` is true ### Why are the changes needed? Fix yarn module compilation error with `-Phadoop-2` profile ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GA - Manual test run `mvn clean install -DskipTests -pl resource-managers/yarn -am -Pyarn -Phadoop-2` **Before** ``` [ERROR] [Error] /basedir/spark-source/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala:454: value DECOMMISSIONING is not a member of object org.apache.hadoop.yarn.api.records.NodeState [ERROR] one error found [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary for Spark Project Parent POM 3.4.0-SNAPSHOT: [INFO] [INFO] Spark Project Parent POM ........................... SUCCESS [ 3.252 s] [INFO] Spark Project Tags ................................. SUCCESS [ 5.735 s] [INFO] Spark Project Local DB ............................. SUCCESS [ 5.492 s] [INFO] Spark Project Networking ........................... SUCCESS [ 8.251 s] [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 6.334 s] [INFO] Spark Project Unsafe ............................... SUCCESS [ 15.326 s] [INFO] Spark Project Launcher ............................. SUCCESS [ 4.905 s] [INFO] Spark Project Core ................................. SUCCESS [02:07 min] [INFO] Spark Project YARN Shuffle Service ................. SUCCESS [ 17.382 s] [INFO] Spark Project YARN ................................. FAILURE [ 7.718 s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 03:22 min [INFO] Finished at: 2022-06-20T11:57:54+08:00 [INFO] ------------------------------------------------------------------------ ``` **After** ``` [INFO] Reactor Summary for Spark Project Parent POM 3.4.0-SNAPSHOT: [INFO] [INFO] Spark Project Parent POM ........................... SUCCESS [ 5.451 s] [INFO] Spark Project Tags ................................. SUCCESS [ 5.739 s] [INFO] Spark Project Local DB ............................. SUCCESS [ 5.908 s] [INFO] Spark Project Networking ........................... SUCCESS [ 8.310 s] [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 5.857 s] [INFO] Spark Project Unsafe ............................... SUCCESS [ 8.439 s] [INFO] Spark Project Launcher ............................. SUCCESS [ 4.795 s] [INFO] Spark Project Core ................................. SUCCESS [02:36 min] [INFO] Spark Project YARN Shuffle Service ................. SUCCESS [ 15.044 s] [INFO] Spark Project YARN ................................. SUCCESS [ 32.517 s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 04:09 min [INFO] Finished at: 2022-06-20T13:10:04+08:00 [INFO] ------------------------------------------------------------------------ ``` run `mvn clean install -pl resource-managers/yarn -Pyarn -Phadoop-2 -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnAllocatorSuite` ``` - Test YARN container decommissioning !!! CANCELED !!! org.apache.spark.util.VersionUtils.isHadoop3 was false (YarnAllocatorSuite.scala:749) Run completed in 2 seconds, 140 milliseconds. Total number of tests run: 16 Suites: completed 2, aborted 0 Tests: succeeded 16, failed 0, canceled 6, ignored 0, pending 0 All tests passed. ``` Closes apache#36890 Closes apache#36917 from LuciferYang/SPARK-39491. Lead-authored-by: yangjie01 <[email protected]> Co-authored-by: Abhishek Dixit <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…stead of Utils.localCanonicalHostName in tests ### What changes were proposed in this pull request? This PR aims to use `Utils.localHostNameForURI` instead of `Utils.localCanonicalHostName` in the following suites which changed in apache#36866 - `MasterSuite` - `MasterWebUISuite` - `RocksDBBackendHistoryServerSuite` ### Why are the changes needed? These test cases fails when we run with `SPARK_LOCAL_IP=::1` and `-Djava.net.preferIPv6Addresses=true` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GA - Manual test: 1. `export SPARK_LOCAL_IP=::1` ``` echo $SPARK_LOCAL_IP ::1 ``` 2. add `-Djava.net.preferIPv6Addresses=true` to MAVEN_OPTS, for example: ``` diff --git a/pom.xml b/pom.xml index 1ce3b43..3356622 100644 --- a/pom.xml +++ b/pom.xml -2943,7 +2943,7 <include>**/*Suite.java</include> </includes> <reportsDirectory>${project.build.directory}/surefire-reports</reportsDirectory> - <argLine>-ea -Xmx4g -Xss4m -XX:MaxMetaspaceSize=2g -XX:ReservedCodeCacheSize=${CodeCacheSize} ${extraJavaTestArgs} -Dio.netty.tryReflectionSetAccessible=true</argLine> + <argLine>-ea -Xmx4g -Xss4m -XX:MaxMetaspaceSize=2g -XX:ReservedCodeCacheSize=${CodeCacheSize} ${extraJavaTestArgs} -Dio.netty.tryReflectionSetAccessible=true -Djava.net.preferIPv6Addresses=true</argLine> <environmentVariables> <!-- Setting SPARK_DIST_CLASSPATH is a simple way to make sure any child processes ``` 3. maven test `RocksDBBackendHistoryServerSuite`, `MasterSuite` and `MasterWebUISuite` ``` mvn clean install -DskipTests -pl core -am mvn clean test -pl core -Dtest=none -DwildcardSuites=org.apache.spark.deploy.history.RocksDBBackendHistoryServerSuite mvn clean test -pl core -Dtest=none -DwildcardSuites=org.apache.spark.deploy.master.MasterSuite mvn clean test -pl core -Dtest=none -DwildcardSuites=org.apache.spark.deploy.master.ui.MasterWebUISuite ``` **Before** RocksDBBackendHistoryServerSuite: ``` - Redirect to the root page when accessed to /history/ *** FAILED *** java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:613) at java.net.Socket.connect(Socket.java:561) at sun.net.NetworkClient.doConnect(NetworkClient.java:180) at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) ... Run completed in 31 seconds, 745 milliseconds. Total number of tests run: 73 Suites: completed 2, aborted 0 Tests: succeeded 3, failed 70, canceled 0, ignored 0, pending 0 *** 70 TESTS FAILED *** ``` MasterSuite: ``` - master/worker web ui available behind front-end reverseProxy *** FAILED *** The code passed to eventually never returned normally. Attempted 487 times over 50.079685917 seconds. Last failure message: Connection refused (Connection refused). (MasterSuite.scala:405) Run completed in 3 minutes, 48 seconds. Total number of tests run: 32 Suites: completed 2, aborted 0 Tests: succeeded 29, failed 3, canceled 0, ignored 0, pending 0 *** 3 TESTS FAILED *** ``` MasterWebUISuite: ``` - Kill multiple hosts *** FAILED *** java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:613) at java.net.Socket.connect(Socket.java:561) at sun.net.NetworkClient.doConnect(NetworkClient.java:180) at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) ... Run completed in 7 seconds, 83 milliseconds. Total number of tests run: 4 Suites: completed 2, aborted 0 Tests: succeeded 0, failed 4, canceled 0, ignored 0, pending 0 *** 4 TESTS FAILED *** ``` **After** RocksDBBackendHistoryServerSuite: ``` Run completed in 38 seconds, 205 milliseconds. Total number of tests run: 73 Suites: completed 2, aborted 0 Tests: succeeded 73, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` MasterSuite: ``` Run completed in 1 minute, 10 seconds. Total number of tests run: 32 Suites: completed 2, aborted 0 Tests: succeeded 32, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` MasterWebUISuite: ``` Run completed in 6 seconds, 330 milliseconds. Total number of tests run: 4 Suites: completed 2, aborted 0 Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes apache#36876 from LuciferYang/SPARK-39464-FOLLOWUP. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
7577f63 to
8d2b579
Compare
EnricoMi
pushed a commit
that referenced
this pull request
Mar 7, 2024
…n properly
### What changes were proposed in this pull request?
Make `ResolveRelations` handle plan id properly
### Why are the changes needed?
bug fix for Spark Connect, it won't affect classic Spark SQL
before this PR:
```
from pyspark.sql import functions as sf
spark.range(10).withColumn("value_1", sf.lit(1)).write.saveAsTable("test_table_1")
spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", sf.lit(2)).write.saveAsTable("test_table_2")
df1 = spark.read.table("test_table_1")
df2 = spark.read.table("test_table_2")
df3 = spark.read.table("test_table_1")
join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2)
join2 = df3.join(join1, how="left", on=join1.index==df3.id)
join2.schema
```
fails with
```
AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "id". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704
```
That is due to existing plan caching in `ResolveRelations` doesn't work with Spark Connect
```
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
'[#12]Join LeftOuter, '`==`('index, 'id) '[#12]Join LeftOuter, '`==`('index, 'id)
!:- '[#9]UnresolvedRelation [test_table_1], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1
!+- '[#11]Project ['index, 'value_2] : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false
! +- '[#10]Join Inner, '`==`('id, 'index) +- '[#11]Project ['index, 'value_2]
! :- '[#7]UnresolvedRelation [test_table_1], [], false +- '[#10]Join Inner, '`==`('id, 'index)
! +- '[#8]UnresolvedRelation [test_table_2], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1
! : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false
! +- '[#8]SubqueryAlias spark_catalog.default.test_table_2
! +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false
Can not resolve 'id with plan 7
```
`[#7]UnresolvedRelation [test_table_1], [], false` was wrongly resolved to the cached one
```
:- '[#9]SubqueryAlias spark_catalog.default.test_table_1
+- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false
```
### Does this PR introduce _any_ user-facing change?
yes, bug fix
### How was this patch tested?
added ut
### Was this patch authored or co-authored using generative AI tooling?
ci
Closes apache#45214 from zhengruifeng/connect_fix_read_join.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
EnricoMi
pushed a commit
that referenced
this pull request
Oct 21, 2024
…plan properly ### What changes were proposed in this pull request? Make `ResolveRelations` handle plan id properly cherry-pick bugfix apache#45214 to 3.5 ### Why are the changes needed? bug fix for Spark Connect, it won't affect classic Spark SQL before this PR: ``` from pyspark.sql import functions as sf spark.range(10).withColumn("value_1", sf.lit(1)).write.saveAsTable("test_table_1") spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", sf.lit(2)).write.saveAsTable("test_table_2") df1 = spark.read.table("test_table_1") df2 = spark.read.table("test_table_2") df3 = spark.read.table("test_table_1") join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2) join2 = df3.join(join1, how="left", on=join1.index==df3.id) join2.schema ``` fails with ``` AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "id". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704 ``` That is due to existing plan caching in `ResolveRelations` doesn't work with Spark Connect ``` === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === '[#12]Join LeftOuter, '`==`('index, 'id) '[#12]Join LeftOuter, '`==`('index, 'id) !:- '[#9]UnresolvedRelation [test_table_1], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 !+- '[#11]Project ['index, 'value_2] : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#10]Join Inner, '`==`('id, 'index) +- '[#11]Project ['index, 'value_2] ! :- '[#7]UnresolvedRelation [test_table_1], [], false +- '[#10]Join Inner, '`==`('id, 'index) ! +- '[#8]UnresolvedRelation [test_table_2], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 ! : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#8]SubqueryAlias spark_catalog.default.test_table_2 ! +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false Can not resolve 'id with plan 7 ``` `[#7]UnresolvedRelation [test_table_1], [], false` was wrongly resolved to the cached one ``` :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? ci Closes apache#46291 from zhengruifeng/connect_fix_read_join_35. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
EnricoMi
pushed a commit
that referenced
this pull request
Jan 29, 2025
This is a trivial change to replace the loop index from `int` to `long`. Surprisingly, microbenchmark shows more than double performance uplift.
Analysis
--------
The hot loop of `arrayEquals` method is simplifed as below. Loop index `i` is defined as `int`, it's compared with `length`, which is a `long`, to determine if the loop should end.
```
public static boolean arrayEquals(
Object leftBase, long leftOffset, Object rightBase, long rightOffset, final long length) {
......
int i = 0;
while (i <= length - 8) {
if (Platform.getLong(leftBase, leftOffset + i) !=
Platform.getLong(rightBase, rightOffset + i)) {
return false;
}
i += 8;
}
......
}
```
Strictly speaking, there's a code bug here. If `length` is greater than 2^31 + 8, this loop will never end because `i` as a 32 bit integer is at most 2^31 - 1. But compiler must consider this behaviour as intentional and generate code strictly match the logic. It prevents compiler from generating optimal code.
Defining loop index `i` as `long` corrects this issue. Besides more accurate code logic, JIT is able to optimize this code much more aggressively. From microbenchmark, this trivial change improves performance significantly on both Arm and x86 platforms.
Benchmark
---------
Source code:
https://gist.github.com/cyb70289/258e261f388e22f47e4d961431786d1a
Result on Arm Neoverse N2:
```
Benchmark Mode Cnt Score Error Units
ArrayEqualsBenchmark.arrayEqualsInt avgt 10 674.313 ± 0.213 ns/op
ArrayEqualsBenchmark.arrayEqualsLong avgt 10 313.563 ± 2.338 ns/op
```
Result on Intel Cascake Lake:
```
Benchmark Mode Cnt Score Error Units
ArrayEqualsBenchmark.arrayEqualsInt avgt 10 1130.695 ± 0.168 ns/op
ArrayEqualsBenchmark.arrayEqualsLong avgt 10 461.979 ± 0.097 ns/op
```
Deep dive
---------
Dive deep to the machine code level, we can see why the big gap. Listed below are arm64 assembly generated by Openjdk-17 C2 compiler.
For `int i`, the machine code is similar to source code, no deep optimization. Safepoint polling is expensive in this short loop.
```
// jit c2 machine code snippet
0x0000ffff81ba8904: mov w15, wzr // int i = 0
0x0000ffff81ba8908: nop
0x0000ffff81ba890c: nop
loop:
0x0000ffff81ba8910: ldr x10, [x13, w15, sxtw] // Platform.getLong(leftBase, leftOffset + i)
0x0000ffff81ba8914: ldr x14, [x12, w15, sxtw] // Platform.getLong(rightBase, rightOffset + i)
0x0000ffff81ba8918: cmp x10, x14
0x0000ffff81ba891c: b.ne 0x0000ffff81ba899c // return false if not equal
0x0000ffff81ba8920: ldr x14, [x28, apache#848] // x14 -> safepoint
0x0000ffff81ba8924: add w15, w15, #0x8 // i += 8
0x0000ffff81ba8928: ldr wzr, [x14] // safepoint polling
0x0000ffff81ba892c: sxtw x10, w15 // extend i to long
0x0000ffff81ba8930: cmp x10, x11
0x0000ffff81ba8934: b.le 0x0000ffff81ba8910 // if (i <= length - 8) goto loop
```
For `long i`, JIT is able to do much more aggressive optimization. E.g, below code snippet unrolls the loop by four.
```
// jit c2 machine code snippet
unrolled_loop:
0x0000ffff91de6fe0: sxtw x10, w7
0x0000ffff91de6fe4: add x23, x22, x10
0x0000ffff91de6fe8: add x24, x21, x10
0x0000ffff91de6fec: ldr x13, [x23] // unroll-1
0x0000ffff91de6ff0: ldr x14, [x24]
0x0000ffff91de6ff4: cmp x13, x14
0x0000ffff91de6ff8: b.ne 0x0000ffff91de70a8
0x0000ffff91de6ffc: ldr x13, [x23, #8] // unroll-2
0x0000ffff91de7000: ldr x14, [x24, #8]
0x0000ffff91de7004: cmp x13, x14
0x0000ffff91de7008: b.ne 0x0000ffff91de70b4
0x0000ffff91de700c: ldr x13, [x23, #16] // unroll-3
0x0000ffff91de7010: ldr x14, [x24, #16]
0x0000ffff91de7014: cmp x13, x14
0x0000ffff91de7018: b.ne 0x0000ffff91de70a4
0x0000ffff91de701c: ldr x13, [x23, #24] // unroll-4
0x0000ffff91de7020: ldr x14, [x24, #24]
0x0000ffff91de7024: cmp x13, x14
0x0000ffff91de7028: b.ne 0x0000ffff91de70b0
0x0000ffff91de702c: add w7, w7, #0x20
0x0000ffff91de7030: cmp w7, w11
0x0000ffff91de7034: b.lt 0x0000ffff91de6fe0
```
### What changes were proposed in this pull request?
A trivial change to replace loop index `i` of method `arrayEquals` from `int` to `long`.
### Why are the changes needed?
To improve performance and fix a possible bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit tests.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes apache#49568 from cyb70289/arrayEquals.
Authored-by: Yibo Cai <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
EnricoMi
pushed a commit
that referenced
this pull request
Sep 2, 2025
…onicalized expressions
### What changes were proposed in this pull request?
Make PullOutNonDeterministic use canonicalized expressions to dedup group and aggregate expressions. This affects pyspark udfs in particular. Example:
```
from pyspark.sql.functions import col, avg, udf
pythonUDF = udf(lambda x: x).asNondeterministic()
spark.range(10)\
.selectExpr("id", "id % 3 as value")\
.groupBy(pythonUDF(col("value")))\
.agg(avg("id"), pythonUDF(col("value")))\
.explain(extended=True)
```
Currently results in a plan like this:
```
Aggregate [_nondeterministic#15](#15), [_nondeterministic#15 AS dummyNondeterministicUDF(value)#12, avg(id#0L) AS avg(id)#13, dummyNondeterministicUDF(value#6L)#8 AS dummyNondeterministicUDF(value)#14](#15%20AS%20dummyNondeterministicUDF(value)#12,%20avg(id#0L)%20AS%20avg(id)#13,%20dummyNondeterministicUDF(value#6L)#8%20AS%20dummyNondeterministicUDF(value)#14)
+- Project [id#0L, value#6L, dummyNondeterministicUDF(value#6L)#7 AS _nondeterministic#15](#0L,%20value#6L,%20dummyNondeterministicUDF(value#6L)#7%20AS%20_nondeterministic#15)
+- Project [id#0L, (id#0L % cast(3 as bigint)) AS value#6L](#0L,%20(id#0L%20%%20cast(3%20as%20bigint))%20AS%20value#6L)
+- Range (0, 10, step=1, splits=Some(2))
```
and then it throws:
```
[[MISSING_AGGREGATION] The non-aggregating expression "value" is based on columns which are not participating in the GROUP BY clause. Add the columns or the expression to the GROUP BY, aggregate the expression, or use "any_value(value)" if you do not care which of the values within a group is returned. SQLSTATE: 42803
```
- how canonicalized fixes this:
- nondeterministic PythonUDF expressions always have distinct resultIds per udf
- The fix is to canonicalize the expressions when matching. Canonicalized means that we're setting the resultIds to -1, allowing us to dedup the PythonUDF expressions.
- for deterministic UDFs, this rule does not apply and "Post Analysis" batch extracts and deduplicates the expressions, as expected
### Why are the changes needed?
- the output of the query with the fix applied still makes sense - the nondeterministic UDF is invoked only once, in the project.
### Does this PR introduce _any_ user-facing change?
Yes, it's additive, it enables queries to run that previously threw errors.
### How was this patch tested?
- added unit test
### Was this patch authored or co-authored using generative AI tooling?
No
Closes apache#52061 from benrobby/adhoc-fix-pull-out-nondeterministic.
Authored-by: Ben Hurdelhey <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Testing.