Skip to content

Conversation

@pull
Copy link

@pull pull bot commented Jan 11, 2024

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

dongjoon-hyun and others added 8 commits January 10, 2024 12:27
### What changes were proposed in this pull request?

This PR aims to install `lxml` in Python 3.12.

### Why are the changes needed?

- https://github.com/apache/spark/actions/runs/7476792796/job/20348097114
```
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/__w/spark/spark/python/pyspark/sql/tests/test_session.py", line 22, in <module>
    from lxml import etree
ModuleNotFoundError: No module named 'lxml'
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs and daily `Python CI` should pass with Python 3.12.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44666 from dongjoon-hyun/SPARK-46657.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
The pr aims to upgrade upload-artifact action from v3 to v4.
The pr is also an improvement after the reversal of #44438.

### Why are the changes needed?
- v4.0.0 release notes: https://github.com/actions/upload-artifact/releases/tag/v4.0.0
They have numerous performance and behavioral improvements.

- v3 VS v4: actions/upload-artifact@v3...v4.0.0

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44662 from panbingkun/SPARK-46474_FIX.

Authored-by: panbingkun <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?

As [promised here][1], this change loosens our Ruby dependency specification so that Bundler can update transitive dependencies more easily.

Other changes included:
- Remove the direct dependency on webrick, because Jekyll [fixed the problem][2] that caused us to add it in the first place.
- Add explanatory comments to parts of the document generation process that are not obvious.

[1]: #44628 (comment)
[2]: jekyll/jekyll#8524

We can still build our docs using Ruby 2.7, but we should push devs to install Ruby 3 since Ruby 2 is [EOL][3] and we are unable to upgrade some of our doc dependencies until we're running Ruby 3.

[3]: https://www.ruby-lang.org/en/news/2022/04/12/ruby-2-7-6-released/

### Why are the changes needed?

Make the document building process more robust to future updates coming from the Ruby ecosystem.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I built and reviewed the docs on both Ruby 2.7.8 and Ruby 3.3.0 using the following command:

```sh
SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 bundle exec jekyll build
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44667 from nchammas/SPARK-46658-ruby-deps.

Authored-by: Nicholas Chammas <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?

Update doc for `ignoreSurroundingSpaces`

### Why are the changes needed?

Be aligned with the implementation

### Does this PR introduce _any_ user-facing change?

Yes

### How was this patch tested?

This is a doc-only change

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44671 from shujingyang-db/ignore-surronding-space-doc.

Authored-by: Shujing Yang <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?
Upgrade `kubernetes-client` from 6.9.1 to 6.10.0
[Release notes 6.10.0](https://github.com/fabric8io/kubernetes-client/releases/tag/v6.10.0)
[Release notes 6.9.2](https://github.com/fabric8io/kubernetes-client/releases/tag/v6.9.2)

### Why are the changes needed?

[Updated okio to version 1.17.6 to avoid CVE-2023-3635](fabric8io/kubernetes-client#5587)
[Upgrade Kubernetes Model to Kubernetes v1.29.0](fabric8io/kubernetes-client#5686)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44672 from bjornjorgensen/kubclient6.10.

Authored-by: Bjørn Jørgensen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…ro workers and apps

### What changes were proposed in this pull request?

This PR aims to improve `Master` to recover quickly in case of zero workers and apps.

### Why are the changes needed?

This case happens on the initial cluster creation or during cluster restarting procedure.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the newly added test case.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44673 from dongjoon-hyun/SPARK-46664.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
Split `GroupbyParitySplitApplyTests`

### Why are the changes needed?
to testing parallelism

this test normally takes 10 mins:
```
Starting test(python3.9): pyspark.pandas.tests.connect.groupby.test_parity_split_apply (temp output: /__w/spark/spark/python/target/fb71133e-7d03-4c9b-8a64-10e1d02d6bb6/python3.9__pyspark.pandas.tests.connect.groupby.test_parity_split_apply__6wojkexo.log)
Finished test(python3.9): pyspark.pandas.tests.connect.groupby.test_parity_split_apply (598s)
```

### Does this PR introduce _any_ user-facing change?
no, test-only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44664 from zhengruifeng/ps_test_split_apply.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
…lopApi retriable and fix flakiness of ThriftServerWithSparkContextInHttpSuite

### What changes were proposed in this pull request?

This PR adds an new param to HiveThriftServer2.startWithContext` to tell the `ThriftCLIService`s whether to call `System exit` or not when encountering errors. When developers call `HiveThriftServer2.startWithContext` and if an error occurs, `System exit` will be performed, stop the existing `SqlContext/SparkContext`, and crash the user app.

There is also such a use case in our tests. We intended to retry starting a thrift server three times in total but it might stop the underlying SparkContext early and fail the rest.

For example
https://github.com/apache/spark/actions/runs/7271496487/job/19812142981

```java
06:21:12.854 ERROR org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite: Error start hive server with Context
org.scalatest.exceptions.TestFailedException: SharedThriftServer.this.tempScratchDir.exists() was true
	at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
	at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
	at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
	at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
	at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.startThriftServer(SharedThriftServer.scala:151)
	at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.$anonfun$beforeAll$1(SharedThriftServer.scala:59)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
06:21:12.854 ERROR org.apache.hive.service.cli.thrift.ThriftCLIService: Error starting HiveServer2: could not start ThriftBinaryCLIService
java.lang.NullPointerException: Cannot invoke "org.apache.thrift.server.TServer.serve()" because "this.server" is null
	at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:135)
	at java.base/java.lang.Thread.run(Thread.java:840)
06:21:12.941 ERROR org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite: Error start hive server with Context
java.lang.IllegalStateException: LiveListenerBus is stopped.
	at org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:92)
	at org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:75)
	at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.createListenerAndUI(HiveThriftServer2.scala:74)
	at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.startWithContext(HiveThriftServer2.scala:66)
	at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.startThriftServer(SharedThriftServer.scala:141)
	at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.$anonfun$beforeAll$4(SharedThriftServer.scala:60)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
06:21:12.958 WARN org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite:

[info] org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite *** ABORTED *** (151 milliseconds)
[info]   java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
[info] This stopped SparkContext was created at:
[info]
[info] org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite.beforeAll(ThriftServerWithSparkContextSuite.scala:279)
[info] org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
[info] org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info] org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info] org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:69)
[info] org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
[info] org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
[info] sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
[info] java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[info] java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[info] java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[info] java.base/java.lang.Thread.run(Thread.java:840)
[info]
[info] The currently active SparkContext was created at:
[info]
[info] (No active SparkContext.)
[info]   at org.apache.spark.SparkContext.assertNotStopped(SparkContext.scala:122)
[info]   at org.apache.spark.sql.SparkSession.<init>(SparkSession.scala:115)
[info]   at org.apache.spark.sql.SparkSession.newSession(SparkSession.scala:274)
[info]   at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.startThriftServer(SharedThriftServer.scala:130)
```

### Why are the changes needed?

- Improve the programmability of `HiveThriftServer2.startWithContext`
- Fix flakiness in tests

### Does this PR introduce _any_ user-facing change?

no, developer API change and the default behavior is AS-IS.
### How was this patch tested?

Verified ThriftServerWithSparkContextInHttpSuite locally
```
18:20:02.840 ERROR org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite: A previous Hive's SessionState is leaked, aborting this retry
18:20:02.840 ERROR org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite: Error start hive server with Context
java.lang.IllegalStateException: HiveThriftServer2 started in binary mode while the test case is expecting HTTP mode
	at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.$anonfun$startThriftServer$2(SharedThriftServer.scala:149)
	at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.$anonfun$startThriftServer$2$adapted(SharedThriftServer.scala:144)
	at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
	at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
	at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.startThriftServer(SharedThriftServer.scala:144)
	at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.$anonfun$beforeAll$1(SharedThriftServer.scala:60)
18:20:04.114 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
18:20:04.114 WARN org.apache.hadoop.hive.metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore hzyaoqin127.0.0.1
18:20:04.119 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException
[info] - the scratch dir will not be exist (1 millisecond)
[info] - SPARK-29911: Uncache cached tables when session closed (376 milliseconds)
```
### Was this patch authored or co-authored using generative AI tooling?

no

Closes #44575 from yaooqinn/SPARK-46575.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
…ncy in test_session

### What changes were proposed in this pull request?

This PR proposes to make `lxml` as an optional testing dependency in the `test_session` test

### Why are the changes needed?

To make the tests pass without optional testing dependencies, see https://github.com/apache/spark/actions/runs/7476792796/job/20348097114

```
Starting test(python3.12): pyspark.sql.tests.test_session (temp output: /__w/spark/spark/python/target/5136bb6a-ba9d-4533-925c-adf338021fa0/python3.12__pyspark.sql.tests.test_session__yahhd4j3.log)
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/__w/spark/spark/python/pyspark/sql/tests/test_session.py", line 22, in <module>
    from lxml import etree
ModuleNotFoundError: No module named 'lxml'

```

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manaully.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44676 from HyukjinKwon/SPARK-46666.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
dongjoon-hyun and others added 27 commits January 12, 2024 12:18
### What changes were proposed in this pull request?

This PR aims to disable `fail-fast` in Python CI.

### Why are the changes needed?

We need to disable `fail-fast` because each Python version test pipeline is independent from the others.

Currently, one failure cancelled all running pipelines.
- https://github.com/apache/spark/actions/runs/7503864095
<img width="302" alt="Screenshot 2024-01-12 at 11 42 18 AM" src="https://github.com/apache/spark/assets/9700541/7cfa97b0-e224-468b-b6fc-d5d8d49d4489">

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This should be monitored after merging.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44710 from dongjoon-hyun/SPARK-46703.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…ble by `Duration` column correctly

### What changes were proposed in this pull request?

This PR aims to fix `MasterPage` to sort `Running Drivers` table by `Duration` column correctly.

### Why are the changes needed?

Since Apache Spark 3.0.0, `MasterPage` shows `Duration` column of `Running Drivers`.

**BEFORE**
<img width="111" src="https://github.com/apache/spark/assets/9700541/50276e34-01be-4474-803d-79066e06cb2c">

**AFTER**
<img width="111" src="https://github.com/apache/spark/assets/9700541/a427b2e6-eab0-4d73-9114-1d8ff9d052c2">

### Does this PR introduce _any_ user-facing change?

Yes, this is a bug fix of UI.

### How was this patch tested?

Manual.

Run a Spark standalone cluster.
```
$ SPARK_MASTER_OPTS="-Dspark.master.rest.enabled=true -Dspark.deploy.maxDrivers=2" sbin/start-master.sh
$ sbin/start-worker.sh spark://$(hostname):7077
```

Submit multiple jobs via REST API.
```
$ curl -s -k -XPOST http://localhost:6066/v1/submissions/create \
    --header "Content-Type:application/json;charset=UTF-8" \
    --data '{
      "appResource": "",
      "sparkProperties": {
        "spark.master": "spark://localhost:7077",
        "spark.app.name": "Test 1",
        "spark.submit.deployMode": "cluster",
        "spark.jars": "/Users/dongjoon/APACHE/spark-merge/examples/target/scala-2.13/jars/spark-examples_2.13-4.0.0-SNAPSHOT.jar"
      },
      "clientSparkVersion": "",
      "mainClass": "org.apache.spark.examples.SparkPi",
      "environmentVariables": {},
      "action": "CreateSubmissionRequest",
      "appArgs": [ "10000" ]
    }'
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44711 from dongjoon-hyun/SPARK-46704.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…ling bytes metric

### What changes were proposed in this pull request?

This PR fixes a long-standing bug in ShuffleExternalSorter about the "spilled disk bytes" metrics. When we close the sorter, we will spill the remaining data in the buffer, with a flag `isLastFile = true`. This flag means the spilling will not increase the "spilled disk bytes" metrics. This makes sense if the sorter has never spilled before, then the final spill file will be used as the final shuffle output file, and we should keep the "spilled disk bytes" metrics as 0. However, if spilling did happen before, then we simply miscount the final spill file for the "spilled disk bytes" metrics today.

This PR fixes this issue, by setting that flag when closing the sorter only if this is the first spilling.

### Why are the changes needed?

make metrics accurate

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

updated tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #44709 from cloud-fan/shuffle.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
Split `ArithmeticTests`

### Why are the changes needed?
its parity test is slow

### Does this PR introduce _any_ user-facing change?
no, test-only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44708 from zhengruifeng/ps_test_split_num_arithmetic.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…logics out

### What changes were proposed in this pull request?

This PR factor Connect/non-Connect specific logics out into dedicated test classes. This PR is a followup of #40785

### Why are the changes needed?

In order to avoid test failure such as #44698 (comment) by missing dependencies

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

CI in this PR should verify it.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44715 from HyukjinKwon/SPARK-42960-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…conversion

### What changes were proposed in this pull request?

Eliminate the `HiveUtils.formatTimeVarsForHiveClient`

### Why are the changes needed?

`HiveUtils.formatTimeVarsForHiveClient` was introduced to handle compatibility with Hive prior 0.14.0, as Spark 4.0 only supports Hive 2.0+, it's now unnecessary.

> Hive 0.14.0 introduces timeout operations in HiveConf, and changes default values of a bunch
> of time `ConfVar`s by adding time suffixes (`s`, `ms`, and `d` etc.).  This breaks backwards-
> compatibility when users are trying to connecting to a Hive metastore of lower version,
> because these options are expected to be integral values in lower versions of Hive.
>
> Here we enumerate all time `ConfVar`s and convert their values to numeric strings according
> to their output time units.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44707 from pan3793/SPARK-46697.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… or LimitAndOffset

### What changes were proposed in this pull request?

- Add LocalLimitExec to SparkStrategies in Limit + Offset cases
- Add UT

### Why are the changes needed?

Originally, `OffsetAndLimit` and `LimitAndOffset` match cases were matching then dropping a LocalLimit node. Adds this LocalLimitExec node to the physical plan to improve efficiency. Note that this was not a correctness bug since not applying LocalLimit only leads to larger intermediate shuffles / nodes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

UT

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44699 from n-young-db/limit-offset-drops-local-limit.

Authored-by: Nick Young <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
Expose the partition_id column of state data source was hidden by default.

### Why are the changes needed?
partition_id column is useful to users.

### Does this PR introduce _any_ user-facing change?
yes, Expose the partition_id column of state data source was hidden by default and modify the doc accordingly.

### How was this patch tested?
Modify existing integration test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44717 from chaoqin-li1123/unhide_partition_id.

Authored-by: Chaoqin Li <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?
Fix indent for streaming aggregation operator

### Why are the changes needed?
Indent/style change

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing unit tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44723 from anishshri-db/task/SPARK-46712.

Authored-by: Anish Shrigondekar <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
…llback

### What changes were proposed in this pull request?
Fix RocksDB state provider race condition during rollback

### Why are the changes needed?
The rollback() method in RocksDB is not properly synchronized, thus a race condition can be introduced during rollback when there are tasks trying to commit.

The symptom of the race condition is the following exception being thrown:
```
`Caused by: java.io.FileNotFoundException: No such file or directory: ...state/0/54/10369.changelog
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:4069)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3907)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3801)
	at com.databricks.common.filesystem.LokiS3FS.getFileStatusNoCache(LokiS3FS.scala:91)
	at com.databricks.common.filesystem.LokiS3FS.getFileStatus(LokiS3FS.scala:86)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:1525)`
```

This race condition can happen for the following sequence of events
1. task A gets cancelled after releasing lock for rocksdb
2. task B starts and loads 10368
3. task A performs rocksdb rollback to -1
4. task B reads data from rocksdb and commits

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing unit tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44722 from anishshri-db/task/SPARK-46711.

Authored-by: Anish Shrigondekar <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
### What changes were proposed in this pull request?

The original docstring is here: https://github.com/apache/spark/blob/5db7beb59a673f05e8f39aa9653cb0497a6c97cf/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L847-L848

This docstring was copied incompletely to various RDD methods in the PySpark API.

### Why are the changes needed?

So the docstring is complete and coherent.

### Does this PR introduce _any_ user-facing change?

Yes, it changes the public API docstring.

### How was this patch tested?

Not tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44726 from nchammas/preserves-partitioning-doc.

Authored-by: Nicholas Chammas <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?
Pin

- `sphinxcontrib-applehelp==1.0.4`
- `sphinxcontrib-devhelp==1.0.2`
- `sphinxcontrib-htmlhelp==2.0.1`
- `sphinxcontrib-qthelp==1.0.3`
- `sphinxcontrib-serializinghtml==1.1.5`

previously,
`Install Python linter dependencies` install `sphinxcontrib-applehelp-1.0.7`, and then `Install dependencies for documentation generation` reinstall it with `sphinxcontrib-applehelp-1.0.4`;

now,
`Install Python linter dependencies` install `sphinxcontrib-applehelp-1.0.8`, and `Install dependencies for documentation generation` keep this intallation:
`Requirement already satisfied: sphinxcontrib-applehelp in /usr/local/lib/python3.9/dist-packages (from sphinx==4.5.0) (1.0.8)`

### Why are the changes needed?
doc build is failing with:
```
Sphinx version error:
The sphinxcontrib.applehelp extension used by this project needs at least Sphinx v5.0; it therefore cannot be built with this version.
make: *** [Makefile:35: html] Error 2
                    ------------------------------------------------
      Jekyll 4.3.3   Please append `--trace` to the `build` command
                     for any additional information or backtrace.
                    ------------------------------------------------
/__w/spark/spark/docs/_plugins/copy_api_dirs.rb:131:in `<top (required)>': Python doc generation failed (RuntimeError)
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.3.3/lib/jekyll/external.rb:57:in `require'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.3.3/lib/jekyll/external.rb:57:in `block in require_with_graceful_fail'
```

```
Sphinx version error:
The sphinxcontrib.devhelp extension used by this project needs at least Sphinx v5.0; it therefore cannot be built with this version.
make: *** [Makefile:35: html] Error 2
                    ------------------------------------------------
      Jekyll 4.3.3   Please append `--trace` to the `build` command
                     for any additional information or backtrace.
                    ------------------------------------------------
/__w/spark/spark/docs/_plugins/copy_api_dirs.rb:131:in `<top (required)>': Python doc generation failed (RuntimeError)
```

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44727 from zhengruifeng/infra_pin_sphinxcontrib-applehelp.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
…ty check for Scala StreamingQueryListener

### What changes were proposed in this pull request?

This PR proposes to add a functionality to perform backward compatibility check for StreamingQueryListener in Scala, specifically implementing `onQueryIdle` or not.

### Why are the changes needed?

We missed to add backward compatibility test when introducing onQueryIdle, and it led to an issue in PySpark (https://issues.apache.org/jira/browse/SPARK-45631). We added the compatibility test in PySpark but didn't add it in Scala.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Modified UT.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44730 from HeartSaVioR/SPARK-46716.

Authored-by: Jungtaek Lim <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…ent snapshot

### What changes were proposed in this pull request?

Some may want to use checkpoint to get a consistent snapshot of the Dataset / RDD. Warn that this is not the case with lazy checkpoint, because checkpoint is computed only at the end of the first action, and the data used during the first action may be different because of non-determinism and retries.

`doCheckpoint` is only called at the end of [SparkContext.runJob](https://github.com/apache/spark/blob/5446f548bbc8a93414f1c773a8daf714b57b7d1a/core/src/main/scala/org/apache/spark/SparkContext.scala#L2426). This may cause recomputation both of data of [local checkpoint data](https://github.com/apache/spark/blob/5446f548bbc8a93414f1c773a8daf714b57b7d1a/core/src/main/scala/org/apache/spark/rdd/LocalRDDCheckpointData.scala#L54) and [reliable checkpoint data](https://github.com/apache/spark/blob/5446f548bbc8a93414f1c773a8daf714b57b7d1a/core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala#L166) before it is finalized.

### Why are the changes needed?

Document a gnarly edge case.

### Does this PR introduce _any_ user-facing change?

Yes, change to documentation of public APIs.

### How was this patch tested?

Doc only change.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43247 from juliuszsompolski/SPARK-45435-doc.

Authored-by: Juliusz Sompolski <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…as_slow`

### What changes were proposed in this pull request?
Rebalance `pyspark_pandas` and `pyspark_pandas_slow`

### Why are the changes needed?
before:
`pyspark_pandas`: `Tests passed in 1849 seconds`
`pyspark_pandas-slow`: `Tests passed in 3538 seconds`

after:
`pyspark_pandas`: `Tests passed in 2733 seconds`
`pyspark_pandas-slow`: `Tests passed in 2804 seconds`

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci, https://github.com/zhengruifeng/spark/actions/runs/7524159324/job/20478674209

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44731 from zhengruifeng/infra_rebalance_ps_test.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
…map_entries`

### What changes were proposed in this pull request?
This pr refine docstring of  `map_keys/map_values/map_entries` and add some new examples.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass Github Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44724 from LuciferYang/SPARK-46713.

Lead-authored-by: yangjie01 <[email protected]>
Co-authored-by: YangJie <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
…SessionHolder

### What changes were proposed in this pull request?

This PR makes `SparkConnectReattachExecuteHandler` fetch the `ExecuteHolder` via the`SessionHolder` which in turn refreshes it's aliveness. Further, this makes it consistent with `SparkConnectReleaseExecuteHandler`.

### Why are the changes needed?

Currently in ReattachExecute, we fetch the ExecuteHolder directly without going through the SessionHolder and hence the "aliveness" of the SessionHolder is not refreshed.

This would result in long-running queries (which do not send `ReleaseExecute` requests in specific) failing because the `SessionHolder` would expire from the cache during an active query execution.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the bug where long-running may fail when their corresponding `SessionHolder` is expired during active query execution.

### How was this patch tested?

New unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44670 from vicennial/SPARK-46660.

Authored-by: vicennial <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?
Fix two typos

`deteminism` -> `determinism`

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44733 from zhengruifeng/SPARK-45435-typo.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
…other DSv2 built-in Data Sources

### What changes were proposed in this pull request?

This PR refactors Python Data Source to aline with other DSv2 built-in Data Sources such as CSV, Parquet, ORC, JDBC, etc.

### Why are the changes needed?

For better readability, maintenance, and consistency.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing test cases should cover them.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44734 from HyukjinKwon/SPARK-46720.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?
Rebalance `pyspark_pandas_connect_part?`

### Why are the changes needed?
for testing parallelism

before: https://github.com/apache/spark/actions/runs/7527560858/job/20487999563
`pyspark_pandas_connect_part0`: `Tests passed in 3979 seconds`
`pyspark_pandas_connect_part1`: `Tests passed in 3585 seconds`
`pyspark_pandas_connect_part2`: `Tests passed in 2724 seconds`
`pyspark_pandas_connect_part3`: `Tests passed in 3276 seconds`

the difference is about 20 min

after:
`pyspark_pandas_connect_part0`: `Tests passed in 3516 seconds`
`pyspark_pandas_connect_part1`: `Tests passed in 3228 seconds`
`pyspark_pandas_connect_part2`: `Tests passed in 3760 seconds`
`pyspark_pandas_connect_part3`: `Tests passed in 3195 seconds`

the difference is about 5 min

### Does this PR introduce _any_ user-facing change?
no. test-only

### How was this patch tested?
ci, https://github.com/zhengruifeng/spark/actions/runs/7527236548/job/20488637410

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44741 from zhengruifeng/ps_test_rebalance_pandas_connect.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
### What changes were proposed in this pull request?
This adds support for the async profiler to Spark

### Why are the changes needed?

Profiling of JVM applications on a cluster is cumbersome and it can be complicated to save the output of the profiler especially if the cluster is on K8s where the executor pods are removed and any files saved to the local file system become inaccessible. This feature makes it simple to turn profiling on/off, includes the jar/binaries needed for profiling,  and makes it simple to save output to an HDFS location.

### Does this PR introduce _any_ user-facing change?
This PR introduces three new configuration parameters. These are described in the documentation.

### How was this patch tested?
Tested manually on EKS

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44021 from parthchandra/async-profiler-apache-PR.

Lead-authored-by: Parth Chandra <[email protected]>
Co-authored-by: Parth Chandra <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

SPARK-45315 drops JDK8/11 and makes 17 the default, which also makes G1 the default garbage collector; this PR updates tuning.md to use java 17 doc links and fixes some wordings about G1.

### Why are the changes needed?

doc improvements
### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

doc build

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #44737 from yaooqinn/SPARK-46724.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…operation only depend on interrupt thread

### What changes were proposed in this pull request?
This PR propose to simplify the `ReloadingX509TrustManager`.

### Why are the changes needed?
Currently, close or destroy `ReloadingX509TrustManager` depend on interrupt thread and the volatile variable `running`.
In fact, we can change the `running` to a local variable on stack and let the close operation of `ReloadingX509TrustManager` only depend on interrupt thread.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
GA tests.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #44720 from beliefer/simplify-ReloadingX509TrustManager.

Authored-by: beliefer <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

Make addArtifact API retrying on errors.

Note this is safe operation since addArtifact is idempotent operation (#43314)

### Why are the changes needed?

For the same reasons as we make other API retryable.

### Does this PR introduce _any_ user-facing change?

Yes

### How was this patch tested?

Added test.

Testing by hand against custom spark server.

### Was this patch authored or co-authored using generative AI tooling?

No & never

Closes #44740 from cdkrot/SPARK-46723-addartifact.

Authored-by: Alice Sayutina <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?
On Spark Connect, `df.col("*")` should be resolved against the target plan

### Why are the changes needed?
```
In [6]: df1 = spark.createDataFrame([{"id": 1}])

In [7]: df2 = spark.createDataFrame([{"id": 1, "val": "v"}])

In [8]: df1.join(df2)
Out[8]: DataFrame[id: bigint, id: bigint, val: string]

In [9]: df1.join(df2).select(df1["*"])
Out[9]: DataFrame[id: bigint, id: bigint, val: string]
```

it should be
```
In [3]: df1.join(df2).select(df1["*"])
Out[3]: DataFrame[id: bigint]
```

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
added ut

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44689 from zhengruifeng/py_df_star.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…Sources around when cloning DataSourceManager

### What changes were proposed in this pull request?

This PR is a followup of #44681 that proposes to remove the logic of passing static Python Data Sources around when cloning `DataSourceManager`. They are static Data Sources so we don't actually have to pass around.

### Why are the changes needed?

For better readability.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Existing test cases should cover.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44743 from HyukjinKwon/SPARK-46670-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
… check for StreamingQueryListener in Spark Connect (Scala/PySpark)

### What changes were proposed in this pull request?

This PR proposes to add a functionality to perform backward compatibility check for StreamingQueryListener in Spark Connect (both Scala and PySpark), specifically implementing onQueryIdle or not.

### Why are the changes needed?

We missed to add backward compatibility test when introducing onQueryIdle, and it led to an issue in PySpark (https://issues.apache.org/jira/browse/SPARK-45631). We added the compatibility test in PySpark but didn't add it in Spark Connect.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Modified UTs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44736 from HeartSaVioR/SPARK-46722.

Authored-by: Jungtaek Lim <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
@pull pull bot merged commit a58362e into huangxiaopingRD:master Jan 16, 2024
huangxiaopingRD pushed a commit that referenced this pull request Feb 2, 2024
…HAVING

### What changes were proposed in this pull request?

This PR enhanced the analyzer to handle the following pattern properly.

```
Sort
 - Filter
   - Aggregate
```

### Why are the changes needed?

```
spark-sql (default)> CREATE TABLE t1 (flag BOOLEAN, dt STRING);

spark-sql (default)>   SELECT LENGTH(dt),
                   >          COUNT(t1.flag)
                   >     FROM t1
                   > GROUP BY LENGTH(dt)
                   >   HAVING COUNT(t1.flag) > 1
                   > ORDER BY LENGTH(dt);
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `dt` cannot be resolved. Did you mean one of the following? [`length(dt)`, `count(flag)`].; line 6 pos 16;
'Sort ['LENGTH('dt) ASC NULLS FIRST], true
+- Filter (count(flag)#60L > cast(1 as bigint))
   +- Aggregate [length(dt#9)], [length(dt#9) AS length(dt)#59, count(flag#8) AS count(flag)#60L]
      +- SubqueryAlias spark_catalog.default.t1
         +- Relation spark_catalog.default.t1[flag#8,dt#9] parquet
```

The above code demonstrates the failure case, the query failed during the analysis phase when both `HAVING` and `ORDER BY` clauses are present, but successful if only one is present.

### Does this PR introduce _any_ user-facing change?

Yes, maybe we can call it a bugfix.

### How was this patch tested?

New UTs are added

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#44352 from pan3793/SPARK-28386.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
pull bot pushed a commit that referenced this pull request Nov 22, 2024
…ead pool

### What changes were proposed in this pull request?

This PR aims to use a meaningful class name prefix for REST Submission API thread pool instead of the default value of Jetty QueuedThreadPool, `"qtp"+super.hashCode()`.

https://github.com/dekellum/jetty/blob/3dc0120d573816de7d6a83e2d6a97035288bdd4a/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L64

### Why are the changes needed?

This is helpful during JVM investigation.

**BEFORE (4.0.0-preview2)**

```
$ SPARK_MASTER_OPTS='-Dspark.master.rest.enabled=true' sbin/start-master.sh
$ jstack 28217 | grep qtp
"qtp1925630411-52" #52 daemon prio=5 os_prio=31 cpu=0.07ms elapsed=19.06s tid=0x0000000134906c10 nid=0xde03 runnable  [0x0000000314592000]
"qtp1925630411-53" #53 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=19.06s tid=0x0000000134ac6810 nid=0xc603 runnable  [0x000000031479e000]
"qtp1925630411-54" #54 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=19.06s tid=0x000000013491ae10 nid=0xdc03 runnable  [0x00000003149aa000]
"qtp1925630411-55" #55 daemon prio=5 os_prio=31 cpu=0.08ms elapsed=19.06s tid=0x0000000134ac9810 nid=0xc803 runnable  [0x0000000314bb6000]
"qtp1925630411-56" #56 daemon prio=5 os_prio=31 cpu=0.04ms elapsed=19.06s tid=0x0000000134ac9e10 nid=0xda03 runnable  [0x0000000314dc2000]
"qtp1925630411-57" #57 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=19.06s tid=0x0000000134aca410 nid=0xca03 runnable  [0x0000000314fce000]
"qtp1925630411-58" #58 daemon prio=5 os_prio=31 cpu=0.04ms elapsed=19.06s tid=0x0000000134acaa10 nid=0xcb03 runnable  [0x00000003151da000]
"qtp1925630411-59" #59 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=19.06s tid=0x0000000134acb010 nid=0xcc03 runnable  [0x00000003153e6000]
"qtp1925630411-60-acceptor-0108e9815-ServerConnector1e497474{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #60 daemon prio=3 os_prio=31 cpu=0.11ms elapsed=19.06s tid=0x00000001317ffa10 nid=0xcd03 runnable  [0x00000003155f2000]
"qtp1925630411-61-acceptor-11d90f2aa-ServerConnector1e497474{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #61 daemon prio=3 os_prio=31 cpu=0.10ms elapsed=19.06s tid=0x00000001314ed610 nid=0xcf03 waiting on condition  [0x00000003157fe000]
```

**AFTER**
```
$ SPARK_MASTER_OPTS='-Dspark.master.rest.enabled=true' sbin/start-master.sh
$ jstack 28317 | grep StandaloneRestServer
"StandaloneRestServer-52" #52 daemon prio=5 os_prio=31 cpu=0.09ms elapsed=60.06s tid=0x00000001284a8e10 nid=0xdb03 runnable  [0x000000032cfce000]
"StandaloneRestServer-53" #53 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=60.06s tid=0x00000001284acc10 nid=0xda03 runnable  [0x000000032d1da000]
"StandaloneRestServer-54" #54 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=60.06s tid=0x00000001284ae610 nid=0xd803 runnable  [0x000000032d3e6000]
"StandaloneRestServer-55" #55 daemon prio=5 os_prio=31 cpu=0.09ms elapsed=60.06s tid=0x00000001284aec10 nid=0xd703 runnable  [0x000000032d5f2000]
"StandaloneRestServer-56" #56 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=60.06s tid=0x00000001284af210 nid=0xc803 runnable  [0x000000032d7fe000]
"StandaloneRestServer-57" #57 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=60.06s tid=0x00000001284af810 nid=0xc903 runnable  [0x000000032da0a000]
"StandaloneRestServer-58" #58 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=60.06s tid=0x00000001284afe10 nid=0xcb03 runnable  [0x000000032dc16000]
"StandaloneRestServer-59" #59 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=60.06s tid=0x00000001284b0410 nid=0xcc03 runnable  [0x000000032de22000]
"StandaloneRestServer-60-acceptor-04aefbaa8-ServerConnector44284d85{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #60 daemon prio=3 os_prio=31 cpu=0.13ms elapsed=60.05s tid=0x000000015cda1a10 nid=0xcd03 runnable  [0x000000032e02e000]
"StandaloneRestServer-61-acceptor-148976251-ServerConnector44284d85{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #61 daemon prio=3 os_prio=31 cpu=0.12ms elapsed=60.05s tid=0x000000015cd1c810 nid=0xce03 waiting on condition  [0x000000032e23a000]
```

### Does this PR introduce _any_ user-facing change?

No, the thread names are accessed during the debugging.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#48924 from dongjoon-hyun/SPARK-50385.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: panbingkun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.