Skip to content

Conversation

@pull
Copy link

@pull pull bot commented Dec 28, 2023

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

dongjoon-hyun and others added 9 commits December 28, 2023 09:23
### What changes were proposed in this pull request?

This PR aims to update `zstd-jni` to 1.5.5-11.

### Why are the changes needed?

This version has a few bug fixes and the following improvement.
- luben/zstd-jni#287

Full commit list.
- luben/zstd-jni@v1.5.5-10...v1.5.5-11

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44515 from dongjoon-hyun/SPARK-46528.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…istinct/array_compact`

### What changes were proposed in this pull request?
This pr refine docstring of `array_remove/array_distinct/array_compact` and add some new examples.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass Github Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44506 from LuciferYang/SPARK-46521.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?

This PR enabled the assertion in HiveMetastoreLazyInitializationSuite

### Why are the changes needed?

fix test intenton

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

pass HiveMetastoreLazyInitializationSuite

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #44500 from yaooqinn/SPARK-46514.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
…ce on startup

### What changes were proposed in this pull request?

This PR proposes to add the support of automatic Python Data Source registration.

**End user perspective:**

```bash
# Assume that `customsource` defined a short name as `custom`
pip install pyspark_customsource
```

Users can directly use the Python Data Source

```python
df = spark.format("custom").load()
```

**Developer perspective:**

The packages should follow the structure below:
- The package name should start with `pyspark_` prefix
- `pyspark_*.DefaultSource` has to be defined that inherits `pyspark.sql.datasource.DataSource`

For example:

```
pyspark_customsource
├── __init__.py
 ...
```

`__init__.py`:

```python
from pyspark.sql.datasource import DataSource

class DefaultSource(Datasource):
    pass
```

### Why are the changes needed?

This allows the developers to release and maintain their 3rd party Python Data Sources separately (e.g., in PyPI), and end users can easily install the Python Data Source without doing anything other than just `pip install pyspark_their_source`

### Does this PR introduce _any_ user-facing change?

Yes, this allows users to `pip install pyspark_custom_source`, and automatically register it as Data Source available in Spark.

### How was this patch tested?

Unittests were added.

Also manual test as below:

```bash
rm -fr pyspark_mysource
mkdir pyspark_mysource
cd pyspark_mysource
echo '
from pyspark.sql.datasource import DataSource, DataSourceReader, InputPartition

class TestDataSourceReader(DataSourceReader):
    def __init__(self, options):
        self.options = options
    def partitions(self):
        return [InputPartition(i) for i in range(3)]
    def read(self, partition):
        yield partition.value, str(partition.value)

class DefaultSource(DataSource):
    classmethod
    def name(cls):
        return "test"
    def schema(self):
        return "x INT, y STRING"
    def reader(self, schema) -> "DataSourceReader":
        return TestDataSourceReader(self.options)
    classmethod
    def name(cls):
        return "mysource"
' > __init__.py
cd ..
./bin/pyspark
```

```python
spark.read.format("mysource").load().show()
```

```
+---+---+
|  x|  y|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
+---+---+
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44504 from HyukjinKwon/SPARK-45917.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
… out `test_loc2d*`

### What changes were proposed in this pull request?
1, factor out `test_loc2d*`;
2, add the missing parity tests;

### Why are the changes needed?
test parity and testing parallelism

### Does this PR introduce _any_ user-facing change?
no, test-only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44518 from zhengruifeng/ps_test_indexing_loc2d.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
…gration-tests`

### What changes were proposed in this pull request?

This PR upgrades guava from 18.0 to 19.0 for docker-integration-tests as preparation for #44509

### Why are the changes needed?

requested by dongjoon-hyun as a separated PR at https://github.com/apache/spark/pull/44509/files#r1437220177

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

existing docker it
### Was this patch authored or co-authored using generative AI tooling?

no

Closes #44517 from yaooqinn/SPARK-46529.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
…expressions

### What changes were proposed in this pull request?

Prior to this change `e BETWEEN lower AND upper` expression used to be transformed into `lower <= e && e <= upper`. This means that `e` would be evaluated twice which is problematic from both correctness and performance perspectives.

Suggested fix is to use `WITH` expression that was introduced with [this](01c294b) change.

### Why are the changes needed?

Current implementation is not correct for non deterministic expressions, since two calls might return different results.

### Does this PR introduce _any_ user-facing change?

With this change generated plan for BETWEEN statement will be different. An example of generated plan is provided in tests.

### How was this patch tested?

Existing tests plus new test in PlanGenerationTestSuite.

### Was this patch authored or co-authored using generative AI tooling?

Yes.

Closes #44299 from dbatomic/between_expression_v2.

Lead-authored-by: Aleksandar Tomic <[email protected]>
Co-authored-by: Aleksandar Tomic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…n` file

### What changes were proposed in this pull request?
The pr aims to:
- Clear unused error classes from `error-classes.json`.
- Delete unused methods `dataSourceAlreadyExists` in `QueryCompilationErrors.scala`
- Fix an outdated comment.

### Why are the changes needed?
Make code clear.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GA.
- Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44503 from panbingkun/SPARK-46519.

Authored-by: panbingkun <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
…nfo`

### What changes were proposed in this pull request?
In the PR, I propose to put message parameters together with an error class in the `messageParameter` field in metadata of `ErrorInfo`.

### Why are the changes needed?
To be able to create an error from an error class and message parameters. Before the changes, it is not possible to re-construct an error having only an error class.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the modified test:
```
$ build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44468 from MaxGekk/messageParameters-in-metadata.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
…o `pyspark.pandas.tests.indexes.*` and add the parity test

### What changes were proposed in this pull request?
this is the last PR to Reorganize `IndexingTest`:
1, move it to `pyspark.pandas.tests.indexes.*`;
2, add the missing parity test

### Why are the changes needed?
test parity and testing parallelism

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44520 from zhengruifeng/ps_test_xxx.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
…errors

### What changes were proposed in this pull request?
In the PR, I propose to handle NPE and asserts from eagerly executed commands, and convert them to internal errors.

### Why are the changes needed?
To unify the approach for errors raised by Spark SQL.

### Does this PR introduce _any_ user-facing change?
Yes.

Before the changes:
```
Cannot invoke "org.apache.spark.sql.connector.read.colstats.ColumnStatistics.min()" because the return value of "scala.Option.get()" is null
java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.connector.read.colstats.ColumnStatistics.min()" because the return value of "scala.Option.get()" is null
	at org.apache.spark.sql.execution.datasources.v2.DescribeColumnExec.run(DescribeColumnExec.scala:63)
```

After:
```
org.apache.spark.SparkException: [INTERNAL_ERROR] Eagerly executed command failed. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace. SQLSTATE: XX000
	at org.apache.spark.SparkException$.internalError(SparkException.scala:107)
...
Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.connector.read.colstats.ColumnStatistics.min()" because the return value of "scala.Option.get()" is null
	at org.apache.spark.sql.execution.datasources.v2.DescribeColumnExec.run(DescribeColumnExec.scala:63)
```

### How was this patch tested?
Manually, by running the test from another PR: #44524

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44525 from MaxGekk/internal-error-eagerlyExecuteCommands.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
Zouxxyy and others added 4 commits December 28, 2023 19:57
…l stats

### What changes were proposed in this pull request?

### Why are the changes needed?

Currently executing DESCRIBE TABLE EXTENDED a column without col stats with v2 table will throw a null pointer exception.

```text
Cannot invoke "org.apache.spark.sql.connector.read.colstats.ColumnStatistics.min()" because the return value of "scala.Option.get()" is null
java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.connector.read.colstats.ColumnStatistics.min()" because the return value of "scala.Option.get()" is null
	at org.apache.spark.sql.execution.datasources.v2.DescribeColumnExec.run(DescribeColumnExec.scala:63)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:118)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$6(SQLExecution.scala:150)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:241)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:116)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:918)
```

This RP will fix it

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Add a new test `describe extended (formatted) a column without col stats`

### Was this patch authored or co-authored using generative AI tooling?

Closes #44524 from Zouxxyy/dev/fix-stats.

Lead-authored-by: zouxxyy <[email protected]>
Co-authored-by: Kent Yao <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
### What changes were proposed in this pull request?

In XML, elements typically consist of a name and a value, with the value enclosed between the opening and closing tags. But XML also allows to include arbitrary values interspersed between these elements. To address this, we provide an option named `valueTags`, which is enabled by default, to capture these values. Consider the following example:

```
<ROW>
    <a>1</a>
  value1
  <b>
    value2
    <c>2</c>
    value3
  </b>
</ROW>
```
In this example, `<a>`, `<b>`, and `<c>` are named elements with their respective values enclosed within tags. There are arbitrary values value1 value2 value3 interspersed between the elements. Please note that there can be multiple occurrences of values in a single element (i.e. there are value2, value3 in the element <b>)

We should parse the values between tags into the valueTags field. If there are multiple occurrences of value tags, the value tag field will be converted to an array type.

### Why are the changes needed?

We should parse the values otherwise there would be data loss

### Does this PR introduce _any_ user-facing change?

Yes

### How was this patch tested?

Unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44318 from shujingyang-db/capture-values.

Lead-authored-by: Shujing Yang <[email protected]>
Co-authored-by: Shujing Yang <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
… keep the plan id

### What changes were proposed in this pull request?
1, make following helper functions keep the plan id in transformation:

- `resolveOperatorsDownWithPruning`
- `resolveOperatorsUpWithNewOutput`

2, change the way to keep plan id in `ResolveNaturalAndUsingJoin`:
before:
```
Project <- tag hiddenOutputTag
  - Join <- tag PLAN_ID_TAG
```

after:
```
Project <- tag hiddenOutputTag & PLAN_ID_TAG
  - Join
```

3, to verify this fix, this PR also reverts previous tags copying changes in the rules

### Why are the changes needed?
we had make following rules keep the plan id:
1, `ResolveNaturalAndUsingJoin` in 167bbca
- using `resolveOperatorsUpWithPruning`, it set the tag `Project.hiddenOutputTag` internally, so `copyTagsFrom` (only works if `tags.isEmpty`) in `resolveOperatorsUpWithPruning` takes no effect

2, `ExtractWindowExpressions` in 185a0a5
- using `resolveOperatorsDownWithPruning`, which doesn't copy tags

3, `WidenSetOperationTypes` in 17c206f
- using `resolveOperatorsUpWithNewOutput -> transformUpWithNewOutput`, which doesn't copy tags

4, `ResolvePivot` in 1a89bdc
- using `resolveOperatorsWithPruning -> resolveOperatorsDownWithPruning`, which doesn't copy tags

5, `CTESubstitution` in 79d1cde
- using both `resolveOperatorsDownWithPruning` and `resolveOperatorsUp -> resolveOperatorsUpWithPruning`, the former does't copy tags

But plan id missing issue still keep popping up (see #44454), so this PR attempt to cover more cases by fixing the helper functions which are used to build the rules

6, `ResolveUnpivot`
- using `resolveOperatorsWithPruning -> resolveOperatorsDownWithPruning`, which doesn't copy tags

7, `UnpivotCoercion`
- using `resolveOperators -> resolveOperatorsWithPruning -> resolveOperatorsDownWithPruning`, which doesn't copy tags

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44462 from zhengruifeng/sql_res_op_keep.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
…el.transform`

### What changes were proposed in this pull request?
the column references  in `ALSModel.transform` maybe ambiguous in some case

### Why are the changes needed?
to fix a bug

before this fix, the test fails with:
```
JVM stacktrace:
org.apache.spark.sql.AnalysisException: [MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION] Resolved attribute(s) "features", "features" missing from "user", "item", "id", "features", "id", "features" in operator !Project [user#60, item#63, UDF(features#50, features#54) AS prediction#94]. Attribute(s) with the same name appear in the operation: "features", "features".
Please check if the right attribute(s) are used. SQLSTATE: XX000;
```

and

```

pyspark.errors.exceptions.captured.AnalysisException: Column features#50, features#46 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before joining them, and specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.

JVM stacktrace:
org.apache.spark.sql.AnalysisException: Column features#50, features#46 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before joining them, and specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.
```

### Does this PR introduce _any_ user-facing change?
yes, bug fix

### How was this patch tested?
added ut

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44526 from zhengruifeng/ml_als_reference.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
@github-actions github-actions bot added the ML label Dec 29, 2023
LuciferYang and others added 6 commits December 29, 2023 15:02
…array_size/array_repeat`

### What changes were proposed in this pull request?
This pr refine docstring of `array_min/array_max/array_size/array_repeat` and add some new examples.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass Github Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44522 from LuciferYang/SPARK-46533.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
…alueError` for invalid `numBits`

### What changes were proposed in this pull request?
Function `sha2`  should raise `PySparkValueError` for invalid `numBits`

### Why are the changes needed?
vanilla PySpark invokes the Scala side and raise an `IllegalArgumentException`
https://github.com/apache/spark/blob/fa4096eb6aba4c66f0d9c5dcbabdfc0804064fff/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3212-L3217

while Python client won't do this check and raise an `AnalysisException`.

They should both raise a `PySparkValueError` for this case.

### Does this PR introduce _any_ user-facing change?
yes

```
In [1]: from pyspark.sql import functions as sf
   ...: spark.range(1).select(sf.sha2(sf.col("id"), 1024)).collect()
---------------------------------------------------------------------------
PySparkValueError                         Traceback (most recent call last)
<ipython-input-1-1ae9879dcc31> in ?()
      1 from pyspark.sql import functions as sf
----> 2 spark.range(1).select(sf.sha2(sf.col("id"), 1024)).collect()

~/Dev/spark/python/pyspark/sql/utils.py in ?(*args, **kwargs)
    190             from pyspark.sql.connect import functions
    191
    192             return getattr(functions, f.__name__)(*args, **kwargs)
    193         else:
--> 194             return f(*args, **kwargs)

~/Dev/spark/python/pyspark/sql/functions/builtin.py in ?(col, numBits)
   9112     |Bob  |cd9fb1e148ccd8442e5aa74904cc73bf6fb54d1d54d333bd596aa9bb4bb4e961|
   9113     +-----+----------------------------------------------------------------+
   9114     """
   9115     if numBits not in [0, 224, 256, 384, 512]:
-> 9116         raise PySparkValueError(
   9117             error_class="VALUE_NOT_ALLOWED",
   9118             message_parameters={
   9119                 "arg_name": "numBits",

PySparkValueError: [VALUE_NOT_ALLOWED] Value for `numBits` has to be amongst the following values: [0, 224, 256, 384, 512].
```

### How was this patch tested?
added ut

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44529 from zhengruifeng/py_connect_sha2_check.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
…urce write

### What changes were proposed in this pull request?

This PR introduces support for the commit and abort APIs for Python data source write. After this PR, users can customize their implementations for committing and aborting write operations.

### Why are the changes needed?

To support Python data source.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New unit tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44497 from allisonwang-db/spark-45914-commit-abort.

Authored-by: allisonwang-db <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…Python Spark Connect client

### What changes were proposed in this pull request?

This PR is a followup of #44468 that addresses the additional metadata in Python Spark Connect client.

### Why are the changes needed?

For feature parity.

### Does this PR introduce _any_ user-facing change?

Yes, when `spark.sql.connect.enrichError.enabled` is disabled, users are sill able to get the message parameters.

### How was this patch tested?

Unittest was added.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44528 from HyukjinKwon/SPARK-46532-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…ailable Data Sources

### What changes were proposed in this pull request?

This PR is a sort of followup of #44504 but addresses a separate issue. This PR proposes to check:
- if Python executable exists when looking up available Python Data Sources.
- if PySpark source and Py4J files exist - for the case users don't have them in their machine (and don't use PySpark).

### Why are the changes needed?

For some OSes such as Windows, or minimized Docker containers, there is no Python installed, and it will just fail even when users want to use Scala only. We should check the Python executable, and skip if that does not exist.

### Does this PR introduce _any_ user-facing change?

No because the main change has not been released out yet.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44519 from HyukjinKwon/SPARK-46530.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
### What changes were proposed in this pull request?
This pr aims to replace `.reverse.find` with `.findLast` in Spark code.

### Why are the changes needed?
1. The `.findLast` seems more concise.
2. The `.reverse` involves a copy of the collection, while `StringOps/ArrayOps.findLast` wraps the input data into a `ReverseIterator` to avoid collection copying.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44495 from LuciferYang/find-last.

Lead-authored-by: yangjie01 <[email protected]>
Co-authored-by: YangJie <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
LuciferYang and others added 2 commits December 29, 2023 17:18
…tils#needsEscaping` as it is always true

### What changes were proposed in this pull request?
This pr just remove the check for `c>=0` from `ExternalCatalogUtils#needsEscaping` since it is always true due to the numerical range of the `Char` type in Scala is unsigned integer.
```
scala> Char.char2long(Char.MinValue)
res2: Long = 0

scala> Char.char2long(Char.MaxValue)
res1: Long = 65535
```

### Why are the changes needed?
Remove constant condition

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44533 from LuciferYang/SPARK-46542.

Lead-authored-by: yangjie01 <[email protected]>
Co-authored-by: YangJie <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
Pin 'lxml==4.9.4'

### Why are the changes needed?
it seems the newly released lxml 5.0.0 breaks the CI (the `Install Python packages (Python 3.9)` step for Spark SQL tests)

```
Collecting lxml (from unittest-xml-reporting)
  Downloading lxml-5.0.0.tar.gz (3.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.8/3.8 MB 17.4 MB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'error'
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [4 lines of output]
      <string>:67: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
      Building lxml version 5.0.0.
      Building without Cython.
      Error: Please make sure the libxml2 and libxslt development packages are installed.
      [end of output]
```

in the latest successful run, the version is 4.9.4:
```
Package                  Version
------------------------ ------------
googleapis-common-protos 1.62.0
grpcio                   1.59.3
grpcio-status            1.59.3
lxml                     4.9.4
numpy                    1.26.2
pandas                   2.1.4
pip                      23.0.1
protobuf                 4.25.1
pyarrow                  14.0.2
python-dateutil          2.8.2
pytz                     2023.3.post1
scipy                    1.11.4
setuptools               58.1.0
six                      1.16.0
tzdata                   2023.3
unittest-xml-reporting   3.2.0
```

`unittest-xml-reporting` requires `lxml` but without specified version

```
name                    summary
----------------------  ------------------------------------------------------------------------------------------------
unittest-xml-reporting  unittest-based test runner with Ant/JUnit like XML reporting.
└── lxml                Powerful and Pythonic XML processing library combining libxml2/libxslt with the ElementTree API.
```

### Does this PR introduce _any_ user-facing change?
no, infra only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44539 from zhengruifeng/infra_pin_lxml.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@github-actions github-actions bot added the INFRA label Dec 30, 2023
…java` from `catalyst` to `parent`

### What changes were proposed in this pull request?
This pr move  the dependency management of `datasketches-java` from `catalyst/pom.xml` to `parent/pom.xml` and defined a new property xx to manage it's version.

### Why are the changes needed?
he management of dependencies with non-special versions should be placed in `parent.xml`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44521 from LuciferYang/SPARK-46531.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@pull pull bot merged commit af3a225 into huangxiaopingRD:master Dec 30, 2023
pull bot pushed a commit that referenced this pull request Nov 22, 2024
…ead pool

### What changes were proposed in this pull request?

This PR aims to use a meaningful class name prefix for REST Submission API thread pool instead of the default value of Jetty QueuedThreadPool, `"qtp"+super.hashCode()`.

https://github.com/dekellum/jetty/blob/3dc0120d573816de7d6a83e2d6a97035288bdd4a/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L64

### Why are the changes needed?

This is helpful during JVM investigation.

**BEFORE (4.0.0-preview2)**

```
$ SPARK_MASTER_OPTS='-Dspark.master.rest.enabled=true' sbin/start-master.sh
$ jstack 28217 | grep qtp
"qtp1925630411-52" #52 daemon prio=5 os_prio=31 cpu=0.07ms elapsed=19.06s tid=0x0000000134906c10 nid=0xde03 runnable  [0x0000000314592000]
"qtp1925630411-53" #53 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=19.06s tid=0x0000000134ac6810 nid=0xc603 runnable  [0x000000031479e000]
"qtp1925630411-54" #54 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=19.06s tid=0x000000013491ae10 nid=0xdc03 runnable  [0x00000003149aa000]
"qtp1925630411-55" #55 daemon prio=5 os_prio=31 cpu=0.08ms elapsed=19.06s tid=0x0000000134ac9810 nid=0xc803 runnable  [0x0000000314bb6000]
"qtp1925630411-56" #56 daemon prio=5 os_prio=31 cpu=0.04ms elapsed=19.06s tid=0x0000000134ac9e10 nid=0xda03 runnable  [0x0000000314dc2000]
"qtp1925630411-57" #57 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=19.06s tid=0x0000000134aca410 nid=0xca03 runnable  [0x0000000314fce000]
"qtp1925630411-58" #58 daemon prio=5 os_prio=31 cpu=0.04ms elapsed=19.06s tid=0x0000000134acaa10 nid=0xcb03 runnable  [0x00000003151da000]
"qtp1925630411-59" #59 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=19.06s tid=0x0000000134acb010 nid=0xcc03 runnable  [0x00000003153e6000]
"qtp1925630411-60-acceptor-0108e9815-ServerConnector1e497474{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #60 daemon prio=3 os_prio=31 cpu=0.11ms elapsed=19.06s tid=0x00000001317ffa10 nid=0xcd03 runnable  [0x00000003155f2000]
"qtp1925630411-61-acceptor-11d90f2aa-ServerConnector1e497474{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #61 daemon prio=3 os_prio=31 cpu=0.10ms elapsed=19.06s tid=0x00000001314ed610 nid=0xcf03 waiting on condition  [0x00000003157fe000]
```

**AFTER**
```
$ SPARK_MASTER_OPTS='-Dspark.master.rest.enabled=true' sbin/start-master.sh
$ jstack 28317 | grep StandaloneRestServer
"StandaloneRestServer-52" #52 daemon prio=5 os_prio=31 cpu=0.09ms elapsed=60.06s tid=0x00000001284a8e10 nid=0xdb03 runnable  [0x000000032cfce000]
"StandaloneRestServer-53" #53 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=60.06s tid=0x00000001284acc10 nid=0xda03 runnable  [0x000000032d1da000]
"StandaloneRestServer-54" #54 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=60.06s tid=0x00000001284ae610 nid=0xd803 runnable  [0x000000032d3e6000]
"StandaloneRestServer-55" #55 daemon prio=5 os_prio=31 cpu=0.09ms elapsed=60.06s tid=0x00000001284aec10 nid=0xd703 runnable  [0x000000032d5f2000]
"StandaloneRestServer-56" #56 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=60.06s tid=0x00000001284af210 nid=0xc803 runnable  [0x000000032d7fe000]
"StandaloneRestServer-57" #57 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=60.06s tid=0x00000001284af810 nid=0xc903 runnable  [0x000000032da0a000]
"StandaloneRestServer-58" #58 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=60.06s tid=0x00000001284afe10 nid=0xcb03 runnable  [0x000000032dc16000]
"StandaloneRestServer-59" #59 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=60.06s tid=0x00000001284b0410 nid=0xcc03 runnable  [0x000000032de22000]
"StandaloneRestServer-60-acceptor-04aefbaa8-ServerConnector44284d85{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #60 daemon prio=3 os_prio=31 cpu=0.13ms elapsed=60.05s tid=0x000000015cda1a10 nid=0xcd03 runnable  [0x000000032e02e000]
"StandaloneRestServer-61-acceptor-148976251-ServerConnector44284d85{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #61 daemon prio=3 os_prio=31 cpu=0.12ms elapsed=60.05s tid=0x000000015cd1c810 nid=0xce03 waiting on condition  [0x000000032e23a000]
```

### Does this PR introduce _any_ user-facing change?

No, the thread names are accessed during the debugging.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#48924 from dongjoon-hyun/SPARK-50385.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: panbingkun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.