Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Sep 14, 2023

What changes were proposed in this pull request?

1, in PyArrow 13.0.0, the behavior of Table#to_pandas and ChunkedArray#to_pandas changed, set the coerce_temporal_nanoseconds=True

2, there is another undocumented breaking change in data type conversion TimestampType#to_pandas_dtype:

12.0.1:

In [1]: import pyarrow as pa

In [2]: pa.timestamp("us", tz=None).to_pandas_dtype()
Out[2]: dtype('<M8[ns]')

In [3]: pa.timestamp("ns", tz=None).to_pandas_dtype()
Out[3]: dtype('<M8[ns]')

In [4]: pa.timestamp("us", tz="UTC").to_pandas_dtype()
Out[4]: datetime64[ns, UTC]

In [5]: pa.timestamp("ns", tz="UTC").to_pandas_dtype()
Out[5]: datetime64[ns, UTC]

13.0.0:

In [1]: import pyarrow as pa

In [2]: pa.timestamp("us", tz=None).to_pandas_dtype()
Out[2]: dtype('<M8[us]')

In [3]: pa.timestamp("ns", tz=None).to_pandas_dtype()
Out[3]: dtype('<M8[ns]')

In [4]: pa.timestamp("us", tz="UTC").to_pandas_dtype()
Out[4]: datetime64[us, UTC]

In [5]: pa.timestamp("ns", tz="UTC").to_pandas_dtype()
Out[5]: datetime64[ns, UTC]

Why are the changes needed?

Make PySpark compatible with PyArrow 13.0.0

Does this PR introduce any user-facing change?

NO

How was this patch tested?

CI

Was this patch authored or co-authored using generative AI tooling?

NO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the behavior of to_arrow_type(spark_type).to_pandas_dtype() changed, e.g.:

to_arrow_type(DayTimeIntervalType) -> pa.timestamp("us", tz="UTC") -> datetime64[us, UTC] in 13.0.0, but datetime64[ns, UTC] in 12.0.1

fix

fix

test with pyarrow 13

fix connect

fix koalas

fix x
@zhengruifeng zhengruifeng changed the title [WIP][SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0 [SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0 Sep 14, 2023
@zhengruifeng zhengruifeng marked this pull request as ready for review September 14, 2023 12:09
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great, @zhengruifeng and @HyukjinKwon .

BTW, how can we solve mlflow issue?

mlflow 2.6.0 requires pyarrow<13,>=4.0.0, but you have pyarrow 13.0.0 which is incompatible.


RUN pypy3 -m pip install numpy 'pandas<=2.0.3' scipy coverage matplotlib
RUN python3.9 -m pip install numpy 'pyarrow==12.0.1' 'pandas<=2.0.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*'
RUN python3.9 -m pip install numpy pyarrow 'pandas<=2.0.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also upgrade mlflow version.

@zhengruifeng
Copy link
Contributor Author

@dongjoon-hyun I don't see failure in docker build

#21 [15/19] RUN python3.9 -m pip install numpy pyarrow 'pandas<=2.0.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*'
#21 43.17 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
#21 DONE 46.2s

we don't pin the mlflow version, instead we only set a lowerbound mlflow>=2.3.1, it looks like mlflow 2.7.0 can work with PyArrow 13.0.0:

pyarrow [required: >=4.0.0,<14, installed: 13.0.0]

@dongjoon-hyun
Copy link
Member

Oh, it's great to have mlflow 2.7.0 on time! Thank you.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 4.0.0. Thank you, @zhengruifeng and @HyukjinKwon .

@zhengruifeng
Copy link
Contributor Author

thank you @dongjoon-hyun and @HyukjinKwon

@zhengruifeng zhengruifeng deleted the py_pyarrow_13 branch September 15, 2023 05:16
dongjoon-hyun added a commit that referenced this pull request Mar 17, 2024
…equirement, `<13.0.0`

### What changes were proposed in this pull request?

This PR aims to add `pyarrow` upper bound requirement, `<13.0.0`, to Apache Spark 3.5.x.

### Why are the changes needed?

PyArrow 13.0.0 has breaking changes mentioned by #42920 which is a part of Apache Spark 4.0.0.

### Does this PR introduce _any_ user-facing change?

No, this only clarifies the upper bound.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45553 from dongjoon-hyun/SPARK-47432.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
zhengruifeng added a commit that referenced this pull request May 21, 2025
### What changes were proposed in this pull request?
Implement Scalar Arrow UDF

### Why are the changes needed?

Pandas UDFs (and Pandas Functions like MapInPandas) have a pandas <> arrow conversion, but:
- This conversion is not stable, and gets broken from time to time
-- [The Arrow 13 upgrade](#42920), pandas UDFs with data/time types are all broken;
-- [Weird behavior](9d88020) when the dataset is empty
- Pandas <> Arrow conversion is pretty expensive. Zero-copy is only possible [in certain narrow cases](https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions), e.g. StringType is not supported;
- The support of complex type is not good, e.g. to support `StructType` series, we needs to use `pd.DataFrame` as a workaround;

Arrow UDF is designed to resolve above issues.

### Does this PR introduce _any_ user-facing change?

No, this PR makes the underlying implementation, under `pyspark.sql.pandas.functions` which is not a public module.

So the new feature is not exposed to end users for now, we will need to decide the API later.

```py
import pyarrow as pa

from pyspark.sql import functions as sf
from pyspark.sql.pandas.functions import arrow_udf      <- `pyspark.sql.pandas.functions` is not public

df = spark.range(10).withColumn("v", sf.col("id") + 1)

arrow_udf("long")
def multiply_arrow_func(a: pa.Array, b: pa.Array) -> pa.Array:
    assert isinstance(a, pa.Array)
    assert isinstance(b, pa.Array)
    return pa.compute.multiply(a, b)

df.select("id", "v", multiply_arrow_func("id", "v").alias("m")).show()

+---+---+---+
| id|  v|  m|
+---+---+---+
|  0|  1|  0|
|  1|  2|  2|
|  2|  3|  6|
|  3|  4| 12|
|  4|  5| 20|
|  5|  6| 30|
|  6|  7| 42|
|  7|  8| 56|
|  8|  9| 72|
|  9| 10| 90|
+---+---+---+
```

### How was this patch tested?
new tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #50759 from zhengruifeng/py_arrow_udf.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
yhuang-db pushed a commit to yhuang-db/spark that referenced this pull request Jun 9, 2025
### What changes were proposed in this pull request?
Implement Scalar Arrow UDF

### Why are the changes needed?

Pandas UDFs (and Pandas Functions like MapInPandas) have a pandas <> arrow conversion, but:
- This conversion is not stable, and gets broken from time to time
-- [The Arrow 13 upgrade](apache#42920), pandas UDFs with data/time types are all broken;
-- [Weird behavior](apache@9d88020) when the dataset is empty
- Pandas <> Arrow conversion is pretty expensive. Zero-copy is only possible [in certain narrow cases](https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions), e.g. StringType is not supported;
- The support of complex type is not good, e.g. to support `StructType` series, we needs to use `pd.DataFrame` as a workaround;

Arrow UDF is designed to resolve above issues.

### Does this PR introduce _any_ user-facing change?

No, this PR makes the underlying implementation, under `pyspark.sql.pandas.functions` which is not a public module.

So the new feature is not exposed to end users for now, we will need to decide the API later.

```py
import pyarrow as pa

from pyspark.sql import functions as sf
from pyspark.sql.pandas.functions import arrow_udf      <- `pyspark.sql.pandas.functions` is not public

df = spark.range(10).withColumn("v", sf.col("id") + 1)

arrow_udf("long")
def multiply_arrow_func(a: pa.Array, b: pa.Array) -> pa.Array:
    assert isinstance(a, pa.Array)
    assert isinstance(b, pa.Array)
    return pa.compute.multiply(a, b)

df.select("id", "v", multiply_arrow_func("id", "v").alias("m")).show()

+---+---+---+
| id|  v|  m|
+---+---+---+
|  0|  1|  0|
|  1|  2|  2|
|  2|  3|  6|
|  3|  4| 12|
|  4|  5| 20|
|  5|  6| 30|
|  6|  7| 42|
|  7|  8| 56|
|  8|  9| 72|
|  9| 10| 90|
+---+---+---+
```

### How was this patch tested?
new tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#50759 from zhengruifeng/py_arrow_udf.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants