[SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0 #42920

zhengruifeng · 2023-09-14T03:42:46Z

What changes were proposed in this pull request?

1, in PyArrow 13.0.0, the behavior of Table#to_pandas and ChunkedArray#to_pandas changed, set the coerce_temporal_nanoseconds=True

2, there is another undocumented breaking change in data type conversion TimestampType#to_pandas_dtype:

12.0.1:

In [1]: import pyarrow as pa

In [2]: pa.timestamp("us", tz=None).to_pandas_dtype()
Out[2]: dtype('<M8[ns]')

In [3]: pa.timestamp("ns", tz=None).to_pandas_dtype()
Out[3]: dtype('<M8[ns]')

In [4]: pa.timestamp("us", tz="UTC").to_pandas_dtype()
Out[4]: datetime64[ns, UTC]

In [5]: pa.timestamp("ns", tz="UTC").to_pandas_dtype()
Out[5]: datetime64[ns, UTC]

13.0.0:

In [1]: import pyarrow as pa

In [2]: pa.timestamp("us", tz=None).to_pandas_dtype()
Out[2]: dtype('<M8[us]')

In [3]: pa.timestamp("ns", tz=None).to_pandas_dtype()
Out[3]: dtype('<M8[ns]')

In [4]: pa.timestamp("us", tz="UTC").to_pandas_dtype()
Out[4]: datetime64[us, UTC]

In [5]: pa.timestamp("ns", tz="UTC").to_pandas_dtype()
Out[5]: datetime64[ns, UTC]

Why are the changes needed?

Make PySpark compatible with PyArrow 13.0.0

Does this PR introduce any user-facing change?

NO

How was this patch tested?

CI

Was this patch authored or co-authored using generative AI tooling?

NO

zhengruifeng · 2023-09-14T03:47:55Z

python/pyspark/pandas/typedef/typehints.py

the behavior of to_arrow_type(spark_type).to_pandas_dtype() changed, e.g.:

to_arrow_type(DayTimeIntervalType) -> pa.timestamp("us", tz="UTC") -> datetime64[us, UTC] in 13.0.0, but datetime64[ns, UTC] in 12.0.1

fix fix test with pyarrow 13 fix connect fix koalas fix x

dongjoon-hyun

It's great, @zhengruifeng and @HyukjinKwon .

BTW, how can we solve mlflow issue?

mlflow 2.6.0 requires pyarrow<13,>=4.0.0, but you have pyarrow 13.0.0 which is incompatible.

HyukjinKwon · 2023-09-14T23:48:17Z

dev/infra/Dockerfile


 RUN pypy3 -m pip install numpy 'pandas<=2.0.3' scipy coverage matplotlib
-RUN python3.9 -m pip install numpy 'pyarrow==12.0.1' 'pandas<=2.0.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*'
+RUN python3.9 -m pip install numpy pyarrow 'pandas<=2.0.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*'


Let's also upgrade mlflow version.

zhengruifeng · 2023-09-15T00:39:57Z

@dongjoon-hyun I don't see failure in docker build

#21 [15/19] RUN python3.9 -m pip install numpy pyarrow 'pandas<=2.0.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*'
#21 43.17 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
#21 DONE 46.2s

we don't pin the mlflow version, instead we only set a lowerbound mlflow>=2.3.1, it looks like mlflow 2.7.0 can work with PyArrow 13.0.0:

pyarrow [required: >=4.0.0,<14, installed: 13.0.0]

dongjoon-hyun · 2023-09-15T02:51:39Z

Oh, it's great to have mlflow 2.7.0 on time! Thank you.

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2023-09-15T04:26:47Z

Merged to master for Apache Spark 4.0.0. Thank you, @zhengruifeng and @HyukjinKwon .

zhengruifeng · 2023-09-15T05:16:52Z

thank you @dongjoon-hyun and @HyukjinKwon

…equirement, `<13.0.0` ### What changes were proposed in this pull request? This PR aims to add `pyarrow` upper bound requirement, `<13.0.0`, to Apache Spark 3.5.x. ### Why are the changes needed? PyArrow 13.0.0 has breaking changes mentioned by #42920 which is a part of Apache Spark 4.0.0. ### Does this PR introduce _any_ user-facing change? No, this only clarifies the upper bound. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45553 from dongjoon-hyun/SPARK-47432. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? Implement Scalar Arrow UDF ### Why are the changes needed? Pandas UDFs (and Pandas Functions like MapInPandas) have a pandas <> arrow conversion, but: - This conversion is not stable, and gets broken from time to time -- [The Arrow 13 upgrade](#42920), pandas UDFs with data/time types are all broken; -- [Weird behavior](9d88020) when the dataset is empty - Pandas <> Arrow conversion is pretty expensive. Zero-copy is only possible [in certain narrow cases](https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions), e.g. StringType is not supported; - The support of complex type is not good, e.g. to support `StructType` series, we needs to use `pd.DataFrame` as a workaround; Arrow UDF is designed to resolve above issues. ### Does this PR introduce _any_ user-facing change? No, this PR makes the underlying implementation, under `pyspark.sql.pandas.functions` which is not a public module. So the new feature is not exposed to end users for now, we will need to decide the API later. ```py import pyarrow as pa from pyspark.sql import functions as sf from pyspark.sql.pandas.functions import arrow_udf <- `pyspark.sql.pandas.functions` is not public df = spark.range(10).withColumn("v", sf.col("id") + 1) arrow_udf("long") def multiply_arrow_func(a: pa.Array, b: pa.Array) -> pa.Array: assert isinstance(a, pa.Array) assert isinstance(b, pa.Array) return pa.compute.multiply(a, b) df.select("id", "v", multiply_arrow_func("id", "v").alias("m")).show() +---+---+---+ | id| v| m| +---+---+---+ | 0| 1| 0| | 1| 2| 2| | 2| 3| 6| | 3| 4| 12| | 4| 5| 20| | 5| 6| 30| | 6| 7| 42| | 7| 8| 56| | 8| 9| 72| | 9| 10| 90| +---+---+---+ ``` ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #50759 from zhengruifeng/py_arrow_udf. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

### What changes were proposed in this pull request? Implement Scalar Arrow UDF ### Why are the changes needed? Pandas UDFs (and Pandas Functions like MapInPandas) have a pandas <> arrow conversion, but: - This conversion is not stable, and gets broken from time to time -- [The Arrow 13 upgrade](apache#42920), pandas UDFs with data/time types are all broken; -- [Weird behavior](apache@9d88020) when the dataset is empty - Pandas <> Arrow conversion is pretty expensive. Zero-copy is only possible [in certain narrow cases](https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions), e.g. StringType is not supported; - The support of complex type is not good, e.g. to support `StructType` series, we needs to use `pd.DataFrame` as a workaround; Arrow UDF is designed to resolve above issues. ### Does this PR introduce _any_ user-facing change? No, this PR makes the underlying implementation, under `pyspark.sql.pandas.functions` which is not a public module. So the new feature is not exposed to end users for now, we will need to decide the API later. ```py import pyarrow as pa from pyspark.sql import functions as sf from pyspark.sql.pandas.functions import arrow_udf <- `pyspark.sql.pandas.functions` is not public df = spark.range(10).withColumn("v", sf.col("id") + 1) arrow_udf("long") def multiply_arrow_func(a: pa.Array, b: pa.Array) -> pa.Array: assert isinstance(a, pa.Array) assert isinstance(b, pa.Array) return pa.compute.multiply(a, b) df.select("id", "v", multiply_arrow_func("id", "v").alias("m")).show() +---+---+---+ | id| v| m| +---+---+---+ | 0| 1| 0| | 1| 2| 2| | 2| 3| 6| | 3| 4| 12| | 4| 5| 20| | 5| 6| 30| | 6| 7| 42| | 7| 8| 56| | 8| 9| 72| | 9| 10| 90| +---+---+---+ ``` ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#50759 from zhengruifeng/py_arrow_udf. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

github-actions bot added SQL BUILD INFRA PYTHON PANDAS API ON SPARK CONNECT labels Sep 14, 2023

HyukjinKwon approved these changes Sep 14, 2023

View reviewed changes

zhengruifeng commented Sep 14, 2023

View reviewed changes

zhengruifeng added 2 commits September 14, 2023 18:25

fix

fa9bd03

fix fix test with pyarrow 13 fix connect fix koalas fix x

test_with_13

7129e2e

zhengruifeng force-pushed the py_pyarrow_13 branch from 7ee8925 to 7129e2e Compare September 14, 2023 10:25

zhengruifeng changed the title ~~[WIP][SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0~~ [SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0 Sep 14, 2023

zhengruifeng marked this pull request as ready for review September 14, 2023 12:09

dongjoon-hyun reviewed Sep 14, 2023

View reviewed changes

HyukjinKwon reviewed Sep 14, 2023

View reviewed changes

dongjoon-hyun approved these changes Sep 15, 2023

View reviewed changes

dongjoon-hyun closed this in c1c710e Sep 15, 2023

zhengruifeng deleted the py_pyarrow_13 branch September 15, 2023 05:16

sudohainguyen mentioned this pull request Sep 23, 2023

chore: Extend pandas version support feast-dev/feast#3664

Closed

dongjoon-hyun mentioned this pull request Mar 17, 2024

[SPARK-47432][PYTHON][CONNECT][DOCS][3.5] Add pyarrow upper bound requirement, <13.0.0 #45553

Closed

zhengruifeng mentioned this pull request Apr 30, 2025

[SPARK-52215][PYTHON][CONNECT] Implement Scalar Arrow UDF #50759

Closed

dongjoon-hyun mentioned this pull request Nov 23, 2025

[SPARK-52970][CONNECT][INFRA] Upgrade PyArrow version in 3.5 client <> master server build #51681

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0 #42920

[SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0 #42920

zhengruifeng commented Sep 14, 2023 •

edited

Loading

Uh oh!

zhengruifeng Sep 14, 2023

Uh oh!

dongjoon-hyun left a comment

Uh oh!

HyukjinKwon Sep 14, 2023

Uh oh!

zhengruifeng commented Sep 15, 2023

Uh oh!

dongjoon-hyun commented Sep 15, 2023

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Sep 15, 2023

Uh oh!

zhengruifeng commented Sep 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0 #42920

[SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0 #42920

Conversation

zhengruifeng commented Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng Sep 14, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 14, 2023

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Sep 15, 2023

Uh oh!

dongjoon-hyun commented Sep 15, 2023

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 15, 2023

Uh oh!

zhengruifeng commented Sep 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhengruifeng commented Sep 14, 2023 •

edited

Loading