-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0 #42920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the behavior of to_arrow_type(spark_type).to_pandas_dtype() changed, e.g.:
to_arrow_type(DayTimeIntervalType) -> pa.timestamp("us", tz="UTC") -> datetime64[us, UTC] in 13.0.0, but datetime64[ns, UTC] in 12.0.1
7ee8925 to
7129e2e
Compare
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great, @zhengruifeng and @HyukjinKwon .
BTW, how can we solve mlflow issue?
mlflow 2.6.0 requires pyarrow<13,>=4.0.0, but you have pyarrow 13.0.0 which is incompatible.
|
|
||
| RUN pypy3 -m pip install numpy 'pandas<=2.0.3' scipy coverage matplotlib | ||
| RUN python3.9 -m pip install numpy 'pyarrow==12.0.1' 'pandas<=2.0.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*' | ||
| RUN python3.9 -m pip install numpy pyarrow 'pandas<=2.0.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also upgrade mlflow version.
|
@dongjoon-hyun I don't see failure in docker build we don't pin the mlflow version, instead we only set a lowerbound
|
|
Oh, it's great to have mlflow 2.7.0 on time! Thank you. |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
|
Merged to master for Apache Spark 4.0.0. Thank you, @zhengruifeng and @HyukjinKwon . |
|
thank you @dongjoon-hyun and @HyukjinKwon |
…equirement, `<13.0.0` ### What changes were proposed in this pull request? This PR aims to add `pyarrow` upper bound requirement, `<13.0.0`, to Apache Spark 3.5.x. ### Why are the changes needed? PyArrow 13.0.0 has breaking changes mentioned by #42920 which is a part of Apache Spark 4.0.0. ### Does this PR introduce _any_ user-facing change? No, this only clarifies the upper bound. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45553 from dongjoon-hyun/SPARK-47432. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? Implement Scalar Arrow UDF ### Why are the changes needed? Pandas UDFs (and Pandas Functions like MapInPandas) have a pandas <> arrow conversion, but: - This conversion is not stable, and gets broken from time to time -- [The Arrow 13 upgrade](#42920), pandas UDFs with data/time types are all broken; -- [Weird behavior](9d88020) when the dataset is empty - Pandas <> Arrow conversion is pretty expensive. Zero-copy is only possible [in certain narrow cases](https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions), e.g. StringType is not supported; - The support of complex type is not good, e.g. to support `StructType` series, we needs to use `pd.DataFrame` as a workaround; Arrow UDF is designed to resolve above issues. ### Does this PR introduce _any_ user-facing change? No, this PR makes the underlying implementation, under `pyspark.sql.pandas.functions` which is not a public module. So the new feature is not exposed to end users for now, we will need to decide the API later. ```py import pyarrow as pa from pyspark.sql import functions as sf from pyspark.sql.pandas.functions import arrow_udf <- `pyspark.sql.pandas.functions` is not public df = spark.range(10).withColumn("v", sf.col("id") + 1) arrow_udf("long") def multiply_arrow_func(a: pa.Array, b: pa.Array) -> pa.Array: assert isinstance(a, pa.Array) assert isinstance(b, pa.Array) return pa.compute.multiply(a, b) df.select("id", "v", multiply_arrow_func("id", "v").alias("m")).show() +---+---+---+ | id| v| m| +---+---+---+ | 0| 1| 0| | 1| 2| 2| | 2| 3| 6| | 3| 4| 12| | 4| 5| 20| | 5| 6| 30| | 6| 7| 42| | 7| 8| 56| | 8| 9| 72| | 9| 10| 90| +---+---+---+ ``` ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #50759 from zhengruifeng/py_arrow_udf. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
### What changes were proposed in this pull request? Implement Scalar Arrow UDF ### Why are the changes needed? Pandas UDFs (and Pandas Functions like MapInPandas) have a pandas <> arrow conversion, but: - This conversion is not stable, and gets broken from time to time -- [The Arrow 13 upgrade](apache#42920), pandas UDFs with data/time types are all broken; -- [Weird behavior](apache@9d88020) when the dataset is empty - Pandas <> Arrow conversion is pretty expensive. Zero-copy is only possible [in certain narrow cases](https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions), e.g. StringType is not supported; - The support of complex type is not good, e.g. to support `StructType` series, we needs to use `pd.DataFrame` as a workaround; Arrow UDF is designed to resolve above issues. ### Does this PR introduce _any_ user-facing change? No, this PR makes the underlying implementation, under `pyspark.sql.pandas.functions` which is not a public module. So the new feature is not exposed to end users for now, we will need to decide the API later. ```py import pyarrow as pa from pyspark.sql import functions as sf from pyspark.sql.pandas.functions import arrow_udf <- `pyspark.sql.pandas.functions` is not public df = spark.range(10).withColumn("v", sf.col("id") + 1) arrow_udf("long") def multiply_arrow_func(a: pa.Array, b: pa.Array) -> pa.Array: assert isinstance(a, pa.Array) assert isinstance(b, pa.Array) return pa.compute.multiply(a, b) df.select("id", "v", multiply_arrow_func("id", "v").alias("m")).show() +---+---+---+ | id| v| m| +---+---+---+ | 0| 1| 0| | 1| 2| 2| | 2| 3| 6| | 3| 4| 12| | 4| 5| 20| | 5| 6| 30| | 6| 7| 42| | 7| 8| 56| | 8| 9| 72| | 9| 10| 90| +---+---+---+ ``` ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#50759 from zhengruifeng/py_arrow_udf. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
What changes were proposed in this pull request?
1, in PyArrow 13.0.0, the behavior of Table#to_pandas and ChunkedArray#to_pandas changed, set the
coerce_temporal_nanoseconds=True2, there is another undocumented breaking change in data type conversion
TimestampType#to_pandas_dtype:12.0.1:
13.0.0:
Why are the changes needed?
Make PySpark compatible with PyArrow 13.0.0
Does this PR introduce any user-facing change?
NO
How was this patch tested?
CI
Was this patch authored or co-authored using generative AI tooling?
NO