-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-45065][PYTHON][PS] Support Pandas 2.1.0 #42793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Since there are many features are deprecated from Pandas 2.1.0, let me investigate if there is any corresponding feature from Pandas API on Spark while we're here. |
| psdf = psdf.reset_index(level=should_drop_index, drop=True) | ||
| drop = not any( | ||
| [ | ||
| isinstance(func_or_funcs[gkey.name], list) | ||
| for gkey in self._groupkeys | ||
| if gkey.name in func_or_funcs | ||
| ] | ||
| ) | ||
| psdf = psdf.reset_index(level=should_drop_index, drop=drop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug fixed in Pandas: pandas-dev/pandas#52849.
| pdf = makeMissingDataframe(0.3, 42) | ||
| pdf = pd.DataFrame( | ||
| index=[ | ||
| "".join( | ||
| np.random.choice( | ||
| list("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"), 10 | ||
| ) | ||
| ) | ||
| for _ in range(30) | ||
| ], | ||
| columns=list("ABCD"), | ||
| dtype="float64", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The testing util makeMissingDataframe is removed.
| >>> inferred = infer_return_type(func) | ||
| >>> inferred.dtypes | ||
| [dtype('int64'), CategoricalDtype(categories=[3, 4, 5], ordered=False)] | ||
| [dtype('int64'), CategoricalDtype(categories=[3, 4, 5], ordered=False, categories_dtype=int64)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added dtype of categories is added to __repr__: pandas-dev/pandas#52179.
| m 2.0 NaN | ||
| dog kg NaN 3.0 | ||
| m 4.0 NaN | ||
| >>> df_multi_level_cols2.stack().sort_index() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column ordering bug is fixed in Pandas: pandas-dev/pandas#53786.
|
not related to this PR itself, what is the policy to upgrade the minimum version of dependencies listed here ? |
|
@zhengruifeng AFAIK, there is no separate policy for minimum version. We may change the minimum version of a particular package when if an older version no longer works properly with Spark, or if the community for that package no longer maintains a particular older version, etc. |
|
Let's probably upgrade them since we're going ahead for 4.0.0 major version bumpup |
|
Could you resolve the conflict, @itholic ? |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (Pending CIs)
python/pyspark/pandas/frame.py
Outdated
| 0 1.000000 4.494400 | ||
| 1 11.262736 20.857489 | ||
| """ | ||
| return self.applymap(func=func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call will show a deprecation warning from applymap?
I guess we should call return self._apply_series_op(lambda psser: psser.apply(func)) here and applymap should call map instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah we shouldn't call applymap here.
Just applied the suggestion. Thanks!
| * In Spark 4.0, the resulting name from ``value_counts`` for all objects sets to ``'count'`` (or ``'proportion'`` if ``normalize=True`` was passed) from pandas API on Spark, and the index will be named after the original object. | ||
| * In Spark 4.0, ``squeeze`` parameter from ``ps.read_csv`` and ``ps.read_excel`` has been removed from pandas API on Spark. | ||
| * In Spark 4.0, ``null_counts`` parameter from ``DataFrame.info`` has been removed from pandas API on Spark, use ``show_counts`` instead. | ||
| * In Spark 4.0, the result of ``MultiIndex.append`` does not keep the index names from pandas API on Spark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a line her, where we tell users to have pandas version 2.1.0 installed for spark 4.0
The only way now to find witch pandas version to install is to check the docker file in dev/infra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Related information has been added to the top of the migration guide. Thanks!
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM again
The failure StreamingQueryListenerSuite is irrelevant to this PR.
Merged to master for Apache Spark 4.0.0.
|
Thank you, @itholic and all! |
|
Thanks all! |
What changes were proposed in this pull request?
This PR proposes to support pandas 2.1.0 for PySpark. See What's new in 2.1.0 for more detail.
Why are the changes needed?
We should follow the latest version of pandas.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
The existing CI should passed with Pandas 2.1.0
Was this patch authored or co-authored using generative AI tooling?
No.