[SPARK-45065][PYTHON][PS] Support Pandas 2.1.0 #42793

itholic · 2023-09-04T04:43:47Z

What changes were proposed in this pull request?

This PR proposes to support pandas 2.1.0 for PySpark. See What's new in 2.1.0 for more detail.

Why are the changes needed?

We should follow the latest version of pandas.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The existing CI should passed with Pandas 2.1.0

Was this patch authored or co-authored using generative AI tooling?

No.

itholic · 2023-09-04T04:47:21Z

Since there are many features are deprecated from Pandas 2.1.0, let me investigate if there is any corresponding feature from Pandas API on Spark while we're here.

…2.1.0

itholic · 2023-09-08T02:08:33Z

python/pyspark/pandas/groupby.py

-                psdf = psdf.reset_index(level=should_drop_index, drop=True)
+                drop = not any(
+                    [
+                        isinstance(func_or_funcs[gkey.name], list)
+                        for gkey in self._groupkeys
+                        if gkey.name in func_or_funcs
+                    ]
+                )
+                psdf = psdf.reset_index(level=should_drop_index, drop=drop)


Bug fixed in Pandas: pandas-dev/pandas#52849.

itholic · 2023-09-08T02:10:25Z

python/pyspark/pandas/tests/test_stats.py

-        pdf = makeMissingDataframe(0.3, 42)
+        pdf = pd.DataFrame(
+            index=[
+                "".join(
+                    np.random.choice(
+                        list("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"), 10
+                    )
+                )
+                for _ in range(30)
+            ],
+            columns=list("ABCD"),
+            dtype="float64",
+        )


The testing util makeMissingDataframe is removed.

itholic · 2023-09-08T02:11:41Z

python/pyspark/pandas/typedef/typehints.py

    >>> inferred = infer_return_type(func)
    >>> inferred.dtypes
-    [dtype('int64'), CategoricalDtype(categories=[3, 4, 5], ordered=False)]
+    [dtype('int64'), CategoricalDtype(categories=[3, 4, 5], ordered=False, categories_dtype=int64)]


Added dtype of categories is added to __repr__: pandas-dev/pandas#52179.

itholic · 2023-09-08T02:56:00Z

python/pyspark/pandas/frame.py

-            m      2.0     NaN
-        dog kg     NaN     3.0
-            m      4.0     NaN
+        >>> df_multi_level_cols2.stack().sort_index()


Column ordering bug is fixed in Pandas: pandas-dev/pandas#53786.

…2.1.0

itholic · 2023-09-13T05:27:22Z

~~Many tests are failing due to the PyArrow upgrade in CI.~~

~~#42897 is fixing this issue, so let me rebase the PR after the fixing is get merged.~~

Manually cherry-pick #42897 to fix the CI failure.

…2.1.0

zhengruifeng · 2023-09-14T10:39:43Z

not related to this PR itself, what is the policy to upgrade the minimum version of dependencies listed here ?

itholic · 2023-09-15T02:02:47Z

@zhengruifeng AFAIK, there is no separate policy for minimum version. We may change the minimum version of a particular package when if an older version no longer works properly with Spark, or if the community for that package no longer maintains a particular older version, etc.

HyukjinKwon · 2023-09-15T02:07:30Z

Let's probably upgrade them since we're going ahead for 4.0.0 major version bumpup

dongjoon-hyun · 2023-09-15T04:29:30Z

Could you resolve the conflict, @itholic ?

dongjoon-hyun

+1, LGTM (Pending CIs)

ueshin · 2023-09-15T20:49:53Z

python/pyspark/pandas/frame.py

+        0   1.000000   4.494400
+        1  11.262736  20.857489
+        """
+        return self.applymap(func=func)


This call will show a deprecation warning from applymap?

I guess we should call return self._apply_series_op(lambda psser: psser.apply(func)) here and applymap should call map instead?

Oh, yeah we shouldn't call applymap here.

Just applied the suggestion. Thanks!

bjornjorgensen · 2023-09-16T19:03:47Z

python/docs/source/migration_guide/pyspark_upgrade.rst

 * In Spark 4.0, the resulting name from ``value_counts`` for all objects sets to ``'count'`` (or ``'proportion'`` if ``normalize=True`` was passed) from pandas API on Spark, and the index will be named after the original object.
 * In Spark 4.0, ``squeeze`` parameter from ``ps.read_csv`` and ``ps.read_excel`` has been removed from pandas API on Spark.
 * In Spark 4.0, ``null_counts`` parameter from ``DataFrame.info`` has been removed from pandas API on Spark, use ``show_counts`` instead.
 * In Spark 4.0, the result of ``MultiIndex.append`` does not keep the index names from pandas API on Spark.


Can we add a line her, where we tell users to have pandas version 2.1.0 installed for spark 4.0
The only way now to find witch pandas version to install is to check the docker file in dev/infra

https://github.com/jupyter/docker-stacks/blob/52a999a554fe42951e017f7be132d808695a1261/images/pyspark-notebook/Dockerfile#L69

Good idea. Related information has been added to the top of the migration guide. Thanks!

itholic · 2023-09-18T01:57:15Z

CI link: https://github.com/itholic/spark/actions/runs/6216894150

…2.1.0

dongjoon-hyun

+1, LGTM again

The failure StreamingQueryListenerSuite is irrelevant to this PR.

Merged to master for Apache Spark 4.0.0.

dongjoon-hyun · 2023-09-18T23:27:30Z

Thank you, @itholic and all!

itholic · 2023-09-19T01:20:40Z

Thanks all!

[SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

bf79e7a

itholic changed the title ~~[SPARK-45065][PYTHON][PS] Support Pandas 2.1.0~~ [WIP][SPARK-45065][PYTHON][PS] Support Pandas 2.1.0 Sep 4, 2023

github-actions bot added BUILD PYTHON PANDAS API ON SPARK labels Sep 4, 2023

Fix tests

e81a97a

github-actions bot added the SQL label Sep 5, 2023

itholic added 5 commits September 6, 2023 10:04

Merge branch 'master' of https://github.com/apache/spark into pandas_…

246d2a0

…2.1.0

Respect as_index=False when given funcs is a type of list

f874b85

Apply the Pandas 2.1.0 changes

49c5c5d

Merge branch 'master' of https://github.com/apache/spark into pandas_…

7184a3b

…2.1.0

Fix ordering for stack

15c5aa7

itholic commented Sep 8, 2023

View reviewed changes

itholic added 2 commits September 8, 2023 12:11

Added migration guide

5dbf456

Deprecate all features from Pandas 2.1.0.

2a17d1d

github-actions bot added the CONNECT label Sep 11, 2023

itholic added 4 commits September 12, 2023 08:39

Merge branch 'master' of https://github.com/apache/spark into pandas_…

f48215c

…2.1.0

Fix linter

dc68af1

fix test

4e89cf5

Fix linter

a585fbe

fix

1831923

github-actions bot added the INFRA label Sep 13, 2023

resolve conflicts

91b865c

github-actions bot removed the INFRA label Sep 13, 2023

Merge branch 'master' of https://github.com/apache/spark into pandas_…

76433d0

…2.1.0

itholic changed the title ~~[WIP][SPARK-45065][PYTHON][PS] Support Pandas 2.1.0~~ [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0 Sep 14, 2023

itholic marked this pull request as ready for review September 14, 2023 00:04

Retrigger the CI

afecab9

itholic added 3 commits September 14, 2023 11:21

revert unnecess change

5a0fe26

fix linter

bba34a0

Merge branch 'master' of https://github.com/apache/spark into pandas_…

f66d824

…2.1.0

Remove circular import

21a7dfe

itholic added 2 commits September 15, 2023 13:43

resolve conflicts

2323237

resolve conflicts

0c07f55

dongjoon-hyun approved these changes Sep 15, 2023

View reviewed changes

ueshin reviewed Sep 15, 2023

View reviewed changes

itholic added 3 commits September 16, 2023 14:38

resolve conflicts

5c054ec

do not call applymap

1cb9df4

fix linter

0e8ea3b

bjornjorgensen mentioned this pull request Sep 16, 2023

Upgrade Apache Spark to 3.5.0 jupyter/docker-stacks#1995

Merged

4 tasks

bjornjorgensen reviewed Sep 16, 2023

View reviewed changes

itholic added 3 commits September 17, 2023 05:50

Recommend to use Pandas 2.0.0 and above

46cd7dd

fix linter

357fbce

resolve conflicts

76f0720

itholic added 3 commits September 18, 2023 15:57

resolve conflicts

cf54c67

Import

0008089

Merge branch 'master' of https://github.com/apache/spark into pandas_…

5723b6c

…2.1.0

dongjoon-hyun approved these changes Sep 18, 2023

View reviewed changes

dongjoon-hyun closed this in f83e5ec Sep 18, 2023

itholic deleted the pandas_2.1.0 branch November 20, 2023 01:36

[SPARK-45065][PYTHON][PS] Support Pandas 2.1.0 #42793

[SPARK-45065][PYTHON][PS] Support Pandas 2.1.0 #42793

Uh oh!

Conversation

itholic commented Sep 4, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

itholic commented Sep 4, 2023

Uh oh!

itholic Sep 8, 2023

Choose a reason for hiding this comment

Uh oh!

itholic Sep 8, 2023

Choose a reason for hiding this comment

Uh oh!

itholic Sep 8, 2023

Choose a reason for hiding this comment

Uh oh!

itholic Sep 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itholic commented Sep 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhengruifeng commented Sep 14, 2023

Uh oh!

itholic commented Sep 15, 2023

Uh oh!

HyukjinKwon commented Sep 15, 2023

Uh oh!

dongjoon-hyun commented Sep 15, 2023

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

ueshin Sep 15, 2023

Choose a reason for hiding this comment

Uh oh!

itholic Sep 16, 2023

Choose a reason for hiding this comment

Uh oh!

bjornjorgensen Sep 16, 2023

Choose a reason for hiding this comment

Uh oh!

itholic Sep 16, 2023

Choose a reason for hiding this comment

Uh oh!

itholic commented Sep 18, 2023

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 18, 2023

Uh oh!

itholic commented Sep 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

itholic Sep 8, 2023 •

edited

Loading

itholic commented Sep 13, 2023 •

edited

Loading