[SPARK-32138] Drop Python 2.7, 3.4 and 3.5 #28957

HyukjinKwon · 2020-06-30T15:16:39Z

What changes were proposed in this pull request?

This PR aims to drop Python 2.7, 3.4 and 3.5.

Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as sys.version comparison, __future__. Also, it removes the Python 2 dedicated codes such as ArrayConstructor in Spark.

Why are the changes needed?

Unsupport EOL Python versions
Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2.
PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation.
Users can use Python type hints with Pandas UDFs without thinking about Python version
Users can leverage one latest cloudpickle, [SPARK-32094][PYTHON] Update cloudpickle to v1.4.1 #28950. With Python 3.8+ it can also leverage C pickle.

Does this PR introduce any user-facing change?

Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version.

How was this patch tested?

Manually tested and also tested in Jenkins.

dongjoon-hyun · 2020-06-30T15:59:57Z

I'm supporting @HyukjinKwon 's idea, but cc @gatorsmile and @marmbrus since they wanted to keep the deprecated features forever if there is no big burden. We had better get confirmations from them.

dongjoon-hyun · 2020-06-30T16:00:08Z

Also, cc @rxin

SparkQA · 2020-06-30T16:19:51Z

Test build #124645 has finished for PR 28957 at commit 3efa521.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

Fokko

I'm in favor of dropping Python <3.6 as it would simplify the code. This also allows us to use type hinting to make pyspark easier too use.

I think, we need to update the classifiers as well:

spark/python/setup.py

Lines 214 to 226 in 3efa521

    
               classifiers=[ 
        
                   'Development Status :: 5 - Production/Stable', 
        
                   'License :: OSI Approved :: Apache Software License', 
        
                   'Programming Language :: Python :: 2.7', 
        
                   'Programming Language :: Python :: 3', 
        
                   'Programming Language :: Python :: 3.4', 
        
                   'Programming Language :: Python :: 3.5', 
        
                   'Programming Language :: Python :: 3.6', 
        
                   'Programming Language :: Python :: 3.7', 
        
                   'Programming Language :: Python :: 3.8', 
        
                   'Programming Language :: Python :: Implementation :: CPython', 
        
                   'Programming Language :: Python :: Implementation :: PyPy'] 
        
           )

Also, I would recommend setting python_requires to >=3.6 so people get a notification if they install it using <=3.5.

holdenk · 2020-06-30T19:31:03Z

Thanks for working on this! @Fokko do we have a rough idea what % of general users are running Python 3.6+?

Fokko · 2020-06-30T20:22:17Z

My pleasure @holdenk

I ran a query against the public dataset of Google. They have a dataset that contains all the public pypi downloads:

SELECT 
  EXTRACT(YEAR FROM timestamp) AS year,
  EXTRACT(MONTH FROM timestamp) AS month,
  SAFE.SUBSTR(details.python, 0, 3) AS python_version,
  COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pyspark'
AND SAFE.SUBSTR(details.python, 0, 3) IS NOT NULL
GROUP BY 
  EXTRACT(YEAR FROM timestamp),
  EXTRACT(MONTH FROM timestamp),
  SAFE.SUBSTR(details.python, 0, 3)

This gives us the following per month:

We can see that the majority uses 3.7 and 3.6. However, there is still a share of 3.5 and 2.7.

If we look at the proportional share of people who'm using a compatible version:

SELECT 
  EXTRACT(YEAR FROM timestamp) AS year,
  EXTRACT(MONTH FROM timestamp) AS month,
  if(SAFE.SUBSTR(details.python, 0, 3) >= '3.6', 'ok', 'not_ok') as OK,
  COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pyspark'
AND SAFE.SUBSTR(details.python, 0, 3) IS NOT NULL
GROUP BY 
  EXTRACT(YEAR FROM timestamp),
  EXTRACT(MONTH FROM timestamp),
  if(SAFE.SUBSTR(details.python, 0, 3) >= '3.6', 'ok', 'not_ok')

Then the majority is ok:

The next question would be if Python <3.6 users are on 3.0 or on 2.x. My guess would be the latter, so we're (mostly) safe deprecating the old versions of Python.

HyukjinKwon · 2020-07-01T00:49:03Z

Thanks @Fokko for the investigation here.

dongjoon-hyun · 2020-07-01T06:01:06Z

Wow, nice investigation, @Fokko ! Thanks.

SparkQA · 2020-07-01T07:17:40Z

Test build #124750 has finished for PR 28957 at commit dd288ca.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-07-01T15:21:48Z

@HyukjinKwon . Did you get a chance to get a feedback from @marmbrus and @gatorsmile ?

gatorsmile · 2020-07-01T15:45:41Z

cc @mateiz and @rxin @mengxr

HyukjinKwon · 2020-07-01T15:55:46Z

Nope, I am waiting for more feedback - I usually share anything up in OSS side. I was just assuming it's good enough to go given @Fokko's investigation at #28957 (comment).

HyukjinKwon · 2020-07-02T01:59:31Z

I will send an email to dev list to confirm. I think that's faster.

HyukjinKwon · 2020-07-11T09:47:54Z

I'll merge GitHub actions one first, and fix the conflicts here. I need https://github.com/apache/spark/pull/29057/files#diff-0590ca852e0e565bc489272aee36167fR729 change, and remove Python 2 specific codes at #29057 here.

dongjoon-hyun · 2020-07-11T21:45:46Z

I merged GitHub Action PR. Please rebase this to the master. Thanks, @HyukjinKwon .

HyukjinKwon · 2020-07-13T00:14:43Z

Sure, thanks @dongjoon-hyun

BryanCutler

LGTM, lot's of cleanup here thanks for doing it. This is great!

python/pyspark/sql/tests/test_pandas_grouped_map.py

BryanCutler · 2020-07-13T02:47:06Z

python/pyspark/sql/types.py

Yup, this looks good. I noticed you already fixed up the test cases that if affects, so that's great!

BryanCutler · 2020-07-13T02:53:51Z

python/pyspark/sql/pandas/serializers.py

                    arrs_names = [(pa.array([], type=field.type), field.name) for field in t]
                # Assign result columns by schema name if user labeled with strings
-                elif self._assign_cols_by_name and any(isinstance(name, basestring)
+                elif self._assign_cols_by_name and any(isinstance(name, str)


We might want to think about removing this as an option as a followup. It was mostly added because dataframe constructed with python < 3.6 could not guarantee the order of columns, but now it should match the given schema.

Ah, right. sounds good!

and yes, maybe it should better be in a separate PR.

dev/lint-python

HyukjinKwon · 2020-07-13T14:33:53Z

retest this please

SparkQA · 2020-07-13T17:33:01Z

Test build #125774 has finished for PR 28957 at commit f2356c8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2020-07-13T18:56:44Z

retest this please

SparkQA · 2020-07-13T23:39:56Z

Test build #125786 has finished for PR 28957 at commit f2356c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-07-14T02:21:44Z

I am merging it to master. Thank you guys for reviewing this.

HyukjinKwon · 2020-07-14T02:21:52Z

Merged to master.

holdenk · 2020-07-14T02:23:08Z

Thanks for doing this, awesome work :)

Fokko · 2020-07-14T10:58:26Z

Cool stuff, thanks for the work @HyukjinKwon

dongjoon-hyun · 2020-07-14T16:25:05Z

Great! Thank you always for leading PySpark part (in addition to all the other Spark module), @HyukjinKwon !

…on 2 and work with Python 3 ### What changes were proposed in this pull request? This PR proposes to make the scripts working by: - Recovering credit related scripts that were broken from #29563 `raw_input` does not exist in `releaseutils` but only in Python 2 - Dropping Python 2 in these scripts because we dropped Python 2 in #28957 - Making these scripts workin with Python 3 ### Why are the changes needed? To unblock the release. ### Does this PR introduce _any_ user-facing change? No, it's dev-only change. ### How was this patch tested? I manually tested against Spark 3.1.1 RC3. Closes #31660 from HyukjinKwon/SPARK-34551. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 5b92531) Signed-off-by: HyukjinKwon <[email protected]>

…on 2 and work with Python 3 ### What changes were proposed in this pull request? This PR proposes to make the scripts working by: - Recovering credit related scripts that were broken from #29563 `raw_input` does not exist in `releaseutils` but only in Python 2 - Dropping Python 2 in these scripts because we dropped Python 2 in #28957 - Making these scripts workin with Python 3 ### Why are the changes needed? To unblock the release. ### Does this PR introduce _any_ user-facing change? No, it's dev-only change. ### How was this patch tested? I manually tested against Spark 3.1.1 RC3. Closes #31660 from HyukjinKwon/SPARK-34551. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

probot-autolabeler bot added BUILD CORE DOCS INFRA ML MLLIB PYTHON labels Jun 30, 2020

Fokko reviewed Jun 30, 2020

View reviewed changes

HyukjinKwon force-pushed the SPARK-32138 branch from 3efa521 to dd288ca Compare July 1, 2020 07:12

HyukjinKwon force-pushed the SPARK-32138 branch from dd288ca to d74aa53 Compare July 2, 2020 02:22

This comment has been minimized.

Sign in to view

HyukjinKwon mentioned this pull request Jul 2, 2020

[SPARK-32094][PYTHON] Update cloudpickle to v1.4.1 #28950

Closed

This comment has been minimized.

Sign in to view

HyukjinKwon force-pushed the SPARK-32138 branch from ec42492 to 9e8100f Compare July 13, 2020 00:37

BryanCutler approved these changes Jul 13, 2020

View reviewed changes

This comment has been minimized.

Sign in to view

dongjoon-hyun reviewed Jul 13, 2020

View reviewed changes

dev/lint-python Outdated Show resolved Hide resolved

Drop Python 2.7, 3.4 and 3.5

5c27ea8

HyukjinKwon force-pushed the SPARK-32138 branch from 18f598e to 5c27ea8 Compare July 13, 2020 08:20

Install Python in Yarn test cases too, and some corrections

f2356c8

This comment has been minimized.

Sign in to view

HyukjinKwon closed this in 4ad9bfd Jul 14, 2020

Fokko mentioned this pull request Jul 17, 2020

[SPARK-32320][PYSPARK] Remove mutable default arguments #29122

Closed

ankit-db mentioned this pull request Jul 17, 2020

Remove f-strings because we still support Python 3.5 mlflow/mlflow#3121

Merged

24 tasks

zero323 mentioned this pull request Jul 18, 2020

[SPARK-32138] Drop Python 2.7, 3.4 and 3.5 zero323/pyspark-stubs#439

Closed

HyukjinKwon deleted the SPARK-32138 branch July 27, 2020 07:43

HyukjinKwon mentioned this pull request Feb 26, 2021

[SPARK-34551][INFRA] Fix credit related scripts to recover, drop Python 2 and work with Python 3 #31660

Closed

	classifiers=[
	'Development Status :: 5 - Production/Stable',
	'License :: OSI Approved :: Apache Software License',
	'Programming Language :: Python :: 2.7',
	'Programming Language :: Python :: 3',
	'Programming Language :: Python :: 3.4',
	'Programming Language :: Python :: 3.5',
	'Programming Language :: Python :: 3.6',
	'Programming Language :: Python :: 3.7',
	'Programming Language :: Python :: 3.8',
	'Programming Language :: Python :: Implementation :: CPython',
	'Programming Language :: Python :: Implementation :: PyPy']
	)

[SPARK-32138] Drop Python 2.7, 3.4 and 3.5 #28957

[SPARK-32138] Drop Python 2.7, 3.4 and 3.5 #28957

Uh oh!

Conversation

HyukjinKwon commented Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jun 30, 2020

Uh oh!

SparkQA commented Jun 30, 2020

Uh oh!

Fokko left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk commented Jun 30, 2020

Uh oh!

Fokko commented Jun 30, 2020

Uh oh!

HyukjinKwon commented Jul 1, 2020

Uh oh!

dongjoon-hyun commented Jul 1, 2020

Uh oh!

SparkQA commented Jul 1, 2020

Uh oh!

dongjoon-hyun commented Jul 1, 2020

Uh oh!

gatorsmile commented Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jul 1, 2020

Uh oh!

HyukjinKwon commented Jul 2, 2020

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

HyukjinKwon commented Jul 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jul 11, 2020

Uh oh!

HyukjinKwon commented Jul 13, 2020

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BryanCutler Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

BryanCutler Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

Uh oh!

This comment has been minimized.

This comment has been minimized.

HyukjinKwon commented Jul 13, 2020

Uh oh!

SparkQA commented Jul 13, 2020

Uh oh!

HyukjinKwon commented Jun 30, 2020 •

edited

Loading

dongjoon-hyun commented Jun 30, 2020 •

edited

Loading

Fokko left a comment •

edited

Loading

gatorsmile commented Jul 1, 2020 •

edited

Loading

HyukjinKwon commented Jul 11, 2020 •

edited

Loading