Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jun 30, 2020

What changes were proposed in this pull request?

This PR aims to drop Python 2.7, 3.4 and 3.5.

Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as sys.version comparison, __future__. Also, it removes the Python 2 dedicated codes such as ArrayConstructor in Spark.

Why are the changes needed?

  1. Unsupport EOL Python versions
  2. Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2.
  3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation.
  4. Users can use Python type hints with Pandas UDFs without thinking about Python version
  5. Users can leverage one latest cloudpickle, [SPARK-32094][PYTHON] Update cloudpickle to v1.4.1 #28950. With Python 3.8+ it can also leverage C pickle.

Does this PR introduce any user-facing change?

Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version.

How was this patch tested?

Manually tested and also tested in Jenkins.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 30, 2020

I'm supporting @HyukjinKwon 's idea, but cc @gatorsmile and @marmbrus since they wanted to keep the deprecated features forever if there is no big burden. We had better get confirmations from them.

@dongjoon-hyun
Copy link
Member

Also, cc @rxin

@SparkQA
Copy link

SparkQA commented Jun 30, 2020

Test build #124645 has finished for PR 28957 at commit 3efa521.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of dropping Python <3.6 as it would simplify the code. This also allows us to use type hinting to make pyspark easier too use.

I think, we need to update the classifiers as well:

spark/python/setup.py

Lines 214 to 226 in 3efa521

classifiers=[
'Development Status :: 5 - Production/Stable',
'License :: OSI Approved :: Apache Software License',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.4',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: Implementation :: CPython',
'Programming Language :: Python :: Implementation :: PyPy']
)

Also, I would recommend setting python_requires to >=3.6 so people get a notification if they install it using <=3.5.

@holdenk
Copy link
Contributor

holdenk commented Jun 30, 2020

Thanks for working on this! @Fokko do we have a rough idea what % of general users are running Python 3.6+?

@Fokko
Copy link
Contributor

Fokko commented Jun 30, 2020

My pleasure @holdenk

I ran a query against the public dataset of Google. They have a dataset that contains all the public pypi downloads:

SELECT 
  EXTRACT(YEAR FROM timestamp) AS year,
  EXTRACT(MONTH FROM timestamp) AS month,
  SAFE.SUBSTR(details.python, 0, 3) AS python_version,
  COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pyspark'
AND SAFE.SUBSTR(details.python, 0, 3) IS NOT NULL
GROUP BY 
  EXTRACT(YEAR FROM timestamp),
  EXTRACT(MONTH FROM timestamp),
  SAFE.SUBSTR(details.python, 0, 3)

This gives us the following per month:
image

We can see that the majority uses 3.7 and 3.6. However, there is still a share of 3.5 and 2.7.

If we look at the proportional share of people who'm using a compatible version:

SELECT 
  EXTRACT(YEAR FROM timestamp) AS year,
  EXTRACT(MONTH FROM timestamp) AS month,
  if(SAFE.SUBSTR(details.python, 0, 3) >= '3.6', 'ok', 'not_ok') as OK,
  COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pyspark'
AND SAFE.SUBSTR(details.python, 0, 3) IS NOT NULL
GROUP BY 
  EXTRACT(YEAR FROM timestamp),
  EXTRACT(MONTH FROM timestamp),
  if(SAFE.SUBSTR(details.python, 0, 3) >= '3.6', 'ok', 'not_ok')

Then the majority is ok:
image

The next question would be if Python <3.6 users are on 3.0 or on 2.x. My guess would be the latter, so we're (mostly) safe deprecating the old versions of Python.

@HyukjinKwon
Copy link
Member Author

Thanks @Fokko for the investigation here.

@dongjoon-hyun
Copy link
Member

Wow, nice investigation, @Fokko ! Thanks.

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124750 has finished for PR 28957 at commit dd288ca.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

@HyukjinKwon . Did you get a chance to get a feedback from @marmbrus and @gatorsmile ?

@gatorsmile
Copy link
Member

gatorsmile commented Jul 1, 2020

cc @mateiz and @rxin @mengxr

@HyukjinKwon
Copy link
Member Author

Nope, I am waiting for more feedback - I usually share anything up in OSS side. I was just assuming it's good enough to go given @Fokko's investigation at #28957 (comment).

@HyukjinKwon
Copy link
Member Author

I will send an email to dev list to confirm. I think that's faster.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jul 11, 2020

I'll merge GitHub actions one first, and fix the conflicts here. I need https://github.com/apache/spark/pull/29057/files#diff-0590ca852e0e565bc489272aee36167fR729 change, and remove Python 2 specific codes at #29057 here.

@dongjoon-hyun
Copy link
Member

I merged GitHub Action PR. Please rebase this to the master. Thanks, @HyukjinKwon .

@HyukjinKwon
Copy link
Member Author

Sure, thanks @dongjoon-hyun

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, lot's of cleanup here thanks for doing it. This is great!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this looks good. I noticed you already fixed up the test cases that if affects, so that's great!

arrs_names = [(pa.array([], type=field.type), field.name) for field in t]
# Assign result columns by schema name if user labeled with strings
elif self._assign_cols_by_name and any(isinstance(name, basestring)
elif self._assign_cols_by_name and any(isinstance(name, str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to think about removing this as an option as a followup. It was mostly added because dataframe constructed with python < 3.6 could not guarantee the order of columns, but now it should match the given schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. sounds good!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and yes, maybe it should better be in a separate PR.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 13, 2020

Test build #125774 has finished for PR 28957 at commit f2356c8.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 13, 2020

Test build #125786 has finished for PR 28957 at commit f2356c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

I am merging it to master. Thank you guys for reviewing this.

@HyukjinKwon
Copy link
Member Author

Merged to master.

@holdenk
Copy link
Contributor

holdenk commented Jul 14, 2020

Thanks for doing this, awesome work :)

@Fokko
Copy link
Contributor

Fokko commented Jul 14, 2020

Cool stuff, thanks for the work @HyukjinKwon

@dongjoon-hyun
Copy link
Member

Great! Thank you always for leading PySpark part (in addition to all the other Spark module), @HyukjinKwon !

@HyukjinKwon HyukjinKwon deleted the SPARK-32138 branch July 27, 2020 07:43
HyukjinKwon added a commit that referenced this pull request Feb 26, 2021
…on 2 and work with Python 3

### What changes were proposed in this pull request?

This PR proposes to make the scripts working by:
- Recovering credit related scripts that were broken from #29563
    `raw_input` does not exist in `releaseutils` but only in Python 2
- Dropping Python 2 in these scripts because we dropped Python 2 in #28957
- Making these scripts workin with Python 3

### Why are the changes needed?

To unblock the release.

### Does this PR introduce _any_ user-facing change?

No, it's dev-only change.

### How was this patch tested?

I manually tested against Spark 3.1.1 RC3.

Closes #31660 from HyukjinKwon/SPARK-34551.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 5b92531)
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Feb 26, 2021
…on 2 and work with Python 3

### What changes were proposed in this pull request?

This PR proposes to make the scripts working by:
- Recovering credit related scripts that were broken from #29563
    `raw_input` does not exist in `releaseutils` but only in Python 2
- Dropping Python 2 in these scripts because we dropped Python 2 in #28957
- Making these scripts workin with Python 3

### Why are the changes needed?

To unblock the release.

### Does this PR introduce _any_ user-facing change?

No, it's dev-only change.

### How was this patch tested?

I manually tested against Spark 3.1.1 RC3.

Closes #31660 from HyukjinKwon/SPARK-34551.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.