[SPARK-37244][PYTHON] Build and run tests on Python 3.10 #34526

dongjoon-hyun · 2021-11-08T21:21:47Z

What changes were proposed in this pull request?

This PR aims to support building and running tests on Python 3.10.

Python 3.10 added many new features and breaking changes.

https://docs.python.org/3/whatsnew/3.10.html

For example, the following blocks building and testing PySpark on Python 3.10.

PYTHON 3.9.7

Python 3.9.7 (default, Oct 22 2021, 13:24:00)
[Clang 13.0.0 (clang-1300.0.29.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from collections import Callable
<stdin>:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working

PYTHON 3.10.0

Python 3.10.0 (default, Oct 29 2021, 14:35:18) [Clang 13.0.0 (clang-1300.0.29.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from collections import Callable
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'Callable' from 'collections' (/Users/dongjoon/.pyenv/versions/3.10.0/lib/python3.10/collections/__init__.py)

Why are the changes needed?

BEFORE

$ build/sbt -Phadoop-cloud -Phadoop-3.2 test:package
$ python/run-tests
Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark-merge/python/unit-tests.log
Will test against the following Python executables: ['/Users/dongjoon/.pyenv/versions/3.10.0/bin/python3']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-pandas', 'pyspark-pandas-slow', 'pyspark-resource', 'pyspark-sql', 'pyspark-streaming']
/Users/dongjoon/.pyenv/versions/3.10.0/bin/python3 python_implementation is CPython
/Users/dongjoon/.pyenv/versions/3.10.0/bin/python3 version is: Python 3.10.0
Starting test(/Users/dongjoon/.pyenv/versions/3.10.0/bin/python3): pyspark.ml.tests.test_algorithms (temp output: /var/folders/mq/c32xpgtj4tj19vt8b10wp8rc0000gn/T/Users_dongjoon_.pyenv_versions_3.10.0_bin_python3__pyspark.ml.tests.test_algorithms__nfcl9j4y.log)
Starting test(/Users/dongjoon/.pyenv/versions/3.10.0/bin/python3): pyspark.ml.tests.test_base (temp output: /var/folders/mq/c32xpgtj4tj19vt8b10wp8rc0000gn/T/Users_dongjoon_.pyenv_versions_3.10.0_bin_python3__pyspark.ml.tests.test_base__143gcgep.log)
Starting test(/Users/dongjoon/.pyenv/versions/3.10.0/bin/python3): pyspark.ml.tests.test_evaluation (temp output: /var/folders/mq/c32xpgtj4tj19vt8b10wp8rc0000gn/T/Users_dongjoon_.pyenv_versions_3.10.0_bin_python3__pyspark.ml.tests.test_evaluation__jbhwc3cs.log)
Starting test(/Users/dongjoon/.pyenv/versions/3.10.0/bin/python3): pyspark.ml.tests.test_feature (temp output: /var/folders/mq/c32xpgtj4tj19vt8b10wp8rc0000gn/T/Users_dongjoon_.pyenv_versions_3.10.0_bin_python3__pyspark.ml.tests.test_feature__0vx175eo.log)
Traceback (most recent call last):
  File "/Users/dongjoon/.pyenv/versions/3.10.0/lib/python3.10/runpy.py", line 187, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/Users/dongjoon/.pyenv/versions/3.10.0/lib/python3.10/runpy.py", line 110, in _get_module_details
    __import__(pkg_name)
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/__init__.py", line 53, in <module>
    from pyspark.rdd import RDD, RDDBarrier
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/rdd.py", line 34, in <module>
    from pyspark.java_gateway import local_connect_and_auth
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/java_gateway.py", line 32, in <module>
    from pyspark.serializers import read_int, write_with_length, UTF8Deserializer
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/serializers.py", line 68, in <module>
    from pyspark.util import print_exec  # type: ignore
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/util.py", line 28, in <module>
    from collections import Callable
ImportError: cannot import name 'Callable' from 'collections' (/Users/dongjoon/.pyenv/versions/3.10.0/lib/python3.10/collections/__init__.py)

Had test failures in pyspark.ml.tests.test_feature with /Users/dongjoon/.pyenv/versions/3.10.0/bin/python3; see logs.

AFTER

All tests passed except PyArrow related-tests which is beyond of this PR.

$ build/sbt -Phadoop-cloud -Phadoop-3.2 test:package
$ python/run-tests
...

Does this PR introduce any user-facing change?

Yes, this will add the official support of Python 3.10.

How was this patch tested?

Pass the CIs and manually run Python tests on Python 3.10.

SparkQA · 2021-11-08T22:32:37Z

Test build #145011 has finished for PR 34526 at commit 05d187b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-11-08T22:50:10Z

Could you review this PR, @HyukjinKwon ?

SparkQA · 2021-11-08T22:54:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49484/

SparkQA · 2021-11-08T23:55:56Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49484/

HyukjinKwon · 2021-11-09T00:52:48Z

Looks good. @xinrong-databricks mind double checking this please since you're investigating Python 10 support.

dongjoon-hyun · 2021-11-09T01:08:49Z

Thank you, @HyukjinKwon .
BTW, is there a workaround to pass test_memory_limit on Mac?

HyukjinKwon · 2021-11-09T01:17:12Z

No .. for some reasons, the memory setting doesn't work on Mac. I think previously the tests worked because it returned the fake number of memory limit. I think we should run the tests only on Linux.

BTW the limitation was documented at #23664

dongjoon-hyun · 2021-11-09T01:17:49Z

Thank you for the confirmation. I'll make a PR to ignore it on Mac.

xinrong-meng · 2021-11-09T03:16:16Z

Thank you @dongjoon-hyun ! LGTM

dongjoon-hyun · 2021-11-09T03:48:08Z

Thank you for your reviews and testing, @xinrong-databricks . I'll update the PR description about the missing PyArrow tests.

dongjoon-hyun · 2021-11-09T04:09:37Z

Thank you, @HyukjinKwon and @xinrong-databricks .
I'll merge this because this will enable more further testings.

HyukjinKwon · 2021-11-09T04:11:23Z

Sure, LGTM2

### What changes were proposed in this pull request? This PR is a follow-up of #34526 to adjust one `pyspark.rdd` doctest additionally. ```python - >>> b''.join(result).decode('utf-8') + >>> ''.join([r.decode('utf-8') if isinstance(r, bytes) else r for r in result]) ``` ### Why are the changes needed? **Python 3.8/3.9** ```python Using Python version 3.8.12 (default, Nov 8 2021 17:15:19) Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1636432954207). SparkSession available as 'spark'. >>> from tempfile import NamedTemporaryFile >>> tempFile3 = NamedTemporaryFile(delete=True) >>> tempFile3.close() >>> codec = "org.apache.hadoop.io.compress.GzipCodec" >>> sc.parallelize(['foo', 'bar']).saveAsTextFile(tempFile3.name, codec) >>> from fileinput import input, hook_compressed >>> from glob import glob >>> result = sorted(input(glob(tempFile3.name + "/part*.gz"), openhook=hook_compressed)) >>> result [b'bar\n', b'foo\n'] ``` **Python 3.10** ```python Using Python version 3.10.0 (default, Oct 29 2021 14:35:18) Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1636433378727). SparkSession available as 'spark'. >>> from tempfile import NamedTemporaryFile >>> tempFile3 = NamedTemporaryFile(delete=True) >>> tempFile3.close() >>> codec = "org.apache.hadoop.io.compress.GzipCodec" >>> sc.parallelize(['foo', 'bar']).saveAsTextFile(tempFile3.name, codec) >>> from fileinput import input, hook_compressed >>> from glob import glob >>> result = sorted(input(glob(tempFile3.name + "/part*.gz"), openhook=hook_compressed)) >>> result ['bar\n', 'foo\n'] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` $ python/run-tests --testnames pyspark.rdd ``` Closes #34529 from dongjoon-hyun/SPARK-37244-2. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? This PR fixes `setup.py` to note that PySpark works with Python 3.10. ### Why are the changes needed? To officially support Python 3.10. ### Does this PR introduce _any_ user-facing change? Yes, it officially supports Python 3.10. ### How was this patch tested? It has been tested in #34526. Arrow related features are technically optional. Closes #34533 from HyukjinKwon/SPARK-37257. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

This PR aims to support building and running tests on Python 3.10. Python 3.10 added many new features and breaking changes. - https://docs.python.org/3/whatsnew/3.10.html This PR is a follow-up of apache#34526 to adjust one `pyspark.rdd` doctest additionally. ```python - >>> b''.join(result).decode('utf-8') + >>> ''.join([r.decode('utf-8') if isinstance(r, bytes) else r for r in result]) ``` **Python 3.8/3.9** ```python Using Python version 3.8.12 (default, Nov 8 2021 17:15:19) Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1636432954207). SparkSession available as 'spark'. >>> from tempfile import NamedTemporaryFile >>> tempFile3 = NamedTemporaryFile(delete=True) >>> tempFile3.close() >>> codec = "org.apache.hadoop.io.compress.GzipCodec" >>> sc.parallelize(['foo', 'bar']).saveAsTextFile(tempFile3.name, codec) >>> from fileinput import input, hook_compressed >>> from glob import glob >>> result = sorted(input(glob(tempFile3.name + "/part*.gz"), openhook=hook_compressed)) >>> result [b'bar\n', b'foo\n'] ``` **Python 3.10** ```python Using Python version 3.10.0 (default, Oct 29 2021 14:35:18) Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1636433378727). SparkSession available as 'spark'. >>> from tempfile import NamedTemporaryFile >>> tempFile3 = NamedTemporaryFile(delete=True) >>> tempFile3.close() >>> codec = "org.apache.hadoop.io.compress.GzipCodec" >>> sc.parallelize(['foo', 'bar']).saveAsTextFile(tempFile3.name, codec) >>> from fileinput import input, hook_compressed >>> from glob import glob >>> result = sorted(input(glob(tempFile3.name + "/part*.gz"), openhook=hook_compressed)) >>> result ['bar\n', 'foo\n'] ``` No. ``` $ python/run-tests --testnames pyspark.rdd ``` Closes apache#34529 from dongjoon-hyun/SPARK-37244-2. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 47ceae4) Signed-off-by: Dongjoon Hyun <[email protected]>

[SPARK-37244][PYTHON] Build and test on Python 3.10

05d187b

github-actions bot added CORE PYTHON labels Nov 8, 2021

dongjoon-hyun marked this pull request as draft November 9, 2021 00:49

dongjoon-hyun marked this pull request as ready for review November 9, 2021 04:05

dongjoon-hyun changed the title ~~[SPARK-37244][PYTHON] Build and test on Python 3.10~~ [SPARK-37244][PYTHON] Build and run tests on Python 3.10 Nov 9, 2021

dongjoon-hyun closed this in 2faa144 Nov 9, 2021

dongjoon-hyun deleted the SPARK-37244 branch November 9, 2021 04:10

dongjoon-hyun mentioned this pull request Nov 9, 2021

[SPARK-37244][PYTHON][FOLLOWUP] Adjust pyspark.rdd doctest #34529

Closed

HyukjinKwon mentioned this pull request Nov 9, 2021

[SPARK-37257][PYTHON] Update setup.py for Python 3.10 #34533

Closed

[SPARK-37244][PYTHON] Build and run tests on Python 3.10 #34526

[SPARK-37244][PYTHON] Build and run tests on Python 3.10 #34526

Uh oh!

Conversation

dongjoon-hyun commented Nov 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Nov 8, 2021

Uh oh!

dongjoon-hyun commented Nov 8, 2021

Uh oh!

SparkQA commented Nov 8, 2021

Uh oh!

SparkQA commented Nov 8, 2021

Uh oh!

HyukjinKwon commented Nov 9, 2021

Uh oh!

dongjoon-hyun commented Nov 9, 2021

Uh oh!

HyukjinKwon commented Nov 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Nov 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xinrong-meng commented Nov 9, 2021

Uh oh!

dongjoon-hyun commented Nov 9, 2021

Uh oh!

dongjoon-hyun commented Nov 9, 2021

Uh oh!

HyukjinKwon commented Nov 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dongjoon-hyun commented Nov 8, 2021 •

edited

Loading

HyukjinKwon commented Nov 9, 2021 •

edited

Loading

dongjoon-hyun commented Nov 9, 2021 •

edited

Loading