Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Nov 8, 2021

What changes were proposed in this pull request?

This PR aims to support building and running tests on Python 3.10.

Python 3.10 added many new features and breaking changes.

For example, the following blocks building and testing PySpark on Python 3.10.

PYTHON 3.9.7

Python 3.9.7 (default, Oct 22 2021, 13:24:00)
[Clang 13.0.0 (clang-1300.0.29.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from collections import Callable
<stdin>:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working

PYTHON 3.10.0

Python 3.10.0 (default, Oct 29 2021, 14:35:18) [Clang 13.0.0 (clang-1300.0.29.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from collections import Callable
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'Callable' from 'collections' (/Users/dongjoon/.pyenv/versions/3.10.0/lib/python3.10/collections/__init__.py)

Why are the changes needed?

BEFORE

$ build/sbt -Phadoop-cloud -Phadoop-3.2 test:package
$ python/run-tests
Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark-merge/python/unit-tests.log
Will test against the following Python executables: ['/Users/dongjoon/.pyenv/versions/3.10.0/bin/python3']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-pandas', 'pyspark-pandas-slow', 'pyspark-resource', 'pyspark-sql', 'pyspark-streaming']
/Users/dongjoon/.pyenv/versions/3.10.0/bin/python3 python_implementation is CPython
/Users/dongjoon/.pyenv/versions/3.10.0/bin/python3 version is: Python 3.10.0
Starting test(/Users/dongjoon/.pyenv/versions/3.10.0/bin/python3): pyspark.ml.tests.test_algorithms (temp output: /var/folders/mq/c32xpgtj4tj19vt8b10wp8rc0000gn/T/Users_dongjoon_.pyenv_versions_3.10.0_bin_python3__pyspark.ml.tests.test_algorithms__nfcl9j4y.log)
Starting test(/Users/dongjoon/.pyenv/versions/3.10.0/bin/python3): pyspark.ml.tests.test_base (temp output: /var/folders/mq/c32xpgtj4tj19vt8b10wp8rc0000gn/T/Users_dongjoon_.pyenv_versions_3.10.0_bin_python3__pyspark.ml.tests.test_base__143gcgep.log)
Starting test(/Users/dongjoon/.pyenv/versions/3.10.0/bin/python3): pyspark.ml.tests.test_evaluation (temp output: /var/folders/mq/c32xpgtj4tj19vt8b10wp8rc0000gn/T/Users_dongjoon_.pyenv_versions_3.10.0_bin_python3__pyspark.ml.tests.test_evaluation__jbhwc3cs.log)
Starting test(/Users/dongjoon/.pyenv/versions/3.10.0/bin/python3): pyspark.ml.tests.test_feature (temp output: /var/folders/mq/c32xpgtj4tj19vt8b10wp8rc0000gn/T/Users_dongjoon_.pyenv_versions_3.10.0_bin_python3__pyspark.ml.tests.test_feature__0vx175eo.log)
Traceback (most recent call last):
  File "/Users/dongjoon/.pyenv/versions/3.10.0/lib/python3.10/runpy.py", line 187, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/Users/dongjoon/.pyenv/versions/3.10.0/lib/python3.10/runpy.py", line 110, in _get_module_details
    __import__(pkg_name)
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/__init__.py", line 53, in <module>
    from pyspark.rdd import RDD, RDDBarrier
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/rdd.py", line 34, in <module>
    from pyspark.java_gateway import local_connect_and_auth
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/java_gateway.py", line 32, in <module>
    from pyspark.serializers import read_int, write_with_length, UTF8Deserializer
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/serializers.py", line 68, in <module>
    from pyspark.util import print_exec  # type: ignore
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/util.py", line 28, in <module>
    from collections import Callable
ImportError: cannot import name 'Callable' from 'collections' (/Users/dongjoon/.pyenv/versions/3.10.0/lib/python3.10/collections/__init__.py)

Had test failures in pyspark.ml.tests.test_feature with /Users/dongjoon/.pyenv/versions/3.10.0/bin/python3; see logs.

AFTER

  • All tests passed except PyArrow related-tests which is beyond of this PR.
$ build/sbt -Phadoop-cloud -Phadoop-3.2 test:package
$ python/run-tests
...

Does this PR introduce any user-facing change?

Yes, this will add the official support of Python 3.10.

How was this patch tested?

Pass the CIs and manually run Python tests on Python 3.10.

@SparkQA
Copy link

SparkQA commented Nov 8, 2021

Test build #145011 has finished for PR 34526 at commit 05d187b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Could you review this PR, @HyukjinKwon ?

@SparkQA
Copy link

SparkQA commented Nov 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49484/

@SparkQA
Copy link

SparkQA commented Nov 8, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49484/

@dongjoon-hyun dongjoon-hyun marked this pull request as draft November 9, 2021 00:49
@HyukjinKwon
Copy link
Member

Looks good. @xinrong-databricks mind double checking this please since you're investigating Python 10 support.

@dongjoon-hyun
Copy link
Member Author

Thank you, @HyukjinKwon .
BTW, is there a workaround to pass test_memory_limit on Mac?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 9, 2021

No .. for some reasons, the memory setting doesn't work on Mac. I think previously the tests worked because it returned the fake number of memory limit. I think we should run the tests only on Linux.

BTW the limitation was documented at #23664

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Nov 9, 2021

Thank you for the confirmation. I'll make a PR to ignore it on Mac.

@xinrong-meng
Copy link
Member

Thank you @dongjoon-hyun ! LGTM

@dongjoon-hyun
Copy link
Member Author

Thank you for your reviews and testing, @xinrong-databricks . I'll update the PR description about the missing PyArrow tests.

@dongjoon-hyun dongjoon-hyun marked this pull request as ready for review November 9, 2021 04:05
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-37244][PYTHON] Build and test on Python 3.10 [SPARK-37244][PYTHON] Build and run tests on Python 3.10 Nov 9, 2021
@dongjoon-hyun
Copy link
Member Author

Thank you, @HyukjinKwon and @xinrong-databricks .
I'll merge this because this will enable more further testings.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-37244 branch November 9, 2021 04:10
@HyukjinKwon
Copy link
Member

Sure, LGTM2

HyukjinKwon pushed a commit that referenced this pull request Nov 9, 2021
### What changes were proposed in this pull request?

This PR is a follow-up of #34526 to adjust one `pyspark.rdd` doctest additionally.

```python
- >>> b''.join(result).decode('utf-8')
+ >>> ''.join([r.decode('utf-8') if isinstance(r, bytes) else r for r in result])
```
### Why are the changes needed?

**Python 3.8/3.9**
```python
Using Python version 3.8.12 (default, Nov  8 2021 17:15:19)
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1636432954207).
SparkSession available as 'spark'.
>>> from tempfile import NamedTemporaryFile
>>> tempFile3 = NamedTemporaryFile(delete=True)
>>> tempFile3.close()
>>> codec = "org.apache.hadoop.io.compress.GzipCodec"
>>> sc.parallelize(['foo', 'bar']).saveAsTextFile(tempFile3.name, codec)
>>> from fileinput import input, hook_compressed
>>> from glob import glob
>>> result = sorted(input(glob(tempFile3.name + "/part*.gz"), openhook=hook_compressed))
>>> result
[b'bar\n', b'foo\n']
```

**Python 3.10**
```python
Using Python version 3.10.0 (default, Oct 29 2021 14:35:18)
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1636433378727).
SparkSession available as 'spark'.
>>> from tempfile import NamedTemporaryFile
>>> tempFile3 = NamedTemporaryFile(delete=True)
>>> tempFile3.close()
>>> codec = "org.apache.hadoop.io.compress.GzipCodec"
>>> sc.parallelize(['foo', 'bar']).saveAsTextFile(tempFile3.name, codec)
>>> from fileinput import input, hook_compressed
>>> from glob import glob
>>> result = sorted(input(glob(tempFile3.name + "/part*.gz"), openhook=hook_compressed))
>>> result
['bar\n', 'foo\n']
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

```
$ python/run-tests --testnames pyspark.rdd
```

Closes #34529 from dongjoon-hyun/SPARK-37244-2.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Nov 9, 2021
### What changes were proposed in this pull request?

This PR fixes `setup.py` to note that PySpark works with Python 3.10.

### Why are the changes needed?

To officially support Python 3.10.

### Does this PR introduce _any_ user-facing change?

Yes, it officially supports Python 3.10.

### How was this patch tested?

It has been tested in #34526.
Arrow related features are technically optional.

Closes #34533 from HyukjinKwon/SPARK-37257.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
sunchao pushed a commit to sunchao/spark that referenced this pull request Dec 8, 2021
This PR aims to support building and running tests on Python 3.10.

Python 3.10 added many new features and breaking changes.
- https://docs.python.org/3/whatsnew/3.10.html

This PR is a follow-up of apache#34526 to adjust one `pyspark.rdd` doctest additionally.

```python
- >>> b''.join(result).decode('utf-8')
+ >>> ''.join([r.decode('utf-8') if isinstance(r, bytes) else r for r in result])
```

**Python 3.8/3.9**
```python
Using Python version 3.8.12 (default, Nov  8 2021 17:15:19)
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1636432954207).
SparkSession available as 'spark'.
>>> from tempfile import NamedTemporaryFile
>>> tempFile3 = NamedTemporaryFile(delete=True)
>>> tempFile3.close()
>>> codec = "org.apache.hadoop.io.compress.GzipCodec"
>>> sc.parallelize(['foo', 'bar']).saveAsTextFile(tempFile3.name, codec)
>>> from fileinput import input, hook_compressed
>>> from glob import glob
>>> result = sorted(input(glob(tempFile3.name + "/part*.gz"), openhook=hook_compressed))
>>> result
[b'bar\n', b'foo\n']
```

**Python 3.10**
```python
Using Python version 3.10.0 (default, Oct 29 2021 14:35:18)
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1636433378727).
SparkSession available as 'spark'.
>>> from tempfile import NamedTemporaryFile
>>> tempFile3 = NamedTemporaryFile(delete=True)
>>> tempFile3.close()
>>> codec = "org.apache.hadoop.io.compress.GzipCodec"
>>> sc.parallelize(['foo', 'bar']).saveAsTextFile(tempFile3.name, codec)
>>> from fileinput import input, hook_compressed
>>> from glob import glob
>>> result = sorted(input(glob(tempFile3.name + "/part*.gz"), openhook=hook_compressed))
>>> result
['bar\n', 'foo\n']
```

No.

```
$ python/run-tests --testnames pyspark.rdd
```

Closes apache#34529 from dongjoon-hyun/SPARK-37244-2.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 47ceae4)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants