Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Feb 7, 2024

What changes were proposed in this pull request?

This PR proposes to release a separate pyspark-connect package, see also SPIP: Pure Python Package in PyPI (Spark Connect).

Today's PySpark package is roughly as follows:

pyspark
├── *.py               # *Core / No Spark Connect support*
├── mllib              # MLlib / No Spark Connect support
├── resource           # Resource profile API / No Spark Connect support
├── streaming          # DStream (deprecated) / No Spark Connect support
├── ml                 # ML 
│   └── connect            # Spark Connect for ML
├── pandas             # API on Spark with/without Spark Connect support
└── sql                # SQL
    └── connect            # Spark Connect for SQL
        └── streaming      # Spark Connect for Structured Streaming

There will be two packages available, pyspark and pyspark-connect.

pyspark

Same as today’s PySpark. But Core module is factored out to pyspark.core.*. User-facing interface stays the same at pyspark.*.

pyspark
├── core               # *Core / No Spark Connect support*
├── mllib              # MLlib / No Spark Connect support
├── resource           # Resource profile API / No Spark Connect support
├── streaming          # DStream (deprecated) / No Spark Connect support
├── ml                 # ML 
│   └── connect            # Spark Connect for ML
├── pandas             # API on Spark with/without Spark Connect support
└── sql                # SQL
    └── connect            # Spark Connect for SQL
        └── streaming      # Spark Connect for Structured Streaming

pyspark-connect

Package after excluding modules that do not support Spark Connect, also excluding jars, that are, ml without jars:

pyspark
├── ml
│   └── connect
├── pandas
└── sql
    └── connect
        └── streaming

Why are the changes needed?

To provide a pure Python library that does not depend on JVM.

See also SPIP: Pure Python Package in PyPI (Spark Connect).

Does this PR introduce any user-facing change?

Yes, users can install pure Python library via pip install pyspark-connect.

How was this patch tested?

Manually tested the basic set of tests.

./sbin/start-connect-server.sh --jars `ls connector/connect/server/target/**/spark-connect*SNAPSHOT.jar`
cd python
python packaging/connect/setup.py sdist
cd dist
conda create -y -n clean-py-3.11 python=3.11
conda activate clean-py-3.11
pip install pyspark-connect-4.0.0.dev0.tar.gz
python
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
>>> spark.range(10).show()
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+

They will be separated added, and set as a scheduled job in CI.

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-47683][PYTHON][BUILD] Decouple PySpark core API to pyspark.core package [SPARK-47683][PYTHON][BUILD] Decouple PySpark core API to pyspark.core package Apr 2, 2024
@HyukjinKwon
Copy link
Member Author

cc @zhengruifeng @grundprinzip @ueshin @hvanhovell @itholic @WeichenXu123 @mengxr @allisonwang-db @xinrong-meng @gatorsmile @cloud-fan This is ready for a look (before merging, should wait one more day for the SPIP to pass though)

@HyukjinKwon
Copy link
Member Author

I restored the references for our internal API. Explicitly private attributes starting _ are not restored.

@HyukjinKwon
Copy link
Member Author

Merged to master.

HyukjinKwon added a commit that referenced this pull request May 2, 2024
…spark-connect` package

### What changes were proposed in this pull request?

This PR is a followup of #45053 that includes `lib/py4j*zip` in the package. Currently it's being picked up by https://github.com/apache/spark/blob/master/python/MANIFEST.in#L26. For other files, we don't create `deps` directory in `setup.py` for `pyspark-connect` so they are not included. But `lib` is being included.

### Why are the changes needed?

To exclude unrelated files.

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released out yet.

### How was this patch tested?

Manually packaged, and checked the contents via `vi`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46331 from HyukjinKwon/SPARK-47683-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants