Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/build_and_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -611,7 +611,7 @@ jobs:
- name: Python linter
run: PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
- name: Python code generation check
run: if test -f ./dev/check-codegen-python.py; then PATH=$PATH:$HOME/buf/bin PYTHON_EXECUTABLE=python3.9 ./dev/check-codegen-python.py; fi
run: if test -f ./dev/connect-check-protos.py; then PATH=$PATH:$HOME/buf/bin PYTHON_EXECUTABLE=python3.9 ./dev/connect-check-protos.py; fi
- name: R linter
run: ./dev/lint-r
- name: JS linter
Expand Down
74 changes: 10 additions & 64 deletions connector/connect/README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,28 @@
# Spark Connect - Developer Documentation
# Spark Connect

**Spark Connect is a strictly experimental feature and under heavy development.
All APIs should be considered volatile and should not be used in production.**

This module contains the implementation of Spark Connect which is a logical plan
facade for the implementation in Spark. Spark Connect is directly integrated into the build
of Spark. To enable it, you only need to activate the driver plugin for Spark Connect.
of Spark.

The documentation linked here is specifically for developers of Spark Connect and not
directly intended to be end-user documentation.

## Development Topics

## Getting Started
### Guidelines for new clients

### Build
When contributing a new client please be aware that we strive to have a common
user experience across all languages. Please follow the below guidelines:

```bash
./build/mvn -Phive clean package
```
* [Connection string configuration](docs/client-connection-string.md)
* [Adding new messages](docs/adding-proto-messages.md) in the Spark Connect protocol.

or
### Python client developement

```bash
./build/sbt -Phive clean package
```
Python-specific developement guidelines are located in [python/docs/source/development/testing.rst](https://github.com/apache/spark/blob/master/python/docs/source/development/testing.rst) that is published at [Development tab](https://spark.apache.org/docs/latest/api/python/development/index.html) in PySpark documentation.

### Build with user-defined `protoc` and `protoc-gen-grpc-java`

Expand All @@ -48,56 +47,3 @@ export CONNECT_PLUGIN_EXEC_PATH=/path-to-protoc-gen-grpc-java-exe
The user-defined `protoc` and `protoc-gen-grpc-java` binary files can be produced in the user's compilation environment by source code compilation,
for compilation steps, please refer to [protobuf](https://github.com/protocolbuffers/protobuf) and [grpc-java](https://github.com/grpc/grpc-java).


### Run Spark Shell
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to python/docs/source/development/testing.rst


To run Spark Connect you locally built:

```bash
# Scala shell
./bin/spark-shell \
--jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar | paste -sd ',' -` \
--conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin

# PySpark shell
./bin/pyspark \
--jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar | paste -sd ',' -` \
--conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
```

To use the release version of Spark Connect:

```bash
./bin/spark-shell \
--packages org.apache.spark:spark-connect_2.12:3.4.0 \
--conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
```

### Run Tests

```bash
# Run a single Python class.
./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic'
```

```bash
# Run all Spark Connect Python tests as a module.
./python/run-tests --module pyspark-connect --parallelism 1
```


## Development Topics

### Generate proto generated files for the Python client
1. Install `buf version 1.11.0`: https://docs.buf.build/installation
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to Environment Setup

2. Run `pip install grpcio==1.48.1 protobuf==3.19.5 mypy-protobuf==3.3.0 googleapis-common-protos==1.56.4 grpcio-status==1.48.1`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. Run `./connector/connect/dev/generate_protos.sh`
4. Optional Check `./dev/check-codegen-python.py`

### Guidelines for new clients

When contributing a new client please be aware that we strive to have a common
user experience across all languages. Please follow the below guidelines:

* [Connection string configuration](docs/client-connection-string.md)
* [Adding new messages](docs/adding-proto-messages.md) in the Spark Connect protocol.
4 changes: 2 additions & 2 deletions dev/check-codegen-python.py → dev/connect-check-protos.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ def run_cmd(cmd):
def check_connect_protos():
print("Start checking the generated codes in pyspark-connect.")
with tempfile.TemporaryDirectory() as tmp:
run_cmd(f"{SPARK_HOME}/connector/connect/dev/generate_protos.sh {tmp}")
run_cmd(f"{SPARK_HOME}/dev/connect-gen-protos.sh {tmp}")
result = filecmp.dircmp(
f"{SPARK_HOME}/python/pyspark/sql/connect/proto/",
tmp,
Expand Down Expand Up @@ -76,7 +76,7 @@ def check_connect_protos():
fail(
"Generated files for pyspark-connect are out of sync! "
"If you have touched files under connector/connect/src/main/protobuf, "
"please run ./connector/connect/dev/generate_protos.sh. "
"please run ./dev/connect-gen-protos.sh. "
"If you haven't touched any file above, please rebase your PR against main branch."
)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@ set -ex

if [[ $# -gt 1 ]]; then
echo "Illegal number of parameters."
echo "Usage: ./connector/connect/dev/generate_protos.sh [path]"
echo "Usage: ./dev/generate_protos.sh [path]"
exit -1
fi


SPARK_HOME="$(cd "`dirname $0`"/../../..; pwd)"
SPARK_HOME="$(cd "`dirname $0`"/..; pwd)"
cd "$SPARK_HOME"


Expand Down
2 changes: 2 additions & 0 deletions python/docs/source/development/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,8 @@ Prerequisite

PySpark development requires to build Spark that needs a proper JDK installed, etc. See `Building Spark <https://spark.apache.org/docs/latest/building-spark.html>`_ for more details.

Note that if you intend to contribute to Spark Connect in Python, ``buf`` version ``1.11.0`` is required, see `Buf Installation <https://docs.buf.build/installation>`_ for more details.

Conda
~~~~~

Expand Down
52 changes: 51 additions & 1 deletion python/docs/source/development/testing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,11 @@ In order to run PySpark tests, you should build Spark itself first via Maven or

build/mvn -DskipTests clean package

.. code-block:: bash

build/sbt -Phive clean package


After that, the PySpark test cases can be run via using ``python/run-tests``. For example,

.. code-block:: bash
Expand All @@ -49,9 +54,54 @@ You can run a specific test via using ``python/run-tests``, for example, as belo
Please refer to `Testing PySpark <https://spark.apache.org/developer-tools.html>`_ for more details.


Running tests using GitHub Actions
Running Tests using GitHub Actions
----------------------------------

You can run the full PySpark tests by using GitHub Actions in your own forked GitHub
repository with a few clicks. Please refer to
`Running tests in your forked repository using GitHub Actions <https://spark.apache.org/developer-tools.html>`_ for more details.


Running Tests for Spark Connect
-------------------------------

Running Tests for Python Client
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In order to run the tests for Spark Connect in Pyth, you should pass ``--parallelism 1`` option together, for example, as below:

.. code-block:: bash

python/run-tests --module pyspark-connect --parallelism 1

Note that if you made some changes in Protobuf definitions, for example, at
`spark/connector/connect/common/src/main/protobuf/spark/connect <https://github.com/apache/spark/tree/master/connector/connect/common/src/main/protobuf/spark/connect>`_,
you should regenerate Python Protobuf client by running ``dev/connect-gen-protos.sh``.


Running PySpark Shell with Python Client
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To run Spark Connect server you locally built:

.. code-block:: bash

bin/spark-shell \
--jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar | paste -sd ',' -` \
--conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin

To run the Spark Connect server from the Apache Spark release:

.. code-block:: bash

bin/spark-shell \
--packages org.apache.spark:spark-connect_2.12:3.4.0 \
--conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin


To run the PySpark Shell with the client for the Spark Connect server:

.. code-block:: bash

bin/pyspark --remote sc://localhost