diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 443fbf47942c2..17c4f06dc28d2 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -611,7 +611,7 @@ jobs: - name: Python linter run: PYTHON_EXECUTABLE=python3.9 ./dev/lint-python - name: Python code generation check - run: if test -f ./dev/check-codegen-python.py; then PATH=$PATH:$HOME/buf/bin PYTHON_EXECUTABLE=python3.9 ./dev/check-codegen-python.py; fi + run: if test -f ./dev/connect-check-protos.py; then PATH=$PATH:$HOME/buf/bin PYTHON_EXECUTABLE=python3.9 ./dev/connect-check-protos.py; fi - name: R linter run: ./dev/lint-r - name: JS linter diff --git a/connector/connect/README.md b/connector/connect/README.md index d5cc767c7445a..4f2e06678ddd3 100644 --- a/connector/connect/README.md +++ b/connector/connect/README.md @@ -1,29 +1,28 @@ -# Spark Connect - Developer Documentation +# Spark Connect **Spark Connect is a strictly experimental feature and under heavy development. All APIs should be considered volatile and should not be used in production.** This module contains the implementation of Spark Connect which is a logical plan facade for the implementation in Spark. Spark Connect is directly integrated into the build -of Spark. To enable it, you only need to activate the driver plugin for Spark Connect. +of Spark. The documentation linked here is specifically for developers of Spark Connect and not directly intended to be end-user documentation. +## Development Topics -## Getting Started +### Guidelines for new clients -### Build +When contributing a new client please be aware that we strive to have a common +user experience across all languages. Please follow the below guidelines: -```bash -./build/mvn -Phive clean package -``` +* [Connection string configuration](docs/client-connection-string.md) +* [Adding new messages](docs/adding-proto-messages.md) in the Spark Connect protocol. -or +### Python client developement -```bash -./build/sbt -Phive clean package -``` +Python-specific developement guidelines are located in [python/docs/source/development/testing.rst](https://github.com/apache/spark/blob/master/python/docs/source/development/testing.rst) that is published at [Development tab](https://spark.apache.org/docs/latest/api/python/development/index.html) in PySpark documentation. ### Build with user-defined `protoc` and `protoc-gen-grpc-java` @@ -48,56 +47,3 @@ export CONNECT_PLUGIN_EXEC_PATH=/path-to-protoc-gen-grpc-java-exe The user-defined `protoc` and `protoc-gen-grpc-java` binary files can be produced in the user's compilation environment by source code compilation, for compilation steps, please refer to [protobuf](https://github.com/protocolbuffers/protobuf) and [grpc-java](https://github.com/grpc/grpc-java). - -### Run Spark Shell - -To run Spark Connect you locally built: - -```bash -# Scala shell -./bin/spark-shell \ - --jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar | paste -sd ',' -` \ - --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin - -# PySpark shell -./bin/pyspark \ - --jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar | paste -sd ',' -` \ - --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin -``` - -To use the release version of Spark Connect: - -```bash -./bin/spark-shell \ - --packages org.apache.spark:spark-connect_2.12:3.4.0 \ - --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin -``` - -### Run Tests - -```bash -# Run a single Python class. -./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic' -``` - -```bash -# Run all Spark Connect Python tests as a module. -./python/run-tests --module pyspark-connect --parallelism 1 -``` - - -## Development Topics - -### Generate proto generated files for the Python client -1. Install `buf version 1.11.0`: https://docs.buf.build/installation -2. Run `pip install grpcio==1.48.1 protobuf==3.19.5 mypy-protobuf==3.3.0 googleapis-common-protos==1.56.4 grpcio-status==1.48.1` -3. Run `./connector/connect/dev/generate_protos.sh` -4. Optional Check `./dev/check-codegen-python.py` - -### Guidelines for new clients - -When contributing a new client please be aware that we strive to have a common -user experience across all languages. Please follow the below guidelines: - -* [Connection string configuration](docs/client-connection-string.md) -* [Adding new messages](docs/adding-proto-messages.md) in the Spark Connect protocol. diff --git a/dev/check-codegen-python.py b/dev/connect-check-protos.py similarity index 94% rename from dev/check-codegen-python.py rename to dev/connect-check-protos.py index bcb2b0341da90..b902274b1f456 100755 --- a/dev/check-codegen-python.py +++ b/dev/connect-check-protos.py @@ -46,7 +46,7 @@ def run_cmd(cmd): def check_connect_protos(): print("Start checking the generated codes in pyspark-connect.") with tempfile.TemporaryDirectory() as tmp: - run_cmd(f"{SPARK_HOME}/connector/connect/dev/generate_protos.sh {tmp}") + run_cmd(f"{SPARK_HOME}/dev/connect-gen-protos.sh {tmp}") result = filecmp.dircmp( f"{SPARK_HOME}/python/pyspark/sql/connect/proto/", tmp, @@ -76,7 +76,7 @@ def check_connect_protos(): fail( "Generated files for pyspark-connect are out of sync! " "If you have touched files under connector/connect/src/main/protobuf, " - "please run ./connector/connect/dev/generate_protos.sh. " + "please run ./dev/connect-gen-protos.sh. " "If you haven't touched any file above, please rebase your PR against main branch." ) diff --git a/connector/connect/dev/generate_protos.sh b/dev/connect-gen-protos.sh similarity index 96% rename from connector/connect/dev/generate_protos.sh rename to dev/connect-gen-protos.sh index 38cb821a47c53..cb5b66379b2fa 100755 --- a/connector/connect/dev/generate_protos.sh +++ b/dev/connect-gen-protos.sh @@ -20,12 +20,12 @@ set -ex if [[ $# -gt 1 ]]; then echo "Illegal number of parameters." - echo "Usage: ./connector/connect/dev/generate_protos.sh [path]" + echo "Usage: ./dev/generate_protos.sh [path]" exit -1 fi -SPARK_HOME="$(cd "`dirname $0`"/../../..; pwd)" +SPARK_HOME="$(cd "`dirname $0`"/..; pwd)" cd "$SPARK_HOME" diff --git a/python/docs/source/development/contributing.rst b/python/docs/source/development/contributing.rst index 88f7b3a7b436b..385e7db035de5 100644 --- a/python/docs/source/development/contributing.rst +++ b/python/docs/source/development/contributing.rst @@ -120,6 +120,8 @@ Prerequisite PySpark development requires to build Spark that needs a proper JDK installed, etc. See `Building Spark `_ for more details. +Note that if you intend to contribute to Spark Connect in Python, ``buf`` version ``1.11.0`` is required, see `Buf Installation `_ for more details. + Conda ~~~~~ diff --git a/python/docs/source/development/testing.rst b/python/docs/source/development/testing.rst index 3eab8d04511d6..0262c318cd6f1 100644 --- a/python/docs/source/development/testing.rst +++ b/python/docs/source/development/testing.rst @@ -25,6 +25,11 @@ In order to run PySpark tests, you should build Spark itself first via Maven or build/mvn -DskipTests clean package +.. code-block:: bash + + build/sbt -Phive clean package + + After that, the PySpark test cases can be run via using ``python/run-tests``. For example, .. code-block:: bash @@ -49,9 +54,54 @@ You can run a specific test via using ``python/run-tests``, for example, as belo Please refer to `Testing PySpark `_ for more details. -Running tests using GitHub Actions +Running Tests using GitHub Actions ---------------------------------- You can run the full PySpark tests by using GitHub Actions in your own forked GitHub repository with a few clicks. Please refer to `Running tests in your forked repository using GitHub Actions `_ for more details. + + +Running Tests for Spark Connect +------------------------------- + +Running Tests for Python Client +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In order to run the tests for Spark Connect in Pyth, you should pass ``--parallelism 1`` option together, for example, as below: + +.. code-block:: bash + + python/run-tests --module pyspark-connect --parallelism 1 + +Note that if you made some changes in Protobuf definitions, for example, at +`spark/connector/connect/common/src/main/protobuf/spark/connect `_, +you should regenerate Python Protobuf client by running ``dev/connect-gen-protos.sh``. + + +Running PySpark Shell with Python Client +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To run Spark Connect server you locally built: + +.. code-block:: bash + + bin/spark-shell \ + --jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar | paste -sd ',' -` \ + --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin + +To run the Spark Connect server from the Apache Spark release: + +.. code-block:: bash + + bin/spark-shell \ + --packages org.apache.spark:spark-connect_2.12:3.4.0 \ + --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin + + +To run the PySpark Shell with the client for the Spark Connect server: + +.. code-block:: bash + + bin/pyspark --remote sc://localhost +