diff --git a/python/pyspark/sql/connect/README.md b/connector/connect/README.md similarity index 72% rename from python/pyspark/sql/connect/README.md rename to connector/connect/README.md index 9fd4d1a4596cb..d7c96209fbec5 100644 --- a/python/pyspark/sql/connect/README.md +++ b/connector/connect/README.md @@ -1,4 +1,4 @@ -# Spark Connect +# Spark Connect - Developer Documentation **Spark Connect is a strictly experimental feature and under heavy development. All APIs should be considered volatile and should not be used in production.** @@ -7,7 +7,13 @@ This module contains the implementation of Spark Connect which is a logical plan facade for the implementation in Spark. Spark Connect is directly integrated into the build of Spark. To enable it, you only need to activate the driver plugin for Spark Connect. -## Build +The documentation linked here is specifically for developers of Spark Connect and not +directly intended to be end-user documentation. + + +## Getting Started + +### Build ```bash ./build/mvn -Phive clean package @@ -19,7 +25,7 @@ or ./build/sbt -Phive clean package ``` -## Run Spark Shell +### Run Spark Shell To run Spark Connect you locally built: @@ -43,14 +49,24 @@ To use the release version of Spark Connect: --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin ``` -## Run Tests +### Run Tests ```bash ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic' ``` -## Generate proto generated files for the Python client + +## Development Topics + +### Generate proto generated files for the Python client 1. Install `buf version 1.8.0`: https://docs.buf.build/installation 2. Run `pip install grpcio==1.48.1 protobuf==4.21.6 mypy-protobuf==3.3.0` 3. Run `./connector/connect/dev/generate_protos.sh` 4. Optional Check `./dev/check-codegen-python.py` + +### Guidelines for new clients + +When contributing a new client please be aware that we strive to have a common +user experience across all languages. Please follow the below guidelines: + +* [Connection string configuration](docs/client_connection_string.md) diff --git a/connector/connect/docs/coient-connection-string.md b/connector/connect/docs/coient-connection-string.md new file mode 100644 index 0000000000000..99cc4df4658f6 --- /dev/null +++ b/connector/connect/docs/coient-connection-string.md @@ -0,0 +1,116 @@ +# Connecting to Spark Connect using Clients + +From the client perspective, Spark Connect mostly behaves as any other GRPC +client and can be configured as such. However, to make it easy to use from +different programming languages and to have a homogenous connection surface +this document proposes what the user surface is for connecting to a +Spark Connect endpoint. + +## Background +Similar to JDBC or other database connections, Spark Connect leverages a +connection string that contains the relevant parameters that are interpreted +to connect to the Spark Connect endpoint + + +## Connection String + +Generally, the connection string follows the standard URI definitions. The URI +scheme is fixed and set to `sc://`. The full URI has to be a +[valid URI](http://www.faqs.org/rfcs/rfc2396.html) and must +be parsed properly by most systems. For example, hostnames have to be valid and +cannot contain arbitrary characters. Configuration parameter are passed in the +style of the HTTP URL Path Parameter Syntax. This is similar to the JDBC connection +strings. The path component must be empty. + +```shell +sc://hostname:port/;param1=value;param2=value +``` + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterTypeDescriptionExamples
hostnameString + The hostname of the endpoint for Spark Connect. Since the endpoint + has to be a fully GRPC compatible endpoint a particular path cannot + be specified. The hostname must be fully qualified or can be an IP + address as well. + +
myexample.com
+
127.0.0.1
+
portNumericThe portname to be used when connecting to the GRPC endpoint. The + default values is: 15002. Any valid port number can be used.
15002
443
tokenStringWhen this param is set in the URL, it will enable standard + bearer token authentication using GRPC. By default this value is not set.
token=ABCDEFGH
use_sslBooleanWhen this flag is set, will by default connect to the endpoint + using TLS. The assumption is that the necessary certificates to verify + the server certificates are available in the system. The default + value is false
use_ssl=true
use_ssl=false
user_idStringUser ID to automatically set in the Spark Connect UserContext message. + This is necssary for the appropriate Spark Session management. This is an + *optional* parameter and depending on the deployment this parameter might + be automatically injected using other means. +
user_id=Martin
+
+ +## Examples + +### Valid Examples +Below we capture valid configuration examples, explaining how the connection string +will be used when configuring the Spark Connect client. + +The below example connects to port **`15002`** on **myhost.com**. +```python +server_url = "sc://myhost.com/" +``` + +The next example configures the connection to use a different port with SSL. + +```python +server_url = "sc://myhost.com:443/;use_ssl=true" +``` + +```python +server_url = "sc://myhost.com:443/;use_ssl=true;token=ABCDEFG" +``` + +### Invalid Examples + +As mentioned above, Spark Connect uses a regular GRPC client and the server path +cannot be configured to remain compatible with the GRPC standard and HTTP. For +example the following examles are invalid. + +```python +server_url = "sc://myhost.com:443/mypathprefix/;token=AAAAAAA" +``` +