Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Spark Connect
# Spark Connect - Developer Documentation

**Spark Connect is a strictly experimental feature and under heavy development.
All APIs should be considered volatile and should not be used in production.**
Expand All @@ -7,7 +7,13 @@ This module contains the implementation of Spark Connect which is a logical plan
facade for the implementation in Spark. Spark Connect is directly integrated into the build
of Spark. To enable it, you only need to activate the driver plugin for Spark Connect.

## Build
The documentation linked here is specifically for developers of Spark Connect and not
directly intended to be end-user documentation.


## Getting Started

### Build

```bash
./build/mvn -Phive clean package
Expand All @@ -19,7 +25,7 @@ or
./build/sbt -Phive clean package
```

## Run Spark Shell
### Run Spark Shell

To run Spark Connect you locally built:

Expand All @@ -43,14 +49,24 @@ To use the release version of Spark Connect:
--conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
```

## Run Tests
### Run Tests

```bash
./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic'
```

## Generate proto generated files for the Python client

## Development Topics

### Generate proto generated files for the Python client
1. Install `buf version 1.8.0`: https://docs.buf.build/installation
2. Run `pip install grpcio==1.48.1 protobuf==4.21.6 mypy-protobuf==3.3.0`
3. Run `./connector/connect/dev/generate_protos.sh`
4. Optional Check `./dev/check-codegen-python.py`

### Guidelines for new clients

When contributing a new client please be aware that we strive to have a common
user experience across all languages. Please follow the below guidelines:

* [Connection string configuration](docs/client_connection_string.md)
116 changes: 116 additions & 0 deletions connector/connect/docs/coient-connection-string.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Connecting to Spark Connect using Clients

From the client perspective, Spark Connect mostly behaves as any other GRPC
client and can be configured as such. However, to make it easy to use from
different programming languages and to have a homogenous connection surface
this document proposes what the user surface is for connecting to a
Spark Connect endpoint.

## Background
Similar to JDBC or other database connections, Spark Connect leverages a
connection string that contains the relevant parameters that are interpreted
to connect to the Spark Connect endpoint


## Connection String

Generally, the connection string follows the standard URI definitions. The URI
scheme is fixed and set to `sc://`. The full URI has to be a
[valid URI](http://www.faqs.org/rfcs/rfc2396.html) and must
be parsed properly by most systems. For example, hostnames have to be valid and
cannot contain arbitrary characters. Configuration parameter are passed in the
style of the HTTP URL Path Parameter Syntax. This is similar to the JDBC connection
strings. The path component must be empty.

```shell
sc://hostname:port/;param1=value;param2=value
```

<table>
<tr>
<td>Parameter</td>
<td>Type</td>
<td>Description</td>
<td>Examples</td>
</tr>
<tr>
<td>hostname</td>
<td>String</td>
<td>
The hostname of the endpoint for Spark Connect. Since the endpoint
has to be a fully GRPC compatible endpoint a particular path cannot
be specified. The hostname must be fully qualified or can be an IP
address as well.
</td>
<td>
<pre>myexample.com</pre>
<pre>127.0.0.1</pre>
</td>
</tr>
<tr>
<td>port</td>
<td>Numeric</td>
<td>The portname to be used when connecting to the GRPC endpoint. The
default values is: <b>15002</b>. Any valid port number can be used.</td>
<td><pre>15002</pre><pre>443</pre></td>
</tr>
<tr>
<td>token</td>
<td>String</td>
<td>When this param is set in the URL, it will enable standard
bearer token authentication using GRPC. By default this value is not set.</td>
<td><pre>token=ABCDEFGH</pre></td>
</tr>
<tr>
<td>use_ssl</td>
<td>Boolean</td>
<td>When this flag is set, will by default connect to the endpoint
using TLS. The assumption is that the necessary certificates to verify
the server certificates are available in the system. The default
value is <b>false</b></td>
<td><pre>use_ssl=true</pre><pre>use_ssl=false</pre></td>
</tr>
<tr>
<td>user_id</td>
<td>String</td>
<td>User ID to automatically set in the Spark Connect UserContext message.
This is necssary for the appropriate Spark Session management. This is an
*optional* parameter and depending on the deployment this parameter might
be automatically injected using other means.</td>
<td>
<pre>user_id=Martin</pre>
</td>
</tr>
</table>

## Examples

### Valid Examples
Below we capture valid configuration examples, explaining how the connection string
will be used when configuring the Spark Connect client.

The below example connects to port **`15002`** on **myhost.com**.
```python
server_url = "sc://myhost.com/"
```

The next example configures the connection to use a different port with SSL.

```python
server_url = "sc://myhost.com:443/;use_ssl=true"
```

```python
server_url = "sc://myhost.com:443/;use_ssl=true;token=ABCDEFG"
```

### Invalid Examples

As mentioned above, Spark Connect uses a regular GRPC client and the server path
cannot be configured to remain compatible with the GRPC standard and HTTP. For
example the following examles are invalid.

```python
server_url = "sc://myhost.com:443/mypathprefix/;token=AAAAAAA"
```