Skip to content

Conversation

@grundprinzip
Copy link
Contributor

What changes were proposed in this pull request?

This patch adds documentation to describe how clients should implement handling
connecting to the Spark Connect endpoint. GRPC as a protocol is well documented and has
many options. However, this does not make it easy for users to reason about how to correctly
configure GRPC to make it work for their use cases.

To overcome the issue, this document defines a client connection string that needs to be parsed
by the different language clients and can be used to properly configure the GRPC client.

Why are the changes needed?

Documentation and design specification for clients implementing the Spark Connect protocol.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Doc only.

@amaliujia
Copy link
Contributor

Overall LGTM

Is the user_id (or the user session token) be relevant to this doc?

def __init__(self, user_id: str, host: Optional[str] = None, port: int = 15002):

@grundprinzip
Copy link
Contributor Author

Good point, I will incorporate that into the doc.

@@ -0,0 +1,110 @@
# Connecting to Spark Connect using Clients
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage documentation should better be in https://github.com/apache/spark/tree/master/docs as an md file so it can be properly documented.

@HyukjinKwon
Copy link
Member

Maybe it's better to have a JIRA. BTW, wonder if we have an e2e example for users can copy and paste to try. (e.g., like most of docs in https://spark.apache.org/docs/latest/index.html). Another decision to make is if we should document it in PySpark docs (https://spark.apache.org/docs/latest/api/python/getting_started/index.html) or Spark main page (https://spark.apache.org/docs/latest/index.html)

@grundprinzip
Copy link
Contributor Author

@HyukjinKwon I will add a Jira this is just the starting point to align where we want to go.

My idea would be that once this is merged I will create a pr for the python client to follow this proposal and then we can update this doc with an e2e example.

But the main goal here is developer documentation not user doc. For example, scala must then implement the same model.

@HyukjinKwon
Copy link
Member

@grundprinzip
Copy link
Contributor Author

What about we link to it from the top level Readme in the component?

The reason why it's not in the code is because it's client language agnostic.

@HyukjinKwon
Copy link
Member

Maybe putting it to the top model level (connector/connect/README.md) for now could be a good idea (?). Just wanted to avoid a different structure compared to other compoenents (connector/connect/doc)

@grundprinzip
Copy link
Contributor Author

@HyukjinKwon so my proposal is the following:

  • I moved the README from the pyspark folder into the top-level connector/connect directory and made it clear that these are developer docs.
  • The doc directory was moved to docs to be consistent with the project
  • Linked from the top-level README to the connection string configuration doc.

WDYT?

@github-actions github-actions bot added the DOCS label Nov 2, 2022
@grundprinzip grundprinzip changed the title [CONNECT] [DOC] Defining Spark Connect Client Connection String [SPARK-40995] [CONNECT] [DOC] Defining Spark Connect Client Connection String Nov 2, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

Merged to master.

HyukjinKwon pushed a commit that referenced this pull request Nov 4, 2022
…hon Client

### What changes were proposed in this pull request?

This PR implements the connection string for Spark Connect clients according to the documentation added in #38470.

With this patch it becomes possible to connect to a Spark Connect endpoint using

```
spark = SparkRemoteSession(user_id="martin", connection_string="sc://hostname/;use_ssl=true;token=abcd")
spark.read.table("test").limit(10).toPandas()
```

The connection string is properly parsed and filtered. This allows to dynamically configure SSL and bearer token authentication. All remaining parameters are converted into GRPC Metadata pairs and submitted as part of the request.

### Why are the changes needed?
User experience.

### Does this PR introduce _any_ user-facing change?
No, experimental API.

### How was this patch tested?
UT

Closes #38485 from grundprinzip/SPARK-41001.

Lead-authored-by: Martin Grund <[email protected]>
Co-authored-by: Martin Grund <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…String

### What changes were proposed in this pull request?

This patch adds documentation to describe how clients should implement handling
connecting to the Spark Connect endpoint. GRPC as a protocol is well documented and has
many options. However, this does not make it easy for users to reason about how to correctly
configure GRPC to make it work for their use cases.

To overcome the issue, this document defines a client connection string that needs to be parsed
by the different language clients and can be used to properly configure the GRPC client.

### Why are the changes needed?

Documentation and design specification for clients implementing the Spark Connect protocol.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Doc only.

Closes apache#38470 from grundprinzip/client-connection.

Lead-authored-by: Martin Grund <[email protected]>
Co-authored-by: Martin Grund <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…hon Client

### What changes were proposed in this pull request?

This PR implements the connection string for Spark Connect clients according to the documentation added in apache#38470.

With this patch it becomes possible to connect to a Spark Connect endpoint using

```
spark = SparkRemoteSession(user_id="martin", connection_string="sc://hostname/;use_ssl=true;token=abcd")
spark.read.table("test").limit(10).toPandas()
```

The connection string is properly parsed and filtered. This allows to dynamically configure SSL and bearer token authentication. All remaining parameters are converted into GRPC Metadata pairs and submitted as part of the request.

### Why are the changes needed?
User experience.

### Does this PR introduce _any_ user-facing change?
No, experimental API.

### How was this patch tested?
UT

Closes apache#38485 from grundprinzip/SPARK-41001.

Lead-authored-by: Martin Grund <[email protected]>
Co-authored-by: Martin Grund <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants