[SPARK-43474] [SS] [CONNECT] Add a spark connect function to create DataFrame reference #41146

pengzhon-db · 2023-05-11T22:40:28Z

What changes were proposed in this pull request?

This change adds a new spark connect relation type CachedDataFrame, which can represent a DataFrame that's been cached on the server side.

On the server side, each (userId, sessionId) has a map to cache DataFrame. DataFrame will be removed from cache when the corresponding session expires. (The caller can also evict the DataFrame from cache earlier, depending on the logic.)

On the client side, a new relation type and function is added. The new function can create a DataFrame reference given a key. The key is the id of a cached DataFrame, which is usually passed from server to the client. When transforming the DataFrame reference, the server finds the actual DataFrame from the cache and replace it.

One use case of this function will be streaming foreachBatch(). Server needs to call user function for every batch which takes a DataFrame as argument. With the new function, we can cache the DataFrame on the server. Pass the id back to client which can creates the DataFrame reference.

Why are the changes needed?

This change is needed to support streaming foreachBatch() in Spark Connect.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Scala unit test.
Manual test.
(More end to end test will be added when foreachBatch() is supported. Currently there is no way to add a dataframe to the server cache using Python.)

pengzhon-db · 2023-05-12T00:56:01Z

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

+// Represents a DataFrame that has been cached on server.
+message CachedDataFrame {
+  // (Required) An identifier of the user which cached the dataframe
+  string userId = 1;


We can also just get userId and sessionId from server via request, instead of passing from here.
But that would require we update transformRelation() to take into two more parameters, which means all all those transform...() need to be updated to have two more parameter.

pengzhon-db · 2023-06-07T15:57:38Z

@rangadi can u review this PR?

rangadi

Over all LGTM. Made a few comments.
@grundprinzip, @amaliujia, @hvanhovell could one of you take a quick look? This is used for representing runtime Dataframe that is result of a microbatch (needed for foreachBatch()).

...src/main/scala/org/apache/spark/sql/connect/service/SparkConnectCachedDataFrameManager.scala

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

rangadi

@HyukjinKwon could you take a look. This look good to me.

rangadi · 2023-06-09T18:32:30Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala

      val SessionHolder(userId, sessionId, session) = notification.getValue
      val blockManager = session.sparkContext.env.blockManager
      blockManager.removeCache(userId, sessionId)
+      cachedDataFrameManager.remove(userId, sessionId)


Note to self: This reference should be removed from streaming engine once the foreach batch completes..

amaliujia · 2023-06-12T23:06:20Z

...src/main/scala/org/apache/spark/sql/connect/service/SparkConnectCachedDataFrameManager.scala

+ * accessed from the same user within the same session. The DataFrame will be removed from the
+ * cache when the session expires.
+ */
+private[connect] class SparkConnectCachedDataFrameManager extends Logging {


nit: do we need Logging? It is not used?

Removed. See continuation of this PR : #41580

zhenlineo · 2023-06-12T23:08:05Z

...src/main/scala/org/apache/spark/sql/connect/service/SparkConnectCachedDataFrameManager.scala

+  private val dataFrameCache = mutable.Map[(String, String), mutable.Map[String, DataFrame]]()
+
+  def put(userId: String, sessionId: String, dataFrameId: String, value: DataFrame): Unit =
+    synchronized {
+      val sessionKey = (userId, sessionId)
+      val sessionDataFrameMap = dataFrameCache
+        .getOrElseUpdate(sessionKey, mutable.Map[String, DataFrame]())
+      sessionDataFrameMap.put(dataFrameId, value)
+    }


How about using two concurrent hash maps + compute to avoid synchronized? For example:

dataFrameCache.compute(sessionKey, (key, sessionDataFrameMap) => { val newMap = if (sessionDataFrameMap == null) new ConcurrentHashMap[String, String]() else sessionDataFrameMap newMap.put(dataFrameId, value) })

Similar logics apply for remove.
For get, you just need to get without the need to explicitly lock.

only my personal taste:

I feel like synchronized is easy to reason compared to ConcurrentHashMap for code readers. Unless there is significantly performance gain somehow if we switch to a concurrent data structure.

Agree, we could user ConcurrentHashMap. But I often end up preferring synchronized as well. Since this is not perf critical (used only for certain DFs), though I am not sure if there is any perf difference.
Added @GuardedBy annotation.
See the the continuation of this PR here: https://github.com/apache/spark/pull/41580/files#diff-1a8933e9723f5497c3991441c7ff21fe43db63d483354af9a0113043ea600b3eR42

HyukjinKwon

Looks fine but would be great if @grundprinzip finds some time to take a look.

grundprinzip

I believe that the better design would be to embed the rel cache into the SessionHolder() because that is already keyed on the user ID and session ID.

grundprinzip · 2023-06-13T14:36:51Z

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

+  string userId = 1;
+
+  // (Required) An identifier of the Spark session in which the relation is cached
+  string sessionId = 2;


The user, session ID can't be trusted coming from the proto. THe cached relation must only have the actual unique ID of the relation ID and the rest is resolved from the context of the query.

Agree. This is important. Changed the implementation to use SparkSession as the key (it has as sessionUUID)
[continue the discussion here]

grundprinzip · 2023-06-13T14:37:36Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

+    SparkConnectService.cachedDataFrameManager
+      .get(rel.getUserId, rel.getSessionId, rel.getRelationId)
+      .logicalPlan
+  }


Conceptually, the cached data should come from the session holder that could be passed to the planner instead.

Agree. For now proposing to keep it in separate class. Continue discussion here.

grundprinzip · 2023-06-13T14:38:23Z

...src/main/scala/org/apache/spark/sql/connect/service/SparkConnectCachedDataFrameManager.scala

+ * accessed from the same user within the same session. The DataFrame will be removed from the
+ * cache when the session expires.
+ */
+private[connect] class SparkConnectCachedDataFrameManager extends Logging {


Can we add this class to the session holder to make sure that this is properly associated to the right user ID and session.

Discussed above. SessionHolder is not accessible yet. Also removed session_id and user_id from this cache, instead making it key on actual Spark session (user_id & session_id is implicit in that)

grundprinzip · 2023-06-13T14:38:50Z

...src/main/scala/org/apache/spark/sql/connect/service/SparkConnectCachedDataFrameManager.scala

+  def put(userId: String, sessionId: String, dataFrameId: String, value: DataFrame): Unit =
+    synchronized {
+      val sessionKey = (userId, sessionId)
+      val sessionDataFrameMap = dataFrameCache
+        .getOrElseUpdate(sessionKey, mutable.Map[String, DataFrame]())
+      sessionDataFrameMap.put(dataFrameId, value)
+    }


This will make this easier as well because you only have one concurrent map

grundprinzip · 2023-06-13T14:39:57Z

python/pyspark/sql/connect/session.py


    createDataFrame.__doc__ = PySparkSession.createDataFrame.__doc__

+    def _createCachedDataFrame(self, relationId: str) -> "DataFrame":


this seems to be unused here?

Removed. It will used in foreachBatch implementation (in follow up PRs)

rangadi

Addressed the feedback in continuation PR: #41580
Please see replies here, but comment in the above PR.

rangadi · 2023-06-14T00:35:07Z

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

+  string userId = 1;
+
+  // (Required) An identifier of the Spark session in which the relation is cached
+  string sessionId = 2;


Agree. This is important. Changed the implementation to use SparkSession as the key (it has as sessionUUID)
[continue the discussion here]

rangadi · 2023-06-14T00:37:03Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

+    SparkConnectService.cachedDataFrameManager
+      .get(rel.getUserId, rel.getSessionId, rel.getRelationId)
+      .logicalPlan
+  }


Agree. For now proposing to keep it in separate class. Continue discussion here.

rangadi · 2023-06-14T00:46:29Z

...src/main/scala/org/apache/spark/sql/connect/service/SparkConnectCachedDataFrameManager.scala

+ * accessed from the same user within the same session. The DataFrame will be removed from the
+ * cache when the session expires.
+ */
+private[connect] class SparkConnectCachedDataFrameManager extends Logging {


Removed. See continuation of this PR : #41580

rangadi · 2023-06-14T00:52:58Z

...src/main/scala/org/apache/spark/sql/connect/service/SparkConnectCachedDataFrameManager.scala

+ * accessed from the same user within the same session. The DataFrame will be removed from the
+ * cache when the session expires.
+ */
+private[connect] class SparkConnectCachedDataFrameManager extends Logging {


Discussed above. SessionHolder is not accessible yet. Also removed session_id and user_id from this cache, instead making it key on actual Spark session (user_id & session_id is implicit in that)

rangadi · 2023-06-14T00:58:19Z

...src/main/scala/org/apache/spark/sql/connect/service/SparkConnectCachedDataFrameManager.scala

+  private val dataFrameCache = mutable.Map[(String, String), mutable.Map[String, DataFrame]]()
+
+  def put(userId: String, sessionId: String, dataFrameId: String, value: DataFrame): Unit =
+    synchronized {
+      val sessionKey = (userId, sessionId)
+      val sessionDataFrameMap = dataFrameCache
+        .getOrElseUpdate(sessionKey, mutable.Map[String, DataFrame]())
+      sessionDataFrameMap.put(dataFrameId, value)
+    }


Agree, we could user ConcurrentHashMap. But I often end up preferring synchronized as well. Since this is not perf critical (used only for certain DFs), though I am not sure if there is any perf difference.
Added @GuardedBy annotation.
See the the continuation of this PR here: https://github.com/apache/spark/pull/41580/files#diff-1a8933e9723f5497c3991441c7ff21fe43db63d483354af9a0113043ea600b3eR42

rangadi · 2023-06-14T01:43:28Z

python/pyspark/sql/connect/session.py


    createDataFrame.__doc__ = PySparkSession.createDataFrame.__doc__

+    def _createCachedDataFrame(self, relationId: str) -> "DataFrame":


Removed. It will used in foreachBatch implementation (in follow up PRs)

rangadi · 2023-06-14T05:00:30Z

Please note that updates to this PR are in another PR: #41580

…frames by ID [This is a continuation of #41146, to change the author of the PR. Retains the description.] ### What changes were proposed in this pull request? This change adds a new spark connect relation type `CachedRemoteRelation`, which can represent a DataFrame that's been cached on the server side. On the server side, each `SessionHolder` has a cache to maintain mapping from Dataframe ID to actual dataframe. On the client side, a new relation type and function is added. The new function can create a DataFrame reference given a key. The key is the id of a cached DataFrame, which is usually passed from server to the client. When transforming the DataFrame reference, the server finds the actual DataFrame from the cache and replace it. One use case of this function will be streaming foreachBatch(). Server needs to call user function for every batch which takes a DataFrame as argument. With the new function, we can cache the DataFrame on the server. Pass the id back to client which can creates the DataFrame reference. ### Why are the changes needed? This change is needed to support streaming foreachBatch() in Spark Connect. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Scala unit test. Manual test. (More end to end test will be added when foreachBatch() is supported. Currently there is no way to add a dataframe to the server cache using Python.) Closes #41580 from rangadi/df-ref. Authored-by: Raghu Angadi <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

github-actions bot added CONNECT CORE PYTHON SQL labels May 11, 2023

pengzhon-db changed the title ~~Spark connect function to create dataframe ref~~ [SPARK-43474] [SS] [CONNECT] Add a spark connect function to create DataFrame reference May 12, 2023

pengzhon-db commented May 12, 2023

View reviewed changes

rangadi reviewed Jun 8, 2023

View reviewed changes

pengzhon-db added 4 commits June 8, 2023 16:29

Spark connect function to create dataframe ref

ea07c1b

update map type

f425deb

code comment

80cbe55

code comment

4924f91

pengzhon-db force-pushed the spark_connect_create_df_ref branch from fefed84 to 4924f91 Compare June 8, 2023 23:30

pengzhon-db added 4 commits June 8, 2023 17:23

review comment

debacf6

rename

51b7a31

format

4d0858a

type check

443f175

rangadi approved these changes Jun 12, 2023

View reviewed changes

amaliujia reviewed Jun 12, 2023

View reviewed changes

zhenlineo reviewed Jun 12, 2023

View reviewed changes

HyukjinKwon approved these changes Jun 13, 2023

View reviewed changes

grundprinzip reviewed Jun 13, 2023

View reviewed changes

rangadi mentioned this pull request Jun 13, 2023

[SPARK-43474] [SS] [CONNECT] Add a spark connect access to runtime Dataframes by ID. #41580

Closed

rangadi reviewed Jun 14, 2023

View reviewed changes

hvanhovell closed this Jun 14, 2023


		createDataFrame.__doc__ = PySparkSession.createDataFrame.__doc__

		def _createCachedDataFrame(self, relationId: str) -> "DataFrame":

[SPARK-43474] [SS] [CONNECT] Add a spark connect function to create DataFrame reference #41146

[SPARK-43474] [SS] [CONNECT] Add a spark connect function to create DataFrame reference #41146

Uh oh!

Conversation

pengzhon-db commented May 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pengzhon-db commented Jun 7, 2023

Uh oh!

rangadi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rangadi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhenlineo Jun 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

grundprinzip left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rangadi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rangadi commented Jun 14, 2023

Uh oh!

Reviewers

pengzhon-db commented May 11, 2023 •

edited

Loading

zhenlineo Jun 12, 2023 •

edited

Loading