[SPARK-49249][SPARK-49320] Add new tag-related APIs in Connect back to Spark Core #47815

xupefei · 2024-08-20T12:26:23Z

What changes were proposed in this pull request?

This PR adds several new tag-related APIs in Connect back to Spark Core. Following the isolation practice in the original Connect API, the newly introduced APIs also support isolation:

interrupt{Tag,All,Operation} can only cancel jobs created by this Spark session.
{add,remove}Tag and {get,clear}Tags only apply to jobs created by this Spark session.

Instead of returning query IDs like in Spark Connect, here in Spark SQL, these methods will return SQL execution root IDs - as "query IDs" are only for Connect.

Why are the changes needed?

To close the API gap between Connect and Core.

Does this PR introduce any user-facing change?

Yes, Core users can use some new APIs.

How was this patch tested?

New test added.

Was this patch authored or co-authored using generative AI tooling?

No.

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

HyukjinKwon · 2024-08-21T04:00:24Z

Could we file a JIRA for Python API set too? Just to make sure we don't miss it out

xupefei · 2024-08-21T14:44:51Z

Could we file a JIRA for Python API set too? Just to make sure we don't miss it out

Done! https://issues.apache.org/jira/browse/SPARK-49337

xupefei · 2024-08-23T14:06:22Z

@HyukjinKwon @hvanhovell This PR is now ready for review. Could you take a look? Thanks!

core/src/main/scala/org/apache/spark/SparkContext.scala

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

core/src/main/scala/org/apache/spark/SparkContext.scala

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

HyukjinKwon · 2024-08-26T09:58:57Z

I feel like what you're doing here is similar with JobArtifactSet. It has things to do with SparkContext but we separated them to JobArtifactSet with a state so we can decouple Spark core from Spark SQL.

xupefei · 2024-08-26T10:25:22Z

I feel like what you're doing here is similar with JobArtifactSet. It has things to do with SparkContext but we separated them to JobArtifactSet with a state so we can decouple Spark core from Spark SQL.

Yes exactly. Basically the equivalent of JobArtifactSet.withActiveJobArtifactState is SparkSession.withActive.

hvanhovell · 2024-09-05T12:54:07Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

      extensions,
-      Map.empty)
+      Map.empty,
+      managedJobTags.asScala.toMap)


qq, does this produce an immutable map (I think it should)? If so, then you don't have to force materialization.

Unfortunately, it is a mutable concurrent map - and is backed by the same map (so changes propagate back to managedJobTags).

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala

hvanhovell

LGTM. I left a few minor comments. Let me know if you want to address now, or in a follow-up? Two follow-ups here: We need to add this pyspark, and we need to homogenize this with the connect implementation.

HyukjinKwon · 2024-09-06T06:18:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala

  private def nextExecutionId: Long = _nextExecutionId.getAndIncrement

-  private val executionIdToQueryExecution = new ConcurrentHashMap[Long, QueryExecution]()
+  private[sql] val executionIdToQueryExecution = new ConcurrentHashMap[Long, QueryExecution]()


Hm I like the current implementation/separation but doing this in SQLExecution might have some corner cases, e.g., df.rdd.collect() won't be cancelled. One way is to explicitly document that the cancellations only work with SQL/DataFrame API.

BTW, does this work streaming queries too? I am fine with doing it in a followup but would like to make sure Spark Connect and Classic versions behave the same.

Thank for the comment - I'll add it to the documentation.
I'll check streaming and follow-up in a next PR.

does this work streaming queries too?
No it's not. I have to change

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

Line 313 in 8023504

sparkSessionForStream.withActive {

to make it work. Will do as a follow-up.

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

core/src/main/scala/org/apache/spark/SparkContext.scala

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

xupefei · 2024-09-06T18:07:15Z

LGTM. I left a few minor comments. Let me know if you want to address now, or in a follow-up? Two follow-ups here: We need to add this pyspark, and we need to homogenize this with the connect implementation.

I'll address most comments in this PR. Currently, I am being distracted by something else, but will come back very soon.

# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

hvanhovell · 2024-09-18T03:05:46Z

Merging to master.

dongjoon-hyun

Hi, @xupefei , @hvanhovell , @HyukjinKwon , @mridulm .

This PR seems to introduce a new flaky test. I filed a JIRA. Please take a look because this happens in 12 hours. It might be very flaky.

https://issues.apache.org/jira/browse/SPARK-49696
- https://github.com/apache/spark/actions/runs/10915451051/job/30295259985

dongjoon-hyun · 2024-09-18T14:40:39Z

FYI, there are two open JIRA issues in the interrupt and cancellation area.

dongjoon-hyun · 2024-10-17T15:24:03Z

Ping once more, @xupefei and @hvanhovell . Could you fix the flakiness or disable it (if you are busy), please?

https://github.com/apache/spark/actions/runs/11353570330/job/31654735430

SparkSessionJobTaggingAndCancellationSuite:
...
- Cancellation APIs in SparkSession are isolated *** FAILED ***

HyukjinKwon · 2024-10-23T10:20:01Z

@xupefei mind taking a look please?

xupefei · 2024-10-23T14:34:24Z

On it.

xupefei · 2024-10-23T15:19:02Z

Trying out a fix at #48622.

itholic · 2024-11-19T01:29:04Z

sql/core/src/test/scala/org/apache/spark/sql/SparkSessionJobTaggingAndCancellationSuite.scala

+    } finally {
+      fpool.shutdownNow()
+    }
+  }


Hi, @xupefei could you add a test case below?? It seems to be not working properly with multithreaded environment

test("Tags are isolated in multithreaded environment") { // Custom thread pool for multi-threaded testing val threadPool = Executors.newFixedThreadPool(2) implicit val ec: ExecutionContext = ExecutionContext.fromExecutor(threadPool) val session = SparkSession.builder().master("local").getOrCreate() @volatile var output1: Set[String] = null @volatile var output2: Set[String] = null def tag1(): Unit = { session.addTag("tag1") output1 = session.getTags() } def tag2(): Unit = { session.addTag("tag2") output2 = session.getTags() } try { // Run tasks in separate threads val future1 = Future { tag1() } val future2 = Future { tag2() } // Wait for threads to complete ThreadUtils.awaitResult(Future.sequence(Seq(future1, future2)), 1.minute) // Assert outputs assert(output1 != null) assert(output1 == Set("tag1")) assert(output2 != null) assert(output2 == Set("tag2")) } finally { threadPool.shutdownNow() } }

fyi: you might need importing java.util.concurrent.Executors for the test case above

fyi: also there is a corresponding tests from Spark Connect Python client as well:

spark/python/pyspark/sql/tests/connect/test_session.py

Lines 122 to 148 in b61411d

def test_tags_multithread(self):

output1 = None

output2 = None

def tag1():

nonlocal output1

self.spark.addTag("tag1")

output1 = self.spark.getTags()

def tag2():

nonlocal output2

self.spark.addTag("tag2")

output2 = self.spark.getTags()

t1 = threading.Thread(target=tag1)

t1.start()

t1.join()

t2 = threading.Thread(target=tag2)

t2.start()

t2.join()

self.assertIsNotNone(output1)

self.assertEquals(output1, {"tag1"})

self.assertIsNotNone(output2)

self.assertEquals(output2, {"tag2"})

This PR didn't isolate tags on thread- but on session-level.
Two threads can be using the same Spark session and will share the same set of tags.

But that has a different semantic with the existing Spark Connect API. Is it WIP?

xupefei added 3 commits August 20, 2024 14:06

tags api

e93cedd

Merge branch 'master' of github.com:apache/spark into reverse-api-tag

694db1b

rename

40610a7

hvanhovell reviewed Aug 20, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala Outdated Show resolved Hide resolved

hvanhovell reviewed Aug 20, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala Outdated Show resolved Hide resolved

github-actions bot added SQL CORE CONNECT labels Aug 20, 2024

xupefei added 2 commits August 21, 2024 16:25

address comments

a70d7d2

.

0656e25

xupefei added 2 commits August 21, 2024 16:59

.

6b6ca7f

new approach

d3cd5f5

xupefei requested a review from hvanhovell August 23, 2024 13:42

xupefei marked this pull request as ready for review August 23, 2024 13:43