Skip to content

Conversation

@xupefei
Copy link
Contributor

@xupefei xupefei commented Aug 20, 2024

What changes were proposed in this pull request?

This PR adds several new tag-related APIs in Connect back to Spark Core. Following the isolation practice in the original Connect API, the newly introduced APIs also support isolation:

  • interrupt{Tag,All,Operation} can only cancel jobs created by this Spark session.
  • {add,remove}Tag and {get,clear}Tags only apply to jobs created by this Spark session.

Instead of returning query IDs like in Spark Connect, here in Spark SQL, these methods will return SQL execution root IDs - as "query IDs" are only for Connect.

Why are the changes needed?

To close the API gap between Connect and Core.

Does this PR introduce any user-facing change?

Yes, Core users can use some new APIs.

How was this patch tested?

New test added.

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon
Copy link
Member

Could we file a JIRA for Python API set too? Just to make sure we don't miss it out

@xupefei
Copy link
Contributor Author

xupefei commented Aug 21, 2024

Could we file a JIRA for Python API set too? Just to make sure we don't miss it out

Done! https://issues.apache.org/jira/browse/SPARK-49337

@xupefei xupefei requested a review from hvanhovell August 23, 2024 13:42
@xupefei xupefei marked this pull request as ready for review August 23, 2024 13:43
@xupefei
Copy link
Contributor Author

xupefei commented Aug 23, 2024

@HyukjinKwon @hvanhovell This PR is now ready for review. Could you take a look? Thanks!

@HyukjinKwon
Copy link
Member

I feel like what you're doing here is similar with JobArtifactSet. It has things to do with SparkContext but we separated them to JobArtifactSet with a state so we can decouple Spark core from Spark SQL.

@xupefei
Copy link
Contributor Author

xupefei commented Aug 26, 2024

I feel like what you're doing here is similar with JobArtifactSet. It has things to do with SparkContext but we separated them to JobArtifactSet with a state so we can decouple Spark core from Spark SQL.

Yes exactly. Basically the equivalent of JobArtifactSet.withActiveJobArtifactState is SparkSession.withActive.

extensions,
Map.empty)
Map.empty,
managedJobTags.asScala.toMap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq, does this produce an immutable map (I think it should)? If so, then you don't have to force materialization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it is a mutable concurrent map - and is backed by the same map (so changes propagate back to managedJobTags).

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left a few minor comments. Let me know if you want to address now, or in a follow-up? Two follow-ups here: We need to add this pyspark, and we need to homogenize this with the connect implementation.

private def nextExecutionId: Long = _nextExecutionId.getAndIncrement

private val executionIdToQueryExecution = new ConcurrentHashMap[Long, QueryExecution]()
private[sql] val executionIdToQueryExecution = new ConcurrentHashMap[Long, QueryExecution]()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I like the current implementation/separation but doing this in SQLExecution might have some corner cases, e.g., df.rdd.collect() won't be cancelled. One way is to explicitly document that the cancellations only work with SQL/DataFrame API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, does this work streaming queries too? I am fine with doing it in a followup but would like to make sure Spark Connect and Classic versions behave the same.

Copy link
Contributor Author

@xupefei xupefei Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for the comment - I'll add it to the documentation.
I'll check streaming and follow-up in a next PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work streaming queries too?
No it's not. I have to change

to make it work. Will do as a follow-up.

@xupefei
Copy link
Contributor Author

xupefei commented Sep 6, 2024

LGTM. I left a few minor comments. Let me know if you want to address now, or in a follow-up? Two follow-ups here: We need to add this pyspark, and we need to homogenize this with the connect implementation.

I'll address most comments in this PR. Currently, I am being distracted by something else, but will come back very soon.

@hvanhovell
Copy link
Contributor

Merging to master.

@asfgit asfgit closed this in fd8e99b Sep 18, 2024
@xupefei xupefei deleted the reverse-api-tag branch September 18, 2024 07:20
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @xupefei , @hvanhovell , @HyukjinKwon , @mridulm .

This PR seems to introduce a new flaky test. I filed a JIRA. Please take a look because this happens in 12 hours. It might be very flaky.

@dongjoon-hyun
Copy link
Member

FYI, there are two open JIRA issues in the interrupt and cancellation area.
Screenshot 2024-09-18 at 07 39 14

@dongjoon-hyun
Copy link
Member

Ping once more, @xupefei and @hvanhovell . Could you fix the flakiness or disable it (if you are busy), please?

SparkSessionJobTaggingAndCancellationSuite:
...
- Cancellation APIs in SparkSession are isolated *** FAILED ***

@HyukjinKwon
Copy link
Member

@xupefei mind taking a look please?

@xupefei
Copy link
Contributor Author

xupefei commented Oct 23, 2024

On it.

@xupefei
Copy link
Contributor Author

xupefei commented Oct 23, 2024

Trying out a fix at #48622.

} finally {
fpool.shutdownNow()
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @xupefei could you add a test case below?? It seems to be not working properly with multithreaded environment

  test("Tags are isolated in multithreaded environment") {
    // Custom thread pool for multi-threaded testing
    val threadPool = Executors.newFixedThreadPool(2)
    implicit val ec: ExecutionContext = ExecutionContext.fromExecutor(threadPool)

    val session = SparkSession.builder().master("local").getOrCreate()
    @volatile var output1: Set[String] = null
    @volatile var output2: Set[String] = null

    def tag1(): Unit = {
      session.addTag("tag1")
      output1 = session.getTags()
    }

    def tag2(): Unit = {
      session.addTag("tag2")
      output2 = session.getTags()
    }

    try {
      // Run tasks in separate threads
      val future1 = Future {
        tag1()
      }
      val future2 = Future {
        tag2()
      }

      // Wait for threads to complete
      ThreadUtils.awaitResult(Future.sequence(Seq(future1, future2)), 1.minute)

      // Assert outputs
      assert(output1 != null)
      assert(output1 == Set("tag1"))
      assert(output2 != null)
      assert(output2 == Set("tag2"))
    } finally {
      threadPool.shutdownNow()
    }
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi: you might need importing java.util.concurrent.Executors for the test case above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi: also there is a corresponding tests from Spark Connect Python client as well:

def test_tags_multithread(self):
output1 = None
output2 = None
def tag1():
nonlocal output1
self.spark.addTag("tag1")
output1 = self.spark.getTags()
def tag2():
nonlocal output2
self.spark.addTag("tag2")
output2 = self.spark.getTags()
t1 = threading.Thread(target=tag1)
t1.start()
t1.join()
t2 = threading.Thread(target=tag2)
t2.start()
t2.join()
self.assertIsNotNone(output1)
self.assertEquals(output1, {"tag1"})
self.assertIsNotNone(output2)
self.assertEquals(output2, {"tag2"})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR didn't isolate tags on thread- but on session-level.
Two threads can be using the same Spark session and will share the same set of tags.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that has a different semantic with the existing Spark Connect API. Is it WIP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR: #48906

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants