Skip to content

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Jan 3, 2024

What changes were proposed in this pull request?

This PR adds an new param to HiveThriftServer2.startWithContextto tell theThriftCLIServices whether to call System exitor not when encountering errors. When developers callHiveThriftServer2.startWithContextand if an error occurs,System exitwill be performed, stop the existingSqlContext/SparkContext`, and crash the user app.

There is also such a use case in our tests. We intended to retry starting a thrift server three times in total but it might stop the underlying SparkContext early and fail the rest.

For example
https://github.com/apache/spark/actions/runs/7271496487/job/19812142981

06:21:12.854 ERROR org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite: Error start hive server with Context 
org.scalatest.exceptions.TestFailedException: SharedThriftServer.this.tempScratchDir.exists() was true
	at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
	at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
	at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
	at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
	at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.startThriftServer(SharedThriftServer.scala:151)
	at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.$anonfun$beforeAll$1(SharedThriftServer.scala:59)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
06:21:12.854 ERROR org.apache.hive.service.cli.thrift.ThriftCLIService: Error starting HiveServer2: could not start ThriftBinaryCLIService
java.lang.NullPointerException: Cannot invoke "org.apache.thrift.server.TServer.serve()" because "this.server" is null
	at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:135)
	at java.base/java.lang.Thread.run(Thread.java:840)
06:21:12.941 ERROR org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite: Error start hive server with Context 
java.lang.IllegalStateException: LiveListenerBus is stopped.
	at org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:92)
	at org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:75)
	at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.createListenerAndUI(HiveThriftServer2.scala:74)
	at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.startWithContext(HiveThriftServer2.scala:66)
	at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.startThriftServer(SharedThriftServer.scala:141)
	at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.$anonfun$beforeAll$4(SharedThriftServer.scala:60)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
06:21:12.958 WARN org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite: 



[info] org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite *** ABORTED *** (151 milliseconds)
[info]   java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
[info] This stopped SparkContext was created at:
[info] 
[info] org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite.beforeAll(ThriftServerWithSparkContextSuite.scala:279)
[info] org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
[info] org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info] org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info] org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:69)
[info] org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
[info] org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
[info] sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
[info] java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[info] java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[info] java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[info] java.base/java.lang.Thread.run(Thread.java:840)
[info] 
[info] The currently active SparkContext was created at:
[info] 
[info] (No active SparkContext.)
[info]   at org.apache.spark.SparkContext.assertNotStopped(SparkContext.scala:122)
[info]   at org.apache.spark.sql.SparkSession.<init>(SparkSession.scala:115)
[info]   at org.apache.spark.sql.SparkSession.newSession(SparkSession.scala:274)
[info]   at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.startThriftServer(SharedThriftServer.scala:130)

Why are the changes needed?

  • Improve the programmability of HiveThriftServer2.startWithContext
  • Fix flakiness in tests

Does this PR introduce any user-facing change?

no, developer API change and the default behavior is AS-IS.

How was this patch tested?

Verified ThriftServerWithSparkContextInHttpSuite locally

18:20:02.840 ERROR org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite: A previous Hive's SessionState is leaked, aborting this retry
18:20:02.840 ERROR org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite: Error start hive server with Context
java.lang.IllegalStateException: HiveThriftServer2 started in binary mode while the test case is expecting HTTP mode
	at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.$anonfun$startThriftServer$2(SharedThriftServer.scala:149)
	at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.$anonfun$startThriftServer$2$adapted(SharedThriftServer.scala:144)
	at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
	at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
	at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.startThriftServer(SharedThriftServer.scala:144)
	at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.$anonfun$beforeAll$1(SharedThriftServer.scala:60)
18:20:04.114 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
18:20:04.114 WARN org.apache.hadoop.hive.metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore [email protected]
18:20:04.119 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException
[info] - the scratch dir will not be exist (1 millisecond)
[info] - SPARK-29911: Uncache cached tables when session closed (376 milliseconds)

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Jan 3, 2024
@yaooqinn
Copy link
Member Author

yaooqinn commented Jan 3, 2024

@dongjoon-hyun @LuciferYang @cloud-fan, PTAL, thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it a little weird to have a configuration to fix flakiness?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out, @dongjoon-hyun. I completely agree with you that creating a new configuration for a test fix would be excessive. However, in this case, we can address both the DevelopAPI and test side issue as a typical use case simultaneously, This way, the solution would not be limited to just fixing the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we want it to be false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, when we retry the thrift server on the same sc. If this is true, sc will be stopped. And weird things follow in the next retries


[info] org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite *** ABORTED *** (151 milliseconds)
[info]   java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
[info] This stopped SparkContext was created at:
[info] 
[info] org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite.beforeAll(ThriftServerWithSparkContextSuite.scala:279)
[info] org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
[info] org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info] org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info] org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:69)
[info] org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
[info] org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
[info] sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
[info] java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[info] java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[info] java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[info] java.base/java.lang.Thread.run(Thread.java:840)
[info] 
[info] The currently active SparkContext was created at:
[info] 
[info] (No active SparkContext.)
[info]   at org.apache.spark.SparkContext.assertNotStopped(SparkContext.scala:122)
[info]   at org.apache.spark.sql.SparkSession.<init>(SparkSession.scala:115)
[info]   at org.apache.spark.sql.SparkSession.newSession(SparkSession.scala:274)
[info]   at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.startThriftServer(SharedThriftServer.scala:130)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about non-thriftserver? Maybe we should clarify what is "retry", then people can understand when to set this config.

Copy link
Member Author

@yaooqinn yaooqinn Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about non-thriftserver?

As we have a shutdown hook to perform self-terminating in SparkContext, for user self-contained apps called System.exit unexpectedly or intentionally, the context runs out.

The configuration controls the part for the thrift server binding to the SparkContext which shouldn't be affected by the shutdown hook. In other words, it does not affect non-thriftserver.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will anyone set this config in production? It looks fishy to me to add a new config to fix flaky tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will anyone set this config in production?

This question is difficult to answer. But, it can be used in production when using startWithContext together.

It looks fishy to me to add a new config to fix flaky tests.

Or, we can have another try to modify startWithContext directly to add a new bool parameter that works as same as the new config. But because the spark thrift-server is initialized by HiveConf instance, we still need a new key to store this bool value in order to pass it into the ThriftCLIService. Otherwise, we might need extra work to refactor the thrift-server bootstraping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need a new key

I think a new local key that works like a function parameter is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flu hit me. I'm sorry for not getting back to you sooner. The comments are addressed.

…lopApi retriable and fix flakiness of ThriftServerWithSparkContextInHttpSuite

try {
hiveServer2 = HiveThriftServer2.startWithContext(sqlContext)
hiveServer2 = HiveThriftServer2.startWithContext(sqlContext, exitOnError = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add some comments to explain why we don't want to exit on error here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Please check whether the comments added are informative or not.

serverPort = t.getPortNumber
if (t.isInstanceOf[ThriftBinaryCLIService] && mode == ServerMode.http) {
logError("A previous Hive's SessionState is leaked, aborting this retry")
throw new IllegalStateException("HiveThriftServer2 started in binary mode " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use SparkException.internalError?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@yaooqinn
Copy link
Member Author

Thank you for the review, @cloud-fan & @dongjoon-hyun.

Merged to master.

@yaooqinn yaooqinn closed this in 5c3b36a Jan 11, 2024
@dongjoon-hyun
Copy link
Member

Thank you, @yaooqinn and all.

@yaooqinn yaooqinn deleted the SPARK-46575 branch January 11, 2024 10:37
cloud-fan added a commit that referenced this pull request Mar 27, 2024
…text(SQLContext)` method for compatibility

### What changes were proposed in this pull request?
#44575 added a default parameter to `HiveThriftServer2.startWithContext` API. Although this is source-compatible, this is not binary-compatible - for eg. a binary built with new code won't be able to run with the old code. In this PR, we maintain forward and backward compatibility by keeping the API same and introducing a separate API for the additional parameter.

### Why are the changes needed?
See above

### Does this PR introduce _any_ user-facing change?
only developer API change

### How was this patch tested?
Existing tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #45727 from dragqueen95/thriftserver-back-compat.

Lead-authored-by: Saksham Garg <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants