-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32242][SQL] CliSuite flakiness fix via differentiating cli driver bootup timeout and query execution timeout #29036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Rationalization: I've been looking into the recent failures of CliSuite, and realized it took around 40 seconds for bootup message ( This PR is a POC to validate whether it helps if we can differentiate cli driver bootup timeout (with enough time to avoid failing on slow env) and query execution timeout. If this PR succeeds to pass the build (at least for CliSuite) 5 times sequentially, it becomes pretty much better than current. You'll be surprised if you go backward and see the occurrence of failures, and agree that 5 sequential builds passing can be the end condition. |
|
retest this, please |
1 similar comment
|
retest this, please |
|
Test build #125339 has finished for PR 29036 at commit
|
|
I’ll temporarily disable gendoc here soon to unblock experiments. |
|
Test build #125338 has finished for PR 29036 at commit
|
|
Test build #125341 has finished for PR 29036 at commit
|
…and query execution timeout
ab1acf6 to
3f166b8
Compare
|
retest this, please |
1 similar comment
|
retest this, please |
|
Test build #125350 has started for PR 29036 at commit |
|
retest this, please |
|
Test build #125351 has started for PR 29036 at commit |
|
retest this, please |
|
Test build #125353 has started for PR 29036 at commit |
|
Test build #125347 has finished for PR 29036 at commit
|
|
Test build #125348 has finished for PR 29036 at commit
|
|
Summary of the 5 builds:
All tests in CliSuite passed.
A bunch of tests in CliSuite failed, though looks like the build was super slow, bunch of other suites also failed as well.
Only one test in CliSuite failed, and it wasn't about timeout -
All tests in CliSuite passed.
Except the failures of Looks like this fix helps mitigating the flakiness. Let me update the patch to not fail the query for |
|
retest this, please |
3 similar comments
|
retest this, please |
|
retest this, please |
|
retest this, please |
|
Test build #125404 has finished for PR 29036 at commit
|
|
retest this, please |
|
Test build #125406 has finished for PR 29036 at commit
|
|
Test build #125407 has finished for PR 29036 at commit
|
|
Test build #125409 has finished for PR 29036 at commit
|
|
Test build #125417 has finished for PR 29036 at commit
|
|
Another summary for next set of builds
All suites passed
All suites passed
All suites passed
Only one suite failed - #29039 is to fix flakiness, so please cross-check this and #29039 together
All suites passed |
This reverts commit 3f166b8.
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it passes, looks good.
|
Test build #125471 has finished for PR 29036 at commit
|
|
@yaooqinn can you take a look? |
|
Thanks for ping me @cloud-fan, "It took around 40 seconds for boot-up" a local mode backend with UI disabled sounds a bit weird to me. Spark SQL CLI command line: ../../bin/spark-sql --master local \
--driver-java-options -Dderby.system.durability=test \
--driver-class-path /home/jenkins/workspace/SparkPullRequestBuilder@3/sql/hive-thriftserver/src/test/noclasspath \
--conf spark.ui.enabled=false \
--hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-64c2cb57-c790-4fa1-a682-92f54805fbe3;create=true \
--hiveconf hive.exec.scratchdir=/home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-46cc494d-0aa3-455f-aeb8-7232c7634a1a \
--hiveconf conf1=conftest \
--hiveconf conf2=1
Exception: java.util.concurrent.TimeoutException: Futures timed out after [1 minute]But I am also +1 with this approach. |
|
Thanks for the feedbacks! Merging to master. As same as #29039, we can port back anytime when we find the flakiness in other branches, so it should be OK to start with only master branch. |
What changes were proposed in this pull request?
This patch tries to mitigate the flakiness of CliSuite, via below changes:
Cli driver boot-up is determined by master and app ID message. Given spark-sql doesn't print the message if
-eoption is specified, the patch simply add 2 mins on timeout for the case to cover the boot-up timeout.don't fail the test even spark-sql doesn't gracefully shut down in 1 min.
extend timeout for
path commandtest in CliSuiteWhy are the changes needed?
It took around 40 seconds for boot-up message (master: ... Application Id: ...) to be printed in stderr, while the overall timeout is 1 minute in many tests. This case the actual timeout for query execution is just 20 seconds, which may not be enough.
Some of the tests also failed with
org.scalatest.exceptions.TestFailedException: spark-sql did not exit gracefully, which I don't feel the test has to be failed.Does this PR introduce any user-facing change?
No
How was this patch tested?
Verified with multiple triggers of Jenkins builds