Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

@HeartSaVioR HeartSaVioR commented Jul 8, 2020

What changes were proposed in this pull request?

This patch tries to mitigate the flakiness of CliSuite, via below changes:

  1. differentiate cli driver boot-up timeout (2 mins) and query execution timeout (parameter)

Cli driver boot-up is determined by master and app ID message. Given spark-sql doesn't print the message if -e option is specified, the patch simply add 2 mins on timeout for the case to cover the boot-up timeout.

  1. don't fail the test even spark-sql doesn't gracefully shut down in 1 min.

  2. extend timeout for path command test in CliSuite

Why are the changes needed?

It took around 40 seconds for boot-up message (master: ... Application Id: ...) to be printed in stderr, while the overall timeout is 1 minute in many tests. This case the actual timeout for query execution is just 20 seconds, which may not be enough.

Some of the tests also failed with org.scalatest.exceptions.TestFailedException: spark-sql did not exit gracefully, which I don't feel the test has to be failed.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Verified with multiple triggers of Jenkins builds

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Jul 8, 2020

Rationalization: I've been looking into the recent failures of CliSuite, and realized it took around 40 seconds for bootup message (master: ... Application Id: ...) to be printed in stderr, while the overall timeout is 1 minute in many tests.

This PR is a POC to validate whether it helps if we can differentiate cli driver bootup timeout (with enough time to avoid failing on slow env) and query execution timeout. If this PR succeeds to pass the build (at least for CliSuite) 5 times sequentially, it becomes pretty much better than current.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125152/testReport/junit/org.apache.spark.sql.hive.thriftserver/CliSuite/history

You'll be surprised if you go backward and see the occurrence of failures, and agree that 5 sequential builds passing can be the end condition.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

1 similar comment
@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125339 has finished for PR 29036 at commit ab1acf6.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

I’ll temporarily disable gendoc here soon to unblock experiments.

@HeartSaVioR HeartSaVioR marked this pull request as draft July 8, 2020 09:14
@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125338 has finished for PR 29036 at commit ab1acf6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125341 has finished for PR 29036 at commit ab1acf6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR HeartSaVioR force-pushed the clisuite-flakiness-fix branch from ab1acf6 to 3f166b8 Compare July 8, 2020 10:41
@HeartSaVioR
Copy link
Contributor Author

retest this, please

1 similar comment
@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125350 has started for PR 29036 at commit 3f166b8.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125351 has started for PR 29036 at commit 3f166b8.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125353 has started for PR 29036 at commit 3f166b8.

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125347 has finished for PR 29036 at commit 3f166b8.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125348 has finished for PR 29036 at commit 3f166b8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

Summary of the 5 builds:

125347 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125347/testReport/)

All tests in CliSuite passed.

125348 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125348/testReport/)

A bunch of tests in CliSuite failed, though looks like the build was super slow, bunch of other suites also failed as well.

125350 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125350/testReport/)

Only one test in CliSuite failed, and it wasn't about timeout - org.scalatest.exceptions.TestFailedException: spark-sql did not exit gracefully.. I wonder we really have to fail the test for the case.

125351 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125351/testReport/)

All tests in CliSuite passed.

125353 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125353/testReport/)

Except the failures of org.scalatest.exceptions.TestFailedException: spark-sql did not exit gracefully., only one test failed (path command), and boot timeout actually helped to run the query. Just because query execution timeout was short.

Looks like this fix helps mitigating the flakiness. Let me update the patch to not fail the query for spark-sql did not exit gracefully, and give more time on path command.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

3 similar comments
@HeartSaVioR
Copy link
Contributor Author

retest this, please

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125404 has finished for PR 29036 at commit 0f3384a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125406 has finished for PR 29036 at commit 0f3384a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 9, 2020

Test build #125407 has finished for PR 29036 at commit 0f3384a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 9, 2020

Test build #125409 has finished for PR 29036 at commit 0f3384a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 9, 2020

Test build #125417 has finished for PR 29036 at commit 0f3384a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

Another summary for next set of builds

125404 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125404/testReport/)

All suites passed

125406 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125406/testReport/)

All suites passed

125407 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125407/testReport/)

All suites passed

125409 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125409/testReport/)

Only one suite failed - org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It is not a test it is a sbt.testing.SuiteSelector)

#29039 is to fix flakiness, so please cross-check this and #29039 together

125417 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125417/testReport/)

All suites passed

@HeartSaVioR HeartSaVioR changed the title [WIP][SPARK-XXXXX][SQL] CliSuite flakiness fix via differentiating cli driver bootup timeout and query execution timeout [SPARK-32242][SQL] CliSuite flakiness fix via differentiating cli driver bootup timeout and query execution timeout Jul 9, 2020
@HeartSaVioR HeartSaVioR marked this pull request as ready for review July 9, 2020 12:57
@HeartSaVioR
Copy link
Contributor Author

cc. @cloud-fan @HyukjinKwon

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it passes, looks good.

@SparkQA
Copy link

SparkQA commented Jul 9, 2020

Test build #125471 has finished for PR 29036 at commit 1802d60.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

@yaooqinn can you take a look?

@yaooqinn
Copy link
Member

yaooqinn commented Jul 10, 2020

Thanks for ping me @cloud-fan,

"It took around 40 seconds for boot-up" a local mode backend with UI disabled sounds a bit weird to me.

Spark SQL CLI command line: ../../bin/spark-sql --master local \
--driver-java-options -Dderby.system.durability=test \
--driver-class-path /home/jenkins/workspace/SparkPullRequestBuilder@3/sql/hive-thriftserver/src/test/noclasspath \
--conf spark.ui.enabled=false \
--hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-64c2cb57-c790-4fa1-a682-92f54805fbe3;create=true \
--hiveconf hive.exec.scratchdir=/home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-46cc494d-0aa3-455f-aeb8-7232c7634a1a \
--hiveconf conf1=conftest \
--hiveconf conf2=1
Exception: java.util.concurrent.TimeoutException: Futures timed out after [1 minute]

But I am also +1 with this approach.

@HeartSaVioR
Copy link
Contributor Author

Thanks for the feedbacks! Merging to master. As same as #29039, we can port back anytime when we find the flakiness in other branches, so it should be OK to start with only master branch.

@HeartSaVioR HeartSaVioR deleted the clisuite-flakiness-fix branch July 10, 2020 04:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants