[SPARK-31921][CORE] Fix the wrong warning: "App app-xxx requires more resource than any of Workers could have" #28742

Ngone51 · 2020-06-06T14:43:21Z

What changes were proposed in this pull request?

This PR adds the check to see whether the allocated executors for the waiting application is empty before recognizing it as a possible hang application.

Why are the changes needed?

It's a bugfix. The warning means there are not enough resources for the application to launch at least one executor. But we can still successfully run a job under this warning, which means it does have launched executor.

Does this PR introduce any user-facing change?

Yes. Before this PR, when using local cluster mode to start spark-shell, e.g. ./bin/spark-shell --master "local-cluster[2, 1, 1024]", the user would always see the warning:

20/06/06 22:21:02 WARN Utils: Your hostname, C02ZQ051LVDR resolves to a loopback address: 127.0.0.1; using 192.168.1.6 instead (on interface en0)
20/06/06 22:21:02 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/06/06 22:21:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
Spark context Web UI available at http://192.168.1.6:4040
Spark context available as 'sc' (master = local-cluster[2, 1, 1024], app id = app-20200606222107-0000).
Spark session available as 'spark'.
20/06/06 22:21:07 WARN Master: App app-20200606222107-0000 requires more resource than any of Workers could have.
20/06/06 22:21:07 WARN Master: App app-20200606222107-0000 requires more resource than any of Workers could have.

After this PR, the warning has gone.

How was this patch tested?

Tested manually.

Ngone51 · 2020-06-06T14:43:57Z

ping @tgravescs @jiangxb1987 Please take a look, thanks!

SparkQA · 2020-06-06T17:14:39Z

Test build #123590 has finished for PR 28742 at commit 55d4730.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2020-06-08T06:41:47Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

          .filter(canLaunchExecutor(_, app.desc))
          .sortBy(_.coresFree).reverse
-        if (waitingApps.length == 1 && usableWorkers.isEmpty) {
+        val appMayHang = waitingApps.length == 1 &&


Why do you need to verify waitingApps.length == 1 ?

hmm..I was also thinking about this question when I was fixing. I think we can extend it to multiple apps. Let me try it.

I'm not sure how useful is this warning message, this just reflects the current status within one round of scheduling, we can tolerant the behavior to not launch a new executor on currently available workers.

It's from #25047 (comment) originally. The warning is mainly used to detect the possible hang of an application.

I don't like this check, but that seems to be the most simple solution for now. Let's keep it then.

I think the idea of ==1 is that there is only 1 active application, if more then 1, other applications could be using resources so unless you did a lot more checking. We could definitely make it a more comprehensive check in the future. I believe this was supposed to be a simple sanity check for people playing around that may have started the cluster with not enough resources. The standalone mode does not tell you there aren't enough and will just hang forever.

jiangxb1987 · 2020-06-08T07:21:36Z

LGTM

jiangxb1987 · 2020-06-08T17:13:18Z

retest this please

SparkQA · 2020-06-08T20:46:21Z

Test build #123643 has finished for PR 28742 at commit 55d4730.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you all. Merged to master/3.0.

… resource than any of Workers could have" ### What changes were proposed in this pull request? This PR adds the check to see whether the allocated executors for the waiting application is empty before recognizing it as a possible hang application. ### Why are the changes needed? It's a bugfix. The warning means there are not enough resources for the application to launch at least one executor. But we can still successfully run a job under this warning, which means it does have launched executor. ### Does this PR introduce _any_ user-facing change? Yes. Before this PR, when using local cluster mode to start spark-shell, e.g. `./bin/spark-shell --master "local-cluster[2, 1, 1024]"`, the user would always see the warning: ``` 20/06/06 22:21:02 WARN Utils: Your hostname, C02ZQ051LVDR resolves to a loopback address: 127.0.0.1; using 192.168.1.6 instead (on interface en0) 20/06/06 22:21:02 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 20/06/06 22:21:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. Spark context Web UI available at http://192.168.1.6:4040 Spark context available as 'sc' (master = local-cluster[2, 1, 1024], app id = app-20200606222107-0000). Spark session available as 'spark'. 20/06/06 22:21:07 WARN Master: App app-20200606222107-0000 requires more resource than any of Workers could have. 20/06/06 22:21:07 WARN Master: App app-20200606222107-0000 requires more resource than any of Workers could have. ``` After this PR, the warning has gone. ### How was this patch tested? Tested manually. Closes #28742 from Ngone51/fix_warning. Authored-by: yi.wu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 38873d5) Signed-off-by: Dongjoon Hyun <[email protected]>

Ngone51 · 2020-06-10T02:31:33Z

thanks all!

fix

55d4730

probot-autolabeler bot added the CORE label Jun 6, 2020

jiangxb1987 reviewed Jun 8, 2020

View reviewed changes

tgravescs approved these changes Jun 8, 2020

View reviewed changes

dongjoon-hyun approved these changes Jun 9, 2020

View reviewed changes

dongjoon-hyun closed this in 38873d5 Jun 9, 2020

[SPARK-31921][CORE] Fix the wrong warning: "App app-xxx requires more resource than any of Workers could have" #28742

[SPARK-31921][CORE] Fix the wrong warning: "App app-xxx requires more resource than any of Workers could have" #28742

Uh oh!

Conversation

Ngone51 commented Jun 6, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Ngone51 commented Jun 6, 2020

Uh oh!

SparkQA commented Jun 6, 2020

Uh oh!

jiangxb1987 Jun 8, 2020

Choose a reason for hiding this comment

Uh oh!

Ngone51 Jun 8, 2020

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 Jun 8, 2020

Choose a reason for hiding this comment

Uh oh!

Ngone51 Jun 8, 2020

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 Jun 8, 2020

Choose a reason for hiding this comment

Uh oh!

tgravescs Jun 8, 2020

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Jun 8, 2020

Uh oh!

jiangxb1987 commented Jun 8, 2020

Uh oh!

SparkQA commented Jun 8, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Ngone51 commented Jun 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants