-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31921][CORE] Fix the wrong warning: "App app-xxx requires more resource than any of Workers could have" #28742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ping @tgravescs @jiangxb1987 Please take a look, thanks! |
|
Test build #123590 has finished for PR 28742 at commit
|
| .filter(canLaunchExecutor(_, app.desc)) | ||
| .sortBy(_.coresFree).reverse | ||
| if (waitingApps.length == 1 && usableWorkers.isEmpty) { | ||
| val appMayHang = waitingApps.length == 1 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to verify waitingApps.length == 1 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm..I was also thinking about this question when I was fixing. I think we can extend it to multiple apps. Let me try it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how useful is this warning message, this just reflects the current status within one round of scheduling, we can tolerant the behavior to not launch a new executor on currently available workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's from #25047 (comment) originally. The warning is mainly used to detect the possible hang of an application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this check, but that seems to be the most simple solution for now. Let's keep it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea of ==1 is that there is only 1 active application, if more then 1, other applications could be using resources so unless you did a lot more checking. We could definitely make it a more comprehensive check in the future. I believe this was supposed to be a simple sanity check for people playing around that may have started the cluster with not enough resources. The standalone mode does not tell you there aren't enough and will just hang forever.
|
LGTM |
|
retest this please |
|
Test build #123643 has finished for PR 28742 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you all. Merged to master/3.0.
… resource than any of Workers could have" ### What changes were proposed in this pull request? This PR adds the check to see whether the allocated executors for the waiting application is empty before recognizing it as a possible hang application. ### Why are the changes needed? It's a bugfix. The warning means there are not enough resources for the application to launch at least one executor. But we can still successfully run a job under this warning, which means it does have launched executor. ### Does this PR introduce _any_ user-facing change? Yes. Before this PR, when using local cluster mode to start spark-shell, e.g. `./bin/spark-shell --master "local-cluster[2, 1, 1024]"`, the user would always see the warning: ``` 20/06/06 22:21:02 WARN Utils: Your hostname, C02ZQ051LVDR resolves to a loopback address: 127.0.0.1; using 192.168.1.6 instead (on interface en0) 20/06/06 22:21:02 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 20/06/06 22:21:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. Spark context Web UI available at http://192.168.1.6:4040 Spark context available as 'sc' (master = local-cluster[2, 1, 1024], app id = app-20200606222107-0000). Spark session available as 'spark'. 20/06/06 22:21:07 WARN Master: App app-20200606222107-0000 requires more resource than any of Workers could have. 20/06/06 22:21:07 WARN Master: App app-20200606222107-0000 requires more resource than any of Workers could have. ``` After this PR, the warning has gone. ### How was this patch tested? Tested manually. Closes #28742 from Ngone51/fix_warning. Authored-by: yi.wu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 38873d5) Signed-off-by: Dongjoon Hyun <[email protected]>
|
thanks all! |
What changes were proposed in this pull request?
This PR adds the check to see whether the allocated executors for the waiting application is empty before recognizing it as a possible hang application.
Why are the changes needed?
It's a bugfix. The warning means there are not enough resources for the application to launch at least one executor. But we can still successfully run a job under this warning, which means it does have launched executor.
Does this PR introduce any user-facing change?
Yes. Before this PR, when using local cluster mode to start spark-shell, e.g.
./bin/spark-shell --master "local-cluster[2, 1, 1024]", the user would always see the warning:After this PR, the warning has gone.
How was this patch tested?
Tested manually.