[SPARK-15917][CORE] Added support for number of executors in Standalone [WIP] #15405

JonathanTaws · 2016-10-09T14:55:37Z

What changes were proposed in this pull request?

Currently in standalone mode it is not possible to set the number of executors by using the --num-executors or spark.executor.instances property. Instead, as many executors as possible will be spawned based on the available resources and the properties set.
This patch corrects that to support the number of executors property.

Here's the new behavior :

If the executor.cores property isn't set, we will try to spawn one executor on each worker taking all of the cores available (like the default value) while the number of workers < number of executors requested. If we can't launch the specified number of executors, a warning is logged.
If the executor.cores property is set (repeat the same logic for executor.memory):
- and executor.instances * executor.cores <= cores.max, then executor.instances will be spawned,
- and executor.instances * executor.cores > cores.max, then as many executors will be spawned as it is possible - basically the previous behavior when only executor.cores was set - but we also log a warning saying we couldn't spawn the requested number of executors,

In the case where executor.memory is set, all constraints are taken into account based on the number of cores and memory per worker assigned (same logic as with the cores).

How was this patch tested?

I tested this patch by running a simple Spark app in standalone mode and specifying the --num-executors or spark.executor.instances property, and checking if the number of executors was coherent based on the available resources and the requested number of executors.
I plan on testing this patch by adding tests in MasterSuite and running the usual /dev/run-tests.

…can't be satisfied

… multiple times

andrewor14 · 2016-10-10T19:46:55Z

add to whitelist

andrewor14 · 2016-10-10T19:51:47Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

+    val numExecutorsLaunched = app.executors.size
+    // Check to see if we managed to launch the requested number of executors
+    if(numUsable != 0 && numExecutorsLaunched != app.executorLimit &&
+      numExecutorsScheduled != app.executorLimit) {


How are numExecutorsLaunched and numExecutorsScheduled related to each other? Also here we probably want to do an inequality check just in case.

Also style: need space after if

Another thing is, how noisy is this? Do we log this if dynamic allocation is turned on (we shouldn't)?

numExecutorsLaunched corresponds to the actual number of executors that have been launched so far (literally that have been registered in the executors list in the ApplicationInfo), whereas numExecutorsScheduled corresponds to the number of executors that have been scheduled/allocated by scheduleExecutorsOnWorkers. This is needed because scheduleExecutorsOnWorkers is called multiple times when setting up the executors, and if we don't check the condition we will log repeatedly the same message but with incorrect information (such as "0 executors launched" even though the executors have been launched previously).
Tell me if that doesn't make sense, I did a lot of trial and error until coming up with this condition.

Regarding the noise produced, it should be quite minimal. When it's not possible to launch the number of executors requested, just one warning is logged.
With dynamic allocation on, a message is logged when the initial number of executors is specified and it couldn't be satisfied. I don't think it's too much of a problem as there isn't any warning currently for that, but I can add a check to remove the warning when dynamic allocation is enabled if you prefer.

andrewor14 · 2016-10-10T19:53:05Z

Thanks for working on this. It's great to see how small the patch turned out to be!

SparkQA · 2016-10-10T22:13:04Z

Test build #66675 has finished for PR 15405 at commit eed3ecd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-10T22:52:18Z

Test build #66681 has finished for PR 15405 at commit bffedac.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-11T10:35:40Z

Test build #3323 has finished for PR 15405 at commit bffedac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2017-06-13T02:54:10Z

Are you still working on this? @JonathanTaws

JonathanTaws · 2017-06-13T11:08:13Z

Hi Jiang, I've put this on hold as I wasn't getting updates from the admins on the next steps for this. I'd definitely like to move on with this and contribute it to the codebase, as I belive it's still relevant nowadays. Let me know! Le 13 juin 2017 04:54, "Jiang Xingbo" <[email protected]> a écrit : Are you still working on this? @JonathanTaws <https://github.com/jonathantaws> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15405 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFS21qYo6hs8nM2K2Yus5rM33vPjsCZSks5sDfnxgaJpZM4KSBQa> .

jiangxb1987 · 2017-06-14T01:32:37Z

I see this is WIP, when do you think it will be ready for review? Thanks!

JonathanTaws · 2017-06-15T13:46:26Z

My bad, should have removed it. I'll check it's working as expected this weekend and we can move forward on it! Le 14 juin 2017 03:33, "Jiang Xingbo" <[email protected]> a écrit :

…

I see this is WIP, when do you think it will be ready for review? Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15405 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFS21gNJxKCLiugvq3CCyVVP94DLEvfWks5sDzhXgaJpZM4KSBQa> .

jiangxb1987 · 2017-06-25T14:32:54Z

ping @JonathanTaws Please let me know once this PR is ready for review, thanks!

JonathanTaws · 2017-06-30T07:09:55Z

@jiang Quite busy at the moment, will take care of it as soon as possible. I'll ping you once it's done Le 25 juin 2017 16:33, "Jiang Xingbo" <[email protected]> a écrit : ping @JonathanTaws <https://github.com/jonathantaws> Please let me know once this PR is ready for review, thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15405 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFS21riOyIG178z5LukgJj2hxxAV1Hljks5sHm-zgaJpZM4KSBQa> .

## What changes were proposed in this pull request? This PR proposes to close stale PRs, mostly the same instances with apache#18017 Closes apache#14085 - [SPARK-16408][SQL] SparkSQL Added file get Exception: is a directory … Closes apache#14239 - [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism to accelerate shuffle stage. Closes apache#14567 - [SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation Closes apache#14579 - [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() should return Python context managers Closes apache#14601 - [SPARK-13979][Core] Killed executor is re spawned without AWS key… Closes apache#14830 - [SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples Closes apache#14963 - [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in lint-python Closes apache#15227 - [SPARK-17655][SQL]Remove unused variables declarations and definations in a WholeStageCodeGened stage Closes apache#15240 - [SPARK-17556] [CORE] [SQL] Executor side broadcast for broadcast joins Closes apache#15405 - [SPARK-15917][CORE] Added support for number of executors in Standalone [WIP] Closes apache#16099 - [SPARK-18665][SQL] set statement state to "ERROR" after user cancel job Closes apache#16445 - [SPARK-19043][SQL]Make SparkSQLSessionManager more configurable Closes apache#16618 - [SPARK-14409][ML][WIP] Add RankingEvaluator Closes apache#16766 - [SPARK-19426][SQL] Custom coalesce for Dataset Closes apache#16832 - [SPARK-19490][SQL] ignore case sensitivity when filtering hive partition columns Closes apache#17052 - [SPARK-19690][SS] Join a streaming DataFrame with a batch DataFrame which has an aggregation may not work Closes apache#17267 - [SPARK-19926][PYSPARK] Make pyspark exception more user-friendly Closes apache#17371 - [SPARK-19903][PYSPARK][SS] window operator miss the `watermark` metadata of time column Closes apache#17401 - [SPARK-18364][YARN] Expose metrics for YarnShuffleService Closes apache#17519 - [SPARK-15352][Doc] follow-up: add configuration docs for topology-aware block replication Closes apache#17530 - [SPARK-5158] Access kerberized HDFS from Spark standalone Closes apache#17854 - [SPARK-20564][Deploy] Reduce massive executor failures when executor count is large (>2000) Closes apache#17979 - [SPARK-19320][MESOS][WIP]allow specifying a hard limit on number of gpus required in each spark executor when running on mesos Closes apache#18127 - [SPARK-6628][SQL][Branch-2.1] Fix ClassCastException when executing sql statement 'insert into' on hbase table Closes apache#18236 - [SPARK-21015] Check field name is not null and empty in GenericRowWit… Closes apache#18269 - [SPARK-21056][SQL] Use at most one spark job to list files in InMemoryFileIndex Closes apache#18328 - [SPARK-21121][SQL] Support changing storage level via the spark.sql.inMemoryColumnarStorage.level variable Closes apache#18354 - [SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting Closes apache#18383 - [SPARK-21167][SS] Set kafka clientId while fetch messages Closes apache#18414 - [SPARK-21169] [core] Make sure to update application status to RUNNING if executors are accepted and RUNNING after recovery Closes apache#18432 - resolve com.esotericsoftware.kryo.KryoException Closes apache#18490 - [SPARK-21269][Core][WIP] Fix FetchFailedException when enable maxReqSizeShuffleToMem and KryoSerializer Closes apache#18585 - SPARK-21359 Closes apache#18609 - Spark SQL merge small files to big files Update InsertIntoHiveTable.scala Added: Closes apache#18308 - [SPARK-21099][Spark Core] INFO Log Message Using Incorrect Executor I… Closes apache#18599 - [SPARK-21372] spark writes one log file even I set the number of spark_rotate_log to 0 Closes apache#18619 - [SPARK-21397][BUILD]Maven shade plugin adding dependency-reduced-pom.xml to … Closes apache#18667 - Fix the simpleString used in error messages Closes apache#18782 - Branch 2.1 Added: Closes apache#17694 - [SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcasts when using multiple threads Added: Closes apache#16456 - [SPARK-18994] clean up the local directories for application in future by annother thread Closes apache#18683 - [SPARK-21474][CORE] Make number of parallel fetches from a reducer configurable Closes apache#18690 - [SPARK-21334][CORE] Add metrics reporting service to External Shuffle Server Added: Closes apache#18827 - Merge pull request 1 from apache/master ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18780 from HyukjinKwon/close-prs.

JonathanTaws added 4 commits June 24, 2016 12:23

[SPARK-15917] Added support for number of executors for Standalone mode

f45a673

[SPARK-15917] Added warning message if requested number of executors …

d0b1a71

…can't be satisfied

Added check on number of workers to avoid displaying the same message…

0af7b10

… multiple times

Improved check on num executors warning message

eed3ecd

andrewor14 reviewed Oct 10, 2016

View reviewed changes

Corrected style mistake on if statement

bffedac

HyukjinKwon mentioned this pull request Jun 25, 2017

[INFRA] Close stale PRs #18417

Closed

HyukjinKwon mentioned this pull request Jul 31, 2017

[INFRA] Close stale PRs #18780

Closed

asfgit closed this in 3a45c7f Aug 5, 2017

[SPARK-15917][CORE] Added support for number of executors in Standalone [WIP] #15405

[SPARK-15917][CORE] Added support for number of executors in Standalone [WIP] #15405

Uh oh!

Conversation

JonathanTaws commented Oct 9, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

andrewor14 commented Oct 10, 2016

Uh oh!

andrewor14 Oct 10, 2016

Choose a reason for hiding this comment

Uh oh!

andrewor14 Oct 10, 2016

Choose a reason for hiding this comment

Uh oh!

JonathanTaws Oct 10, 2016

Choose a reason for hiding this comment

Uh oh!

JonathanTaws Oct 10, 2016

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Oct 10, 2016

Uh oh!

SparkQA commented Oct 10, 2016

Uh oh!

SparkQA commented Oct 10, 2016

Uh oh!

SparkQA commented Oct 11, 2016

Uh oh!

jiangxb1987 commented Jun 13, 2017

Uh oh!

JonathanTaws commented Jun 13, 2017 via email

Uh oh!

jiangxb1987 commented Jun 14, 2017

Uh oh!

JonathanTaws commented Jun 15, 2017 via email

Uh oh!

jiangxb1987 commented Jun 25, 2017

Uh oh!

JonathanTaws commented Jun 30, 2017 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants