-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n… #10794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
30048ac
fa7d54b
3a1724c
659c505
ebe3c7f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -78,6 +78,9 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp | |
| // Executors that have been lost, but for which we don't yet know the real exit reason. | ||
| protected val executorsPendingLossReason = new HashSet[String] | ||
|
|
||
| // The num of current max ExecutorId used to re-register appMaster | ||
| protected var currentExecutorIdCounter = 0 | ||
|
|
||
| class DriverEndpoint(override val rpcEnv: RpcEnv, sparkProperties: Seq[(String, String)]) | ||
| extends ThreadSafeRpcEndpoint with Logging { | ||
|
|
||
|
|
@@ -155,6 +158,9 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp | |
| // in this block are read when requesting executors | ||
| CoarseGrainedSchedulerBackend.this.synchronized { | ||
| executorDataMap.put(executorId, data) | ||
| if (currentExecutorIdCounter < Integer.parseInt(executorId)) { | ||
| currentExecutorIdCounter = Integer.parseInt(executorId) | ||
| } | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is kind of awkward. You don't need to keep track of another variable; just compute the max executor ID when the AM asks for it. You already have all the information you need in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @andrewor14 Thank for review it. For my understanding, I don't think we can get the max executor ID in executorDataMap. Because, when AM is failure, all the executor are disconnect and be removed, by this time, as the code in method
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, with dynamic allocation, for example, the executor with the max known id may be gone already. minor:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vanzin Thanks for your comments. I will optimize it. |
||
| if (numPendingExecutors > 0) { | ||
| numPendingExecutors -= 1 | ||
| logDebug(s"Decremented number of pending executors ($numPendingExecutors left)") | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -37,6 +37,7 @@ import org.apache.spark.deploy.yarn.YarnSparkHadoopUtil._ | |
| import org.apache.spark.rpc.{RpcCallContext, RpcEndpointRef} | ||
| import org.apache.spark.scheduler.{ExecutorExited, ExecutorLossReason} | ||
| import org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages.RemoveExecutor | ||
| import org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages.RetrieveMaxExecutorId | ||
| import org.apache.spark.util.ThreadUtils | ||
|
|
||
| /** | ||
|
|
@@ -81,8 +82,20 @@ private[yarn] class YarnAllocator( | |
| new ConcurrentHashMap[ContainerId, java.lang.Boolean]) | ||
|
|
||
| @volatile private var numExecutorsRunning = 0 | ||
| // Used to generate a unique ID per executor | ||
| private var executorIdCounter = 0 | ||
|
|
||
| /** | ||
| * Used to generate a unique ID per executor | ||
| * | ||
| * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then | ||
| * the id of new executor will start from 1, this will conflict with the executor has | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to clarify this to say this is required for client mode when driver isn't running on yarn. this isn't an issue in cluster mode.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tgravescs I think we can clarify this in
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would prefer to do it here in this comment to better describe the situation this is needed. It should be a line or two and I personally much prefer that then pointing at jiras unless its a big discussion/background required, then the jira makes more sense.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tgravescs Ok, I will do it soon. Thanks a lot. |
||
| * already created before. So, we should initialize the `executorIdCounter` by getting | ||
| * the max executorId from driver. | ||
| * | ||
| * @see SPARK-12864 | ||
| */ | ||
| private var executorIdCounter: Int = { | ||
|
||
| driverRef.askWithRetry[Int](RetrieveMaxExecutorId) + 1 | ||
|
||
| } | ||
| @volatile private var numExecutorsFailed = 0 | ||
|
|
||
| @volatile private var targetNumExecutors = | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we seem to be intermixing current and Max. Max makes me think that this is some limit so can we change things to be consistent and perhaps use CurrentExecutorId or LargestAllocated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgravescs Thanks for reviewing. Because this variables is the max executorId in previous AM, and this is just called by initializing AM. Our intention is to get the max executorId from all executor in previous AM. So I think maxExecutorId is ok.
@andrewor14 What's your opinion ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what the variable is, but readers looking at just this message without other context wouldn't necessarily know. When I look at the name I think its giving me some limit for max executor id. For instance we have many configs that set max of things ( for instance spark.reducer.maxSizeInFlight, spark.shuffle.io.maxRetries, etc.) That is why I would like the name clarified.
If we change the calling context, then its not just the max executor id for the last AM, its the last executor id that was allocated. So perhaps rename to be RetrieveLastAllocatedExecutorId
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
MaxKnownExecutorId?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgravescs Ok,I get your mean. Thanks a lot.
Use this name
RetrieveLastAllocatedExecutorIdis ok ? @vanzin What's your opinion ?