-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32215] Expose a (protected) /workers/kill endpoint on the MasterWebUI #29015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
24b32f8 to
112ac42
Compare
|
@holdenk, @jiangxb1987 @cloud-fan @Ngone51 -- This PR is ready for your review please. Thanks ! |
core/src/main/scala/org/apache/spark/deploy/master/Master.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/deploy/master/Master.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala
Outdated
Show resolved
Hide resolved
cc70cf2 to
93f2d52
Compare
jiangxb1987
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good only nits.
core/src/main/scala/org/apache/spark/deploy/master/Master.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will multiple requests block each other, since we use askSync here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah: We would wait for each one to be processed iteratively by the master's message handling thread. Having said that, the decommissioning does not block for actually sending/acking the messages to the executors. Its merely updating some (potentially persistent) state in the Master so shouldn't be that slow.
Having said that, would this be a problem ? I am assuming that the JettyHandler that the MasterWebUI is built atop can indeed handle multiple requests in flight, where some of them are blocking.
The use case for making this handler synchronous is so that the external agent doing the decommissioning of the hosts can know whether the cleanup succeeded or not. While this information is scrapeable from the MasterPage (that returns the status of the Workers), it would require some brittle scraping on the external end point. So I figured it would be better for this call to return the number of workers it was actually able to decommission.
I am happy to switch this logic to Async if you see any red flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it such that the actual decommissioning is done async in the master: So now the DecommissionHostPorts call should be very quick and it is okay to be synchronous. ie, DecommissionHostPorts will simply enqueue multiple WorkerDecommission to decommission a worker at a time.
core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala
Outdated
Show resolved
Hide resolved
fb662f9 to
d8e241f
Compare
|
ok to test |
| case class Heartbeat(workerId: String, worker: RpcEndpointRef) extends DeployMessage | ||
|
|
||
| // Out of band commands to Master | ||
| case class DecommissionHostPorts(hostPorts: Seq[String]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe DecommissionWorkers? In the comment, we can say that the worker is identified by host and port(optional)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be confusing to say DecommissionWorkers but passes in sequence of hostPorts...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but DecommissionHostPorts is more confusing as you don't even know what it does unless you look at the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's change to DecommissionWorkers then, since WorkerStateResponse also passes in host and port.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am narrowing the scope to simply decommission a bunch of hostnames, and all workers within a host will go away. This is the only production use case I have in mind and there is no need to design for the flexibility of wanting to decommission an individual worker on the node.
As such, I have renamed the API to DecommissionHosts and it takes a list of host names.
core/src/main/scala/org/apache/spark/deploy/master/Master.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/deploy/master/ui/MasterWebUISuite.scala
Outdated
Show resolved
Hide resolved
9ea178b to
08a4c9b
Compare
|
Test build #125911 has finished for PR 29015 at commit
|
08a4c9b to
3ee87f3
Compare
|
Test build #125922 has finished for PR 29015 at commit
|
|
jenkins retest this please |
|
Retest this please. |
jiangxb1987
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #125927 has finished for PR 29015 at commit
|
|
Test build #125935 has finished for PR 29015 at commit
|
core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala
Outdated
Show resolved
Hide resolved
3ee87f3 to
c6a6a90
Compare
|
Test build #125998 has finished for PR 29015 at commit
|
This allows an external agent to inform the Master that certain hosts are being decommissioned. This alternative is suitable for some environments that cannot trigger a clean up hook on the Worker that is needed today to inform the Master. This new API also allows the Master to be informed of all hosts being decommissioned in bulk by specifying a list of hostnames. This API is merely a new entry point into the existing decommissioning logic. It does not change how the decommissioning request is handled in its core. Added unit tests
c6a6a90 to
31b231e
Compare
|
Test build #126012 has finished for PR 29015 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
This PR allows an external agent to inform the Master that certain hosts
are being decommissioned.
Why are the changes needed?
The current decommissioning is triggered by the Worker getting getting a SIGPWR
(out of band possibly by some cleanup hook), which then informs the Master
about it. This approach may not be feasible in some environments that cannot
trigger a clean up hook on the Worker. In addition, when a large number of
worker nodes are being decommissioned then the master will get a flood of
messages.
So we add a new post endpoint
/workers/killon the MasterWebUI that allows anexternal agent to inform the master about all the nodes being decommissioned in
bulk. The list of nodes is specified by providing a list of hostnames. All workers on those
hosts will be decommissioned.
This API is merely a new entry point into the existing decommissioning
logic. It does not change how the decommissioning request is handled in
its core.
Does this PR introduce any user-facing change?
Yes, a new endpoint
/workers/killis added to the MasterWebUI. By default onlyrequests originating from an IP address local to the MasterWebUI are allowed.
How was this patch tested?
Added unit tests