[SPARK-32215] Expose a (protected) /workers/kill endpoint on the MasterWebUI #29015

agrawaldevesh · 2020-07-06T19:15:05Z

What changes were proposed in this pull request?

This PR allows an external agent to inform the Master that certain hosts
are being decommissioned.

Why are the changes needed?

The current decommissioning is triggered by the Worker getting getting a SIGPWR
(out of band possibly by some cleanup hook), which then informs the Master
about it. This approach may not be feasible in some environments that cannot
trigger a clean up hook on the Worker. In addition, when a large number of
worker nodes are being decommissioned then the master will get a flood of
messages.

So we add a new post endpoint /workers/kill on the MasterWebUI that allows an
external agent to inform the master about all the nodes being decommissioned in
bulk. The list of nodes is specified by providing a list of hostnames. All workers on those
hosts will be decommissioned.

This API is merely a new entry point into the existing decommissioning
logic. It does not change how the decommissioning request is handled in
its core.

Does this PR introduce any user-facing change?

Yes, a new endpoint /workers/kill is added to the MasterWebUI. By default only
requests originating from an IP address local to the MasterWebUI are allowed.

How was this patch tested?

Added unit tests

agrawaldevesh · 2020-07-10T04:25:11Z

@holdenk, @jiangxb1987 @cloud-fan @Ngone51 -- This PR is ready for your review please. Thanks !

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala

jiangxb1987

looks good only nits.

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

jiangxb1987 · 2020-07-14T06:55:57Z

core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala

will multiple requests block each other, since we use askSync here?

Yeah: We would wait for each one to be processed iteratively by the master's message handling thread. Having said that, the decommissioning does not block for actually sending/acking the messages to the executors. Its merely updating some (potentially persistent) state in the Master so shouldn't be that slow.

Having said that, would this be a problem ? I am assuming that the JettyHandler that the MasterWebUI is built atop can indeed handle multiple requests in flight, where some of them are blocking.

The use case for making this handler synchronous is so that the external agent doing the decommissioning of the hosts can know whether the cleanup succeeded or not. While this information is scrapeable from the MasterPage (that returns the status of the Workers), it would require some brittle scraping on the external end point. So I figured it would be better for this call to return the number of workers it was actually able to decommission.

I am happy to switch this logic to Async if you see any red flags.

I changed it such that the actual decommissioning is done async in the master: So now the DecommissionHostPorts call should be very quick and it is okay to be synchronous. ie, DecommissionHostPorts will simply enqueue multiple WorkerDecommission to decommission a worker at a time.

core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala

cloud-fan · 2020-07-15T18:00:26Z

ok to test

cloud-fan · 2020-07-15T18:01:51Z

core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala

  case class Heartbeat(workerId: String, worker: RpcEndpointRef) extends DeployMessage

+  // Out of band commands to Master
+  case class DecommissionHostPorts(hostPorts: Seq[String])


nit: maybe DecommissionWorkers? In the comment, we can say that the worker is identified by host and port(optional)

It would be confusing to say DecommissionWorkers but passes in sequence of hostPorts...

but DecommissionHostPorts is more confusing as you don't even know what it does unless you look at the comment.

let's change to DecommissionWorkers then, since WorkerStateResponse also passes in host and port.

I am narrowing the scope to simply decommission a bunch of hostnames, and all workers within a host will go away. This is the only production use case I have in mind and there is no need to design for the flexibility of wanting to decommission an individual worker on the node.

As such, I have renamed the API to DecommissionHosts and it takes a list of host names.

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

core/src/main/scala/org/apache/spark/internal/config/UI.scala

core/src/test/scala/org/apache/spark/deploy/master/ui/MasterWebUISuite.scala

SparkQA · 2020-07-16T01:03:18Z

Test build #125911 has finished for PR 29015 at commit d8e241f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DecommissionHostPorts(hostPorts: Seq[String])

SparkQA · 2020-07-16T03:41:57Z

Test build #125922 has finished for PR 29015 at commit 9ea178b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

agrawaldevesh · 2020-07-16T04:46:38Z

jenkins retest this please

agrawaldevesh · 2020-07-16T04:51:30Z

Retest this please.

jiangxb1987

LGTM

core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala

core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala

SparkQA · 2020-07-16T07:01:22Z

Test build #125927 has finished for PR 29015 at commit 08a4c9b.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-16T07:05:02Z

Test build #125935 has finished for PR 29015 at commit 3ee87f3.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DecommissionHosts(hostnames: Seq[String])

core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala

SparkQA · 2020-07-16T22:17:33Z

Test build #125998 has finished for PR 29015 at commit c6a6a90.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DecommissionWorkersOnHosts(hostnames: Seq[String])

This allows an external agent to inform the Master that certain hosts are being decommissioned. This alternative is suitable for some environments that cannot trigger a clean up hook on the Worker that is needed today to inform the Master. This new API also allows the Master to be informed of all hosts being decommissioned in bulk by specifying a list of hostnames. This API is merely a new entry point into the existing decommissioning logic. It does not change how the decommissioning request is handled in its core. Added unit tests

SparkQA · 2020-07-17T04:39:23Z

Test build #126012 has finished for PR 29015 at commit 31b231e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DecommissionWorkersOnHosts(hostnames: Seq[String])

cloud-fan · 2020-07-17T06:05:11Z

thanks, merging to master!

probot-autolabeler bot added CORE WEB UI labels Jul 6, 2020

agrawaldevesh force-pushed the master_decom_endpoint branch 2 times, most recently from 24b32f8 to 112ac42 Compare July 6, 2020 19:29

agrawaldevesh changed the title ~~[WIP] Expose a (protected) /workers/kill endpoint on the MasterWebUI~~ [WIP][SPARK-32215] Expose a (protected) /workers/kill endpoint on the MasterWebUI Jul 7, 2020

agrawaldevesh changed the title ~~[WIP][SPARK-32215] Expose a (protected) /workers/kill endpoint on the MasterWebUI~~ [SPARK-32215] Expose a (protected) /workers/kill endpoint on the MasterWebUI Jul 8, 2020

Ngone51 reviewed Jul 13, 2020

View reviewed changes

agrawaldevesh force-pushed the master_decom_endpoint branch 2 times, most recently from cc70cf2 to 93f2d52 Compare July 13, 2020 23:10

jiangxb1987 reviewed Jul 14, 2020

View reviewed changes

agrawaldevesh force-pushed the master_decom_endpoint branch 2 times, most recently from fb662f9 to d8e241f Compare July 14, 2020 22:41

cloud-fan reviewed Jul 15, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/deploy/master/Master.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 15, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/internal/config/UI.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 15, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/internal/config/UI.scala Show resolved Hide resolved

cloud-fan reviewed Jul 15, 2020

View reviewed changes

core/src/test/scala/org/apache/spark/deploy/master/ui/MasterWebUISuite.scala Outdated Show resolved Hide resolved

agrawaldevesh force-pushed the master_decom_endpoint branch 2 times, most recently from 9ea178b to 08a4c9b Compare July 15, 2020 23:17

agrawaldevesh force-pushed the master_decom_endpoint branch from 08a4c9b to 3ee87f3 Compare July 16, 2020 01:31

jiangxb1987 approved these changes Jul 16, 2020

View reviewed changes

cloud-fan reviewed Jul 16, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala Outdated Show resolved Hide resolved

agrawaldevesh force-pushed the master_decom_endpoint branch from 3ee87f3 to c6a6a90 Compare July 16, 2020 19:19

agrawaldevesh force-pushed the master_decom_endpoint branch from c6a6a90 to 31b231e Compare July 17, 2020 01:20

cloud-fan approved these changes Jul 17, 2020

View reviewed changes

cloud-fan closed this in ffdbbae Jul 17, 2020

[SPARK-32215] Expose a (protected) /workers/kill endpoint on the MasterWebUI #29015

[SPARK-32215] Expose a (protected) /workers/kill endpoint on the MasterWebUI #29015

Uh oh!

Conversation

agrawaldevesh commented Jul 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

agrawaldevesh commented Jul 10, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cloud-fan commented Jul 15, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Jul 16, 2020

Uh oh!

SparkQA commented Jul 16, 2020

Uh oh!

agrawaldevesh commented Jul 16, 2020

Uh oh!

agrawaldevesh commented Jul 16, 2020

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Jul 16, 2020

Uh oh!

SparkQA commented Jul 16, 2020

Uh oh!

Uh oh!

SparkQA commented Jul 16, 2020

Uh oh!

SparkQA commented Jul 17, 2020

Uh oh!

cloud-fan commented Jul 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

agrawaldevesh commented Jul 6, 2020 •

edited

Loading