Skip to content

Conversation

@Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Jul 4, 2019

What changes were proposed in this pull request?

In this PR, we implements a complete process of GPU-aware resources scheduling
in Standalone. The whole process looks like: Worker sets up isolated resources
when it starts up and registers to master along with its resources. And, Master
picks up usable workers according to driver/executor's resource requirements to
launch driver/executor on them. Then, Worker launches the driver/executor after
preparing resources file, which is created under driver/executor's working directory,
with specified resource addresses(told by master). When driver/executor finished,
their resources could be recycled to worker. Finally, if a worker stops, it
should always release its resources firstly.

For the case of Workers and Drivers in client mode run on the same host, we introduce
a config option named spark.resources.coordinate.enable(default true) to indicate
whether Spark should coordinate resources for user. If spark.resources.coordinate.enable=false, user should be responsible for configuring different resources for Workers and Drivers when use resourcesFile or discovery script. If true, Spark would help user to assign different resources for Workers and Drivers.

The solution for Spark to coordinate resources among Workers and Drivers is:

Generally, use a shared file named allocated_resources.json to sync allocated
resources info among Workers and Drivers on the same host.

After a Worker or Driver found all resources using the configured resourcesFile and/or
discovery script during launching, it should filter out available resources by excluding resources already allocated in allocated_resources.json and acquire resources from available resources according to its own requirement. After that, it should write its allocated resources along with its process id (pid) into allocated_resources.json. Pid (proposed by @tgravescs) here used to check whether the allocated resources are still valid in case of Worker or Driver crashes and doesn't release resources properly. And when a Worker or Driver finished, normally, it would always clean up its own allocated resources in allocated_resources.json.

Note that we'll always get a file lock before any access to file allocated_resources.json
and release the lock finally.

Futhermore, we appended resources info in WorkerSchedulerStateResponse to work
around master change behaviour in HA mode.

How was this patch tested?

Added unit tests in WorkerSuite, MasterSuite, SparkContextSuite.

Manually tested with client/cluster mode (e.g. multiple workers) in a single node Standalone.

@Ngone51
Copy link
Member Author

Ngone51 commented Jul 4, 2019

val requests = parseAllResourceRequests(_conf, SPARK_DRIVER_PREFIX).map {req =>
req.id.resourceName -> req.amount
}.toMap
// TODO(wuyi) log driver's acquired resources separately ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @Ngone51 . Please don't use user-id TODO in the patch. As you know, Apache Spark repository has already a few ancient user-id TODOs like this which is not fixed until now. :)
Since we don't know the future, let's use JIRA-IDed TODO like TODO(SPARK-XXX).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Thank you for reminding that. I'll fix those TODOs in following commits.

}

def hasEnoughResources(resourcesFree: Map[String, Int], resourceReqs: Map[String, Int])
: Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. indentation.

val delegate = brothers.head._2
delegate.endpoint.send(ReleaseResources(worker.resourcesCanBeReleased))
} else {
// TODO(wuyi5) cases here are hard to handle:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

* @return
*/
def acquireResources(resourceReqs: Map[String, Int])
: Map[String, Seq[String]] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. indentation.

def acquireResources(resourceReqs: Map[String, Int])
: Map[String, Seq[String]] = {
resourceReqs.map { case (rName, amount) =>
// TODO (wuyi) rName does not exists ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for (wuyi).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the comment mean? Why wouldn't rName exist?

case e: Exception =>
logError("Failed to create work directory " + workDir, e)
System.exit(1)
if (!Utils.createDirectory(workDir)) {
Copy link
Member

@dongjoon-hyun dongjoon-hyun Jul 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, is it the same? Utils.createDirectory seems to work differently from workDir.mkdirs().
Or, do we need to change the current behavior of createWorkDir fo this PR?
Oops. I realized that this is a newly added function here in this PR.
I was confused because this overloaded the existing one. Only parameters are different.

* @param addresses Resource addresses provided by the executor/worker
*/
class ResourceAllocator(name: String, addresses: Seq[String]) extends Serializable {
/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation.

@SparkQA
Copy link

SparkQA commented Jul 4, 2019

Test build #107211 has finished for PR 25047 at commit a9160e4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ReleaseResources(toRelease: Map[String, ResourceInformation]) extends DeployMessage

@SparkQA
Copy link

SparkQA commented Jul 4, 2019

Test build #107229 has finished for PR 25047 at commit 73027b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 4, 2019

Test build #107243 has finished for PR 25047 at commit 3a54e16.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 5, 2019

Test build #107249 has finished for PR 25047 at commit d47c8db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 5, 2019

Test build #107276 has finished for PR 25047 at commit 0bf0d7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 5, 2019

Test build #107285 has finished for PR 25047 at commit d06985d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

I have a few general questions, note I haven't look at all of the code yet.

I'm not an expert in standalone mode but it supports both a client mode and a cluster mode. In your description are you saying even the client mode will use the resource file and lock it? How do you know the client is running on a node with GPU's or a worker? I guess as long as location is the same it doesn't matter. This is one thing in YARN we never have handled, in client mode the user is on their own for resource coordination.

It seems unreliable to assume you have multiple workers per node (for the case a worker crashes). When the worker dies it automatically kills any executors, correct? is there a chance it doesn't?
It feels like you really want something like Worker restart and recovery, meaning if a worker crashes and you restart it, it should have same id and discover what it had reserved before and possibly any executors still running. But that is probably a much bigger change. standalone uses a PID dir to tell what workers are running, correct? it could use this to track and check allocated resources. If you track the pid with the assignments if a new worker/driver starts there could check if there aren't enough resources for it to allocate, or based on what Master says, or both really, when new Worker comes up ask Master who is supposed to be there and then checks process existence.

@SparkQA
Copy link

SparkQA commented Aug 6, 2019

Test build #108714 has finished for PR 25047 at commit b46b243.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private[spark] val SPARK_TASK_PREFIX = "spark.task"

private[spark] val SPARK_RESOURCES_COORDINATE =
ConfigBuilder("spark.resources.coordinate.enable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean to turn this to false? If so we need to update the configuration.md to match. I'm ok either way

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. I misunderstand your comment. I'd prefer it to be true. And tests' failures are probably affected by this change.

@tgravescs
Copy link
Contributor

test this please

}
}

test("Workers should avoid resources conflict when launch from the same host") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to add a test with the SPARK_RESOURCES_COORDINATE off to make sure all the resources from file/discovery returned properly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

<td>(none)</td>
<td>
Path to resources file which is used to find various resources while worker starting up.
The content of resources file should be formatted like the <code>ResourceAllocation</code> class.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized this is supposed to be an array of ResourceAllocations

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly.

@SparkQA
Copy link

SparkQA commented Aug 6, 2019

Test build #108721 has finished for PR 25047 at commit b46b243.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2019

Test build #108725 has finished for PR 25047 at commit 16523bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

@Ngone51 can you remove the [WIP] from the description. I think this is really close was going to make one more pass through but things look good.

@Ngone51 Ngone51 changed the title [WIP][SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone [SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone Aug 7, 2019
@Ngone51
Copy link
Member Author

Ngone51 commented Aug 7, 2019

@tgravescs Thanks for reminding this. Have updated title and description.

.filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
worker.coresFree >= coresPerExecutor)
.filter(canLaunchExecutor(_, app.desc))
.sortBy(_.coresFree).reverse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an issue here if no Workers have the resources required by the executor/task requirements, then it doesn't warn/error and it doesn't retry. Basically I started a Worker without GPUs and then said I need gpus for my executor task and it end up hanging. I suppose one could argue this is ok as someone could start another Worker that has the resources, but I think we at least need to Warn about it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it does retry when other executors or drivers finish. But, we can warn if executor or driver requires more resources than any of workers could have. BTW, I'm thinking do we have the same issue for memory and cores ? For example, a Worker has 10 cores at most while an executor ask for 20 cores ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right if something changes (other app, other workers, etc) it retries, but if I'm the only app on the cluster its not clear why the app isn't launching. The one thing I don't want is it to be to noisy though either. I thought about that before making the comment, because like you said if its just out of resources because other apps are running we don't really want to print anything. I think for now we should just limit it to resources and perhaps just say no Workers are configured with the resources you requested. If we can do that without much performance impact lets do it. If not maybe we just file a separate jira for it and look at it there

Copy link
Member Author

@Ngone51 Ngone51 Aug 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this way?

for (app <- waitingApps) {
   ...
   val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
     .filter(canLaunchExecutor(_, app.desc))
     .sortBy(_.coresFree).reverse
   if (waitingApps.size == 1 && usableWorkers.isEmpty) {
    logWarn("The app requires more resources(mem, core, accelerator) than any of Workers could have.") 
  }
  ...
}

Telling "the Workers are not configured with the resources(I mean accelerator) as an app requested" may require more changes. For example, you may need to traversal workers again to judge whether it's due to resources(I mean accelerator) or memory or cores. Or, you need to refactor canLaunchExecutor to tell more details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds good for now. Lets also leave it called resource since that is what its called everywhere right now. just leave off the (mem, core, accelerator) part.

if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
if (canLaunchDriver(worker, driver.desc)) {
val allocated = worker.acquireResources(driver.desc.resourceReqs)
driver.withResources(allocated)
Copy link
Contributor

@tgravescs tgravescs Aug 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming driver could also not launch if worker doesn't have any GPUs and it requests them, may want to warn here as well

*/
private[spark] case class ResourceAllocation(id: ResourceID, addresses: Seq[String]) {
@Evolving
case class ResourceAllocation(id: ResourceID, addresses: Seq[String]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I just realized we made this public but REsourceID is private still. lets make this private again and just put the format in the docs like you had it before. Again sorry for switching on you here. If we end up thinking users will find this useful then we can open it up to be public later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind.

@SparkQA
Copy link

SparkQA commented Aug 9, 2019

Test build #108858 has finished for PR 25047 at commit 15a9897.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thanks @Ngone51

@asfgit asfgit closed this in cbad616 Aug 9, 2019
@Ngone51
Copy link
Member Author

Ngone51 commented Aug 10, 2019

Thank you for your help @tgravescs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants