[SPARK-45527][CORE] Use fraction to do the resource calculation #43494

wbo4958 · 2023-10-24T02:22:27Z

What changes were proposed in this pull request?

This (PR) introduces the utilization of fractions instead of slots, which is similar to the CPU strategy,
for determining whether a worker offer can provide the necessary resources to tasks.

For instance, when an executor reports to the driver with [gpu, ["1,", "2"]], the driver constructs an executor data map.
The keys in the map represent the GPU addresses, and their default values are set to 1.0, indicating one whole GPU.

Consequently, the available resource amounts for the executor are as follows: { "1" -> 1.0f, "2" -> 1.0f }.

When offering resources to a task that requires 1 CPU and 0.08 GPU, the worker offer examines the available resource amounts.
It identifies that the capacity of GPU address "1.0" is greater than the task's GPU requirement (1.0 >= 0.08).
Therefore, Spark assigns the GPU address "1" to this task. After the assignment, the available resource amounts
for this executor are updated to { "1" -> 0.92, "2" -> 1.0}, ensuring that the remaining resources can be allocated to other tasks.

In scenarios where other tasks, using different task resource profiles, request varying GPU amounts
when dynamic allocation is disabled, Spark applies the same comparison approach. It compares the task's GPU requirement with
the available resource amounts to determine if the resources can be assigned to the task.

Why are the changes needed?

The existing resources offering including gpu, fpga is based on "slots per address", which is defined by the default resource profile.
and it's a fixed number for all different resource profiles when dynamic allcation is disabled.

Consider the below test case,

  withTempDir { dir =>
    val scriptPath = createTempScriptWithExpectedOutput(dir, "gpuDiscoveryScript",
      """{"name": "gpu","addresses":["0"]}""")

    val conf = new SparkConf()
      .setAppName("test")
      .setMaster("local-cluster[1, 12, 1024]")
      .set("spark.executor.cores", "12")
      
    conf.set("spark.worker.resource.gpu.amount", "1")
    conf.set("spark.worker.resource.gpu.discoveryScript", scriptPath)
    conf.set("spark.executor.resource.gpu.amount", "1")
    conf.set("spark.task.resource.gpu.amount", "0.08")
    
    sc = new SparkContext(conf)
    val rdd = sc.range(0, 100, 1, 4)
    var rdd1 = rdd.repartition(3)
    val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 1.0)
    val rp = new ResourceProfileBuilder().require(treqs).build
    rdd1 = rdd1.withResources(rp)
    assert(rdd1.collect().size === 100)
  }

During the initial stages, Spark generates a default resource profile based on the configurations. The calculation
for determining the slots per GPU address is performed as "spark.executor.resource.gpu.amount / spark.task.resource.gpu.amount",
resulting in a value of 12 (1/0.08 = 12). This means that Spark can accommodate up to 12 tasks running on each GPU address simultaneously.

The job is then divided into two stages. The first stage, which consists of 4 tasks, runs concurrently based on
the default resource profile. However, the second stage, comprising 3 tasks, runs sequentially using a new task
resource profile. This new profile specifies that each task requires 1 CPU and 1.0 full GPU.

In reality, the tasks in the second stage are running in parallel, which is the underlying issue.

The problem lies in the line new TaskResourceRequests().cpus(1).resource("gpu", 1.0). The value of 1.0
for the GPU, or any value below 1.0 (specifically, (0, 0.5] which is rounded up to 1.0, spark throws an exception if the value is in (0.5, 1)),
is merely requesting the number of slots. In this case, it is requesting only 1 slot. Consequently, each task
necessitates 1 CPU core and 1 GPU slot, resulting in all tasks running simultaneously.

Does this PR introduce any user-facing change?

No

How was this patch tested?

To ensure all tests got passed

Was this patch authored or co-authored using generative AI tooling?

No

wbo4958 · 2023-10-24T02:25:23Z

The pipelines of both "Run docker integration test" and "Linters, License, dependencies" keep failing. But seems the code-related pipelines got passed. I will check it.

wbo4958 · 2023-10-24T02:28:41Z

Hi @tgravescs @Ngone51 @WeichenXu123 , Could you help to review this PR, thx very much.

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala

tgravescs · 2023-10-24T18:31:35Z

I'll take a look, might not get to it today.

tgravescs

so I started going through this but a bunch of places I'm worried this is changing the behavior of the with dynamic allocation on case where you can't change task requirements withing a resource profile. I need to spend more time on the bigger picture of this. I also want to make sure this isn't added a lot of extra time in a critical scheduler part

core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala

core/src/main/scala/org/apache/spark/resource/ResourceAllocator.scala

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala

core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

tgravescs · 2023-10-25T20:10:11Z

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala

why isn't this just using taskSetProf.getSchedulerTaskResourceAmount(rName) and that function updated to do the right thing ?

Also there is a comment in Resourceprofile.scala in the getSchedulerTaskResourceAmount that assumes addresses are done old way.

I'm again worried about this changing the behavior with dynamic allocation on.

The new way will use the original taskAmount directly to do the resource calculation, so we don't need to taskSetProf.getSchedulerTaskResourceAmount(rName).

I'm again worried about this changing the behavior with dynamic allocation on.

Yeah, adding the check on the specific ResourceProfiles like TaskResourceProfile or ResourceProfile.

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

tgravescs

Need to check barrier scheduling algorithm, which checks for max slots to make sure this doesn't break that.

I'm also curious if you are getting warnings out from the warnOnWastedResources function when setting things like this? Maybe its bypassing those checks, but like getNumSlotsPerAddress which you removed usage in code is still used for those warnings.

core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala

tgravescs · 2023-10-31T15:21:26Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala


    override def receive: PartialFunction[Any, Unit] = {
-      case StatusUpdate(executorId, taskId, state, data, taskCpus, resources) =>
+      case StatusUpdate(executorId, taskId, state, data, taskCpus, resources, resourcesAmounts) =>


resources here is no longer used and seems like a lot of duplicate information now. We should figure out a better way to do this. This also likely means the way its stored on Executor side needs to change.

Yeah. I've noticed that. The reason that I didn't change this part is I'd like to keep the minimum change in this PR. is that ok?

no, it doesn't make sense to leave something not used here and we should be as efficient as we can about storage and passing things around.

Done. Removed all the duplicated un-used resource

tgravescs · 2023-10-31T15:22:25Z

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

  }

  override def statusUpdate(taskId: Long, state: TaskState, data: ByteBuffer): Unit = {
    val resources = taskResources.getOrDefault(taskId, Map.empty[String, ResourceInformation])


I think resources here is essentially unused now, which means taskResources is likely not used.

yeah, you're right. I'd like to keep the minimum change in this PR, could we clean them up in the followup?

got it. I will fix them in this PR.

tgravescs · 2023-10-31T15:26:41Z

core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala

    val resources: immutable.Map[String, ResourceInformation],
+    // resourcesAmounts is the total resources assigned to the task
+    // Eg, Map("gpu" -> Map("0" -> 0.7)): assign 0.7 of the gpu address "0" to this task
+    val resourcesAmounts: immutable.Map[String, immutable.Map[String, Double]],


this also still seems like we are keeping duplicate information, we have the resources and then the resource amounts that have the same info. We may need like a ResourceInformationWithAmount and just combine these. The resources on the executor side do get into the TaskContext so we need to keep that information.

My original thinking is we can remove val resources: immutable.Map[String, ResourceInformation] and re-construct it in the TaskContext according to resourcesAmounts.

Done. Removed the resources and reconstructed in the TaskContext

tgravescs · 2023-10-31T15:49:43Z

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala

+ *                  )
+ */
+private[spark] class ExecutorResourcesAmounts(
+    private val resources: Map[String, Map[String, Double]]) extends Serializable {


is there a reason we are taking these as Double vs just using the Long representation (double * RESOURCE_TOTAL_AMOUNT) . Seems like that would just be more efficient to not convert back and forth.

Good suggestion, Let me just use the Long represnetation.

tgravescs · 2023-10-31T15:51:22Z

core/src/main/scala/org/apache/spark/resource/ResourceAllocator.scala

+   * Get the resources and its amounts.
+   * @return the resources amounts
+   */
+  def resourcesAmounts: Map[String, Double] = addressAvailabilityMap.map {


leave these in the Long form, I think only place this is used is in ExecutorResourcesAmount which could store the same way. I think this is a global comment, if we can store it in Long format and pass that everywhere and skip converting I'd rather do that. Only convert back to double to display to user and possibly logs.

I guess we do need to be careful to make sure that these are 1.0 or less though, if we start getting into the requests where user could ask for 250000 resources then we could hit overflow issues, so if we are passing those requests around might need to keep them in double format. Hopefully those are limited to the requests in the resource profiles though and we pass around the GPU index -> amount which should be 1.0 or less.

Yes, Leave in the Long should be more effective.

Hi Tom, for "if we start getting into the requests where user could ask for 250000 resources then we could hit overflow issues"

I couldn't understand why hitting the overflow issue?

I just saying if we are storing things in long format (10000000000000000L * number of resources requested). Be sure that isn't going to overflow. Since I said its a generic comment all over, just make sure it isn't going to happen. If the values we store are always < (10000000000000000L * 1) its not a problem.

Overflow means you have a value larger then 2^64, which then isn't positive anymore

scala> 10000000000000000L * 2500l res7: Long = 6553255926290448384 scala> 10000000000000000L * 25000l res8: Long = -8254417031933722624

Thx for the explanation, Yeah, the values we store are always < 10000000000000000L * 1), so overflow is not going to happen.

tgravescs · 2023-10-31T16:00:29Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

      taskResourceAssignments: Map[String, ResourceInformation],
-      launchTime: Long): TaskDescription = {
+      launchTime: Long,
+      resourcesAmounts: Map[String, Map[String, Double]]): TaskDescription = {


same thing in these classes as mentioned earlier, seems like we are duplicating a lot of information between this and taskResourceAssignments, I would like to see a new class that tracks both I think. If there is some reason you don't think that will work let me know.

I think we can remove taskResourceAssignments and re-construct it according to resourcesAmounts. But I'd like to keep this PR minimum change, and fix them in the follow up

tgravescs · 2023-10-31T16:08:22Z

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala

+      internalResources.get(rName) match {
+        case Some(addressesAmountMap) =>
+
+          var internalTaskAmount = (taskAmount * RESOURCE_TOTAL_AMOUNT).toLong


this might have issues with overflow if taskAmount if large. You might need to handle > 1 differently then < 1.0

yes, you're right. Thx for catching. I will fix it.

core/src/main/scala/org/apache/spark/resource/TaskResourceRequest.scala

tgravescs · 2023-11-06T21:19:01Z

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

+        // Convert resources amounts into ResourceInformation
+        val resources = taskDesc.resources.map { case (rName, addressesAmounts) =>
+          rName -> new ResourceInformation(rName, addressesAmounts.keys.toSeq.sorted.toArray)}
+        taskResources.put(taskDesc.taskId, resources)


I don't think taskResources is needed at all anymore. Lets remove it unless you see it being used for something I'm missing. It was used in the statusUpdate call below that you removed. I actually think it wasn't needed even before (changed in Spark 3.4) that since the taskDescription and runningTasks has the same information and is now accessible.

The taskResources is exposing for testing purpose, please refer to https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L71-L79

It's used in the CoarseGrainedExecutorBackendSuite, like https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/executor/CoarseGrainedExecutorBackendSuite.scala#L342-L350

so taskResources originally was required and provided useful functionality because the extra resources from taskDesc.resources wasn't exposed as public here. Since a previous change, taskResources is only used by the tests. But the tests were only validating it with taskResources because that is how it functionally works. The fact it isn't actually using taskResources means we are testing something that could functionally be wrong.

We should update the tests to stop using taskResources (and remove taskResources) and instead use backend.executor.runningTasks.get(taskId).taskDescription.resources, which is what we are really using to track the extra resources.

Sounds good. new commits have removed the taskResources

wbo4958 · 2023-11-13T01:55:31Z

Hi Tom, the resource amount info will be displayed in the log like the below,

23/11/13 09:40:22 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1) (10.51.70.102, executor 0, partition 1, 
PROCESS_LOCAL, 7823 bytes) taskResourceAssignments Map(gpu -> Map(0 -> 2000000000000000))

Compared to the original log info

23/11/13 09:38:47 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 5) (10.51.70.102, executor 0, partition 0, 
NODE_LOCAL, 7713 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])

Should I change this behavior to align with the original one?

wbo4958 · 2023-11-14T07:29:28Z

Need to check barrier scheduling algorithm, which checks for max slots to make sure this doesn't break that.

I'm also curious if you are getting warnings out from the warnOnWastedResources function when setting things like this? Maybe its bypassing those checks, but like getNumSlotsPerAddress which you removed usage in code is still used for those warnings.

Yeah. Added the tests for barrier scheduling. please refer to here and here and here and here

and tests for warnOnWastedResources, please refer to here and here and here

tgravescs · 2023-11-15T20:43:51Z

23/11/13 09:40:22 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1) (10.51.70.102, executor 0, partition 1,
PROCESS_LOCAL, 7823 bytes) taskResourceAssignments Map(gpu -> Map(0 -> 2000000000000000))

Should I change this behavior to align with the original one?

I think the log message as you have it is fine for now. I don't want to add extra logic in here unless we really need it.

tgravescs · 2023-11-06T21:33:39Z

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala

+ * One example is GPUs, where the addresses would be the indices of the GPUs
+ *
+ * @param resources The executor available resources and amount. eg,
+ *                  Map("gpu" -> Map("0" -> 0.2*RESOURCE_TOTAL_AMOUNT,


nit put spaces around the *

tgravescs · 2023-11-06T21:34:49Z

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala

+   *                   "b" -> 0.5 * RESOURCE_TOTAL_AMOUNT))
+   * the resourceAmount will be Map("gpu" -> 3, "fpga" -> 2)
+   */
+  lazy val resourceAmount: Map[String, Int] = internalResources.map { case (rName, addressMap) =>


rename to be resourceAddressAmount

tgravescs · 2023-11-06T21:36:04Z

core/src/main/scala/org/apache/spark/resource/ResourceAllocator.scala

+   * Double can display up to 16 decimal places, so we set the factor to
+   * 10, 000, 000, 000, 000, 000L.
+   */
+  final val RESOURCE_TOTAL_AMOUNT: Long = 10000000000000000L


I think we should rename this ONE_ENTIRE_RESOURCE or something that indicates this is the entire amount of a single resource..

Really good suggestion. Done

tgravescs · 2023-11-06T21:46:15Z

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala

+              }
+            }
+          } else if (taskAmount > 0.0) { // 0 < task.amount < 1.0
+            val internalTaskAmount = (taskAmount * RESOURCE_TOTAL_AMOUNT).toLong


can you make a utility function that converts to the internal task amount.. ie does the * RESOURCE_TOTAL_AMOUNT

Wow, really good suggestion, Done.

tgravescs · 2023-11-06T21:48:49Z

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala

+   * @return the optional resources amounts
+   */
+  def assignResources(taskSetProf: ResourceProfile): Option[Map[String, Map[String, Long]]] = {
+


nit: remove extra newline

tgravescs · 2023-11-15T20:25:16Z

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

+        // Convert resources amounts into ResourceInformation
+        val resources = taskDesc.resources.map { case (rName, addressesAmounts) =>
+          rName -> new ResourceInformation(rName, addressesAmounts.keys.toSeq.sorted.toArray)}
+        taskResources.put(taskDesc.taskId, resources)


so taskResources originally was required and provided useful functionality because the extra resources from taskDesc.resources wasn't exposed as public here. Since a previous change, taskResources is only used by the tests. But the tests were only validating it with taskResources because that is how it functionally works. The fact it isn't actually using taskResources means we are testing something that could functionally be wrong.

We should update the tests to stop using taskResources (and remove taskResources) and instead use backend.executor.runningTasks.get(taskId).taskDescription.resources, which is what we are really using to track the extra resources.

core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

core/src/main/scala/org/apache/spark/resource/ResourceAllocator.scala

tgravescs · 2023-12-18T21:13:01Z

sorry for the delay on this, overall looks good, a few minor comments. Can you confirm you tested with the dynamic allocation path and everything is working as expected there?

wbo4958 · 2024-01-02T11:51:09Z

Manual test on Spark Standalone Cluster

Environment

The Spark Standalone cluster consists of a single worker node equipped with 8 CPU cores but lacks physical GPUs. However, Spark is capable of managing GPU resources by utilizing GPU IDs instead of actual GPUs. To achieve this, you can configure the spark.worker.resource.gpu.discoveryScript setting with a script that can retrieve the GPU IDs. For instance,

cat <<EOF
{"name": "gpu","addresses":["0", "1"]}
EOF

So we can change the above script to get 1 GPU/ 2 GPUs or any kind of GPUs.

with dynamic allocation off

1 GPU

configurations

add below configurations in the SPARK_HOME/conf/spark-defaults.conf

spark.worker.resource.gpu.amount 1
spark.worker.resource.gpu.discoveryScript /tmp/gpu_discovery.sh

spark-submit configurations

spark-shell --master spark://192.168.0.103:7077 --conf spark.executor.cores=8 --conf spark.task.cpus=1 \
   --conf spark.executor.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=0.125 \
   --conf spark.dynamicAllocation.enabled=false

The aforementioned spark-submit configurations will launch a single executor with 8 CPU cores and 1 GPU. The tasks requires 1 CPU core and 0.125 GPUs each, allowing for the concurrent execution of 8 tasks.

test code

import org.apache.spark.TaskContext
import org.apache.spark.resource.{ResourceProfileBuilder, TaskResourceRequests}

val rdd = sc.range(0, 100, 1, 12).mapPartitions { iter => {
  val tc = TaskContext.get()
  val tid = tc.partitionId()
  assert(tc.resources()("gpu").addresses sameElements Array("0"))
  iter
}}

val rdd1 = rdd.repartition(2)
val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 0.6)
val rp = new ResourceProfileBuilder().require(treqs).build
val rdd2 = rdd1.withResources(rp).mapPartitions { iter => {
  val tc = TaskContext.get()
  val tid = tc.partitionId()
  assert(tc.resources()("gpu").addresses sameElements Array("0"))
  iter
}
}
rdd2.collect()

The provided Spark job will be split into two stages. The first stage comprises 12 tasks, each requiring 1 CPU core and 0.125 GPUs. As a result, the first 8 tasks can run concurrently, and then run the remaining 4 tasks.

In contrast, the second stage consists of 2 tasks, each necessitating 1 CPU core and 0.6 GPUs. Consequently, only one task will run at any given time, while the remaining 2 tasks will execute sequentially.

2 GPUs

configurations

add below configurations in the SPARK_HOME/conf/spark-defaults.conf

spark.worker.resource.gpu.amount 2
spark.worker.resource.gpu.discoveryScript /tmp/gpu_discovery.sh

spark-submit configurations

spark-shell --master spark://192.168.0.103:7077 --conf spark.executor.cores=8 --conf spark.task.cpus=1 \
   --conf spark.executor.resource.gpu.amount=2 --conf spark.task.resource.gpu.amount=0.25 \
   --conf spark.dynamicAllocation.enabled=false

The aforementioned spark-submit configurations will launch a single executor with 8 CPU cores and 2 GPU. The tasks requires 1 CPU core and 0.25 GPUs each, allowing for the concurrent execution of 8 tasks. the first 4 tasks will grab GPU 0, while the remaining 4 tasks grabs the GPU 1 due to the round-robin manner when offering the resources.

test code

import org.apache.spark.TaskContext
import org.apache.spark.resource.{ResourceProfileBuilder, TaskResourceRequests}

val rdd = sc.range(0, 100, 1, 8).mapPartitions { iter => {
  val tc = TaskContext.get()
  val tid = tc.partitionId()
  if (tid >= 4) {
    assert(tc.resources()("gpu").addresses sameElements Array("1"))
  } else {
    assert(tc.resources()("gpu").addresses sameElements Array("0"))
  }
  iter
}}

val rdd1 = rdd.repartition(2)
val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 0.6)
val rp = new ResourceProfileBuilder().require(treqs).build
val rdd2 = rdd1.withResources(rp).mapPartitions { iter => {
  val tc = TaskContext.get()
  val tid = tc.partitionId()
  if (tid > 0) {
    assert(tc.resources()("gpu").addresses sameElements Array("1"))
  } else {
    assert(tc.resources()("gpu").addresses sameElements Array("0"))
  }
  iter
}
}
rdd2.collect()

The provided Spark job will be split into two stages. The first stage comprises 8 tasks, each requiring 1 CPU core and 0.25 GPUs. As a result, the total 8 tasks can run concurrently. The first 4 tasks will grab GPU 0, while the remaining 4 tasks grabs the GPU 1 due to the round-robin manner when offering the resources. The assert line can ensure this policy.

In contrast, the second stage consists of 2 tasks, each necessitating 1 CPU core and 0.6 GPUs, since there're 2 GPUs availabe, so the total 2 tasks can run concurrently, each grabs 1 different GPU, the assert line can ensure that.

concurrent spark jobs

This test case is to ensure the other spark job can still grab the left gpu resources and run alongside the other spark job.

configurations

add below configurations in the SPARK_HOME/conf/spark-defaults.conf

spark.worker.resource.gpu.amount 2
spark.worker.resource.gpu.discoveryScript /tmp/gpu_discovery.sh

spark-submit configurations

spark-shell --master spark://192.168.0.103:7077 --conf spark.executor.cores=8 --conf spark.task.cpus=1 \
   --conf spark.executor.resource.gpu.amount=2 --conf spark.task.resource.gpu.amount=0.25 \
   --conf spark.dynamicAllocation.enabled=false

The aforementioned spark-submit configurations will launch a single executor with 8 CPU cores and 2 GPU. The tasks requires 1 CPU core and 0.25 GPUs each, allowing for the concurrent execution of 8 tasks. the first 4 tasks will grab GPU 0, while the remaining 4 tasks grabs the GPU 1 due to the round-robin manner when offering the resources.

test code

    import org.apache.spark.TaskContext
    import org.apache.spark.resource.{ResourceProfileBuilder, TaskResourceRequests}

    // Submit Spark Job 0 in thread1.
    val thread1 = new Thread(() => {
      val rdd = sc.range(0, 8, 1, 8).mapPartitions { iter => {
        val tc = TaskContext.get()
        val tid = tc.partitionId()
        if (tid >= 4) {
          assert(tc.resources()("gpu").addresses sameElements Array("1"))
        } else {
          assert(tc.resources()("gpu").addresses sameElements Array("0"))
        }
        iter
      }
      }

      val rdd1 = rdd.repartition(2)

      val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 0.6)
      val rp = new ResourceProfileBuilder().require(treqs).build

      val rdd2 = rdd1.withResources(rp).mapPartitions { iter => {
        val tc = TaskContext.get()
        val tid = tc.partitionId()
        assert(tc.resources()("gpu").addresses sameElements Array(tid.toString))
        println("sleeping 20s")
        Thread.sleep(20000)
        iter
      }
      }
      rdd2.collect()
    })

    thread1.start()
    // sleep 5s in main thread to make sure the spark result tasks launched in thread1 are running
    Thread.sleep(5000)

    // Submit Spark Job 1 in main thread.
    // Each spark result task in thread1 takes 0.6 gpus, so there is only 0.4 gpus (for each gpu) left.
    // since the default task gpu amount = 0.25, the concurrent spark tasks in Spark Job 1
    // will be 1(0.4/0.25) * 2 (2 gpus)
    val rdd = sc.range(0, 4, 1, 2).mapPartitions(iter => {
      Thread.sleep(10000)
      val tc = TaskContext.get()
      val tid = tc.partitionId()
      if (tid % 2 == 1) {
        assert(tc.resources()("gpu").addresses sameElements Array("1"))
      } else {
        assert(tc.resources()("gpu").addresses sameElements Array("0"))
      }
      iter
    })
    rdd.collect()

    thread1.join()

The given Spark application consists of two spark jobs. The first spark job 0 is submitted in thread1, while the second spark job 1 is submitted in the main thread. To guarantee that the spark job 0 runs prior to the spark job 1, a sleep 5s is included in the main thread. As a result, the result tasks in spark job 0 will pause for 20 seconds to await the completion of spark job 1. This is done to test whether spark job 1 can utilize the remaining GPU and execute concurrently with spark job 0.

Event timeline

from the picture, we can see, the Spark Job 1 was running alongside Spark Job 0 and finished before spark job 0.

spark job 0

spark job 1

If we change val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 1) to require 1 each GPU for each task, then the spark job 1 will not grab any gpus because the left available GPUs is 0 after spark job is running.

    import org.apache.spark.TaskContext
    import org.apache.spark.resource.{ResourceProfileBuilder, TaskResourceRequests}

    // Submit Spark Job 0 in thread1.
    val thread1 = new Thread(() => {
      val rdd = sc.range(0, 8, 1, 8).mapPartitions { iter => {
        val tc = TaskContext.get()
        val tid = tc.partitionId()
        if (tid >= 4) {
          assert(tc.resources()("gpu").addresses sameElements Array("1"))
        } else {
          assert(tc.resources()("gpu").addresses sameElements Array("0"))
        }
        iter
      }
      }

      val rdd1 = rdd.repartition(2)

      val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 1)
      val rp = new ResourceProfileBuilder().require(treqs).build

      val rdd2 = rdd1.withResources(rp).mapPartitions { iter => {
        val tc = TaskContext.get()
        val tid = tc.partitionId()
        assert(tc.resources()("gpu").addresses sameElements Array(tid.toString))
        println("sleeping 20s")
        Thread.sleep(20000)
        iter
      }
      }
      rdd2.collect()
    })

    thread1.start()
    // sleep 5s in main thread to make sure the spark result tasks launched in thread1 are running
    Thread.sleep(5000)

    // Submit Spark Job 1 in main thread.
    // Each spark result task in thread1 takes 1 gpus, so there is no available gpus left for spark job 1.
    // The spark job 1 will run after spark job 0 finished, but we can't ensure which gpu the task will grab. 
    val rdd = sc.range(0, 4, 1, 2).mapPartitions(iter => {
      Thread.sleep(10000)
      val tc = TaskContext.get()
      assert(tc.resources().contains("gpu"))
      iter
    })
    rdd.collect()

    thread1.join()

From the picture, we can see, spark job 1 was submitted when spark job 0 was running, but the tasks on spark job 1 didn't run because of a lack of GPU resources. After spark job 0 is finished and releases the GPU, then tasks on spark job 1 can grab the GPUs and run.

wbo4958 · 2024-01-02T11:54:28Z

with dynamic allocation off

1 GPU

configurations

add below configurations in the SPARK_HOME/conf/spark-defaults.conf

spark.worker.resource.gpu.amount 1
spark.worker.resource.gpu.discoveryScript /tmp/gpu_discovery.sh

spark-submit configurations

spark-shell --master spark://192.168.0.103:7077 --conf spark.executor.instances=1 --conf spark.executor.cores=4 --conf spark.task.cpus=1 \
 --conf spark.dynamicAllocation.enabled=true

By utilizing the aforementioned spark-submit configurations, dynamic allocation is enabled, resulting in the launch of an initial executor equipped with 4 CPU cores. As each task requires 1 CPU core, this setup allows for the simultaneous execution of 4 tasks.

test code

    import org.apache.spark.TaskContext
    import org.apache.spark.resource.{ExecutorResourceRequests, ResourceProfileBuilder, TaskResourceRequests}

    val rdd = sc.range(0, 100, 1, 6).mapPartitions { iter => {
      val tc = TaskContext.get()
      assert(!tc.resources().contains("gpu"))
      iter
    }
    }

    val ereqs = new ExecutorResourceRequests().cores(8).resource("gpu", 1)
    val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 0.125)
    val rp = new ResourceProfileBuilder().require(ereqs).require(treqs).build

    val rdd1 = rdd.repartition(12).withResources(rp).mapPartitions { iter => {
      Thread.sleep(1000)
      val tc = TaskContext.get()
      assert(tc.resources()("gpu").addresses sameElements Array("0"))
      iter
    }
    }
    rdd1.collect()

The provided Spark job will be split into two stages. The first stage comprises 6 tasks, each requiring 1 CPU core for the default profile file. As a result, the first 4 tasks can run concurrently, and then run the remaining 2 tasks.

Within the second stage, there are a total of 12 tasks that demand a distinct resource profile. These tasks necessitate executors equipped with 8 cores and 1 GPU, with each individual task requiring 1 CPU core and 0.125 GPUs. As a result, the initial 8 tasks will execute concurrently, followed by the subsequent 4 tasks.

2 GPUs

configurations

add below configurations in the SPARK_HOME/conf/spark-defaults.conf

spark.worker.resource.gpu.amount 2
spark.worker.resource.gpu.discoveryScript /tmp/gpu_discovery.sh

spark-submit configurations

spark-shell --master spark://192.168.0.103:7077 --conf spark.executor.instances=1 --conf spark.executor.cores=4 --conf spark.task.cpus=1 \
   --conf spark.dynamicAllocation.enabled=true

By utilizing the aforementioned spark-submit configurations, dynamic allocation is enabled, resulting in the launch of an initial executor equipped with 4 CPU cores. As each task requires 1 CPU core, this setup allows for the simultaneous execution of 4 tasks.

test code

    import org.apache.spark.TaskContext
    import org.apache.spark.resource.{ExecutorResourceRequests, ResourceProfileBuilder, TaskResourceRequests}

    val rdd = sc.range(0, 100, 1, 6).mapPartitions { iter => {
      val tc = TaskContext.get()
      assert(!tc.resources().contains("gpu"))
      iter
    }
    }

    val ereqs = new ExecutorResourceRequests().cores(8).resource("gpu", 2)
    val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 0.25)
    val rp = new ResourceProfileBuilder().require(ereqs).require(treqs).build

    val rdd1 = rdd.repartition(8).withResources(rp).mapPartitions { iter => {
      Thread.sleep(1000)
      val tc = TaskContext.get()
      val tid = tc.partitionId()
      if (tid >= 4) {
        assert(tc.resources()("gpu").addresses sameElements Array("1"))
      } else {
        assert(tc.resources()("gpu").addresses sameElements Array("0"))
      }
      iter
    }
    }
    rdd1.collect()

The provided Spark job will be split into two stages. The first stage comprises 6 tasks, each requiring 1 CPU core for the default profile file. As a result, the first 4 tasks can run concurrently, and then run the remaining 2 tasks.

Within the second stage, there are a total of 8 tasks that demand a distinct resource profile. These tasks necessitate executors equipped with 8 cores and 2 GPUs, with each individual task requiring 1 CPU core and 0.25 GPUs. As a result, the total 8 tasks will execute concurrently, the first 4 tasks will grab GPU ID 0 while the remaining 4 tasks will grab GPU ID 1. The assert line can ensure that.

wbo4958 · 2024-01-02T11:59:49Z

Hi @tgravescs, I did the manual tests on dynamic allocation on/off on the spark standalone cluster. Please check the comments of dynamic allocation off and dynamic allocation on

If you think I still need to do some other tests, please feel free to tell me. Thx very much.

tgravescs · 2024-01-04T15:08:31Z

thanks for the detailed test results and all the work! this looks good.

tgravescs · 2024-01-04T15:20:47Z

Merged to master branch.

This (PR) introduces the utilization of fractions instead of slots, which is similar to the CPU strategy, for determining whether a worker offer can provide the necessary resources to tasks. For instance, when an executor reports to the driver with [gpu, ["1,", "2"]], the driver constructs an executor data map. The keys in the map represent the GPU addresses, and their default values are set to 1.0, indicating one whole GPU. Consequently, the available resource amounts for the executor are as follows: { "1" -> 1.0f, "2" -> 1.0f }. When offering resources to a task that requires 1 CPU and 0.08 GPU, the worker offer examines the available resource amounts. It identifies that the capacity of GPU address "1.0" is greater than the task's GPU requirement (1.0 >= 0.08). Therefore, Spark assigns the GPU address "1" to this task. After the assignment, the available resource amounts for this executor are updated to { "1" -> 0.92, "2" -> 1.0}, ensuring that the remaining resources can be allocated to other tasks. In scenarios where other tasks, using different task resource profiles, request varying GPU amounts when dynamic allocation is disabled, Spark applies the same comparison approach. It compares the task's GPU requirement with the available resource amounts to determine if the resources can be assigned to the task. The existing resources offering including gpu, fpga is based on "slots per address", which is defined by the default resource profile. and it's a fixed number for all different resource profiles when dynamic allcation is disabled. Consider the below test case, ``` scala withTempDir { dir => val scriptPath = createTempScriptWithExpectedOutput(dir, "gpuDiscoveryScript", """{"name": "gpu","addresses":["0"]}""") val conf = new SparkConf() .setAppName("test") .setMaster("local-cluster[1, 12, 1024]") .set("spark.executor.cores", "12") conf.set("spark.worker.resource.gpu.amount", "1") conf.set("spark.worker.resource.gpu.discoveryScript", scriptPath) conf.set("spark.executor.resource.gpu.amount", "1") conf.set("spark.task.resource.gpu.amount", "0.08") sc = new SparkContext(conf) val rdd = sc.range(0, 100, 1, 4) var rdd1 = rdd.repartition(3) val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 1.0) val rp = new ResourceProfileBuilder().require(treqs).build rdd1 = rdd1.withResources(rp) assert(rdd1.collect().size === 100) } ``` During the initial stages, Spark generates a default resource profile based on the configurations. The calculation for determining the slots per GPU address is performed as "spark.executor.resource.gpu.amount / spark.task.resource.gpu.amount", resulting in a value of 12 (1/0.08 = 12). This means that Spark can accommodate up to 12 tasks running on each GPU address simultaneously. The job is then divided into two stages. The first stage, which consists of 4 tasks, runs concurrently based on the default resource profile. However, the second stage, comprising 3 tasks, runs sequentially using a new task resource profile. This new profile specifies that each task requires 1 CPU and 1.0 full GPU. In reality, the tasks in the second stage are running in parallel, which is the underlying issue. The problem lies in the line `new TaskResourceRequests().cpus(1).resource("gpu", 1.0)`. The value of 1.0 for the GPU, or any value below 1.0 (specifically, (0, 0.5] which is rounded up to 1.0, spark throws an exception if the value is in (0.5, 1)), is merely requesting the number of slots. In this case, it is requesting only 1 slot. Consequently, each task necessitates 1 CPU core and 1 GPU slot, resulting in all tasks running simultaneously. No To ensure all tests got passed No Closes apache#43494 from wbo4958/SPARK-45527. Authored-by: Bobby Wang <[email protected]> Signed-off-by: Thomas Graves <[email protected]>

### What changes were proposed in this pull request? When cherry-picking #43494 back to branch 3.5 #44690, I ran into the issue that some tests for Scala 2.12 failed when comparing two maps. It turned out that the function [compareMaps](https://github.com/apache/spark/pull/43494/files#diff-f205431247dd9446f4ce941e5a4620af438c242b9bdff6e7faa7df0194db49acR129) is not so robust for scala 2.12 and scala 2.13. - scala 2.13 ``` scala Welcome to Scala 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.9). Type in expressions for evaluation. Or try :help. scala> def compareMaps(lhs: Map[String, Double], rhs: Map[String, Double], | eps: Double = 0.00000001): Boolean = { | lhs.size == rhs.size && | lhs.zip(rhs).forall { case ((lName, lAmount), (rName, rAmount)) => | lName == rName && (lAmount - rAmount).abs < eps | } | } | | import scala.collection.mutable.HashMap | val resources = Map("gpu" -> Map("a" -> 1.0, "b" -> 2.0, "c" -> 3.0, "d"-> 4.0)) | val mapped = resources.map { case (rName, addressAmounts) => | rName -> HashMap(addressAmounts.toSeq.sorted: _*) | } | | compareMaps(resources("gpu"), mapped("gpu").toMap) def compareMaps(lhs: Map[String,Double], rhs: Map[String,Double], eps: Double): Boolean import scala.collection.mutable.HashMap val resources: scala.collection.immutable.Map[String,scala.collection.immutable.Map[String,Double]] = Map(gpu -> Map(a -> 1.0, b -> 2.0, c -> 3.0, d -> 4.0)) val mapped: scala.collection.immutable.Map[String,scala.collection.mutable.HashMap[String,Double]] = Map(gpu -> HashMap(a -> 1.0, b -> 2.0, c -> 3.0, d -> 4.0)) val res0: Boolean = true ``` - scala 2.12 ``` scala Welcome to Scala 2.12.14 (OpenJDK 64-Bit Server VM, Java 17.0.9). Type in expressions for evaluation. Or try :help. scala> def compareMaps(lhs: Map[String, Double], rhs: Map[String, Double], | eps: Double = 0.00000001): Boolean = { | lhs.size == rhs.size && | lhs.zip(rhs).forall { case ((lName, lAmount), (rName, rAmount)) => | lName == rName && (lAmount - rAmount).abs < eps | } | } compareMaps: (lhs: Map[String,Double], rhs: Map[String,Double], eps: Double)Boolean scala> import scala.collection.mutable.HashMap import scala.collection.mutable.HashMap scala> val resources = Map("gpu" -> Map("a" -> 1.0, "b" -> 2.0, "c" -> 3.0, "d"-> 4.0)) resources: scala.collection.immutable.Map[String,scala.collection.immutable.Map[String,Double]] = Map(gpu -> Map(a -> 1.0, b -> 2.0, c -> 3.0, d -> 4.0)) scala> val mapped = resources.map { case (rName, addressAmounts) => | rName -> HashMap(addressAmounts.toSeq.sorted: _*) | } mapped: scala.collection.immutable.Map[String,scala.collection.mutable.HashMap[String,Double]] = Map(gpu -> Map(b -> 2.0, d -> 4.0, a -> 1.0, c -> 3.0)) scala> compareMaps(resources("gpu"), mapped("gpu").toMap) res0: Boolean = false ``` The same code bug got different results for Scala 2.12 and Scala 2.13. This PR tried to rework compareMaps to make tests pass for both scala 2.12 and scala 2.13 ### Why are the changes needed? Some users may back-port #43494 to some older branch for scala 2.12 and will run into the same issue. It's just trivial work to make the GPU fraction tests compatible with Scala 2.12 and Scala 2.13 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Make sure all the CI pipelines pass ### Was this patch authored or co-authored using generative AI tooling? No Closes #44735 from wbo4958/gpu-fraction-tests. Authored-by: Bobby Wang <[email protected]> Signed-off-by: Yi Wu <[email protected]>

When cherry-picking apache#43494 back to branch 3.5 apache#44690, I ran into the issue that some tests for Scala 2.12 failed when comparing two maps. It turned out that the function [compareMaps](https://github.com/apache/spark/pull/43494/files#diff-f205431247dd9446f4ce941e5a4620af438c242b9bdff6e7faa7df0194db49acR129) is not so robust for scala 2.12 and scala 2.13. - scala 2.13 ``` scala Welcome to Scala 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.9). Type in expressions for evaluation. Or try :help. scala> def compareMaps(lhs: Map[String, Double], rhs: Map[String, Double], | eps: Double = 0.00000001): Boolean = { | lhs.size == rhs.size && | lhs.zip(rhs).forall { case ((lName, lAmount), (rName, rAmount)) => | lName == rName && (lAmount - rAmount).abs < eps | } | } | | import scala.collection.mutable.HashMap | val resources = Map("gpu" -> Map("a" -> 1.0, "b" -> 2.0, "c" -> 3.0, "d"-> 4.0)) | val mapped = resources.map { case (rName, addressAmounts) => | rName -> HashMap(addressAmounts.toSeq.sorted: _*) | } | | compareMaps(resources("gpu"), mapped("gpu").toMap) def compareMaps(lhs: Map[String,Double], rhs: Map[String,Double], eps: Double): Boolean import scala.collection.mutable.HashMap val resources: scala.collection.immutable.Map[String,scala.collection.immutable.Map[String,Double]] = Map(gpu -> Map(a -> 1.0, b -> 2.0, c -> 3.0, d -> 4.0)) val mapped: scala.collection.immutable.Map[String,scala.collection.mutable.HashMap[String,Double]] = Map(gpu -> HashMap(a -> 1.0, b -> 2.0, c -> 3.0, d -> 4.0)) val res0: Boolean = true ``` - scala 2.12 ``` scala Welcome to Scala 2.12.14 (OpenJDK 64-Bit Server VM, Java 17.0.9). Type in expressions for evaluation. Or try :help. scala> def compareMaps(lhs: Map[String, Double], rhs: Map[String, Double], | eps: Double = 0.00000001): Boolean = { | lhs.size == rhs.size && | lhs.zip(rhs).forall { case ((lName, lAmount), (rName, rAmount)) => | lName == rName && (lAmount - rAmount).abs < eps | } | } compareMaps: (lhs: Map[String,Double], rhs: Map[String,Double], eps: Double)Boolean scala> import scala.collection.mutable.HashMap import scala.collection.mutable.HashMap scala> val resources = Map("gpu" -> Map("a" -> 1.0, "b" -> 2.0, "c" -> 3.0, "d"-> 4.0)) resources: scala.collection.immutable.Map[String,scala.collection.immutable.Map[String,Double]] = Map(gpu -> Map(a -> 1.0, b -> 2.0, c -> 3.0, d -> 4.0)) scala> val mapped = resources.map { case (rName, addressAmounts) => | rName -> HashMap(addressAmounts.toSeq.sorted: _*) | } mapped: scala.collection.immutable.Map[String,scala.collection.mutable.HashMap[String,Double]] = Map(gpu -> Map(b -> 2.0, d -> 4.0, a -> 1.0, c -> 3.0)) scala> compareMaps(resources("gpu"), mapped("gpu").toMap) res0: Boolean = false ``` The same code bug got different results for Scala 2.12 and Scala 2.13. This PR tried to rework compareMaps to make tests pass for both scala 2.12 and scala 2.13 Some users may back-port apache#43494 to some older branch for scala 2.12 and will run into the same issue. It's just trivial work to make the GPU fraction tests compatible with Scala 2.12 and Scala 2.13 No Make sure all the CI pipelines pass No Closes apache#44735 from wbo4958/gpu-fraction-tests. Authored-by: Bobby Wang <[email protected]> Signed-off-by: Yi Wu <[email protected]>

Ngone51 · 2024-01-31T01:17:21Z

core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala

+    // resources is the total resources assigned to the task
+    // Eg, Map("gpu" -> Map("0" -> ResourceAmountUtils.toInternalResource(0.7))):
+    // assign 0.7 of the gpu address "0" to this task
+    val resources: immutable.Map[String, immutable.Map[String, Long]],


nit: Shall we create a named type for this map and have its key and value better clarified? @wbo4958

Do you think we should have a new PR for master branch for your comment? @Ngone51

dongjoon-hyun · 2024-02-26T22:06:07Z

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

+    val taskCpus = 1
+    val taskGpus = 0.3
+    val executorGpus = 4
+    val executorCpus = 1000


1000 threads seem to be too large for some CI systems with a limited resource.

https://github.com/apache/spark/actions/workflows/build_maven_java21_macos14.yml

https://github.com/apache/spark/actions/runs/8054862135/job/22000403549

Warning: [766.327s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 4096k, guardsize: 16k, detached. Warning: [766.327s][warning][os,thread] Failed to start the native thread for java.lang.Thread "dispatcher-event-loop-840" *** RUN ABORTED *** An exception or error caused a run to abort: unable to create native thread: possibly out of memory or process/resource limits reached java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached

I made a test-case follow-up.

[SPARK-45527][CORE][TESTS][FOLLOWUP] Reduce the number of threads from 1k to 100 in TaskSchedulerImplSuite #45264

Thx for your improvement.

…m 1k to 100 in `TaskSchedulerImplSuite` ### What changes were proposed in this pull request? This PR is a follow-up of #43494 in order to reduce the number of threads of SparkContext from 1k to 100 in the test environment. ### Why are the changes needed? To reduce the test resource requirement. 1000 threads seem to be too large for some CI systems with a limited resource. - https://github.com/apache/spark/actions/workflows/build_maven_java21_macos14.yml - https://github.com/apache/spark/actions/runs/8054862135/job/22000403549 ``` Warning: [766.327s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 4096k, guardsize: 16k, detached. Warning: [766.327s][warning][os,thread] Failed to start the native thread for java.lang.Thread "dispatcher-event-loop-840" *** RUN ABORTED *** An exception or error caused a run to abort: unable to create native thread: possibly out of memory or process/resource limits reached java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached ``` ### Does this PR introduce _any_ user-facing change? No, this is a test-case update. ### How was this patch tested? Pass the CIs and monitor Daily Apple Silicon test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45264 from dongjoon-hyun/SPARK-45527. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… in fraction resource calculation ### What changes were proposed in this pull request? This PR is a followup of #43494, which reduces the test cases by shuffling `taskNum`, and taking 5. ### Why are the changes needed? It runts too many test cases, https://github.com/apache/spark/actions/runs/8054862135/job/22000403549 which consumes the limited resources in CI ``` - SPARK-45527 default rp with task.gpu.amount=1.0 can restrict 1 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.5 can restrict 2 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.3333333333333333 can restrict 3 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.25 can restrict 4 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.2 can restrict 5 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1666666666666666 can restrict 6 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1428571428571428 can restrict 7 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.125 can restrict 8 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1111111111111111 can restrict 9 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1 can restrict 10 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0909090909090909 can restrict 11 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0833333333333333 can restrict 12 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0769230769230769 can restrict 13 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0714285714285714 can restrict 14 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0666666666666666 can restrict 15 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0625 can restrict 16 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0588235294117647 can restrict 17 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0555555555555555 can restrict 18 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0526315789473684 can restrict 19 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.05 can restrict 20 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=1.0 can restrict 1 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.5 can restrict 2 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.3333333333333333 can restrict 3 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.25 can restrict 4 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.2 can restrict 5 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1666666666666666 can restrict 6 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1428571428571428 can restrict 7 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.125 can restrict 8 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1111111111111111 can restrict 9 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1 can restrict 10 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0909090909090909 can restrict 11 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0833333333333333 can restrict 12 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0769230769230769 can restrict 13 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0714285714285714 can restrict 14 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0666666666666666 can restrict 15 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0625 can restrict 16 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0588235294117647 can restrict 17 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0555555555555555 can restrict 18 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0526315789473684 can restrict 19 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.05 can restrict 20 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=1.0 can restrict 1 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.5 can restrict 2 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.3333333333333333 can restrict 3 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.25 can restrict 4 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.2 can restrict 5 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.1666666666666666 can restrict 6 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.1428571428571428 can restrict 7 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.125 can restrict 8 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.1111111111111111 can restrict 9 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.1 can restrict 10 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0909090909090909 can restrict 11 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0833333333333333 can restrict 12 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0769230769230769 can restrict 13 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0714285714285714 can restrict 14 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0666666666666666 can restrict 15 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0625 can restrict 16 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0588235294117647 can restrict 17 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0555555555555555 can restrict 18 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0526315789473684 can restrict 19 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.05 can restrict 20 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=1.0 can restrict 1 tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.5 can restrict 2 tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.3333333333333333 can restrict 3 tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.25 can restrict 4 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0666666666666666 can restrict 15 tasks run in the same executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0625 can restrict 16 tasks run in the same executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0588235294117647 can restrict 17 tasks run in the same executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0555555555555555 can restrict 18 tasks run in the same executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0526315789473684 can restrict 19 tasks run in the same executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.05 can restrict 20 tasks run in the same executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=1.0 can restrict 1 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.5 can restrict 2 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.3333333333333333 can restrict 3 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.25 can restrict 4 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.2 can restrict 5 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1666666666666666 can restrict 6 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1428571428571428 can restrict 7 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.125 can restrict 8 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1111111111111111 can restrict 9 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1 can restrict 10 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0909090909090909 can restrict 11 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0833333333333333 can restrict 12 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0769230769230769 can restrict 13 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0714285714285714 can restrict 14 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0666666666666666 can restrict 15 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0625 can restrict 16 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0588235294117647 can restrict 17 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0555555555555555 can restrict 18 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0526315789473684 can restrict 19 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.05 can restrict 20 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=1.0 can restrict 1 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.5 can restrict 2 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.3333333333333333 can restrict 3 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.25 can restrict 4 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.2 can restrict 5 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1666666666666666 can restrict 6 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1428571428571428 can restrict 7 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.125 can restrict 8 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1111111111111111 can restrict 9 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1 can restrict 10 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0909090909090909 can restrict 11 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0833333333333333 can restrict 12 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0769230769230769 can restrict 13 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0714285714285714 can restrict 14 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0666666666666666 can restrict 15 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0625 can restrict 16 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0588235294117647 can restrict 17 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0555555555555555 can restrict 18 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0526315789473684 can restrict 19 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.05 can restrict 20 tasks run on the different executor Warning: [766.327s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attribute ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45268 from HyukjinKwon/SPARK-45527-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…m 1k to 100 in `TaskSchedulerImplSuite` ### What changes were proposed in this pull request? This PR is a follow-up of apache#43494 in order to reduce the number of threads of SparkContext from 1k to 100 in the test environment. ### Why are the changes needed? To reduce the test resource requirement. 1000 threads seem to be too large for some CI systems with a limited resource. - https://github.com/apache/spark/actions/workflows/build_maven_java21_macos14.yml - https://github.com/apache/spark/actions/runs/8054862135/job/22000403549 ``` Warning: [766.327s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 4096k, guardsize: 16k, detached. Warning: [766.327s][warning][os,thread] Failed to start the native thread for java.lang.Thread "dispatcher-event-loop-840" *** RUN ABORTED *** An exception or error caused a run to abort: unable to create native thread: possibly out of memory or process/resource limits reached java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached ``` ### Does this PR introduce _any_ user-facing change? No, this is a test-case update. ### How was this patch tested? Pass the CIs and monitor Daily Apple Silicon test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45264 from dongjoon-hyun/SPARK-45527. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… in fraction resource calculation ### What changes were proposed in this pull request? This PR is a followup of apache#43494, which reduces the test cases by shuffling `taskNum`, and taking 5. ### Why are the changes needed? It runts too many test cases, https://github.com/apache/spark/actions/runs/8054862135/job/22000403549 which consumes the limited resources in CI ``` - SPARK-45527 default rp with task.gpu.amount=1.0 can restrict 1 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.5 can restrict 2 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.3333333333333333 can restrict 3 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.25 can restrict 4 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.2 can restrict 5 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1666666666666666 can restrict 6 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1428571428571428 can restrict 7 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.125 can restrict 8 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1111111111111111 can restrict 9 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1 can restrict 10 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0909090909090909 can restrict 11 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0833333333333333 can restrict 12 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0769230769230769 can restrict 13 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0714285714285714 can restrict 14 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0666666666666666 can restrict 15 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0625 can restrict 16 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0588235294117647 can restrict 17 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0555555555555555 can restrict 18 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0526315789473684 can restrict 19 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.05 can restrict 20 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=1.0 can restrict 1 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.5 can restrict 2 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.3333333333333333 can restrict 3 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.25 can restrict 4 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.2 can restrict 5 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1666666666666666 can restrict 6 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1428571428571428 can restrict 7 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.125 can restrict 8 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1111111111111111 can restrict 9 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1 can restrict 10 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0909090909090909 can restrict 11 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0833333333333333 can restrict 12 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0769230769230769 can restrict 13 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0714285714285714 can restrict 14 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0666666666666666 can restrict 15 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0625 can restrict 16 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0588235294117647 can restrict 17 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0555555555555555 can restrict 18 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0526315789473684 can restrict 19 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.05 can restrict 20 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=1.0 can restrict 1 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.5 can restrict 2 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.3333333333333333 can restrict 3 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.25 can restrict 4 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.2 can restrict 5 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.1666666666666666 can restrict 6 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.1428571428571428 can restrict 7 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.125 can restrict 8 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.1111111111111111 can restrict 9 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.1 can restrict 10 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0909090909090909 can restrict 11 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0833333333333333 can restrict 12 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0769230769230769 can restrict 13 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0714285714285714 can restrict 14 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0666666666666666 can restrict 15 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0625 can restrict 16 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0588235294117647 can restrict 17 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0555555555555555 can restrict 18 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0526315789473684 can restrict 19 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.05 can restrict 20 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=1.0 can restrict 1 tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.5 can restrict 2 tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.3333333333333333 can restrict 3 tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.25 can restrict 4 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0666666666666666 can restrict 15 tasks run in the same executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0625 can restrict 16 tasks run in the same executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0588235294117647 can restrict 17 tasks run in the same executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0555555555555555 can restrict 18 tasks run in the same executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0526315789473684 can restrict 19 tasks run in the same executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.05 can restrict 20 tasks run in the same executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=1.0 can restrict 1 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.5 can restrict 2 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.3333333333333333 can restrict 3 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.25 can restrict 4 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.2 can restrict 5 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1666666666666666 can restrict 6 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1428571428571428 can restrict 7 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.125 can restrict 8 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1111111111111111 can restrict 9 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1 can restrict 10 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0909090909090909 can restrict 11 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0833333333333333 can restrict 12 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0769230769230769 can restrict 13 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0714285714285714 can restrict 14 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0666666666666666 can restrict 15 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0625 can restrict 16 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0588235294117647 can restrict 17 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0555555555555555 can restrict 18 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0526315789473684 can restrict 19 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.05 can restrict 20 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=1.0 can restrict 1 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.5 can restrict 2 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.3333333333333333 can restrict 3 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.25 can restrict 4 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.2 can restrict 5 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1666666666666666 can restrict 6 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1428571428571428 can restrict 7 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.125 can restrict 8 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1111111111111111 can restrict 9 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1 can restrict 10 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0909090909090909 can restrict 11 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0833333333333333 can restrict 12 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0769230769230769 can restrict 13 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0714285714285714 can restrict 14 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0666666666666666 can restrict 15 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0625 can restrict 16 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0588235294117647 can restrict 17 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0555555555555555 can restrict 18 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0526315789473684 can restrict 19 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.05 can restrict 20 tasks run on the different executor Warning: [766.327s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attribute ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45268 from HyukjinKwon/SPARK-45527-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…m 1k to 100 in `TaskSchedulerImplSuite` ### What changes were proposed in this pull request? This PR is a follow-up of apache#43494 in order to reduce the number of threads of SparkContext from 1k to 100 in the test environment. ### Why are the changes needed? To reduce the test resource requirement. 1000 threads seem to be too large for some CI systems with a limited resource. - https://github.com/apache/spark/actions/workflows/build_maven_java21_macos14.yml - https://github.com/apache/spark/actions/runs/8054862135/job/22000403549 ``` Warning: [766.327s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 4096k, guardsize: 16k, detached. Warning: [766.327s][warning][os,thread] Failed to start the native thread for java.lang.Thread "dispatcher-event-loop-840" *** RUN ABORTED *** An exception or error caused a run to abort: unable to create native thread: possibly out of memory or process/resource limits reached java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached ``` ### Does this PR introduce _any_ user-facing change? No, this is a test-case update. ### How was this patch tested? Pass the CIs and monitor Daily Apple Silicon test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45264 from dongjoon-hyun/SPARK-45527. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… in fraction resource calculation ### What changes were proposed in this pull request? This PR is a followup of apache#43494, which reduces the test cases by shuffling `taskNum`, and taking 5. ### Why are the changes needed? It runts too many test cases, https://github.com/apache/spark/actions/runs/8054862135/job/22000403549 which consumes the limited resources in CI ``` - SPARK-45527 default rp with task.gpu.amount=1.0 can restrict 1 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.5 can restrict 2 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.3333333333333333 can restrict 3 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.25 can restrict 4 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.2 can restrict 5 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1666666666666666 can restrict 6 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1428571428571428 can restrict 7 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.125 can restrict 8 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1111111111111111 can restrict 9 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1 can restrict 10 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0909090909090909 can restrict 11 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0833333333333333 can restrict 12 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0769230769230769 can restrict 13 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0714285714285714 can restrict 14 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0666666666666666 can restrict 15 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0625 can restrict 16 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0588235294117647 can restrict 17 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0555555555555555 can restrict 18 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0526315789473684 can restrict 19 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.05 can restrict 20 barrier tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=1.0 can restrict 1 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.5 can restrict 2 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.3333333333333333 can restrict 3 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.25 can restrict 4 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.2 can restrict 5 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1666666666666666 can restrict 6 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1428571428571428 can restrict 7 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.125 can restrict 8 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1111111111111111 can restrict 9 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.1 can restrict 10 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0909090909090909 can restrict 11 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0833333333333333 can restrict 12 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0769230769230769 can restrict 13 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0714285714285714 can restrict 14 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0666666666666666 can restrict 15 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0625 can restrict 16 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0588235294117647 can restrict 17 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0555555555555555 can restrict 18 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.0526315789473684 can restrict 19 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=0.05 can restrict 20 tasks run in the same executor - SPARK-45527 default rp with task.gpu.amount=1.0 can restrict 1 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.5 can restrict 2 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.3333333333333333 can restrict 3 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.25 can restrict 4 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.2 can restrict 5 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.1666666666666666 can restrict 6 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.1428571428571428 can restrict 7 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.125 can restrict 8 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.1111111111111111 can restrict 9 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.1 can restrict 10 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0909090909090909 can restrict 11 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0833333333333333 can restrict 12 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0769230769230769 can restrict 13 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0714285714285714 can restrict 14 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0666666666666666 can restrict 15 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0625 can restrict 16 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0588235294117647 can restrict 17 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0555555555555555 can restrict 18 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.0526315789473684 can restrict 19 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.05 can restrict 20 barrier tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=1.0 can restrict 1 tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.5 can restrict 2 tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.3333333333333333 can restrict 3 tasks run on the different executor - SPARK-45527 default rp with task.gpu.amount=0.25 can restrict 4 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0666666666666666 can restrict 15 tasks run in the same executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0625 can restrict 16 tasks run in the same executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0588235294117647 can restrict 17 tasks run in the same executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0555555555555555 can restrict 18 tasks run in the same executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0526315789473684 can restrict 19 tasks run in the same executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.05 can restrict 20 tasks run in the same executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=1.0 can restrict 1 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.5 can restrict 2 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.3333333333333333 can restrict 3 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.25 can restrict 4 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.2 can restrict 5 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1666666666666666 can restrict 6 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1428571428571428 can restrict 7 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.125 can restrict 8 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1111111111111111 can restrict 9 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1 can restrict 10 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0909090909090909 can restrict 11 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0833333333333333 can restrict 12 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0769230769230769 can restrict 13 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0714285714285714 can restrict 14 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0666666666666666 can restrict 15 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0625 can restrict 16 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0588235294117647 can restrict 17 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0555555555555555 can restrict 18 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0526315789473684 can restrict 19 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.05 can restrict 20 barrier tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=1.0 can restrict 1 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.5 can restrict 2 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.3333333333333333 can restrict 3 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.25 can restrict 4 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.2 can restrict 5 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1666666666666666 can restrict 6 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1428571428571428 can restrict 7 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.125 can restrict 8 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1111111111111111 can restrict 9 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.1 can restrict 10 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0909090909090909 can restrict 11 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0833333333333333 can restrict 12 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0769230769230769 can restrict 13 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0714285714285714 can restrict 14 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0666666666666666 can restrict 15 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0625 can restrict 16 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0588235294117647 can restrict 17 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0555555555555555 can restrict 18 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.0526315789473684 can restrict 19 tasks run on the different executor - SPARK-45527 TaskResourceProfile with task.gpu.amount=0.05 can restrict 20 tasks run on the different executor Warning: [766.327s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attribute ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45268 from HyukjinKwon/SPARK-45527-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

github-actions bot added the CORE label Oct 24, 2023

wbo4958 commented Oct 24, 2023

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala Outdated Show resolved Hide resolved

wbo4958 commented Oct 24, 2023

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala Outdated Show resolved Hide resolved

tgravescs reviewed Oct 25, 2023

View reviewed changes

wbo4958 added 2 commits October 30, 2023 17:08

Use fraction to do the resource calculation

ee71e46

comments

2d68843

wbo4958 force-pushed the SPARK-45527 branch from 159ccfb to 2d68843 Compare October 30, 2023 12:02

comments

fbd647d

tgravescs reviewed Oct 31, 2023

View reviewed changes

wbo4958 added 4 commits November 2, 2023 14:51

comments

58899b2

keep Long internal

cd1c0ef

remove duplicated resources

51ac5cb

fix CI

d02a3be

beliefer changed the title ~~[SPARK-45527][core] Use fraction to do the resource calculation~~ [SPARK-45527][CORE] Use fraction to do the resource calculation Nov 6, 2023

tgravescs reviewed Nov 6, 2023

View reviewed changes

wbo4958 added 5 commits November 13, 2023 10:15

Merge remote-tracking branch 'upstream/master' into SPARK-45527

07c42b3

validate the task.amount in the ResourceProfile and TaskResourceProfile

2127a7b

add tests

61bcb34

add warnOnWastedResources for TaskResourceProfile

eb7f918

change the rp corresponding tests

18084e6

tgravescs reviewed Nov 15, 2023

View reviewed changes

wbo4958 added 2 commits November 16, 2023 14:31

Comments

e9e7a26

Merge remote-tracking branch 'upstream/master' into SPARK-45527

877cacf

tgravescs reviewed Nov 20, 2023

View reviewed changes

core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala Show resolved Hide resolved

Merge remote-tracking branch 'upstream/master' into SPARK-45527

7c11b6c

tgravescs reviewed Dec 18, 2023

View reviewed changes

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala Outdated Show resolved Hide resolved

core/src/main/scala/org/apache/spark/resource/ResourceAllocator.scala Outdated Show resolved Hide resolved

wbo4958 added 2 commits December 20, 2023 10:18

resolve comments

347196f

Merge remote-tracking branch 'upstream/master' into SPARK-45527

ab4c48e

wbo4958 requested a review from tgravescs January 3, 2024 22:53

asfgit closed this in ae2e00e Jan 4, 2024

wbo4958 deleted the SPARK-45527 branch January 15, 2024 08:48

wbo4958 mentioned this pull request Jan 15, 2024

[SPARK-46721][CORE][TESTS] Make gpu fraction tests more robust #44735

Closed

Ngone51 reviewed Jan 31, 2024

View reviewed changes

dongjoon-hyun reviewed Feb 26, 2024

View reviewed changes

dongjoon-hyun mentioned this pull request Feb 26, 2024

[SPARK-45527][CORE][TESTS][FOLLOWUP] Reduce the number of threads from 1k to 100 in TaskSchedulerImplSuite #45264

Closed

HyukjinKwon mentioned this pull request Feb 27, 2024

[SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation #45268

Closed

HeartSaVioR mentioned this pull request Feb 27, 2024

[SPARK-45527][CORE] Use fraction to do the resource calculation #44690

Closed

[SPARK-45527][CORE] Use fraction to do the resource calculation #43494

[SPARK-45527][CORE] Use fraction to do the resource calculation #43494

Uh oh!

Conversation

wbo4958 commented Oct 24, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

wbo4958 commented Oct 24, 2023

Uh oh!

wbo4958 commented Oct 24, 2023

Uh oh!

Uh oh!

Uh oh!

tgravescs commented Oct 24, 2023

Uh oh!

tgravescs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tgravescs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

tgravescs commented Nov 15, 2023 •

edited

Loading