[jobs] limit the max number of jobs that can run/launch

cg505 · cg505 · commit afe6dc6da831 · 2025-10-05T21:56:52.000-07:00
Previously, after skypilot-org#7051, we allowed the number of jobs launching/running to scale purely based on the controller resources. After this change, we set maximum values for the number that can run and launch, since there are some bottlenecks (e.g. DB) that will not necessarily scale with the instance resources.
diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst
@@ -958,12 +958,12 @@ Best practices for scaling up the jobs controller
 
 The number of active jobs that the controller supports is based on the controller size. There are two limits that apply:
 
-- **Actively launching job count**: maxes out at ``8 * floor((memory - 2GiB) / 3.59GiB)``.
+- **Actively launching job count**: limit is ``8 * floor((memory - 2GiB) / 3.59GiB)``, with a maximum of 512 jobs.
   A job counts towards this limit when it is first starting, launching instances, or recovering.
 
   - The default controller size has 16 GiB memory, meaning **24 jobs** can be actively launching at once.
 
-- **Running job count**: maxes out at ``200 * floor((memory - 2GiB) / 3.59GiB)``.
+- **Running job count**: limit is ``200 * floor((memory - 2GiB) / 3.59GiB)``, with a maximum of 2000 jobs.
 
   - The default controller size supports up to **600 jobs** running in parallel.
 
@@ -1015,25 +1015,4 @@ For absolute maximum parallelism, the following per-cloud configurations are rec
 .. note::
   Remember to tear down your controller to apply these changes, as described above.
 
-With this configuration, you'll get the following performance:
-
-.. list-table::
-   :widths: 1 2 2 2
-   :header-rows: 1
-
-   * - Cloud
-     - Instance type
-     - Launches at once
-     - Running jobs
-   * - AWS
-     - m7i.48xlarge (~768GiB RAM)
-     - **~1,704**
-     - **~42,600**
-   * - GCP
-     - n2-standard-128 (~512GiB RAM)
-     - **~1,136**
-     - **~28,400**
-   * - Azure
-     - Standard_D96s_v5 (~384GiB RAM)
-     - **~848**
-     - **~21,200**
+With this configuration, you can launch up to 512 jobs at once. Once the jobs are launched, up to 2000 jobs can be running in parallel.
diff --git a/sky/jobs/controller.py b/sky/jobs/controller.py
@@ -1144,7 +1144,15 @@ async def monitor_loop(self):
                 await asyncio.sleep(30)
                 continue
 
-            if len(running_tasks) >= scheduler.JOBS_PER_WORKER:
+            # Normally, 200 jobs can run on each controller. But if we have a
+            # ton of controllers, we need to limit the number of jobs that can
+            # run on each controller, to achieve a total of 2000 jobs across all
+            # controllers.
+            max_jobs = min(scheduler.MAX_JOBS_PER_WORKER,
+                           (scheduler.MAX_TOTAL_JOBS //
+                            scheduler.get_number_of_controllers()))
+
+            if len(running_tasks) >= max_jobs:
                 await asyncio.sleep(60)
                 continue
 
diff --git a/sky/jobs/scheduler.py b/sky/jobs/scheduler.py
@@ -91,19 +91,25 @@
 LAUNCHES_PER_WORKER = 8
 # this can probably be increased to around 300-400 but keeping it lower to just
 # to be safe
-JOBS_PER_WORKER = 200
-
-# keep 1GB reserved after the controllers
-MAXIMUM_CONTROLLER_RESERVED_MEMORY_MB = 2048
-
-CURRENT_HASH = os.path.expanduser('~/.sky/wheels/current_sky_wheel_hash')
-
+MAX_JOBS_PER_WORKER = 200
+# Maximum number of controllers that can be running. Hard to handle more than
+# 512 launches at once.
+MAX_CONTROLLERS = 512 // LAUNCHES_PER_WORKER
+# Limit the number of jobs that can be running at once on the entire jobs
+# controller cluster. It's hard to handle cancellation of more than 2000 jobs at
+# once.
+MAX_TOTAL_JOBS = 2000
 # Maximum values for above constants. There will start to be lagging issues
 # at these numbers already.
 # JOB_MEMORY_MB = 200
 # LAUNCHES_PER_WORKER = 16
 # JOBS_PER_WORKER = 400
 
+# keep 1GB reserved after the controllers
+MAXIMUM_CONTROLLER_RESERVED_MEMORY_MB = 2048
+
+CURRENT_HASH = os.path.expanduser('~/.sky/wheels/current_sky_wheel_hash')
+
 
 def get_number_of_controllers() -> int:
     """Returns the number of controllers that should be running.
@@ -136,13 +142,16 @@ def get_number_of_controllers() -> int:
                     config.short_worker_config.burstable_parallelism) * \
             server_config.SHORT_WORKER_MEM_GB * 1024
 
-        return max(1, int((total_memory_mb - used) // JOB_MEMORY_MB))
+        return min(MAX_CONTROLLERS,
+                   max(1, int((total_memory_mb - used) // JOB_MEMORY_MB)))
     else:
-        return max(
-            1,
-            int((total_memory_mb - MAXIMUM_CONTROLLER_RESERVED_MEMORY_MB) /
-                ((LAUNCHES_PER_WORKER * server_config.LONG_WORKER_MEM_GB) * 1024
-                 + JOB_MEMORY_MB)))
+        return min(
+            MAX_CONTROLLERS,
+            max(
+                1,
+                int((total_memory_mb - MAXIMUM_CONTROLLER_RESERVED_MEMORY_MB) /
+                    ((LAUNCHES_PER_WORKER * server_config.LONG_WORKER_MEM_GB) *
+                     1024 + JOB_MEMORY_MB))))
 
 
 def start_controller() -> None: