Skip to content

Commit afe6dc6

Browse files
committed
[jobs] limit the max number of jobs that can run/launch
Previously, after skypilot-org#7051, we allowed the number of jobs launching/running to scale purely based on the controller resources. After this change, we set maximum values for the number that can run and launch, since there are some bottlenecks (e.g. DB) that will not necessarily scale with the instance resources.
1 parent d1c61b3 commit afe6dc6

File tree

3 files changed

+34
-38
lines changed

3 files changed

+34
-38
lines changed

docs/source/examples/managed-jobs.rst

Lines changed: 3 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -958,12 +958,12 @@ Best practices for scaling up the jobs controller
958958

959959
The number of active jobs that the controller supports is based on the controller size. There are two limits that apply:
960960

961-
- **Actively launching job count**: maxes out at ``8 * floor((memory - 2GiB) / 3.59GiB)``.
961+
- **Actively launching job count**: limit is ``8 * floor((memory - 2GiB) / 3.59GiB)``, with a maximum of 512 jobs.
962962
A job counts towards this limit when it is first starting, launching instances, or recovering.
963963

964964
- The default controller size has 16 GiB memory, meaning **24 jobs** can be actively launching at once.
965965

966-
- **Running job count**: maxes out at ``200 * floor((memory - 2GiB) / 3.59GiB)``.
966+
- **Running job count**: limit is ``200 * floor((memory - 2GiB) / 3.59GiB)``, with a maximum of 2000 jobs.
967967

968968
- The default controller size supports up to **600 jobs** running in parallel.
969969

@@ -1015,25 +1015,4 @@ For absolute maximum parallelism, the following per-cloud configurations are rec
10151015
.. note::
10161016
Remember to tear down your controller to apply these changes, as described above.
10171017

1018-
With this configuration, you'll get the following performance:
1019-
1020-
.. list-table::
1021-
:widths: 1 2 2 2
1022-
:header-rows: 1
1023-
1024-
* - Cloud
1025-
- Instance type
1026-
- Launches at once
1027-
- Running jobs
1028-
* - AWS
1029-
- m7i.48xlarge (~768GiB RAM)
1030-
- **~1,704**
1031-
- **~42,600**
1032-
* - GCP
1033-
- n2-standard-128 (~512GiB RAM)
1034-
- **~1,136**
1035-
- **~28,400**
1036-
* - Azure
1037-
- Standard_D96s_v5 (~384GiB RAM)
1038-
- **~848**
1039-
- **~21,200**
1018+
With this configuration, you can launch up to 512 jobs at once. Once the jobs are launched, up to 2000 jobs can be running in parallel.

sky/jobs/controller.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1144,7 +1144,15 @@ async def monitor_loop(self):
11441144
await asyncio.sleep(30)
11451145
continue
11461146

1147-
if len(running_tasks) >= scheduler.JOBS_PER_WORKER:
1147+
# Normally, 200 jobs can run on each controller. But if we have a
1148+
# ton of controllers, we need to limit the number of jobs that can
1149+
# run on each controller, to achieve a total of 2000 jobs across all
1150+
# controllers.
1151+
max_jobs = min(scheduler.MAX_JOBS_PER_WORKER,
1152+
(scheduler.MAX_TOTAL_JOBS //
1153+
scheduler.get_number_of_controllers()))
1154+
1155+
if len(running_tasks) >= max_jobs:
11481156
await asyncio.sleep(60)
11491157
continue
11501158

sky/jobs/scheduler.py

Lines changed: 22 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -91,19 +91,25 @@
9191
LAUNCHES_PER_WORKER = 8
9292
# this can probably be increased to around 300-400 but keeping it lower to just
9393
# to be safe
94-
JOBS_PER_WORKER = 200
95-
96-
# keep 1GB reserved after the controllers
97-
MAXIMUM_CONTROLLER_RESERVED_MEMORY_MB = 2048
98-
99-
CURRENT_HASH = os.path.expanduser('~/.sky/wheels/current_sky_wheel_hash')
100-
94+
MAX_JOBS_PER_WORKER = 200
95+
# Maximum number of controllers that can be running. Hard to handle more than
96+
# 512 launches at once.
97+
MAX_CONTROLLERS = 512 // LAUNCHES_PER_WORKER
98+
# Limit the number of jobs that can be running at once on the entire jobs
99+
# controller cluster. It's hard to handle cancellation of more than 2000 jobs at
100+
# once.
101+
MAX_TOTAL_JOBS = 2000
101102
# Maximum values for above constants. There will start to be lagging issues
102103
# at these numbers already.
103104
# JOB_MEMORY_MB = 200
104105
# LAUNCHES_PER_WORKER = 16
105106
# JOBS_PER_WORKER = 400
106107

108+
# keep 1GB reserved after the controllers
109+
MAXIMUM_CONTROLLER_RESERVED_MEMORY_MB = 2048
110+
111+
CURRENT_HASH = os.path.expanduser('~/.sky/wheels/current_sky_wheel_hash')
112+
107113

108114
def get_number_of_controllers() -> int:
109115
"""Returns the number of controllers that should be running.
@@ -136,13 +142,16 @@ def get_number_of_controllers() -> int:
136142
config.short_worker_config.burstable_parallelism) * \
137143
server_config.SHORT_WORKER_MEM_GB * 1024
138144

139-
return max(1, int((total_memory_mb - used) // JOB_MEMORY_MB))
145+
return min(MAX_CONTROLLERS,
146+
max(1, int((total_memory_mb - used) // JOB_MEMORY_MB)))
140147
else:
141-
return max(
142-
1,
143-
int((total_memory_mb - MAXIMUM_CONTROLLER_RESERVED_MEMORY_MB) /
144-
((LAUNCHES_PER_WORKER * server_config.LONG_WORKER_MEM_GB) * 1024
145-
+ JOB_MEMORY_MB)))
148+
return min(
149+
MAX_CONTROLLERS,
150+
max(
151+
1,
152+
int((total_memory_mb - MAXIMUM_CONTROLLER_RESERVED_MEMORY_MB) /
153+
((LAUNCHES_PER_WORKER * server_config.LONG_WORKER_MEM_GB) *
154+
1024 + JOB_MEMORY_MB))))
146155

147156

148157
def start_controller() -> None:

0 commit comments

Comments
 (0)