-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: make pod spec choice consistent #9531
Conversation
✅ Deploy Preview for determined-ui ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## carolinac/rm-255 #9531 +/- ##
====================================================
+ Coverage 49.30% 51.36% +2.06%
====================================================
Files 1242 750 -492
Lines 161471 112195 -49276
Branches 2868 2867 -1
====================================================
- Hits 79606 57631 -21975
+ Misses 81693 54392 -27301
Partials 172 172
Flags with carried forward coverage won't be shown. Click here to find out more. |
- WebUI: Allow resource pool slot counts to reflect the state of the entire cluster. Allow slot | ||
counts and scheduling to respect node selectors and affinities. This impacts Determined clusters | ||
deployed on Kubernetes with multiple resource pools defined in terms of node selectors and/or | ||
affinities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:orphan:
New Features
-
Kubernetes Configuration: Allow Cluster administrators to define Determined resource pools on
Kubernetes clusters using node selectors and/or affinities. This feature allows a single cluster to be divided into multiple resource pools using node labels. To configure resource pools, modify the default pod spec settings undertask_container_defaults.cpu_pod_spec
or
task_container_defaults.gpu_pod_spec
. -
WebUI: Reflect the resource pool slot counts for the entire Kubernetes cluster. This ensures that slot counts and scheduling decisions respect the defined node selectors and affinities. This impacts Determined clusters
deployed on Kubernetes with multiple resource pools defined using node selectors and/or affinities.
@carolinaecalderon what's up with this pr? just found it randomly. |
Ticket
Description
We choose what pod spec to apply to a resource pool/experiment in two places, and two different ways. Let's try to harmonize this choice such that we will (more) consistently choose the same pod spec to define a pool and for experiments requested to run on it.
For example, if you're running a CPU cluster, you would intuitively want to define a cpuPodSpec for pods that experiments run on, and pods included in resource pools. Right now, GetAgents chooses which pod spec to use based on which one is defined AND device type (CPU vs CUDA). However, MergeExpConf chooses which pod spec to use based on how many slots an experiment wants (<=1 slot defaults to GPU pod spec). This makes it impossible to run experiments requiring more than 1 slot on a CPU cluster with a CPU podSpec defined.
Right now I'm merging this into my node selectors feature branch, but I can merge this into main after some rebasing. I will wait to merge #9428 until this PR is merged.
Test Plan
Pass CircleCI + i'll write a test (in progress)
Checklist
docs/release-notes/
.See Release Note for details.