diff --git a/keps/prod-readiness/sig-scheduling/4671.yaml b/keps/prod-readiness/sig-scheduling/4671.yaml index 17a4b734bff8..3257880a90d5 100644 --- a/keps/prod-readiness/sig-scheduling/4671.yaml +++ b/keps/prod-readiness/sig-scheduling/4671.yaml @@ -1,3 +1,5 @@ kep-number: 4671 alpha: approver: "@soltysh" +beta: + approver: "@soltysh" diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index 5119a6b45c8f..a969e32cf3ae 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -766,30 +766,52 @@ This section must be completed when targeting beta to a release. ###### How can a rollout or rollback fail? Can it impact already running workloads? - +The worst-case scenario is a critical bug in the new Workload Scheduling Cycle code causing a scheduler crash-loop. +This would stop all scheduling but would not impact already running workloads and rollback is a sufficient +mitigation method. ###### What specific metrics should inform a rollback? - +- `scheduler_schedule_attempts_total{result="error"}`: A sudden spike indicates internal errors or panics within +the scheduling loop, possibly caused by the new logic. +- `process_start_time_seconds`: Frequent resets of this metric indicate that the scheduler process is crashing and + restarting (crash loop). +- `scheduler_pod_scheduling_duration_seconds`: A significant regression in P99 latency for standard (non-gang) pods + would indicate that the overhead of the new logic is unacceptable. +- `scheduler_pod_group_scheduling_attempts_total` (new metric, TODO: check with Maciek on the metric name): Consistently + high failure rates for valid gangs compared to successful attempts. +- `scheduler_preemption_attempts_total`, `scheduler_preemption_victims`: A sudden increase might indicate that the + new "delayed preemption" logic is malfunctioning (e.g., triggering unnecessary preemptions). ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? +We'll perform manual testing of the upgrade -> downgrade -> upgrade path using the following sequence: + +1. Start a local Kubernetes v1.35 cluster with GenericWorkload and GangScheduling feature gates disabled (default +behavior). +2. Attempt to create a Pod with `spec.workloadRef` set. +3. The `spec.workloadRef` field is dropped by the API server. The pod is created successfully but without the workload + reference, resulting in immediate standard scheduling (one-by-one). +4. Restart/Upgrade API Server and Scheduler to v1.36 (with feature gates enabled). +5. Create two Workload objects: `gang-test-A` and `gang-test-B` (both with `minCount=2`). +6. Create a Pod `test-pod-1` with `spec.workloadRef` pointing to `gang-test-A`. +7. The Pod stays in `Pending` state (waiting for the gang). We verified that + `scheduler_pod_group_scheduling_attempts_total` metric is incremented. +8. Create a Pod `test-pod-2` pointing to the same workload. +9. Both pods are scheduled successfully in the same cycle (Gang Scheduling works). +10. Downgrade API Server and Scheduler to v1.35 (with feature gates disabled). +11. Create `test-pod-3` pointing to `gang-test-B`. Note: We use a workload created in step 5 because creating new + Workload objects is disabled. +12. The pod is scheduled immediately (Workload logic is ignored because the workloadRef field is dropped by + the v1.35 API server). If Gang Scheduling were active, this pod would hang pending waiting for a second member. +13. Upgrade API Server and Scheduler back to v1.36 (feature gates enabled). +14. Create `test-pod-4` and `test-pod-5` pointing to `gang-test-B`; verifying that Gang Scheduling functionality is + restored (these pods wait for `minCount=2` before scheduling). - ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? @@ -806,11 +828,11 @@ previous answers based on experience in the field. ###### How can an operator determine if the feature is in use by workloads? - +Operators can check the new `scheduler_pod_group_scheduling_attempts_total` metric. A value greater than zero +indicates that the scheduler is processing Workload Scheduling Cycles. + +Alternatively, checking for the existence of `Workload` via `kubectl get workloads` or checking the +`pod.spec.workloadRef` field confirms that users are actively using the feature. ###### How can someone using this feature know that it is working for their instance? @@ -823,50 +845,38 @@ and operation of this feature. Recall that end users cannot usually observe component logs or access metrics. --> -- [ ] Events - - Event Reason: -- [ ] API .status - - Condition name: - - Other field: -- [ ] Other (treat as last resort) - - Details: +- [x] API .spec + - Other field: workloadRef is set on the Pods. +- [x] Events + - Event Type: Warning + - Event Reason: FailedScheduling + - Event Message: The message includes details if the scheduling failed due to gang constraints (e.g., "pod group + minCount requirement not met"). ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? - +Since there are no formal SLOs for the kube-scheduler apart from scalability SLOs, we define the objectives for this +feature primarily in terms of non-regression to ensure the workload scheduling does not degrade the performance of the +standard scheduling loop. -###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? +- Scheduling Throughput: There should be no significant regression in the system-wide scheduling throughput (pods/s) +when scheduling pods attached to a Workload compared to scheduling an equivalent number of individual pods. - +- Scheduling Latency: There should be no significant regression in pod scheduling latency +(`scheduler_pod_scheduling_duration_seconds`) for both workload and non-workload pods compared to the baseline. -- [ ] Metrics +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + +- [x] Metrics - Metric name: - - [Optional] Aggregation method: - - Components exposing the metric: -- [ ] Other (treat as last resort) - - Details: + - `scheduler_pod_group_scheduling_attempts_total` + - `scheduler_pod_group_scheduling_duration_seconds` + - `scheduler_pod_group_scheduling_algorithm_duration_seconds` + - Components exposing the metric: kube-scheduler ###### Are there any missing metrics that would be useful to have to improve observability of this feature? - +No. ### Dependencies @@ -945,23 +955,49 @@ details). For now, we leave it here. ###### How does this feature react if the API server and/or etcd is unavailable? +The behavior is consistent with the status quo. Since the scheduler cannot bind pods or update statuses without the +API server, any in-flight workload scheduling will eventually fail at the binding/update stage. These attempts will be +retried with standard exponential backoff once connectivity is restored. + ###### What are other known failure modes? - +- Pods Pending Indefinitely - Waiting for Gang Assembly (PreEnqueue) + - Detection: + - Check Pod Events/Status. Expected reason: a message indicating that the pod is waiting for more gang members. + - The number of pending pods belonging to the group is less than minCount. + - Mitigations: + - Ensure the controller created all required pods. + - If intended, delete the Workload object to disable gang scheduling (fallback to best-effort scheduling) if + acceptable. + - Diagnostics: + - Scheduler logs at V=4 searching for "workload" to trace the decision flow. + - Verify minCount in the Workload matches the number of pods created by the Job/Controller. + - Testing: + - Covered by integration tests submitting partial gangs. +- Pods Pending Indefinitely - Gang cannot fit (Resource Constraints) + - Detection: Check Pod Events/Status. Expected reason: a message indicating that minCount pods could not be + scheduled. + - Metrics: `scheduler_pod_group_scheduling_attempts_total` with result unschedulable. + - Mitigations: + - Scale up the cluster (add nodes) or delete other real-workloads to free up space. + - If intended, delete the Workload object to disable gang scheduling (fallback to best-effort scheduling) if + acceptable. + - Diagnostics: + - Scheduler logs at V=4 searching for "workload" to see detailed reasons why the placement failed. + - Testing: + - Covered by integration tests submitting gangs larger than cluster capacity. ###### What steps should be taken if SLOs are not being met to determine the problem? +1. Analyze Latency Metrics: Check `scheduler_pod_group_scheduling_duration_seconds` and + `scheduler_pod_group_scheduling_algorithm_duration_seconds`. High values here indicate that the Workload Scheduling + Cycle logic itself is computationally expensive and causing the regression. +2. Inspect Logs: Enable scheduler logging at V=4 to trace the execution time of individual Workload Scheduling + Cycles and identify if specific large gangs are blocking the queue. +3. Disable Feature: If the regression is critical and impacting cluster health, disable the GangScheduling feature + gate. This will revert the scheduler to the standard pod-by-pod logic, restoring baseline performance (at the + cost of losing gang semantics). + ## Implementation History