Merge remote-tracking branch 'upstream/main'

SinaChavoshi · SinaChavoshi · commit b57fb8a3f66d · 2025-06-09T18:38:04.000Z
diff --git a/docs/proposals/0845-scheduler-architecture-proposal/README.md b/docs/proposals/0845-scheduler-architecture-proposal/README.md
@@ -9,18 +9,27 @@ The Scheduling Subsystem is a framework used to implement scheduling algorithms.
 
 ## Design Principles
 - The scheduler framework should act as an independent library, there should be no dependency on EPP packages defined outside of the scheduler
-- The *framework* should be agnostic to web protocols(such as HTTP), endpoint types (such as model servers), and K8s concepts. 
+- The *framework* should be agnostic to endpoint types (such as model servers), and K8s concepts. 
   - Opinions should be held by the plugins, not the framework
 - The entry & exit points should be defined by the framework, acting as the API surface of the system
 - Multiple scheduling 'profiles' should be able to be ran for a single request.
     - They can be conditionally dependent on previous runs, or in parallel
-- Plugin state is managed by the plugin itself
+- State management
+  - State per request: This is managed by what we are calling CycleState and its lifecycle is tied to the request.
+    Cycle state is created internally by the Scheduler per request and its pointer is passed as argument.
+  - State managed by the plugin struct itself: The lifecycle of this state is tied to the plugin, and since plugins will be instantiated once, 
+    it is a state that plugins can use across requests (like prefix-cache index).
+  - State managed by the data layer: each endpoint will be associated with state (currently metrics) that a data layer plugin can add to it. 
+    A data layer plugin could be one that scrapes v1/models from the endpoint for example.
 
 ## Definitions
 - **Scheduling Framework** - The system created to allow for a pluggable scheduling algorithm.
-- **Scheduling Profile** - A named, specific set of Filter(s), Scorer(s), & Picker used to select endpoints.
-- **Scheduler** - An extensible implementation of a scheduling algorithm. Including logic to select Scheduling Profiles, the Scheduling Profiles themselves, & logic to interpret the result.
-- **Scheduling Cycle** - A single run of a Scheduler through the Scheduling Framework.
+- **Scheduler Profile** - A named, specific set of Filter(s), Scorer(s), & Picker used to select endpoints.
+- **Scheduler Profile Run** - a one time run of the Scheduler Profile filters, scorers and picker given a request.
+- **Scheduler** - An extensible implementation of a scheduling algorithm. Including logic to select Scheduler Profiles iteratively, 
+  the Scheduler Profiles themselves, & logic to interpret the result.
+- **Scheduling Cycle** - A single run of a Scheduler through the Scheduling Framework. a scheduling cycle includes one or 
+  more Scheduler Profile runs (at least one).
 - **Plugin** - Implementation of framework-defined interface(s) to add or extend logic across the framework.
 
 ## Proposal
@@ -33,23 +42,32 @@ The Scheduling System can loosely be defined into 3 sections:
 - A *configuration API* to define the Scheduler, Profile(s), & the plugins used within those profiles
 
 A sketch of the System, with extension points is here:
-<img src="./images/scheduler_subsystem.svg" alt="Scheduling Algorithm" width="1000" />
+<img src="./images/scheduler_cycle.png" alt="Scheduling Algorithm" width="1000" />
 
 Describing the interface extension points & flow is the simplest way to convey the intent of what the framework should enable:
 
-### PreSchedule
+### ProfileHandler
 
-PreSchedule is the entry point into the scheduling cycle (called by the framework). PreSchedule, selects profiles conditionally based on: 
+ProfileHandler is a schedler plugin with two extension points - ProfilePick, and ProcessProfilesResults.
+Below is a detailed explanation about these extension points.
+Only a single ProfileHandler plugin may be defined per scheduler.
+
+### ProfilePick
+
+ProfilePick is the entry point into the scheduling cycle (called by the framework). 
+it selects profiles conditionally based on: 
 
 - Request data
-- Results
+- Results of previously executed SchedulerProfiles
 - Cycle State
 
-PreSchedule will be continuously called so long as profiles are returned; multiple profiles may be returned in a single call. Only a single PreSchedule function may be defined per scheduler.
+ProfilePick will be continuously called so long as profiles are returned; multiple profiles may be returned in a single call.
+ProfilePick extension point will be configured as part of a ProfileHandler plugin. 
+Since there is only a single ProfileHandler plugin, that means there is only a single ProfilePick function.
 
-### Profile Cycle
+### Scheduler Profile Run
 
-The profile cycle consists of 3 defined functions `Filter`, `Score`, & `Pick`
+The SchedulerProfile run consists of 3 defined phases `Filter`, `Score`, & `Pick`
 
 *Profile Constraints*
 - A profile can have any number of `Filter` plugins registered (including zero)
@@ -61,17 +79,15 @@ The profile cycle consists of 3 defined functions `Filter`, `Score`, & `Pick`
 Filter runs before any scoring, and remove endpoints that are not fit for selection. The framework will return an error to the client if the endpoints are filtered to zero.
 
 #### Score
-Score applies a score to each remaining endpoint provided. Scorers SHOULD keep their score values in a normalized range: [0-1]. Any weighting should be added at the SchedulingProfile configuration level.
+Score applies a score to each remaining endpoint provided. Scorers SHOULD keep their score values in a normalized range: [0-1]. Any weighting should be added at the SchedulerProfile configuration level.
 
 #### Pick
 Picker selects the endpoint(s) from the provided list of scored endpoints. Picker MUST return, one endpoint at minimum.
 
 
-### PostSchedule
-PostSchedule receives the output of the result(s) of the scheduling cycle(s) and makes sense of the data to be consumed by the calling system.
-
-### PostResponse
-PostResponse is a special case extension that can optionally be implemented by a plugin that needs to augment its state based on response or request data. This should only be implemented for plugins that need to update state outside of the scheduling cycle. PostResponse is ran at the time of processing a response.
+### ProcessProfilesResults
+ProcessProfilesResults receives the output of the result(s) of the scheduler profile(s) and makes sense of the data to be consumed by the calling system.
+Since there is only a single ProfileHandler plugin, that means there is only a single ProcessProfilesResults function.
 
 ## ConfigurationAPI
 TODO
diff --git a/docs/proposals/0845-scheduler-architecture-proposal/images/scheduler_cycle.png b/docs/proposals/0845-scheduler-architecture-proposal/images/scheduler_cycle.png
diff --git a/docs/proposals/0845-scheduler-architecture-proposal/interfaces/interface.go b/docs/proposals/0845-scheduler-architecture-proposal/interfaces/interface.go
@@ -22,82 +22,121 @@ import (
 	scheduling "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types"
 )
 
-// READER NOTE: Currently CycleState is assumed to have appropriate request data rather that making a new object.
-
-// Plugin is the parent type for all the scheduling framework plugins.
-type Plugin interface {
-	Name() string
-}
-
 type Endpoint struct {
 	State EndpointState
-	Score float64
 }
 
 type EndpointState struct {
 	// storage is per Scheduling Cycle, and so has no thread-safe concerns.
 	storage map[string]any //nolint:unused
 }
 
-type SchedulingResult struct {
-	results map[string][]Endpoint //nolint:unused
+// Request is a structured representation of the fields we parse out of the Request body.
+type Request struct {
+	// RequestId is the Envoy generated Id for the request being processed
+	RequestId string
+	// TargetModel is the final target model after traffic split.
+	TargetModel string
+	// Prompt is the prompt that was sent in the request body.
+	Prompt string
+	// Headers is a map of the request headers.
+	Headers map[string]string
 }
 
-// Scheduler is the implementation of a... scheduler.
-// The scheduler object is created at startup using the provided configuration.
-type Scheduler interface {
-	// PreSchedule selects scheduling profiles through the implemented
-	// logic, and returns:
-	// - profiles - A subset of the registered scheduling profiles to be ran
-	PreSchedule(request map[string]any, data scheduling.CycleState, results map[string][]Endpoint) map[string]SchedulingProfile
-
-	// PostSchedule receives the output of the result(s) of the scheduling cycle(s)
-	// and makes sense of the data to be consumed by the calling system.
-	// For example: suppose you have 2 profiles ShadowBoxing Profile & Production Profile.
-	// PostSchedule would know to simply log the result of ShadowBoxing
-	// profile, and do nothing else with it.
-	PostSchedule(profileResults map[string][]Endpoint) SchedulingResult
+// ScoredEndpoint encapsulates Endpoint with its Score.
+// The lifecycle of an endpoint is typically different than a lifecycle of a request.
+// This is intended to be used only internally by Scheduler logic and/or scheduler plugins within the lifecycle of the request.
+// When returning the selected Endpoint(s) out of the Scheduler, an Endpoint is returned without the score.
+type ScoredEndpoint struct {
+	Endpoint
+	Score float64
+}
+
+type Scheduler struct {
+	SchedulerConfig
 }
 
-// SchedulingProfile is used to describe a profile that will
+// SchedulerConfig is the struct that maps to the configuration file that should be further discussed.
+// the configuration file should include the ProfileHandler plugin as well as the profiles with their plugins.
+type SchedulerConfig struct {
+	// exactly one ProfileHandler instance is required.
+	profileHandler ProfileHandler //nolint:unused
+	// map from profile name to its set of plugins.
+	profiles map[string]*SchedulerProfile //nolint:unused
+}
+
+// SchedulerProfile is used to describe a profile that will
 // run for a given scheduling cycle.
-type SchedulingProfile struct {
-	// Name of the profile.
-	Name string
-	// Filters lists all Filter plugins associated with this Profile. Filters
-	// are optional.
-	Filters []Filter
-	// Scorers lists all Score plugins associated with this Profile. Scorers
-	// are optional.
-	Scorers map[Scorer]int
+type SchedulerProfile struct {
+	// Filters lists all Filter plugins associated with this Profile.
+	// Filters are optional.
+	filters []Filter //nolint:unused
+	// Scorers lists all Score plugins associated with this Profile.
+	// Scorers are optional.
+	scorers []*WeightedScorer //nolint:unused
 	// Picker returns the function that picks the endpoint(s). Picker is required.
-	Picker Picker
+	picker Picker //nolint:unused
 }
 
-// Filter runs before any scoring, and remove endpoints that are not fit for
-// selection. The framework will return an error to the client if the endpoints
-// are filtered to zero.
+type SchedulingResult struct {
+	ProfileResults     map[string][]*Endpoint // a map from profile name to its scheduling result
+	PrimaryProfileName string                 // key of the primary profile, its selected endpoints will be used by default as the destination
+}
+
+// Plugin is the parent type for all the scheduling framework plugins.
+type Plugin interface {
+	Name() string
+}
+
+// ProfileHandler defines the interface for handling multi SchedulerProfile instances.
+// More specifically, this interfaction defines two extension points, 'PickProfiles'
+// which runs iteratively, and 'ProcessProfilesResults' which runs after all profiles runs complete
+// and process the results of all profiles.
+type ProfileHandler interface {
+	Plugin
+	// Pick picks the SchedulingProfile objects to run from a list of candidate profiles,
+	// while taking into consideration the request properties
+	// and the previously executed SchedluderProfile runs along with their results.
+	// returns:
+	// - profiles - A subset of the registered scheduling profiles to be ran in next iteration
+	Pick(request *Request, profiles map[string]*SchedulerProfile, executionResults map[string][]*ScoredEndpoint) map[string]*SchedulerProfile
+
+	// ProcessResults handles the outcome of each profile run.
+	// It may aggregate results, log test profile outputs, or apply custom logic. It specifies in the SchedulingResult the
+	// key of the primary profile that should be used to get the request selected destination.
+	// Example: suppose you have 2 profiles ShadowBoxing Profile & Production Profile.
+	// ProcessProfileResults would know to simply log the result of ShadowBoxing
+	// profile, and do nothing else with it.
+	ProcessResults(request *Request, profileResults map[string][]*ScoredEndpoint) *SchedulingResult
+}
+
+// Filter runs before any scoring, and remove endpoints that are not fit for selection.
+// The framework will return an error to the client if the endpoints are filtered to zero.
 type Filter interface {
 	Plugin
-	Filter(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint
+	Filter(ctx context.Context, request *Request, state *scheduling.CycleState, endpoints []*Endpoint) []*Endpoint
 }
 
-// Scorer applies a score to each remaining endpoint provided. Scorers SHOULD
-// keep their score values in a normalized range: [0-1]. Any weighting should
-// be added at the SchedulingProfile configuration level.
+// Scorer applies a score to each remaining endpoint provided.
+// Scorers SHOULD keep their score values in a normalized range: [0-1].
+// Any weighting should be added at the SchedulerProfile configuration level.
 type Scorer interface {
 	Plugin
-	Score(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint
+	Score(ctx context.Context, request *Request, state *scheduling.CycleState, endpoints []*Endpoint) []*ScoredEndpoint
+}
+
+// WeightedScorer is a struct that encapsulates a scorer with its weight.
+// We need this struct in order to be able to keep scorers in profile as a slice instead of a map.
+// This is very useful for having a generic AddPlugin function that registers a plugin to all its extension points.
+// Using a map is much less convenient for this purpose.
+type WeightedScorer struct {
+	Scorer
+	weight int //nolint:unused
 }
 
 // Picker selects the endpoint(s) from the provided list of scored endpoints.
 // Picker MUST return, one endpoint at minimum.
 type Picker interface {
 	Plugin
-	Pick(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint
-}
-
-type PostResponse interface {
-	Plugin
-	PostResponse(ctx context.Context, request map[string]any, response map[string]any)
+	Pick(ctx context.Context, state *scheduling.CycleState, endpoints []*ScoredEndpoint) []*ScoredEndpoint
 }
diff --git a/site-src/contributing/index.md b/site-src/contributing/index.md
@@ -45,5 +45,4 @@ doc. Feel free to add topics for discussion at an upcoming meeting.
 
 All meetings are recorded and automatically uploaded to the [WG-Serving meetings
 YouTube
-playlist][https://www.youtube.com/playlist?list=PL69nYSiGNLP30qNanabU75ayPK7OPNAAS].
-
+playlist](https://www.youtube.com/playlist?list=PL69nYSiGNLP30qNanabU75ayPK7OPNAAS).
diff --git a/site-src/guides/metrics.md b/site-src/guides/metrics.md
@@ -93,4 +93,69 @@ TOKEN=$(kubectl -n default get secret inference-gateway-sa-metrics-reader-secret
 kubectl -n default port-forward inference-gateway-ext-proc-pod-name  9090
 
 curl -H "Authorization: Bearer $TOKEN" localhost:9090/metrics
-```
+```
+
+## Prometheus Alerts
+
+The section instructs how to configure prometheus alerts using collected metrics.
+
+### Configure alerts
+
+You can follow this [blog post](https://grafana.com/blog/2020/02/25/step-by-step-guide-to-setting-up-prometheus-alertmanager-with-slack-pagerduty-and-gmail/) for instruction of setting up alerts in your monitoring stacks with Prometheus.
+
+A template alert rule is available at [alert.yaml](../../tools/alerts/alert.yaml). You can modify and append these rules to your existing Prometheus deployment.
+
+#### High Inference Request Latency P99
+
+```
+alert: HighInferenceRequestLatencyP99
+expr: histogram_quantile(0.99, rate(inference_model_request_duration_seconds_bucket[5m])) > 10.0 # Adjust threshold as needed (e.g., 10.0 seconds)
+for: 5m
+annotations:
+  title: 'High latency (P99) for model {{ $labels.model_name }}'
+  description: 'The 99th percentile request duration for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 10.0 seconds for 5 minutes.'
+labels:
+  severity: 'warning'
+```
+
+#### High Inference Error Rate
+
+```
+alert: HighInferenceErrorRate
+expr: sum by (model_name) (rate(inference_model_request_error_total[5m])) / sum by (model_name) (rate(inference_model_request_total[5m])) > 0.05 # Adjust threshold as needed (e.g., 5% error rate)
+for: 5m
+annotations:
+  title: 'High error rate for model {{ $labels.model_name }}'
+  description: 'The error rate for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 5% for 5 minutes.'
+labels:
+  severity: 'critical'
+  impact: 'availability'
+```
+
+#### High Inference Pool Queue Average Size
+
+```
+alert: HighInferencePoolAvgQueueSize
+expr: inference_pool_average_queue_size > 50 # Adjust threshold based on expected queue size
+for: 5m
+annotations:
+  title: 'High average queue size for inference pool {{ $labels.name }}'
+  description: 'The average number of requests pending in the queue for inference pool {{ $labels.name }} has been consistently above 50 for 5 minutes.'
+labels:
+  severity: 'critical'
+  impact: 'performance'
+```
+
+#### High Inference Pool Average KV Cache
+
+```
+alert: HighInferencePoolAvgKVCacheUtilization
+expr: inference_pool_average_kv_cache_utilization > 0.9 # 90% utilization
+for: 5m
+annotations:
+  title: 'High KV cache utilization for inference pool {{ $labels.name }}'
+  description: 'The average KV cache utilization for inference pool {{ $labels.name }} has been consistently above 90% for 5 minutes, indicating potential resource exhaustion.'
+labels:
+  severity: 'critical'
+  impact: 'resource_exhaustion'
+```
diff --git a/tools/alerts/alert.yaml b/tools/alerts/alert.yaml
@@ -0,0 +1,38 @@
+groups:
+- name: gateway-api-inference-extension
+  rules:
+  - alert: HighInferenceRequestLatencyP99
+    annotations:
+      title: 'High latency (P99) for model {{ $labels.model_name }}'
+      description: 'The 99th percentile request duration for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 10.0 seconds for 5 minutes.'
+    expr: histogram_quantile(0.99, rate(inference_model_request_duration_seconds_bucket[5m])) > 10.0
+    for: 5m
+    labels:
+      severity: 'warning'
+  - alert: HighInferenceErrorRate
+    annotations:
+      title: 'High error rate for model {{ $labels.model_name }}'
+      description: 'The error rate for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 5% for 5 minutes.'
+    expr: sum by (model_name) (rate(inference_model_request_error_total[5m])) / sum by (model_name) (rate(inference_model_request_total[5m])) > 0.05
+    for: 5m
+    labels:
+      severity: 'critical'
+      impact: 'availability'
+  - alert: HighInferencePoolAvgQueueSize
+    annotations:
+      title: 'High average queue size for inference pool {{ $labels.name }}'
+      description: 'The average number of requests pending in the queue for inference pool {{ $labels.name }} has been consistently above 50 for 5 minutes.'
+    expr: inference_pool_average_queue_size > 50
+    for: 5m
+    labels:
+      severity: 'critical'
+      impact: 'performance'
+  - alert: HighInferencePoolAvgKVCacheUtilization
+    annotations:
+      title: 'High KV cache utilization for inference pool {{ $labels.name }}'
+      description: 'The average KV cache utilization for inference pool {{ $labels.name }} has been consistently above 90% for 5 minutes, indicating potential resource exhaustion.'
+    expr: inference_pool_average_kv_cache_utilization > 0.9
+    for: 5m
+    labels:
+      severity: 'critical'
+      impact: 'resource_exhaustion'