InftyAI · InftyAI-Agent · May 14, 2025 · May 8, 2025 · May 8, 2025 · May 13, 2025
diff --git a/docs/proposals/376-metric-aggregagor/README.md b/docs/proposals/376-metric-aggregagor/README.md
@@ -85,10 +85,9 @@ List the specific goals of the Proposal. What is it trying to achieve? How will
 know that this has succeeded?
 -->
 
-- A simple implementation with least busy scheduling algorithm
+- A simple implementation with least-latency scheduling algorithm
 - Extensible with different consumers in the cluster, like the Lora autoscaler or the ai gateway
 - Metrics visualization support, like Grafana
-- Metric management support, especially the gc policy
 
 ### Non-Goals
 
@@ -97,7 +96,7 @@ What is out of scope for this Proposal? Listing non-goals helps to focus discuss
 and make progress.
 -->
 
-- Different scheduling algorithm implementations in ai gateway
+- Different scheduling algorithm implementations in ai gateway, like prefix-cache aware
 - LoRA aware scheduling implementation, will be left to another KEP
 - Performance consideration in big clusters should be left to the Beta level
 
@@ -123,7 +122,7 @@ bogged down.
 
 #### Story 1
 
-As a user, I hope my LLM request could be routed to the least-busy instance, so that I can get the result as soon as possible.
+As a user, I hope my LLM request could be routed to the least-latency instance, so that I can get the result as soon as possible.
 
 #### Story 2
 
@@ -138,7 +137,8 @@ Go in to as much detail as necessary here.
 This might be a good place to talk about core concepts and how they relate.
 -->
 
-Metrics-based routing should meet the baseline requirements: even the metrics are unavailable or outdated, the system should still be able to work, despite the fact that the request response may be slower. For example, metrics-based lora scheduling is unfit here because once the metric indicates the wrong instance, we may hit 500 server error, it's unacceptable.
+- Metrics-based routing should meet the baseline requirements: even the metrics are unavailable or outdated, the system should still be able to work, despite the fact that the request response may be slower. For example, metrics-based lora scheduling maybe unfit here because once the metric indicates the wrong instance, we may hit 500 server error, it's unacceptable. Unless the inference engine will fetch the models dynamically.
+- Once the gateway picks the Pod for scheduling, it could happen that the Pod suddenly becomes unavailable, we should support fallback mechanism to default service routing.
 
 ### Risks and Mitigations
 
@@ -174,25 +174,40 @@ The overall flow looks like:
 
 Let's break down the flow into several steps:
 
-- Step 1: we'll collect the metrics from the inference workloads, we choose `PUSH` mode here just to put less pressure on the gateway side, or the gateway will have iterate all the Pods which obviously will lead to performance issues.
-- Step 2: the gateway plugin will parse the metrics and store them in the redis, this is for HA consideration and cache sharing. Once the instance is down, we can still retrieve the metrics from redis. And if we have multiple instances, we can share the metrics with each other via redis. Considering Envoy AI gateway already uses Redis for limit rating, we'll reuse the Redis here.
-- Step 3 & 4: Traffic comes, and the Router will retrieve the metrics from Redis and make routing decisions based on different algorithms, like queue size aware scheduling.
+- Step 1: we'll collect the metrics from the inference workloads in metrics aggregator.
+- Step 2: the aggregator will parse the metrics and store them in the redis, this is for HA consideration and cache sharing. Once the instance is down, we can still retrieve the metrics from redis. And if we have multiple instances, we can share the metrics with each other via redis. Considering Envoy AI gateway already uses Redis for limit rating, we'll reuse the Redis here.
+- Step 3 & 4: Traffic comes, the gateway plugin (we'll call it router later) will retrieve the metrics from Redis and make routing decisions based on different algorithms, like queue size aware scheduling.
 - Step 5: The router will send the request to the selected instance, and the instance will return the result to the router, return to the user finally.
 
-
 ### Additional components introduced:
 
-- Pod Sidecar: a sidecar container is necessary for each inference workload, which was introduced in Kubernetes 1.28 as alpha feature, and enabled by default in 1.29, see [details](https://kubernetes.io/blog/2023/08/25/native-sidecar-containers/). The sidecar will be responsible for collecting the metrics and pushing them to the AI gateway. Let's set the interval time to 100ms at first.
-- Redis: a Redis instance is necessary for the metrics storage and sharing, we can use the existing Redis instance in the cluster, or deploy a new one if not available.
-- Gateway Plugin: a new plugin or [DynamicLoadBalancingBackend](https://github.com/envoyproxy/ai-gateway/blob/be2b479b04bc7a219b0c8239143bfbabebdcd615/filterapi/filterconfig.go#L199-L208) specifically in Envoy AI gateway to pick the best-fit Pod endpoints. However, we may block by the upstream issue [here](https://github.com/envoyproxy/ai-gateway/issues/604), we'll work with the Envoy AI Gateway team to resolve it ASAP. Maybe the final design will impact our implementation a bit but not much I think.
+- Metrics Aggregator (MA): MA is working as the controller plane to sync the metrics, this is also one of the reason why we want to decouple it from the router, which working as a data plane. MA has several components:
+  - A Pod controller to manage the Pod lifecycle, for example, once a Pod is ready, it will add it to the internal store, and each Pod will fork a background goroutine to sync the metrics continuously, 50ms interval by default. Once the Pod is deleted, the goroutine will be stopped and removed from the store.
+  - A internal store to parse the metric results, and store it in the backend storage, like Redis.
+- Redis: a Redis instance is necessary for the metrics storage and sharing, we can use the existing Redis instance in the cluster, or deploy a new one if necessary. We should have storage interface to support different backends in the future.
+- Router: a new router or [DynamicLoadBalancingBackend](https://github.com/envoyproxy/ai-gateway/blob/be2b479b04bc7a219b0c8239143bfbabebdcd615/filterapi/filterconfig.go#L199-L208) specifically in Envoy AI gateway to pick the best-fit Pod endpoints. However, we may block by the upstream issue [here](https://github.com/envoyproxy/ai-gateway/issues/604), we'll work with the Envoy AI Gateway team to resolve it ASAP. Maybe the final design will impact our implementation a bit but not much I think.
 
 ### Data Structure
 
 The data structure could be varied based on the metrics we want to collect, let's take the queue size as an example:
 
-Because redis is a kv store, we'll use the ZSET to store the results, `LeastBusy::ModelName` as the key, Pod name as the member and the (runningQueueSize * 0.3 + waitingQueueSize * 0.7) as the score, the factor of waitingQueueSize is higher because metric is a delayed indicator. RunningQueueSize and WaitingQueueSize are two metrics most of the inference engines support.
+Because redis is a kv store, we'll use the ZSET to store the results, `LeastLatency::ModelName` as the key, Pod name as the member and the (runningQueueSize * 0.3 + waitingQueueSize * 0.7) as the score, the factor of waitingQueueSize is higher because metric is a delayed indicator. RunningQueueSize and WaitingQueueSize are two metrics most of the inference engines support.
+
+We'll also have another key to record the update timestamp. For example, a Pod named "default/fake-pod" with the score = 0.5, the set commands look like:
+
+```bash
+# set the update timestamp
+SET default/fake-pod "2025-05-12T06:16:27Z"
+
+# set the score
+ZADD LeastLatency::ModelName 0.5 default/fake-pod
+```
 
-Also set the expiration time to 500ms just in case the metric reporting is delayed and lead to the hotspot issue.
+When collecting, we'll update the timestamp and score together. Setting the top 5 is enough for us to help reduce the storage pressure since it's a memory-based database. We don't use the expiration key here is just because most of the time, the metrics should be updated at a regular interval.
+
+When retrieving, we'll first query the ZSET to get the top 5 records, and iterate them one by one to verify that `currentTimestamp - recordTimestamp < 5s`, if not, skipping to the next one. This is to avoid outdated metrics. Once picked the exact endpoint, we'll reset the score with waitingQueueSize + 1 to avoid hotspot issues, especially when metrics update is blocked by some reasons.
+
+If all metrics are outdated, we'll fallback to the default service.
 
 Note: the algorithm is not the final one, we'll have more discussions with the community to find the best one.
 
@@ -293,8 +308,10 @@ milestones with these graduation criteria:
 
 Beta:
 
-- No performance issues in big clusters, we may user daemonset to report metrics.
-- Other storages rather than KV store who supports only key-value pairs which might be not enough for more complex scenarios.
+- Other storages rather than KV store who supports only key-value pairs which might be not enough for more complex scenarios, like the prefix-cache aware scenario.
+- HA support, once the metrics aggregator is down, the system should still work.
+- No performance issues in big clusters, we may use daemonsets to report metrics.
+- Once the picked Pod is down after the routing decision, router will fallback to the default service. Fallback mode is already supported in Envoy AI gateway.
 
 ## Implementation History
 
@@ -324,3 +341,5 @@ What other approaches did you consider, and why did you rule them out? These do
 not need to be as detailed as the proposal, but should include enough
 information to express the idea and why it was not acceptable.
 -->
+
+- When collecting metrics from the inference workloads, `PUSH` mode will put less pressure on the gateway side, or the gateway will have iterate all the Pods which obviously will lead to performance issues. We didn't pick the approach because it will either add additional load to the inference workload and introduces more complexity to the system. The current approach will fork as much goroutines as the number of inference workloads to sync the metrics in parallel, this is feasible because goroutine is lightweight. Once the metrics aggregator becomes the bottleneck, we can consider to use `PUSH` mode at node level.
diff --git a/docs/proposals/376-metric-aggregagor/flow.png b/docs/proposals/376-metric-aggregagor/flow.png