Skip to content

Bug: Response model field does not match routing decision #430

@yossiovadia

Description

@yossiovadia

Bug: Response model field does not match routing decision

Summary

The semantic router correctly classifies prompts and routes requests to the appropriate model endpoint (Model-A or Model-B), but the response JSON model field does not reflect the router's decision. Instead, it shows the model name from the vLLM endpoint that happened to serve the request.

Impact

  • Severity: Medium
  • Component: ExtProc response handling
  • User Impact: API consumers cannot determine which model was actually selected by the semantic router
  • Users must inspect custom headers (x-vsr-selected-model) or logs instead of the standard model field

Steps to Reproduce

  1. Deploy semantic router with Model-A and Model-B configured
  2. Configure routing with categories that should route to Model-A (e.g., economics with score 1.0)
  3. Send a request that should route to Model-A:
curl -X POST "http://<envoy-url>/v1/chat/completions" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Explain marginal utility in economics"}],
    "max_tokens": 20
  }'
  1. Check the response - the model field shows Model-B (incorrect)
  2. Check the logs - routing logs show "selected_model":"Model-A" (correct)

Expected Behavior

The response JSON should contain:

{
  "model": "Model-A",
  "choices": [...],
  "usage": {...}
}

The model field should match what the semantic router decided, as shown in:

  • Routing decision logs: "selected_model":"Model-A"
  • Custom header: x-vsr-selected-model: Model-A
  • Context: ctx.RequestModel = "Model-A"

Actual Behavior

The response JSON contains:

{
  "model": "Model-B",
  "choices": [...],
  "usage": {...}
}

The model field comes from the vLLM endpoint's response and does not reflect the router's decision.

Evidence from Logs

// Classification (correct)
{"msg":"Classified as category: economics (mmlu=economics)"}

// Model selection (correct)
{"msg":"Selected model Model-A for category economics with score 1.0000"}

// Routing decision (correct)
{"msg":"Routing to model: Model-A"}
{"msg":"routing_decision","selected_model":"Model-A","category":"economics","selected_endpoint":"127.0.0.1:8000"}

Yet the API response returns "model": "Model-B"

Root Cause Analysis

File: src/semantic-router/pkg/extproc/response_handler.go
Function: handleResponseBody() (lines 182-296)

Current Code Flow

  1. Response body is received from vLLM endpoint (line 186)
  2. Response is parsed into openai.ChatCompletion struct (line 215-216)
  3. Usage statistics are extracted for metrics (lines 220-270)
  4. Cache is updated with original response (lines 273-282)
  5. Original response is returned unchanged (lines 284-293)
// Line 284-293: Current code
response := &ext_proc.ProcessingResponse{
    Response: &ext_proc.ProcessingResponse_ResponseBody{
        ResponseBody: &ext_proc.BodyResponse{
            Response: &ext_proc.CommonResponse{
                Status: ext_proc.CommonResponse_CONTINUE,
            },
        },
    },
}
return response, nil

The Problem

The parsed.Model field from the vLLM endpoint response is never updated to match ctx.RequestModel (which contains the router's decision).

Proposed Fix

After parsing the response (line 216), update the model field and re-marshal:

// Parse tokens from the response JSON using OpenAI SDK types
var parsed openai.ChatCompletion
if err := json.Unmarshal(responseBody, &parsed); err != nil {
    observability.Errorf("Error parsing tokens from response: %v", err)
    metrics.RecordRequestError(ctx.RequestModel, "parse_error")
}

// FIX: Update model field to match routing decision
if ctx.RequestModel != "" && parsed.Model != ctx.RequestModel {
    observability.Infof("Updating response model field from '%s' to '%s'", parsed.Model, ctx.RequestModel)
    parsed.Model = ctx.RequestModel

    // Re-marshal with updated model field
    modifiedBody, err := json.Marshal(parsed)
    if err != nil {
        observability.Errorf("Error re-marshaling response with updated model: %v", err)
        // Fall back to original response
    } else {
        responseBody = modifiedBody
    }
}

// Continue with existing token extraction...
promptTokens := int(parsed.Usage.PromptTokens)
completionTokens := int(parsed.Usage.CompletionTokens)
// ... rest of the code

Then at the end, return the modified response body:

// Return the modified response body
response := &ext_proc.ProcessingResponse{
    Response: &ext_proc.ProcessingResponse_ResponseBody{
        ResponseBody: &ext_proc.BodyResponse{
            Response: &ext_proc.CommonResponse{
                Status: ext_proc.CommonResponse_CONTINUE,
                BodyMutation: &ext_proc.BodyMutation{
                    Mutation: &ext_proc.BodyMutation_Body{
                        Body: responseBody,
                    },
                },
            },
        },
    },
}
return response, nil

Testing Strategy

Unit Tests

  1. Test response model field rewriting when routing decision differs from endpoint response
  2. Test fallback behavior when JSON unmarshaling/marshaling fails
  3. Test that non-JSON responses are handled gracefully

Integration Tests

  1. Send request that routes to Model-A, verify response contains "model": "Model-A"
  2. Send request that routes to Model-B, verify response contains "model": "Model-B"
  3. Verify custom header x-vsr-selected-model matches response model field
  4. Test with streaming responses (should not modify SSE chunks)

E2E Tests

  1. Deploy with real vLLM endpoints
  2. Test all categories route to correct models
  3. Verify response model field matches routing logs
  4. Verify cached responses also have correct model field

Additional Considerations

  1. Streaming responses: The fix should only apply to non-streaming responses (already handled by ctx.IsStreamingResponse check at line 190)
  2. Cache consistency: Cached responses should also have the correct model field
  3. Performance: JSON re-marshaling adds minimal overhead compared to model inference time
  4. Backwards compatibility: This is a bug fix that makes the API more correct, not a breaking change

Related Code

  • Response headers already include x-vsr-selected-model (line 80-88)
  • Request context tracks ctx.RequestModel throughout routing (set at line 952 in request_handler.go)
  • Metrics already use ctx.RequestModel for tracking (lines 224-270)

Verification

After the fix:

# Send request
curl -X POST "http://<envoy-url>/v1/chat/completions" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Explain marginal utility in economics"}],
    "max_tokens": 20
  }' | jq '.model'

# Expected output: "Model-A"
# Header should also show: x-vsr-selected-model: Model-A

Assignee: @yovadia
Labels: bug, extproc, response-handling
Priority: Medium
Milestone: Next Release

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions