Skip to content

RAG-Optimized Routing with On-Demand Retrieval #155

@Xunzhuo

Description

@Xunzhuo

Title: Implement RAG-Optimized Category Routing with On-Demand Knowledge Retrieval

Overview

Extend the vLLM Semantic Router to include RAG (Retrieval-Augmented Generation) optimization capabilities, enabling intelligent routing decisions based on knowledge retrieval needs and implementing on-demand RAG for queries that would benefit from external knowledge augmentation.

From the latest research via: https://arxiv.org/pdf/2505.10570

When evaluated with several long context models, we observe a performance

  • drop of 7% to 85% as the number of tools increases
  • 7% to 91% degradation in answer retrieval as the tool responses length increases,
  • 13% and 40% degradation for as multi-turn conversations get longer.

Current State

The semantic router currently:

  • Routes based on 14 predefined categories using ModernBERT classification
  • Uses semantic caching for response reuse
  • Supports tool selection based on query similarity
  • Has no integrated knowledge retrieval capabilities

Proposed Enhancement

1. RAG-Aware Categories

Add new categories specifically designed for knowledge-intensive tasks:

categories:
  - name: factual_lookup
    use_reasoning: false
    use_rag: true
    rag_strategy: "always"
    rag_description: "Factual queries requiring up-to-date information"
    knowledge_domains: ["current_events", "statistics", "definitions"]
    
  - name: research_synthesis  
    use_reasoning: true
    use_rag: true
    rag_strategy: "adaptive"
    rag_description: "Complex research requiring multiple sources"
    knowledge_domains: ["academic", "technical", "scientific"]
    
  - name: domain_expertise
    use_reasoning: true
    use_rag: true
    rag_strategy: "conditional"
    rag_description: "Specialized domain knowledge queries"
    knowledge_domains: ["medical", "legal", "financial", "technical"]

2. On-Demand RAG Pipeline

Implement intelligent retrieval decision-making:

type RAGDecision struct {
    ShouldRetrieve    bool                   `json:"should_retrieve"`
    RetrievalStrategy string                 `json:"retrieval_strategy"` // "always", "adaptive", "conditional"
    KnowledgeDomains  []string               `json:"knowledge_domains"`
    RetrievalQuery    string                 `json:"retrieval_query"`
    Confidence        float64                `json:"confidence"`
    Reasoning         string                 `json:"reasoning"`
}

type RAGClassificationRequest struct {
    Text                string                 `json:"text"`
    EnableRAG           bool                   `json:"enable_rag"`
    KnowledgePreference []string               `json:"knowledge_preference,omitempty"`
    Options             *ClassificationOptions `json:"options,omitempty"`
}

3. Knowledge Retrieval Integration

Vector Database Integration:

  • Support for multiple vector databases (Milvus, Pinecone, Weaviate, Chroma)
  • Configurable knowledge bases per domain
  • Hybrid search (semantic + keyword) capabilities

Retrieval Configuration:

rag:
  enabled: true
  default_strategy: "adaptive"
  knowledge_bases:
    - name: "general_knowledge"
      type: "milvus"
      endpoint: "milvus.knowledge.svc.cluster.local:19530"
      collection: "general_kb"
      embedding_model: "sentence-transformers/all-MiniLM-L6-v2"
      
    - name: "technical_docs"
      type: "pinecone"
      api_key_env: "PINECONE_API_KEY"
      index: "tech-docs-index"
      
  retrieval_params:
    top_k: 5
    similarity_threshold: 0.7
    max_context_length: 4000
    rerank_enabled: true

4. Adaptive RAG Decision Engine

Intelligence Layer:

  • Analyze query complexity and knowledge requirements
  • Determine if existing model knowledge is sufficient
  • Decide optimal retrieval strategy based on query type

Decision Factors:

  • Query specificity and recency requirements
  • Domain expertise needed
  • Model knowledge cutoff considerations
  • Cost vs. accuracy trade-offs
type RAGDecisionEngine struct {
    knowledgeBases    map[string]KnowledgeBase
    decisionModel     *RAGDecisionClassifier
    retrievalCache    *RetrievalCache
    config           *RAGConfig
}

func (engine *RAGDecisionEngine) ShouldRetrieve(query string, category string) (*RAGDecision, error) {
    // Analyze query characteristics
    queryFeatures := engine.extractQueryFeatures(query)
    
    // Check category RAG configuration
    categoryConfig := engine.config.GetCategoryRAGConfig(category)
    
    // Make retrieval decision
    decision := engine.decisionModel.Predict(queryFeatures, categoryConfig)
    
    return decision, nil
}

5. Enhanced Routing Logic

RAG-Aware Model Selection:

  • Prioritize models with better RAG integration capabilities
  • Consider context window size for retrieved content
  • Balance retrieval cost with model capability

Context Injection:

  • Seamlessly inject retrieved context into model prompts
  • Handle context length limitations intelligently
  • Implement context summarization when needed

6. Implementation Components

New Services:

type RAGService struct {
    knowledgeBases    map[string]KnowledgeBase
    retrievalEngine   *RetrievalEngine
    decisionEngine    *RAGDecisionEngine
    contextManager    *ContextManager
}

type RetrievalEngine struct {
    vectorStores      map[string]VectorStore
    embeddingService  *EmbeddingService
    rerankingService  *RerankingService
}

Router Integration:

  • Extend OpenAIRouter with RAG capabilities
  • Add retrieval step before model routing
  • Implement context-aware caching

API Extensions:

  • New /classify/rag endpoint for RAG-aware classification
  • Enhanced /chat/completions with automatic RAG integration
  • RAG decision explanation endpoints for debugging

7. Performance Optimizations

Caching Strategy:

  • Cache retrieval results with semantic similarity
  • Implement multi-level caching (query → retrieval → response)
  • Smart cache invalidation based on knowledge freshness

Retrieval Optimization:

  • Parallel retrieval from multiple knowledge bases
  • Asynchronous retrieval with streaming responses
  • Query expansion and refinement

Acceptance Criteria

  • RAG-aware categories can be configured with retrieval strategies
  • Decision engine intelligently determines when to retrieve knowledge
  • Multiple vector database backends are supported
  • Retrieved context is seamlessly integrated into model prompts
  • RAG decisions are cached and reused appropriately
  • Performance impact is minimized through intelligent caching
  • API maintains OpenAI compatibility while adding RAG capabilities
  • Comprehensive metrics and observability for RAG operations

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions