-
Notifications
You must be signed in to change notification settings - Fork 302
Description
Title: Implement RAG-Optimized Category Routing with On-Demand Knowledge Retrieval
Overview
Extend the vLLM Semantic Router to include RAG (Retrieval-Augmented Generation) optimization capabilities, enabling intelligent routing decisions based on knowledge retrieval needs and implementing on-demand RAG for queries that would benefit from external knowledge augmentation.
From the latest research via: https://arxiv.org/pdf/2505.10570
When evaluated with several long context models, we observe a performance
- drop of 7% to 85% as the number of tools increases
- 7% to 91% degradation in answer retrieval as the tool responses length increases,
- 13% and 40% degradation for as multi-turn conversations get longer.
Current State
The semantic router currently:
- Routes based on 14 predefined categories using ModernBERT classification
- Uses semantic caching for response reuse
- Supports tool selection based on query similarity
- Has no integrated knowledge retrieval capabilities
Proposed Enhancement
1. RAG-Aware Categories
Add new categories specifically designed for knowledge-intensive tasks:
categories:
- name: factual_lookup
use_reasoning: false
use_rag: true
rag_strategy: "always"
rag_description: "Factual queries requiring up-to-date information"
knowledge_domains: ["current_events", "statistics", "definitions"]
- name: research_synthesis
use_reasoning: true
use_rag: true
rag_strategy: "adaptive"
rag_description: "Complex research requiring multiple sources"
knowledge_domains: ["academic", "technical", "scientific"]
- name: domain_expertise
use_reasoning: true
use_rag: true
rag_strategy: "conditional"
rag_description: "Specialized domain knowledge queries"
knowledge_domains: ["medical", "legal", "financial", "technical"]2. On-Demand RAG Pipeline
Implement intelligent retrieval decision-making:
type RAGDecision struct {
ShouldRetrieve bool `json:"should_retrieve"`
RetrievalStrategy string `json:"retrieval_strategy"` // "always", "adaptive", "conditional"
KnowledgeDomains []string `json:"knowledge_domains"`
RetrievalQuery string `json:"retrieval_query"`
Confidence float64 `json:"confidence"`
Reasoning string `json:"reasoning"`
}
type RAGClassificationRequest struct {
Text string `json:"text"`
EnableRAG bool `json:"enable_rag"`
KnowledgePreference []string `json:"knowledge_preference,omitempty"`
Options *ClassificationOptions `json:"options,omitempty"`
}3. Knowledge Retrieval Integration
Vector Database Integration:
- Support for multiple vector databases (Milvus, Pinecone, Weaviate, Chroma)
- Configurable knowledge bases per domain
- Hybrid search (semantic + keyword) capabilities
Retrieval Configuration:
rag:
enabled: true
default_strategy: "adaptive"
knowledge_bases:
- name: "general_knowledge"
type: "milvus"
endpoint: "milvus.knowledge.svc.cluster.local:19530"
collection: "general_kb"
embedding_model: "sentence-transformers/all-MiniLM-L6-v2"
- name: "technical_docs"
type: "pinecone"
api_key_env: "PINECONE_API_KEY"
index: "tech-docs-index"
retrieval_params:
top_k: 5
similarity_threshold: 0.7
max_context_length: 4000
rerank_enabled: true4. Adaptive RAG Decision Engine
Intelligence Layer:
- Analyze query complexity and knowledge requirements
- Determine if existing model knowledge is sufficient
- Decide optimal retrieval strategy based on query type
Decision Factors:
- Query specificity and recency requirements
- Domain expertise needed
- Model knowledge cutoff considerations
- Cost vs. accuracy trade-offs
type RAGDecisionEngine struct {
knowledgeBases map[string]KnowledgeBase
decisionModel *RAGDecisionClassifier
retrievalCache *RetrievalCache
config *RAGConfig
}
func (engine *RAGDecisionEngine) ShouldRetrieve(query string, category string) (*RAGDecision, error) {
// Analyze query characteristics
queryFeatures := engine.extractQueryFeatures(query)
// Check category RAG configuration
categoryConfig := engine.config.GetCategoryRAGConfig(category)
// Make retrieval decision
decision := engine.decisionModel.Predict(queryFeatures, categoryConfig)
return decision, nil
}5. Enhanced Routing Logic
RAG-Aware Model Selection:
- Prioritize models with better RAG integration capabilities
- Consider context window size for retrieved content
- Balance retrieval cost with model capability
Context Injection:
- Seamlessly inject retrieved context into model prompts
- Handle context length limitations intelligently
- Implement context summarization when needed
6. Implementation Components
New Services:
type RAGService struct {
knowledgeBases map[string]KnowledgeBase
retrievalEngine *RetrievalEngine
decisionEngine *RAGDecisionEngine
contextManager *ContextManager
}
type RetrievalEngine struct {
vectorStores map[string]VectorStore
embeddingService *EmbeddingService
rerankingService *RerankingService
}Router Integration:
- Extend
OpenAIRouterwith RAG capabilities - Add retrieval step before model routing
- Implement context-aware caching
API Extensions:
- New
/classify/ragendpoint for RAG-aware classification - Enhanced
/chat/completionswith automatic RAG integration - RAG decision explanation endpoints for debugging
7. Performance Optimizations
Caching Strategy:
- Cache retrieval results with semantic similarity
- Implement multi-level caching (query → retrieval → response)
- Smart cache invalidation based on knowledge freshness
Retrieval Optimization:
- Parallel retrieval from multiple knowledge bases
- Asynchronous retrieval with streaming responses
- Query expansion and refinement
Acceptance Criteria
- RAG-aware categories can be configured with retrieval strategies
- Decision engine intelligently determines when to retrieve knowledge
- Multiple vector database backends are supported
- Retrieved context is seamlessly integrated into model prompts
- RAG decisions are cached and reused appropriately
- Performance impact is minimized through intelligent caching
- API maintains OpenAI compatibility while adding RAG capabilities
- Comprehensive metrics and observability for RAG operations