@@ -19,26 +19,26 @@ vLLM-SR solves a critical problem: **how to route LLM requests to the right mode
1919
2020A quick overview of all MoM models:
2121
22- | Category | Model | Size | Base Model | Latency | Purpose |
23- | ----------| -------| ------| ------------| ---------| ---------|
24- | ** 🧠 Intelligent Routing** | mom-brain-flash | Flash | ModernBERT | & lt ; 10ms | Ultra-fast intent classification |
25- | | mom-brain-pro | Pro | Qwen 0.6B | ~ 30-50ms | Balanced routing with reasoning |
26- | | mom-brain-max | Max | Qwen 1.7B | ~ 50-100ms | Maximum accuracy for complex decisions |
27- | ** 🔍 Similarity Search** | mom-similarity-flash | Flash | ModernBERT | & lt ; 10ms | Semantic similarity matching |
28- | ** 🔒 Prompt Guardian** | mom-jailbreak-flash | Flash | ModernBERT | & lt ; 10ms | Jailbreak/attack detection |
29- | | mom-pii-flash | Flash | ModernBERT | & lt ; 10ms | PII detection & privacy protection |
30- | ** 🎯 SLM Experts** | mom-expert-math-flash | Flash | Qwen 0.6B | ~ 30-50ms | Backend math problem solver |
31- | | mom-expert-science-flash | Flash | Qwen 0.6B | ~ 30-50ms | Backend science problem solver |
32- | | mom-expert-social-flash | Flash | Qwen 0.6B | ~ 30-50ms | Backend social sciences solver |
33- | | mom-expert-humanities-flash | Flash | Qwen 0.6B | ~ 30-50ms | Backend humanities solver |
34- | | mom-expert-law-flash | Flash | Qwen 0.6B | ~ 30-50ms | Backend law problem solver |
35- | | mom-expert-generalist-flash | Flash | Qwen 0.6B | ~ 30-50ms | Backend generalist solver |
22+ | Category | Model | Size | Architecture | Base Model | Purpose |
23+ | ----------| -------| ------| -------------- | --- ---------| ---------|
24+ | ** 🧠 Intelligent Routing** | mom-brain-flash | Flash | Encoder | ModernBERT | Ultra-fast intent classification |
25+ | | mom-brain-pro | Pro | Decoder | Qwen3 0.6B | Balanced routing with reasoning |
26+ | | mom-brain-max | Max | Decoder | Qwen3 1.7B | Maximum accuracy for complex decisions |
27+ | ** 🔍 Similarity Search** | mom-similarity-flash | Flash | Encoder | ModernBERT | Semantic similarity matching |
28+ | ** 🔒 Prompt Guardian** | mom-jailbreak-flash | Flash | Encoder | ModernBERT | Jailbreak/attack detection |
29+ | | mom-pii-flash | Flash | Encoder | ModernBERT | PII detection & privacy protection |
30+ | ** 🎯 SLM Experts** | mom-expert-math-flash | Flash | Decoder | Qwen3 0.6B | Backend math problem solver |
31+ | | mom-expert-science-flash | Flash | Decoder | Qwen3 0.6B | Backend science problem solver |
32+ | | mom-expert-social-flash | Flash | Decoder | Qwen3 0.6B | Backend social sciences solver |
33+ | | mom-expert-humanities-flash | Flash | Decoder | Qwen3 0.6B | Backend humanities solver |
34+ | | mom-expert-law-flash | Flash | Decoder | Qwen3 0.6B | Backend law problem solver |
35+ | | mom-expert-generalist-flash | Flash | Decoder | Qwen3 0.6B | Backend generalist solver |
3636
3737** Key Insights:**
3838
3939- ** 4 Categories** : 3 for routing (Intelligent Routing, Similarity Search, Prompt Guardian) + 1 for backend problem solving (SLM Experts)
4040- ** ModernBERT** (encoder-only) → Sub-10ms latency for high-throughput routing
41- - ** Qwen ** (decoder-only) → Explainable routing decisions + domain-specific problem solving
41+ - ** Qwen3 ** (decoder-only) → Explainable routing decisions + domain-specific problem solving
4242- ** Flash** models achieve 10,000+ QPS on commodity hardware
4343- ** SLM Experts** are not routers—they are specialized backend models that solve domain-specific problems
4444
@@ -83,9 +83,9 @@ As vLLM-SR adoption grew, we encountered more diverse scenarios and requirements
8383
8484Our MoM approach combines encoder and decoder strengths:
8585
86- - ⚡ ** Encoders** — Fast classification (sub-10ms latency) for high-throughput scenarios
87- - 🧠 ** Decoders** — Explainable decisions with reasoning for transparency
88- - 🎯 ** Domain Agents** — Expert routing with specialized knowledge
86+ - ⚡ ** Encoders (ModernBERT) ** — Fast classification (sub-10ms latency) for high-throughput scenarios
87+ - 🧠 ** Decoders (Qwen3) ** — Explainable decisions with reasoning for transparency
88+ - 🎯 ** Domain Agents (Qwen3) ** — Expert problem solving with specialized knowledge
8989
9090This hybrid architecture lets you choose the right tool for each job: speed when you need it, reasoning when it matters.
9191
@@ -102,10 +102,10 @@ Smart routing models with three size variants:
102102| Model | Size | Base Model | Purpose |
103103| -------| ------| ------------| ---------|
104104| ** mom-brain-flash** | Flash | ModernBERT | Ultra-fast intent classification (sub-10ms latency) |
105- | ** mom-brain-pro** | Pro | Qwen 0.6B | Balanced performance with reasoning capabilities |
106- | ** mom-brain-max** | Max | Qwen 1.7B | Maximum accuracy for complex routing decisions |
105+ | ** mom-brain-pro** | Pro | Qwen3 0.6B | Balanced performance with reasoning capabilities |
106+ | ** mom-brain-max** | Max | Qwen3 1.7B | Maximum accuracy for complex routing decisions |
107107
108- ** Architecture** : Flash is based on ModernBERT (encoder-only), while Pro and Max are based on Qwen 0.6B and 1.7B (decoder-only) models.
108+ ** Architecture** : Flash is based on ModernBERT (encoder-only), while Pro and Max are based on Qwen3 0.6B and 1.7B (decoder-only) models.
109109
110110### 🔍 Similarity Search
111111
@@ -134,14 +134,14 @@ Specialized small language models deployed as **backend problem solvers**:
134134
135135| Model | Size | Base Model | Domain | Training Data |
136136| -------| ------| ------------| --------| ---------------|
137- | ** mom-expert-math-flash** | Flash | Qwen 0.6B | Mathematics | GSM8K, MATH |
138- | ** mom-expert-science-flash** | Flash | Qwen 0.6B | Science | ARC-Challenge, OpenBookQA, SciQ |
139- | ** mom-expert-social-flash** | Flash | Qwen 0.6B | Social Sciences | CommonsenseQA, StrategyQA |
140- | ** mom-expert-humanities-flash** | Flash | Qwen 0.6B | Humanities | TruthfulQA, MMLU-train subset |
141- | ** mom-expert-law-flash** | Flash | Qwen 0.6B | Law | MMLU-train law subset + specialized sources |
142- | ** mom-expert-generalist-flash** | Flash | Qwen 0.6B | Generalist | Mixed from above domains |
137+ | ** mom-expert-math-flash** | Flash | Qwen3 0.6B | Mathematics | GSM8K, MATH |
138+ | ** mom-expert-science-flash** | Flash | Qwen3 0.6B | Science | ARC-Challenge, OpenBookQA, SciQ |
139+ | ** mom-expert-social-flash** | Flash | Qwen3 0.6B | Social Sciences | CommonsenseQA, StrategyQA |
140+ | ** mom-expert-humanities-flash** | Flash | Qwen3 0.6B | Humanities | TruthfulQA, MMLU-train subset |
141+ | ** mom-expert-law-flash** | Flash | Qwen3 0.6B | Law | MMLU-train law subset + specialized sources |
142+ | ** mom-expert-generalist-flash** | Flash | Qwen3 0.6B | Generalist | Mixed from above domains |
143143
144- ** Architecture** : All based on Qwen 0.6B (decoder-only) for domain-specific problem solving. Currently only Flash variants are available.
144+ ** Architecture** : All based on Qwen3 0.6B (decoder-only) for domain-specific problem solving. Currently only Flash variants are available.
145145
146146** Purpose** : These models are ** not routers** —they are deployed as backend LLMs to solve domain-specific problems. They form part of the Mixture-of-Models backend architecture that vLLM-SR routes to.
147147
@@ -173,17 +173,17 @@ The router then directs requests to the optimal backend LLM from a mixture of mo
173173
174174** General-Purpose LLMs** :
175175
176- - ** Simple queries** → Lightweight models (Llama 3.2, Qwen 2.5)
176+ - ** Simple queries** → Lightweight models (Llama 3.2, Qwen3 2.5)
177177- ** Complex queries** → Premium models (GPT-4, Claude 3.5)
178178
179179** Domain-Specific SLM Experts** (` mom-expert-* ` ):
180180
181- - ** Math problems** → ` mom-expert-math-flash ` (trained on GSM8K, MATH)
182- - ** Science questions** → ` mom-expert-science-flash ` (trained on ARC, SciQ)
183- - ** Social sciences** → ` mom-expert-social-flash ` (CommonsenseQA, StrategyQA)
184- - ** Humanities** → ` mom-expert-humanities-flash ` (TruthfulQA, MMLU)
185- - ** Legal queries** → ` mom-expert-law-flash ` (MMLU law + specialized sources)
186- - ** General tasks** → ` mom-expert-generalist-flash ` (mixed training)
181+ - ** Math problems** → ` mom-expert-math-flash ` (Qwen3 0.6B trained on GSM8K, MATH)
182+ - ** Science questions** → ` mom-expert-science-flash ` (Qwen3 0.6B trained on ARC, SciQ)
183+ - ** Social sciences** → ` mom-expert-social-flash ` (Qwen3 0.6B on CommonsenseQA, StrategyQA)
184+ - ** Humanities** → ` mom-expert-humanities-flash ` (Qwen3 0.6B on TruthfulQA, MMLU)
185+ - ** Legal queries** → ` mom-expert-law-flash ` (Qwen3 0.6B on MMLU law + specialized sources)
186+ - ** General tasks** → ` mom-expert-generalist-flash ` (Qwen3 0.6B on mixed training)
187187
188188This dual-level MoM architecture achieves ** 2x+ cost reduction** while maintaining quality, similar to [ RouteLLM] ( https://arxiv.org/abs/2406.18665 ) .
189189
@@ -243,16 +243,16 @@ mom-expert-{domain}-{size}
243243
244244### Three Size Variants
245245
246- - ** flash** : ModernBERT-based (for brain/similarity/guardian) or Qwen 0.6B (for experts) — fastest, sub-10ms latency
247- - ** pro** : Qwen 0.6B (for brain) — balanced performance with reasoning
248- - ** max** : Qwen 1.7B (for brain) — maximum accuracy and capabilities
246+ - ** flash** : ModernBERT-based (for brain/similarity/guardian) or Qwen3 0.6B (for experts) — fastest, sub-10ms latency
247+ - ** pro** : Qwen3 0.6B (for brain) — balanced performance with reasoning
248+ - ** max** : Qwen3 1.7B (for brain) — maximum accuracy and capabilities
249249
250250### Architecture Summary
251251
252- - ** Intelligent Routing** : Flash (ModernBERT) + Pro/Max (Qwen 0.6B/1.7B)
252+ - ** Intelligent Routing** : Flash (ModernBERT) + Pro/Max (Qwen3 0.6B/1.7B)
253253- ** Similarity Search** : Flash (ModernBERT)
254254- ** Prompt Guardian** : Flash (ModernBERT)
255- - ** SLM Experts** : Flash only (Qwen 0.6B) — 6 domain specialists
255+ - ** SLM Experts** : Flash only (Qwen3 0.6B) — 6 domain specialists
256256
257257## Get Started
258258
0 commit comments