Skip to content

Commit 0b323ed

Browse files
committed
more
Signed-off-by: bitliu <[email protected]>
1 parent cb15748 commit 0b323ed

File tree

1 file changed

+40
-40
lines changed

1 file changed

+40
-40
lines changed

website/blog/2025-10-16-mom-family.md

Lines changed: 40 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -19,26 +19,26 @@ vLLM-SR solves a critical problem: **how to route LLM requests to the right mode
1919

2020
A quick overview of all MoM models:
2121

22-
| Category | Model | Size | Base Model | Latency | Purpose |
23-
|----------|-------|------|------------|---------|---------|
24-
| **🧠 Intelligent Routing** | mom-brain-flash | Flash | ModernBERT | &lt;10ms | Ultra-fast intent classification |
25-
| | mom-brain-pro | Pro | Qwen 0.6B | ~30-50ms | Balanced routing with reasoning |
26-
| | mom-brain-max | Max | Qwen 1.7B | ~50-100ms | Maximum accuracy for complex decisions |
27-
| **🔍 Similarity Search** | mom-similarity-flash | Flash | ModernBERT | &lt;10ms | Semantic similarity matching |
28-
| **🔒 Prompt Guardian** | mom-jailbreak-flash | Flash | ModernBERT | &lt;10ms | Jailbreak/attack detection |
29-
| | mom-pii-flash | Flash | ModernBERT | &lt;10ms | PII detection & privacy protection |
30-
| **🎯 SLM Experts** | mom-expert-math-flash | Flash | Qwen 0.6B | ~30-50ms | Backend math problem solver |
31-
| | mom-expert-science-flash | Flash | Qwen 0.6B | ~30-50ms | Backend science problem solver |
32-
| | mom-expert-social-flash | Flash | Qwen 0.6B | ~30-50ms | Backend social sciences solver |
33-
| | mom-expert-humanities-flash | Flash | Qwen 0.6B | ~30-50ms | Backend humanities solver |
34-
| | mom-expert-law-flash | Flash | Qwen 0.6B | ~30-50ms | Backend law problem solver |
35-
| | mom-expert-generalist-flash | Flash | Qwen 0.6B | ~30-50ms | Backend generalist solver |
22+
| Category | Model | Size | Architecture | Base Model | Purpose |
23+
|----------|-------|------|--------------|------------|---------|
24+
| **🧠 Intelligent Routing** | mom-brain-flash | Flash | Encoder | ModernBERT | Ultra-fast intent classification |
25+
| | mom-brain-pro | Pro | Decoder | Qwen3 0.6B | Balanced routing with reasoning |
26+
| | mom-brain-max | Max | Decoder | Qwen3 1.7B | Maximum accuracy for complex decisions |
27+
| **🔍 Similarity Search** | mom-similarity-flash | Flash | Encoder | ModernBERT | Semantic similarity matching |
28+
| **🔒 Prompt Guardian** | mom-jailbreak-flash | Flash | Encoder | ModernBERT | Jailbreak/attack detection |
29+
| | mom-pii-flash | Flash | Encoder | ModernBERT | PII detection & privacy protection |
30+
| **🎯 SLM Experts** | mom-expert-math-flash | Flash | Decoder | Qwen3 0.6B | Backend math problem solver |
31+
| | mom-expert-science-flash | Flash | Decoder | Qwen3 0.6B | Backend science problem solver |
32+
| | mom-expert-social-flash | Flash | Decoder | Qwen3 0.6B | Backend social sciences solver |
33+
| | mom-expert-humanities-flash | Flash | Decoder | Qwen3 0.6B | Backend humanities solver |
34+
| | mom-expert-law-flash | Flash | Decoder | Qwen3 0.6B | Backend law problem solver |
35+
| | mom-expert-generalist-flash | Flash | Decoder | Qwen3 0.6B | Backend generalist solver |
3636

3737
**Key Insights:**
3838

3939
- **4 Categories**: 3 for routing (Intelligent Routing, Similarity Search, Prompt Guardian) + 1 for backend problem solving (SLM Experts)
4040
- **ModernBERT** (encoder-only) → Sub-10ms latency for high-throughput routing
41-
- **Qwen** (decoder-only) → Explainable routing decisions + domain-specific problem solving
41+
- **Qwen3** (decoder-only) → Explainable routing decisions + domain-specific problem solving
4242
- **Flash** models achieve 10,000+ QPS on commodity hardware
4343
- **SLM Experts** are not routers—they are specialized backend models that solve domain-specific problems
4444

@@ -83,9 +83,9 @@ As vLLM-SR adoption grew, we encountered more diverse scenarios and requirements
8383

8484
Our MoM approach combines encoder and decoder strengths:
8585

86-
-**Encoders** — Fast classification (sub-10ms latency) for high-throughput scenarios
87-
- 🧠 **Decoders** — Explainable decisions with reasoning for transparency
88-
- 🎯 **Domain Agents** — Expert routing with specialized knowledge
86+
-**Encoders (ModernBERT)** — Fast classification (sub-10ms latency) for high-throughput scenarios
87+
- 🧠 **Decoders (Qwen3)** — Explainable decisions with reasoning for transparency
88+
- 🎯 **Domain Agents (Qwen3)** — Expert problem solving with specialized knowledge
8989

9090
This hybrid architecture lets you choose the right tool for each job: speed when you need it, reasoning when it matters.
9191

@@ -102,10 +102,10 @@ Smart routing models with three size variants:
102102
| Model | Size | Base Model | Purpose |
103103
|-------|------|------------|---------|
104104
| **mom-brain-flash** | Flash | ModernBERT | Ultra-fast intent classification (sub-10ms latency) |
105-
| **mom-brain-pro** | Pro | Qwen 0.6B | Balanced performance with reasoning capabilities |
106-
| **mom-brain-max** | Max | Qwen 1.7B | Maximum accuracy for complex routing decisions |
105+
| **mom-brain-pro** | Pro | Qwen3 0.6B | Balanced performance with reasoning capabilities |
106+
| **mom-brain-max** | Max | Qwen3 1.7B | Maximum accuracy for complex routing decisions |
107107

108-
**Architecture**: Flash is based on ModernBERT (encoder-only), while Pro and Max are based on Qwen 0.6B and 1.7B (decoder-only) models.
108+
**Architecture**: Flash is based on ModernBERT (encoder-only), while Pro and Max are based on Qwen3 0.6B and 1.7B (decoder-only) models.
109109

110110
### 🔍 Similarity Search
111111

@@ -134,14 +134,14 @@ Specialized small language models deployed as **backend problem solvers**:
134134

135135
| Model | Size | Base Model | Domain | Training Data |
136136
|-------|------|------------|--------|---------------|
137-
| **mom-expert-math-flash** | Flash | Qwen 0.6B | Mathematics | GSM8K, MATH |
138-
| **mom-expert-science-flash** | Flash | Qwen 0.6B | Science | ARC-Challenge, OpenBookQA, SciQ |
139-
| **mom-expert-social-flash** | Flash | Qwen 0.6B | Social Sciences | CommonsenseQA, StrategyQA |
140-
| **mom-expert-humanities-flash** | Flash | Qwen 0.6B | Humanities | TruthfulQA, MMLU-train subset |
141-
| **mom-expert-law-flash** | Flash | Qwen 0.6B | Law | MMLU-train law subset + specialized sources |
142-
| **mom-expert-generalist-flash** | Flash | Qwen 0.6B | Generalist | Mixed from above domains |
137+
| **mom-expert-math-flash** | Flash | Qwen3 0.6B | Mathematics | GSM8K, MATH |
138+
| **mom-expert-science-flash** | Flash | Qwen3 0.6B | Science | ARC-Challenge, OpenBookQA, SciQ |
139+
| **mom-expert-social-flash** | Flash | Qwen3 0.6B | Social Sciences | CommonsenseQA, StrategyQA |
140+
| **mom-expert-humanities-flash** | Flash | Qwen3 0.6B | Humanities | TruthfulQA, MMLU-train subset |
141+
| **mom-expert-law-flash** | Flash | Qwen3 0.6B | Law | MMLU-train law subset + specialized sources |
142+
| **mom-expert-generalist-flash** | Flash | Qwen3 0.6B | Generalist | Mixed from above domains |
143143

144-
**Architecture**: All based on Qwen 0.6B (decoder-only) for domain-specific problem solving. Currently only Flash variants are available.
144+
**Architecture**: All based on Qwen3 0.6B (decoder-only) for domain-specific problem solving. Currently only Flash variants are available.
145145

146146
**Purpose**: These models are **not routers**—they are deployed as backend LLMs to solve domain-specific problems. They form part of the Mixture-of-Models backend architecture that vLLM-SR routes to.
147147

@@ -173,17 +173,17 @@ The router then directs requests to the optimal backend LLM from a mixture of mo
173173

174174
**General-Purpose LLMs**:
175175

176-
- **Simple queries** → Lightweight models (Llama 3.2, Qwen 2.5)
176+
- **Simple queries** → Lightweight models (Llama 3.2, Qwen3 2.5)
177177
- **Complex queries** → Premium models (GPT-4, Claude 3.5)
178178

179179
**Domain-Specific SLM Experts** (`mom-expert-*`):
180180

181-
- **Math problems**`mom-expert-math-flash` (trained on GSM8K, MATH)
182-
- **Science questions**`mom-expert-science-flash` (trained on ARC, SciQ)
183-
- **Social sciences**`mom-expert-social-flash` (CommonsenseQA, StrategyQA)
184-
- **Humanities**`mom-expert-humanities-flash` (TruthfulQA, MMLU)
185-
- **Legal queries**`mom-expert-law-flash` (MMLU law + specialized sources)
186-
- **General tasks**`mom-expert-generalist-flash` (mixed training)
181+
- **Math problems**`mom-expert-math-flash` (Qwen3 0.6B trained on GSM8K, MATH)
182+
- **Science questions**`mom-expert-science-flash` (Qwen3 0.6B trained on ARC, SciQ)
183+
- **Social sciences**`mom-expert-social-flash` (Qwen3 0.6B on CommonsenseQA, StrategyQA)
184+
- **Humanities**`mom-expert-humanities-flash` (Qwen3 0.6B on TruthfulQA, MMLU)
185+
- **Legal queries**`mom-expert-law-flash` (Qwen3 0.6B on MMLU law + specialized sources)
186+
- **General tasks**`mom-expert-generalist-flash` (Qwen3 0.6B on mixed training)
187187

188188
This dual-level MoM architecture achieves **2x+ cost reduction** while maintaining quality, similar to [RouteLLM](https://arxiv.org/abs/2406.18665).
189189

@@ -243,16 +243,16 @@ mom-expert-{domain}-{size}
243243

244244
### Three Size Variants
245245

246-
- **flash**: ModernBERT-based (for brain/similarity/guardian) or Qwen 0.6B (for experts) — fastest, sub-10ms latency
247-
- **pro**: Qwen 0.6B (for brain) — balanced performance with reasoning
248-
- **max**: Qwen 1.7B (for brain) — maximum accuracy and capabilities
246+
- **flash**: ModernBERT-based (for brain/similarity/guardian) or Qwen3 0.6B (for experts) — fastest, sub-10ms latency
247+
- **pro**: Qwen3 0.6B (for brain) — balanced performance with reasoning
248+
- **max**: Qwen3 1.7B (for brain) — maximum accuracy and capabilities
249249

250250
### Architecture Summary
251251

252-
- **Intelligent Routing**: Flash (ModernBERT) + Pro/Max (Qwen 0.6B/1.7B)
252+
- **Intelligent Routing**: Flash (ModernBERT) + Pro/Max (Qwen3 0.6B/1.7B)
253253
- **Similarity Search**: Flash (ModernBERT)
254254
- **Prompt Guardian**: Flash (ModernBERT)
255-
- **SLM Experts**: Flash only (Qwen 0.6B) — 6 domain specialists
255+
- **SLM Experts**: Flash only (Qwen3 0.6B) — 6 domain specialists
256256

257257
## Get Started
258258

0 commit comments

Comments
 (0)