Skip to content

Commit cb15748

Browse files
committed
more
Signed-off-by: bitliu <[email protected]>
1 parent 4eb3d28 commit cb15748

File tree

1 file changed

+40
-20
lines changed

1 file changed

+40
-20
lines changed

website/blog/2025-10-16-mom-family.md

Lines changed: 40 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -27,15 +27,20 @@ A quick overview of all MoM models:
2727
| **🔍 Similarity Search** | mom-similarity-flash | Flash | ModernBERT | &lt;10ms | Semantic similarity matching |
2828
| **🔒 Prompt Guardian** | mom-jailbreak-flash | Flash | ModernBERT | &lt;10ms | Jailbreak/attack detection |
2929
| | mom-pii-flash | Flash | ModernBERT | &lt;10ms | PII detection & privacy protection |
30-
| **🎯 SLM Experts** | mom-expert-math-flash | Flash | Qwen 0.6B | ~30-50ms | Mathematics routing |
31-
| | mom-expert-math-pro | Pro | Qwen 1.7B | ~50-100ms | Advanced math with reasoning |
30+
| **🎯 SLM Experts** | mom-expert-math-flash | Flash | Qwen 0.6B | ~30-50ms | Backend math problem solver |
31+
| | mom-expert-science-flash | Flash | Qwen 0.6B | ~30-50ms | Backend science problem solver |
32+
| | mom-expert-social-flash | Flash | Qwen 0.6B | ~30-50ms | Backend social sciences solver |
33+
| | mom-expert-humanities-flash | Flash | Qwen 0.6B | ~30-50ms | Backend humanities solver |
34+
| | mom-expert-law-flash | Flash | Qwen 0.6B | ~30-50ms | Backend law problem solver |
35+
| | mom-expert-generalist-flash | Flash | Qwen 0.6B | ~30-50ms | Backend generalist solver |
3236

3337
**Key Insights:**
3438

35-
- **4 Categories** × **3 Size Variants** = Flexible routing architecture
36-
- **ModernBERT** (encoder-only) → Sub-10ms latency for high-throughput scenarios
37-
- **Qwen** (decoder-only) → Explainable decisions with reasoning capabilities
39+
- **4 Categories**: 3 for routing (Intelligent Routing, Similarity Search, Prompt Guardian) + 1 for backend problem solving (SLM Experts)
40+
- **ModernBERT** (encoder-only) → Sub-10ms latency for high-throughput routing
41+
- **Qwen** (decoder-only) → Explainable routing decisions + domain-specific problem solving
3842
- **Flash** models achieve 10,000+ QPS on commodity hardware
43+
- **SLM Experts** are not routers—they are specialized backend models that solve domain-specific problems
3944

4045
## The Evolution: From Encoder-Only to Mixture-of-Models
4146

@@ -125,22 +130,28 @@ Security and safety checks before routing:
125130

126131
### 🎯 SLM Experts
127132

128-
Specialized small language models for domain-specific routing:
133+
Specialized small language models deployed as **backend problem solvers**:
129134

130-
| Model | Size | Base Model | Domain |
131-
|-------|------|------------|--------|
132-
| **mom-expert-math-flash** | Flash | Qwen 0.6B | Mathematics (algebra, calculus, statistics) |
133-
| **mom-expert-math-pro** | Pro | Qwen 1.7B | Advanced mathematics with reasoning |
135+
| Model | Size | Base Model | Domain | Training Data |
136+
|-------|------|------------|--------|---------------|
137+
| **mom-expert-math-flash** | Flash | Qwen 0.6B | Mathematics | GSM8K, MATH |
138+
| **mom-expert-science-flash** | Flash | Qwen 0.6B | Science | ARC-Challenge, OpenBookQA, SciQ |
139+
| **mom-expert-social-flash** | Flash | Qwen 0.6B | Social Sciences | CommonsenseQA, StrategyQA |
140+
| **mom-expert-humanities-flash** | Flash | Qwen 0.6B | Humanities | TruthfulQA, MMLU-train subset |
141+
| **mom-expert-law-flash** | Flash | Qwen 0.6B | Law | MMLU-train law subset + specialized sources |
142+
| **mom-expert-generalist-flash** | Flash | Qwen 0.6B | Generalist | Mixed from above domains |
134143

135-
**Architecture**: Based on Qwen models (decoder-only) for domain-specific reasoning capabilities.
144+
**Architecture**: All based on Qwen 0.6B (decoder-only) for domain-specific problem solving. Currently only Flash variants are available.
145+
146+
**Purpose**: These models are **not routers**—they are deployed as backend LLMs to solve domain-specific problems. They form part of the Mixture-of-Models backend architecture that vLLM-SR routes to.
136147

137148
## Design Principles
138149

139150
**Safety-First**: Prompt Guardian models (PII, jailbreak detection) run before routing—security at the edge.
140151

141152
**Speed ↔ Capability**: Choose Flash for sub-10ms latency, Pro for balanced performance, or Max for maximum accuracy. Different sizes, different SLAs.
142153

143-
**Domain Expertise**: SLM Expert models achieve 15-25% better accuracy on domain-specific tasks vs. generalist routing. Math queries go to math experts.
154+
**Domain Expertise**: SLM Expert models are deployed as backend problem solvers, achieving 15-25% better accuracy on domain-specific tasks vs. generalist LLMs. Math problems are solved by math experts, science questions by science experts, etc.
144155

145156
## How vLLM-SR Uses MoM
146157

@@ -153,21 +164,30 @@ The router itself is a mixture of specialized models working together in a pipel
153164
1. **Security Check**`mom-jailbreak-flash` and `mom-pii-flash` filter malicious/sensitive requests
154165
2. **Intent Classification**`mom-brain-*` models (flash/pro/max) determine query type and routing decisions
155166
3. **Similarity Search**`mom-similarity-flash` finds semantically similar routes
156-
4. **Domain Routing**`mom-expert-*` models route specialized queries to optimal downstream models
157167

158-
Each stage uses the **right model for the right task**: fast encoders for security checks, reasoning decoders for complex decisions, domain experts for specialized queries.
168+
Each stage uses the **right model for the right task**: fast encoders for security checks, reasoning decoders for complex decisions.
159169

160170
### Level 2: Backend LLM Orchestration (MoM Outside)
161171

162-
The router then directs requests to the optimal backend LLM:
172+
The router then directs requests to the optimal backend LLM from a mixture of models:
173+
174+
**General-Purpose LLMs**:
163175

164176
- **Simple queries** → Lightweight models (Llama 3.2, Qwen 2.5)
165177
- **Complex queries** → Premium models (GPT-4, Claude 3.5)
166-
- **Domain-specific** → Specialized models (Code Llama, Mistral Math)
178+
179+
**Domain-Specific SLM Experts** (`mom-expert-*`):
180+
181+
- **Math problems**`mom-expert-math-flash` (trained on GSM8K, MATH)
182+
- **Science questions**`mom-expert-science-flash` (trained on ARC, SciQ)
183+
- **Social sciences**`mom-expert-social-flash` (CommonsenseQA, StrategyQA)
184+
- **Humanities**`mom-expert-humanities-flash` (TruthfulQA, MMLU)
185+
- **Legal queries**`mom-expert-law-flash` (MMLU law + specialized sources)
186+
- **General tasks**`mom-expert-generalist-flash` (mixed training)
167187

168188
This dual-level MoM architecture achieves **2x+ cost reduction** while maintaining quality, similar to [RouteLLM](https://arxiv.org/abs/2406.18665).
169189

170-
**The Philosophy**: Mixture-of-Models all the way down—from the router's internal decision-making to the backend LLM selection.
190+
**The Philosophy**: Mixture-of-Models all the way down—from the router's internal decision-making to the backend LLM pool (including both general-purpose LLMs and specialized SLM experts).
171191

172192
## What's Next: Exploring Frontier Techniques
173193

@@ -219,20 +239,20 @@ mom-expert-{domain}-{size}
219239
1. **Intelligent Routing**: `mom-brain-{flash|pro|max}`
220240
2. **Similarity Search**: `mom-similarity-{flash}`
221241
3. **Prompt Guardian**: `mom-{jailbreak|pii}-{flash}`
222-
4. **SLM Experts**: `mom-expert-{domain}-{flash|pro}`
242+
4. **SLM Experts**: `mom-expert-{domain}-{flash}` where domain = `{math|science|social|humanities|law|generalist}`
223243

224244
### Three Size Variants
225245

226246
- **flash**: ModernBERT-based (for brain/similarity/guardian) or Qwen 0.6B (for experts) — fastest, sub-10ms latency
227-
- **pro**: Qwen 0.6B (for brain) or Qwen 1.7B (for experts) — balanced performance with reasoning
247+
- **pro**: Qwen 0.6B (for brain) — balanced performance with reasoning
228248
- **max**: Qwen 1.7B (for brain) — maximum accuracy and capabilities
229249

230250
### Architecture Summary
231251

232252
- **Intelligent Routing**: Flash (ModernBERT) + Pro/Max (Qwen 0.6B/1.7B)
233253
- **Similarity Search**: Flash (ModernBERT)
234254
- **Prompt Guardian**: Flash (ModernBERT)
235-
- **SLM Experts**: Flash/Pro (Qwen 0.6B/1.7B)
255+
- **SLM Experts**: Flash only (Qwen 0.6B) — 6 domain specialists
236256

237257
## Get Started
238258

0 commit comments

Comments
 (0)