You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Architecture**: Based on Qwen models (decoder-only) for domain-specific reasoning capabilities.
144
+
**Architecture**: All based on Qwen 0.6B (decoder-only) for domain-specific problem solving. Currently only Flash variants are available.
145
+
146
+
**Purpose**: These models are **not routers**—they are deployed as backend LLMs to solve domain-specific problems. They form part of the Mixture-of-Models backend architecture that vLLM-SR routes to.
136
147
137
148
## Design Principles
138
149
139
150
**Safety-First**: Prompt Guardian models (PII, jailbreak detection) run before routing—security at the edge.
140
151
141
152
**Speed ↔ Capability**: Choose Flash for sub-10ms latency, Pro for balanced performance, or Max for maximum accuracy. Different sizes, different SLAs.
142
153
143
-
**Domain Expertise**: SLM Expert models achieve 15-25% better accuracy on domain-specific tasks vs. generalist routing. Math queries go to math experts.
154
+
**Domain Expertise**: SLM Expert models are deployed as backend problem solvers, achieving 15-25% better accuracy on domain-specific tasks vs. generalist LLMs. Math problems are solved by math experts, science questions by science experts, etc.
144
155
145
156
## How vLLM-SR Uses MoM
146
157
@@ -153,21 +164,30 @@ The router itself is a mixture of specialized models working together in a pipel
153
164
1.**Security Check** → `mom-jailbreak-flash` and `mom-pii-flash` filter malicious/sensitive requests
154
165
2.**Intent Classification** → `mom-brain-*` models (flash/pro/max) determine query type and routing decisions
155
166
3.**Similarity Search** → `mom-similarity-flash` finds semantically similar routes
Each stage uses the **right model for the right task**: fast encoders for security checks, reasoning decoders for complex decisions, domain experts for specialized queries.
168
+
Each stage uses the **right model for the right task**: fast encoders for security checks, reasoning decoders for complex decisions.
This dual-level MoM architecture achieves **2x+ cost reduction** while maintaining quality, similar to [RouteLLM](https://arxiv.org/abs/2406.18665).
169
189
170
-
**The Philosophy**: Mixture-of-Models all the way down—from the router's internal decision-making to the backend LLM selection.
190
+
**The Philosophy**: Mixture-of-Models all the way down—from the router's internal decision-making to the backend LLM pool (including both general-purpose LLMs and specialized SLM experts).
0 commit comments