[docs][serve][llm] Add comprehensive architecture documentation for Ray Serve LLM #57830

kouroshHakha · 2025-10-17T01:48:41Z

Summary

This PR adds complete architecture documentation for Ray Serve LLM, providing technical implementation details, serving patterns, and design documentation for developers building on and extending the framework.

Builds on: #57787 (Phase 1: User guides and structure)

Motivation

Ray Serve LLM is a sophisticated distributed serving framework, but lacked comprehensive architecture documentation explaining:

How core components work and interact
Available serving patterns and when to use them
How to extend the framework with custom implementations
Design principles and technical decisions

This documentation is essential for:

Internal developers working on Ray Serve LLM
External contributors extending the framework
Users deploying complex patterns like DeepSeek-V3
Teams evaluating Ray Serve LLM for production use

Changes

New Architecture Documentation

Core Architecture (2 documents)

architecture/overview.md

High-level architecture overview
Core components: LLMServer and OpenAiIngress
Physical placement and engine management
Network topology and scaling considerations
Architecture patterns summary
Design principles

architecture/core.md

Technical implementation details
Core abstractions: LLMEngine protocol, LLMConfig, deployment protocols
Builder pattern for flexible deployment
Async constructor pattern for engine initialization
Component relationships diagram
Extension points for custom implementations

Serving Patterns (4 documents)

architecture/serving-patterns/index.md

Overview of available serving patterns
Pattern composition capabilities

architecture/serving-patterns/prefill-decode.md

Prefill-decode disaggregation architecture
Resource characteristics comparison
Components: PDProxyServer, prefill/decode LLMServers
Request flow diagrams
Performance characteristics and design considerations

architecture/serving-patterns/data-parallel.md

Data parallelism architecture for MoE models
Focus on expert parallelism with MLA (Multi-head Latent Attention)
DPServer and DPRankAssigner coordination
When to use DP (sparse MoE) vs when not to (non-MoE models)
Placement strategy and coordination overhead

architecture/serving-patterns/deepseek-v3.md

Complete production example combining all patterns
Production-scale configuration
Code examples for PD + DP + Custom routing
Pattern synergy explanation
Performance characteristics and design considerations
Links to DeepSeek-V3 R1 Inference System Overview

Request Routing Architecture

architecture/routing-policies.md

Request routing concepts and available policies
Design patterns for custom routers:
- Centralized metric store pattern
- Broadcast metrics pattern
Implementation details and trade-offs

Followups to this PR:

Make the diagrams use mermaid so that they are version controlled, etc.

Convert the Ray Serve LLM quickstart documentation from RST to MyST Markdown format. Changes: - Convert quick-start.rst to quick-start.md using MyST syntax - Apply Ray docs style guide (contractions, active voice, sentence case) - Convert all RST directives to MyST equivalents (tab-sets, code blocks, cross-refs) - No content changes, pure format conversion - Verified: builds successfully with no errors or warnings This is Phase 0 of the LLM docs reorganization plan. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Create the directory structure for Ray Serve LLM documentation reorganization. Changes: - Add user-guides/ directory for user guide content - Add architecture/ directory for architecture docs - Add api/ directory for API reference This is Phase 1 of the LLM docs reorganization plan. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Create a simple examples landing page that lists all available LLM tutorials. Changes: - Add examples.md with flat list of tutorial links - Include all 7 existing deployment tutorials - Follow style guide (sentence case, active voice) This is part of Phase 2 of the LLM docs reorganization plan. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Extract FAQ section from quick-start and create dedicated troubleshooting page. Changes: - Create troubleshooting.md with 3 FAQs: - How to use gated Huggingface models - Why is downloading the model so slow - How to configure tokenizer pool size - Add 'Get help' section with community links - Follow style guide (sentence case, contractions, active voice) This is part of Phase 2 of the LLM docs reorganization plan. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Extract Multi-LoRA documentation from quick-start into dedicated user guide. Changes: - Create user-guides/multi-lora.md with: - Overview of multi-LoRA deployment and when to use it - How request routing works with adapter caching - Configuration sections (LoRA config and engine arguments) - Complete server and client examples - Example queries for base model and adapters - Follow Ray docs style guide (sentence case, contractions, active voice) This is part of Phase 3 of the LLM docs reorganization plan. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Remove Multi-LoRA deployment section from quick-start as it's now a dedicated user guide. Changes: - Remove Multi-LoRA deployment section (server + client examples) - Content moved to user-guides/multi-lora.md This reduces quick-start.md from 917 to ~850 lines. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Create user guide and architecture docs for request routing policies. Changes: - Add user-guides/prefix-aware-routing.md: - Overview and when to use prefix-aware routing - How it works (load balance, prefix matching, fallback) - Deployment example with corrected import path - Configuration parameters and best practices - Add architecture/routing-policies.md: - Routing vs ingress distinction - Request routing architecture - Available policies (Power of Two, Prefix-aware) - Design patterns (centralized store, broadcast metrics) - Custom routing policies with proper links to Ray Serve docs - Utility mixins documentation - Implementation details (async operations, state management) - Add routing policy diagrams (request_router_1.png, request_router_2.png) - Follow Ray docs style guide (sentence case, contractions, active voice) This is part of Phase 3 of the LLM docs reorganization plan. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Add index files and update main index to include new documentation sections. Changes: - Update serve/llm/index.md toctree to include: - Examples - User Guides (with index) - Architecture (with index) - Troubleshooting - Create user-guides/index.md with links to: - Multi-LoRA deployment - Prefix-aware routing - Create architecture/index.md with link to: - Request routing This resolves the 'document isn't included in any toctree' warnings. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Apply Ray docs style guide and improve clarity in routing-policies.md. Style improvements: - Use active voice and direct address (you/your) - Add code backticks for technical terms (model_id, OpenAiIngress, etc.) - Improve conversational tone - Change 'Choosing' to 'Choose' for imperative mood Image improvements: - Rename request_router_1.png → routing_centralized_store.png - Rename request_router_2.png → routing_broadcast_metrics.png - Update all image references in documentation These changes improve readability and align with Ray documentation standards. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Replace implementation-focused prefix-aware-request-router.md with the user-focused prefix-aware-routing.md in user guides. Changes: - Delete prefix-aware-request-router.md (implementation details) - Remove toctree entry from index.md - User guide (user-guides/prefix-aware-routing.md) is now the primary documentation for prefix-aware routing The new user guide focuses on practical usage (when to use, how to deploy, configuration) rather than implementation details, which aligns better with user needs. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Create user-guides/prefill-decode.md to replace pd-dissagregation.md. Changes: - Create user-guides/prefill-decode.md with improved structure - Apply Ray docs style guide (active voice, contractions, you/your) - Remove VLLM_USE_V1 mention (v1 is now default) - Improve section organization and headings (sentence case, imperative) - Reorganize structure: Deploy with NIXLConnector → Deploy with LMCacheConnectorV1 - Add 'When to use' and 'Best practices' sections - Enhance troubleshooting section with specific guidance - Update user-guides/index.md to include new guide - Remove pd-dissagregation.md from main index.md toctree - Delete old pd-dissagregation.md file The new guide focuses on practical deployment steps while maintaining all technical content from the original. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Create user-guides/vllm-compatibility.md to document vLLM feature support. Changes: - Create user-guides/vllm-compatibility.md with comprehensive coverage - Document embeddings, structured output, vision models, and reasoning models - Highlight Ray Serve LLM alignment with vLLM OpenAI-compatible server - Apply Ray docs style guide (active voice, you/your, sentence case) - Include practical examples for each feature with server/client code - Update user-guides/index.md to include new guide The guide emphasizes that users can switch between vllm serve and Ray Serve LLM with no code changes while gaining Ray Serve's production features (autoscaling, multi-model serving, advanced routing). Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Create user-guides/model-loading.md for configuring model sources. Changes: - Create user-guides/model-loading.md with comprehensive coverage - Document loading from Hugging Face Hub with fast download - Document gated model authentication with HF_TOKEN - Document loading from S3/GCS remote storage - Include AWS credentials configuration (env vars and IAM roles) - Add cluster-wide environment variable configuration with ray.init - Include best practices for security, performance, and source selection - Add troubleshooting for slow downloads and access errors - Apply Ray docs style guide (active voice, contractions, 'such as') - Update user-guides/index.md to include new guide The guide consolidates model loading patterns extracted from quick-start FAQs and remote storage sections into a focused, practical resource. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Remove sections moved to dedicated user guides, reducing file by 53%. Removed sections: - Embeddings → moved to vllm-compatibility.md - Structured output → moved to vllm-compatibility.md - Vision language models → moved to vllm-compatibility.md - Fractional GPU serving → moved to fractional-gpu.md - Remote storage → moved to model-loading.md - Gated models FAQ → moved to model-loading.md - Fast download FAQ → moved to model-loading.md Changes: - Reduced from 845 to 398 lines (447 lines removed, 53% reduction) - Kept tokenizer pool size FAQ (not moved to user guides) - Maintained clean flow from multi-model section to FAQs - No linter errors The quick-start now focuses on core deployment patterns with links to dedicated user guides for advanced topics. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Remove placeholder files in preparation for PR. Changes: - Remove .gitkeep from user-guides/ - Remove .gitkeep from architecture/ - Remove api/.gitkeep and empty api/ directory The api/ directory will be added in a future PR focused on API reference documentation. This PR focuses on user guides. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Create comprehensive architecture documentation for Ray Serve LLM. Changes: - Create architecture/overview.md - High-level architecture overview - What Ray Serve LLM provides (horizontal scaling, distributed strategies) - Ray Serve primitives (deployment, replica, handle) - Core components (LLMServer, OpenAiIngress) - Physical placement and engine management - Network topology and scaling considerations - Architecture patterns overview (data parallel, prefill-decode, routing) - Design principles - Create architecture/core.md - Technical implementation details - Core abstractions (LLMEngine protocol, LLMConfig, protocols) - Builder pattern for flexible deployment configuration - Async constructor pattern for engine initialization - Component relationships diagram - Extension points (custom engines, servers, ingress, builders) - Update architecture/index.md to include new docs in toctree Content extracted from kh/arch-docs branch design/core.md and adapted to Ray docs style guide. Split into two focused documents for better discoverability and different audiences (overview for understanding, core for extending). Full Ray docs style guide compliance: - Active voice throughout - Contractions used appropriately - Complete sentences for code lead-ins - Proper list punctuation - Consistent capitalization Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Create serving patterns documentation and copy images from arch-docs branch. Changes: - Copy architecture images from kh/arch-docs branch: - llmserver.png - LLMServer component diagram - placement.png - Physical placement strategy - llmserver-ingress-rpc.png - Network topology - pd.png - Prefill-decode architecture - dp.png - Data parallel architecture - dp_flow.png - Data parallel request flow - Create architecture/serving-patterns/ directory structure - index.md - Overview of serving patterns - prefill-decode.md - PD disaggregation architecture - Update architecture/index.md to include serving patterns Architecture doc for prefill-decode includes: - Resource characteristics comparison table - PDProxyServer component details - KV cache transfer backends (NIXLConnector, LMCacheConnectorV1) - Request flow with diagrams - Combining with data parallelism - Performance characteristics and design considerations - Full Ray docs style guide compliance Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Apply Ray documentation style guide improvements to prefill-decode.md. Changes: - Use active voice: 'The correct ratio...improves' instead of 'could improve' - Use contractions: 'it's' instead of 'it is', 'isn't' instead of 'are not' - Replace informal abbreviations: 'prefill instance' instead of 'P' - Improve clarity: 'when decode-limited' instead of long fragment - User-focused language: 'You should inspect' instead of 'this needs' - Better word choice: 'versus' instead of 'vs.' All changes maintain technical accuracy while improving readability and following Ray docs voice and grammar guidelines. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Refine data-parallel architecture documentation to focus on its primary use case: serving large sparse MoE models with expert parallelism. Changes: - Clarify that DP is most useful when combined with expert parallelism - Emphasize that replicas work together, not in isolation - Focus on MLA (Multi-head Latent Attention) benefits - Explain when DP helps: saturating experts with larger batch sizes - Remove unnecessary details about fault tolerance (to be designed) - Simplify scaling section to reflect static support only - Remove comparison table and custom routing details - Update pseudocode to show rank passing to engine Key insight: DP allows reaching larger batch sizes by utilizing sparsity of experts more efficiently in MoE + MLA architectures. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Apply Ray documentation style guide improvements to both files. data-parallel.md fixes: - Use contractions: 'aren't', 'don't', 'can't', 'isn't' - Fix typos: 'or' -> 'of', 'benefitional' -> 'beneficial', 'effecitve' -> 'effective', 'Psudocode' -> 'Pseudocode' - Improve clarity: 'qkv' -> 'attention (QKV) layers' - Define MLA (Multi-head Latent Attention) on first use deepseek-v3.md fixes: - Fix typo: 'builde_pd_openai_app' -> 'build_pd_openai_app' - Clarify eager mode comment - Update imports to use build_pd_openai_app helper - Update diagram to show 32 ingress replicas and 32 PDProxyServer replicas - Add note about ~2 PDProxyServer replicas per node - Update all examples to use P:32, D:96 scaling - Add request routing config to deployment_config - Simplify code by removing manual PDProxyServer construction All changes maintain technical accuracy while improving readability and following Ray docs voice and grammar guidelines. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Complete final missing pieces of documentation reorganization. Changes: - Create user-guides/observability.md: - Extract observability content from quick-start.md - Service-level metrics and engine metrics - Grafana dashboard documentation - Configuration examples for log_engine_metrics - Usage data collection information - Best practices for monitoring - Update quick-start.md: - Remove observability section (moved to user guide) - Remove gen_config CLI section (deprecated feature) - Add link to observability user guide - Streamline quick-start content - Update index.md with improved overview: - Better positioning: 'high-performance, scalable framework' - Add 'Why Ray Serve LLM?' section highlighting distributed workloads - Emphasize key differentiators (parallelism, PD disaggregation, routing) - Update features list to include new capabilities - Remove outdated 'Key Components' and 'Configuration' sections - Update user-guides/index.md to include observability guide Full Ray docs style guide compliance maintained throughout. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Co-authored-by: angelinalg <[email protected]> Signed-off-by: kourosh hakhamaneshi <[email protected]>

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

…e the corresponding file. Update architecture overview to generalize serving patterns without specific mention of DeepSeek-V3. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha · 2025-10-19T18:00:53Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces extensive and well-structured architecture documentation for Ray Serve LLM, covering core components, abstractions, and serving patterns like data parallelism and prefill-decode disaggregation. My review focuses on improving the clarity and correctness of this new documentation. I've identified several areas for improvement, including fixing typos in documentation directives, correcting code examples to ensure they are valid, and highlighting placeholder content that needs to be completed. These suggestions aim to make the documentation more accurate and user-friendly for developers.

doc/source/serve/llm/architecture/core.md

doc/source/serve/llm/architecture/serving-patterns/data-parallel.md

doc/source/serve/llm/architecture/serving-patterns/prefill-decode.md

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha · 2025-10-19T18:47:21Z

/gemini review

gemini-code-assist

Code Review

This pull request adds extensive and valuable architecture documentation for Ray Serve LLM. The new documents are well-structured and provide a great overview of the core components, serving patterns, and design principles. My review focuses on improving the correctness and clarity of the code examples provided in the documentation. I've identified a few instances of undefined variables, missing imports, and inconsistent method signatures in the code snippets, which could confuse users. Addressing these points will make the documentation more accurate and easier to follow.

doc/source/serve/llm/architecture/core.md

doc/source/serve/llm/architecture/serving-patterns/data-parallel.md

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha · 2025-10-19T18:57:41Z

/gemini review

gemini-code-assist

Code Review

This pull request adds extensive and valuable architecture documentation for Ray Serve LLM. The new documents are well-structured, detailed, and cover core components, serving patterns, and design principles, which will be a great resource for developers. My review focuses on minor improvements to enhance clarity, consistency, and correctness in the documentation and code examples. Overall, this is an excellent contribution.

doc/source/serve/llm/architecture/core.md

doc/source/serve/llm/architecture/overview.md

doc/source/serve/llm/architecture/serving-patterns/data-parallel.md

doc/source/serve/llm/architecture/serving-patterns/prefill-decode.md

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

ruisearch42

Nice docs, overall looks good

ruisearch42 · 2025-10-19T23:32:00Z

doc/source/serve/llm/architecture/serving-patterns/data-parallel.md

@@ -0,0 +1,205 @@
+(serve-llm-architecture-data-parallel)=
+# Data parallelism


Call this "Data parallel attention" to be more precise? Or clarify in the texts.

ruisearch42 · 2025-10-19T23:37:39Z

doc/source/serve/llm/architecture/serving-patterns/data-parallel.md

+
+## Architecture overview
+
+```{figure} ../../images/dp.png


The MP executor instead of Ray executor is currently used?

Ray executor is still used. The MPClient is used not RayActorClient.

ruisearch42 · 2025-10-19T23:41:19Z

doc/source/serve/llm/architecture/serving-patterns/data-parallel.md

+
+- **Large sparse MoE with MLA**: Allows reaching larger batch sizes by utilizing the sparsity of the experts more efficiently. MLA (Multi-head Latent Attention) reduces KV cache memory requirements. 
+- **High throughput required**: You need to serve many concurrent requests.
+- **KV-cache limited**: Adding more KV cache capacity increases throughput, so that parallelization of experts could effectively increase the capacity of KV-cache for handling concurrent requests.


so that parallelization of experts

This is benefit of EP?

it's not meant to frame that we do EP because it increases kv-cache size (TP does that too). It's just a note that if we do EP kv-cache size increases.

The reader should get an idea of what the general thinking behind DP + EP is

ruisearch42 · 2025-10-19T23:43:57Z

doc/source/serve/llm/architecture/serving-patterns/data-parallel.md

+
+Data parallel serving works best when:
+
+- **Large sparse MoE with MLA**: Allows reaching larger batch sizes by utilizing the sparsity of the experts more efficiently. MLA (Multi-head Latent Attention) reduces KV cache memory requirements. 


The main motivation is attention weights and KV-cache cannot be partitioned along head dimension for MLA, and DP allows partition at request dimension?

so you could do TP up to the degree that you can keep sharding your kv, after that to avoid duplication you can do DP.

So somehting like DP2TP4EP8 for Qwen-235b makes sense, instead of DP8EP8

Added some clarification around this.

ruisearch42 · 2025-10-19T23:46:38Z

doc/source/serve/llm/architecture/serving-patterns/data-parallel.md

+Consider alternatives when:
+
+- **Low to medium throughput**: If you can't saturate the MoE layers, don't use DP. 
+- **Non-MLA Attention**: DP is beneficial with MLA. Without DP (using TP instead), you need to replicate the KV cache, which isn't beneficial because you want to maximize batch size. As long as the KV cache can be sharded, using TP might be sufficient. 


DP can benefit GQA as well? When TP_size > num_kv_heads

yeah. do you think these are not coming across? Maybe I should be more explicit :D ?

Added some clarification around this.

ruisearch42 · 2025-10-19T23:50:34Z

doc/source/serve/llm/images/dp_flow.png

replica is 1..N
but rank is 0..N
Should unify both to 0..N-1 ?

yeah, it's minor. we'll do it when we create mermaid format of these figures.

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

…ay Serve LLM (ray-project#57830) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: xgui <[email protected]>

…ay Serve LLM (#57830) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: elliot-barn <[email protected]>

…ay Serve LLM (ray-project#57830) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

…ay Serve LLM (ray-project#57830) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>

kouroshHakha and others added 30 commits October 15, 2025 20:57

wip

31bad71

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

wip

7d09a33

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

wip

f408ba7

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

wip

0be378e

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Merge branch 'master' into kh/llm-docs-phase-1

9c21149

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Merge branch 'kh/llm-docs-phase-1' into kh/llm-docs-phase-2

881de7f

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Apply suggestions from code review

31f4ea7

Co-authored-by: angelinalg <[email protected]> Signed-off-by: kourosh hakhamaneshi <[email protected]>

Apply suggestions from code review

c622ec6

Co-authored-by: angelinalg <[email protected]> Signed-off-by: kourosh hakhamaneshi <[email protected]>

fix comments

6c35085

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha requested review from a team as code owners October 18, 2025 02:02

ray-gardener bot added serve Ray Serve Related Issue docs An issue or change related to documentation labels Oct 18, 2025

Merge branch 'master' into kh/llm-docs-phase-2

ebf5a16

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

This comment was marked as outdated.

Sign in to view

Remove references to DeepSeek-V3 pattern from documentation and delet…

aa02ac0

…e the corresponding file. Update architecture overview to generalize serving patterns without specific mention of DeepSeek-V3. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

gemini-code-assist bot reviewed Oct 19, 2025

View reviewed changes

Wip

3791b96

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

gemini-code-assist bot reviewed Oct 19, 2025

View reviewed changes

Wip

c0d36b4

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

gemini-code-assist bot reviewed Oct 19, 2025

View reviewed changes

wip

dbab715

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha added the go add ONLY when ready to merge, run all tests label Oct 19, 2025

ruisearch42 reviewed Oct 20, 2025

View reviewed changes

richardliaw approved these changes Oct 20, 2025

View reviewed changes

kouroshHakha added 2 commits October 20, 2025 17:07

some clarifications

9e8bf55

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

wip

2a473c5

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

ruisearch42 approved these changes Oct 20, 2025

View reviewed changes

kouroshHakha enabled auto-merge (squash) October 20, 2025 17:22

kouroshHakha merged commit 3287523 into ray-project:master Oct 20, 2025
7 checks passed

kouroshHakha mentioned this pull request Oct 22, 2025

[docs][serve][llm] examples and doc for cross-node TP/PP in Serve #57715

Merged

16 tasks

elliot-barn pushed a commit that referenced this pull request Oct 23, 2025

[docs][serve][llm] Add comprehensive architecture documentation for R…

1b69edb

…ay Serve LLM (#57830) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: elliot-barn <[email protected]>

landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025

[docs][serve][llm] Add comprehensive architecture documentation for R…

2709078

…ay Serve LLM (ray-project#57830) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

		@@ -0,0 +1,205 @@
		(serve-llm-architecture-data-parallel)=
		# Data parallelism


		Data parallel serving works best when:

		- Large sparse MoE with MLA: Allows reaching larger batch sizes by utilizing the sparsity of the experts more efficiently. MLA (Multi-head Latent Attention) reduces KV cache memory requirements.

[docs][serve][llm] Add comprehensive architecture documentation for Ray Serve LLM #57830

[docs][serve][llm] Add comprehensive architecture documentation for Ray Serve LLM #57830

Uh oh!

Conversation

kouroshHakha commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

New Architecture Documentation

Core Architecture (2 documents)

Serving Patterns (4 documents)

Request Routing Architecture

Uh oh!

This comment was marked as outdated.

Uh oh!

kouroshHakha commented Oct 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kouroshHakha commented Oct 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kouroshHakha commented Oct 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ruisearch42 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kouroshHakha commented Oct 17, 2025 •

edited

Loading