[docs][serve][llm] Reorganize Ray Serve LLM documentation with user guides and improved navigation #57787

kouroshHakha · 2025-10-16T06:11:41Z

Summary

This PR reorganizes the Ray Serve LLM documentation to improve discoverability and usability. The main changes include:

Created 6 focused user guides covering practical deployment scenarios
Improved landing page navigation with clear "Get started" section
Added architecture documentation for request routing policies -- more to be added here.

Motivation

The previous documentation structure had several issues:

Quick-start was too long (845 lines) covering too many topics
Advanced features were buried in the quick-start guide
No clear navigation path for different use cases
Content duplication between files
Difficult to find specific topics

Changes

New User Guides (6 files)

Created focused, practical guides for common deployment scenarios:

user-guides/model-loading.md - Loading models from HuggingFace, S3/GCS, gated repositories
user-guides/vllm-compatibility.md - Using vLLM features (embeddings, structured output, vision, reasoning models)
user-guides/fractional-gpu.md - Cost-efficient serving on fractional GPUs
user-guides/prefill-decode.md - Prefill/decode disaggregation with NIXLConnector and LMCache
user-guides/prefix-aware-routing.md - Cache-aware request routing
user-guides/multi-lora.md - Multi-LoRA deployment

Architecture Documentation

architecture/routing-policies.md - Comprehensive request routing architecture including centralized store and broadcast metrics patterns

Supporting Files

examples.md - Flat list of tutorial links
troubleshooting.md - FAQs extracted from quick-start

Files Deleted

pd-dissagregation.md → replaced by user-guides/prefill-decode.md
prefix-aware-request-router.md → replaced by user-guides/prefix-aware-routing.md

Before/After Comparison

Before

serve/llm/
├── index.md (mixed concerns)
├── quick-start.rst (845 lines, everything)
├── pd-dissagregation.md
├── prefix-aware-request-router.md
└── troubleshooting (embedded in quick-start)

After

serve/llm/
├── index.md (clear navigation)
├── quick-start.md (373 lines, focused)
├── examples.md
├── troubleshooting.md
├── user-guides/ (6 focused guides)
│   ├── model-loading.md
│   ├── vllm-compatibility.md
│   ├── fractional-gpu.md
│   ├── prefill-decode.md
│   ├── prefix-aware-routing.md
│   └── multi-lora.md
└── architecture/
    └── routing-policies.md

Not Included (Future PRs)

This PR focuses on user guides. Follow-up PRs will add:

Architecture docs (overview.md, core.md, serving-patterns.md)
API reference documentation
Data parallel user guide
Final index.md cleanup (move Key Components & Configuration sections)

Screenshots

Side bar

Before:

After:

Convert the Ray Serve LLM quickstart documentation from RST to MyST Markdown format. Changes: - Convert quick-start.rst to quick-start.md using MyST syntax - Apply Ray docs style guide (contractions, active voice, sentence case) - Convert all RST directives to MyST equivalents (tab-sets, code blocks, cross-refs) - No content changes, pure format conversion - Verified: builds successfully with no errors or warnings This is Phase 0 of the LLM docs reorganization plan. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Create the directory structure for Ray Serve LLM documentation reorganization. Changes: - Add user-guides/ directory for user guide content - Add architecture/ directory for architecture docs - Add api/ directory for API reference This is Phase 1 of the LLM docs reorganization plan. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Create a simple examples landing page that lists all available LLM tutorials. Changes: - Add examples.md with flat list of tutorial links - Include all 7 existing deployment tutorials - Follow style guide (sentence case, active voice) This is part of Phase 2 of the LLM docs reorganization plan. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Extract FAQ section from quick-start and create dedicated troubleshooting page. Changes: - Create troubleshooting.md with 3 FAQs: - How to use gated Huggingface models - Why is downloading the model so slow - How to configure tokenizer pool size - Add 'Get help' section with community links - Follow style guide (sentence case, contractions, active voice) This is part of Phase 2 of the LLM docs reorganization plan. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Extract Multi-LoRA documentation from quick-start into dedicated user guide. Changes: - Create user-guides/multi-lora.md with: - Overview of multi-LoRA deployment and when to use it - How request routing works with adapter caching - Configuration sections (LoRA config and engine arguments) - Complete server and client examples - Example queries for base model and adapters - Follow Ray docs style guide (sentence case, contractions, active voice) This is part of Phase 3 of the LLM docs reorganization plan. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Remove Multi-LoRA deployment section from quick-start as it's now a dedicated user guide. Changes: - Remove Multi-LoRA deployment section (server + client examples) - Content moved to user-guides/multi-lora.md This reduces quick-start.md from 917 to ~850 lines. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Create user guide and architecture docs for request routing policies. Changes: - Add user-guides/prefix-aware-routing.md: - Overview and when to use prefix-aware routing - How it works (load balance, prefix matching, fallback) - Deployment example with corrected import path - Configuration parameters and best practices - Add architecture/routing-policies.md: - Routing vs ingress distinction - Request routing architecture - Available policies (Power of Two, Prefix-aware) - Design patterns (centralized store, broadcast metrics) - Custom routing policies with proper links to Ray Serve docs - Utility mixins documentation - Implementation details (async operations, state management) - Add routing policy diagrams (request_router_1.png, request_router_2.png) - Follow Ray docs style guide (sentence case, contractions, active voice) This is part of Phase 3 of the LLM docs reorganization plan. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Add index files and update main index to include new documentation sections. Changes: - Update serve/llm/index.md toctree to include: - Examples - User Guides (with index) - Architecture (with index) - Troubleshooting - Create user-guides/index.md with links to: - Multi-LoRA deployment - Prefix-aware routing - Create architecture/index.md with link to: - Request routing This resolves the 'document isn't included in any toctree' warnings. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Apply Ray docs style guide and improve clarity in routing-policies.md. Style improvements: - Use active voice and direct address (you/your) - Add code backticks for technical terms (model_id, OpenAiIngress, etc.) - Improve conversational tone - Change 'Choosing' to 'Choose' for imperative mood Image improvements: - Rename request_router_1.png → routing_centralized_store.png - Rename request_router_2.png → routing_broadcast_metrics.png - Update all image references in documentation These changes improve readability and align with Ray documentation standards. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Replace implementation-focused prefix-aware-request-router.md with the user-focused prefix-aware-routing.md in user guides. Changes: - Delete prefix-aware-request-router.md (implementation details) - Remove toctree entry from index.md - User guide (user-guides/prefix-aware-routing.md) is now the primary documentation for prefix-aware routing The new user guide focuses on practical usage (when to use, how to deploy, configuration) rather than implementation details, which aligns better with user needs. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Create user-guides/prefill-decode.md to replace pd-dissagregation.md. Changes: - Create user-guides/prefill-decode.md with improved structure - Apply Ray docs style guide (active voice, contractions, you/your) - Remove VLLM_USE_V1 mention (v1 is now default) - Improve section organization and headings (sentence case, imperative) - Reorganize structure: Deploy with NIXLConnector → Deploy with LMCacheConnectorV1 - Add 'When to use' and 'Best practices' sections - Enhance troubleshooting section with specific guidance - Update user-guides/index.md to include new guide - Remove pd-dissagregation.md from main index.md toctree - Delete old pd-dissagregation.md file The new guide focuses on practical deployment steps while maintaining all technical content from the original. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Create user-guides/vllm-compatibility.md to document vLLM feature support. Changes: - Create user-guides/vllm-compatibility.md with comprehensive coverage - Document embeddings, structured output, vision models, and reasoning models - Highlight Ray Serve LLM alignment with vLLM OpenAI-compatible server - Apply Ray docs style guide (active voice, you/your, sentence case) - Include practical examples for each feature with server/client code - Update user-guides/index.md to include new guide The guide emphasizes that users can switch between vllm serve and Ray Serve LLM with no code changes while gaining Ray Serve's production features (autoscaling, multi-model serving, advanced routing). Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Create user-guides/model-loading.md for configuring model sources. Changes: - Create user-guides/model-loading.md with comprehensive coverage - Document loading from Hugging Face Hub with fast download - Document gated model authentication with HF_TOKEN - Document loading from S3/GCS remote storage - Include AWS credentials configuration (env vars and IAM roles) - Add cluster-wide environment variable configuration with ray.init - Include best practices for security, performance, and source selection - Add troubleshooting for slow downloads and access errors - Apply Ray docs style guide (active voice, contractions, 'such as') - Update user-guides/index.md to include new guide The guide consolidates model loading patterns extracted from quick-start FAQs and remote storage sections into a focused, practical resource. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Remove sections moved to dedicated user guides, reducing file by 53%. Removed sections: - Embeddings → moved to vllm-compatibility.md - Structured output → moved to vllm-compatibility.md - Vision language models → moved to vllm-compatibility.md - Fractional GPU serving → moved to fractional-gpu.md - Remote storage → moved to model-loading.md - Gated models FAQ → moved to model-loading.md - Fast download FAQ → moved to model-loading.md Changes: - Reduced from 845 to 398 lines (447 lines removed, 53% reduction) - Kept tokenizer pool size FAQ (not moved to user guides) - Maintained clean flow from multi-model section to FAQs - No linter errors The quick-start now focuses on core deployment patterns with links to dedicated user guides for advanced topics. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Remove placeholder files in preparation for PR. Changes: - Remove .gitkeep from user-guides/ - Remove .gitkeep from architecture/ - Remove api/.gitkeep and empty api/ directory The api/ directory will be added in a future PR focused on API reference documentation. This PR focuses on user guides. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Complete final missing pieces of documentation reorganization. Changes: - Create user-guides/observability.md: - Extract observability content from quick-start.md - Service-level metrics and engine metrics - Grafana dashboard documentation - Configuration examples for log_engine_metrics - Usage data collection information - Best practices for monitoring - Update quick-start.md: - Remove observability section (moved to user guide) - Remove gen_config CLI section (deprecated feature) - Add link to observability user guide - Streamline quick-start content - Update index.md with improved overview: - Better positioning: 'high-performance, scalable framework' - Add 'Why Ray Serve LLM?' section highlighting distributed workloads - Emphasize key differentiators (parallelism, PD disaggregation, routing) - Update features list to include new capabilities - Remove outdated 'Key Components' and 'Configuration' sections - Update user-guides/index.md to include observability guide Full Ray docs style guide compliance maintained throughout. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

doc/source/serve/llm/architecture/routing-policies.md

doc/source/serve/llm/index.md

doc/source/serve/llm/troubleshooting.md

doc/source/serve/llm/user-guides/fractional-gpu.md

doc/source/serve/llm/user-guides/observability.md

doc/source/serve/llm/user-guides/prefill-decode.md

doc/source/serve/llm/user-guides/prefix-aware-routing.md

Co-authored-by: angelinalg <[email protected]> Signed-off-by: kourosh hakhamaneshi <[email protected]>

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

…uides and improved navigation (ray-project#57787) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: angelinalg <[email protected]>

…uides and improved navigation (ray-project#57787) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: angelinalg <[email protected]> Signed-off-by: xgui <[email protected]>

…uides and improved navigation (#57787) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: angelinalg <[email protected]> Signed-off-by: elliot-barn <[email protected]>

…uides and improved navigation (ray-project#57787) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: angelinalg <[email protected]>

…uides and improved navigation (ray-project#57787) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: angelinalg <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>

kouroshHakha added 19 commits October 15, 2025 20:57

wip

31bad71

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

wip

7d09a33

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

wip

f408ba7

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

wip

0be378e

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha changed the title ~~[wip] Kh/llm docs phase 1~~ [docs] Reorganize Ray Serve LLM documentation with user guides and improved navigation Oct 16, 2025

Merge branch 'master' into kh/llm-docs-phase-1

9c21149

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha added the go add ONLY when ready to merge, run all tests label Oct 16, 2025

kouroshHakha marked this pull request as ready for review October 17, 2025 01:40

kouroshHakha requested review from a team as code owners October 17, 2025 01:40

kouroshHakha changed the title ~~[docs] Reorganize Ray Serve LLM documentation with user guides and improved navigation~~ [docs][serve][llm] Reorganize Ray Serve LLM documentation with user guides and improved navigation Oct 17, 2025

kouroshHakha mentioned this pull request Oct 17, 2025

[docs][serve][llm] Add comprehensive architecture documentation for Ray Serve LLM #57830

Merged

1 task

ray-gardener bot added the serve Ray Serve Related Issue label Oct 17, 2025

ray-gardener bot added docs An issue or change related to documentation llm labels Oct 17, 2025

angelinalg reviewed Oct 17, 2025

View reviewed changes

angelinalg approved these changes Oct 17, 2025

View reviewed changes

kouroshHakha and others added 3 commits October 17, 2025 16:36

Apply suggestions from code review

31f4ea7

Co-authored-by: angelinalg <[email protected]> Signed-off-by: kourosh hakhamaneshi <[email protected]>

Apply suggestions from code review

c622ec6

Co-authored-by: angelinalg <[email protected]> Signed-off-by: kourosh hakhamaneshi <[email protected]>

fix comments

6c35085

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha enabled auto-merge (squash) October 18, 2025 01:58

kouroshHakha merged commit 7806bf2 into ray-project:master Oct 18, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[docs][serve][llm] Reorganize Ray Serve LLM documentation with user guides and improved navigation #57787

[docs][serve][llm] Reorganize Ray Serve LLM documentation with user guides and improved navigation #57787

kouroshHakha commented Oct 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[docs][serve][llm] Reorganize Ray Serve LLM documentation with user guides and improved navigation #57787

[docs][serve][llm] Reorganize Ray Serve LLM documentation with user guides and improved navigation #57787

Conversation

kouroshHakha commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

New User Guides (6 files)

Architecture Documentation

Supporting Files

Files Deleted

Before/After Comparison

Before

After

Not Included (Future PRs)

Screenshots

Side bar

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kouroshHakha commented Oct 16, 2025 •

edited

Loading