Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions .markdownlint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,6 @@ MD013: false
MD024:
siblings_only: true
MD033: false
MD045: false
MD046: false
MD051: false
MD052: false
MD053: false
MD059: false
Comment on lines 4 to 9
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid enabling MD045 while empty alt text remains elsewhere

The change re-enables markdownlint rule MD045 by removing MD045: false, but many existing docs still embed images with empty alt text via syntax like ![](…) (for example in docs/design/arch_overview.md and docs/design/paged_attention.md). Once MD045 is active, these files will fail linting and block the docs pipeline. Either add descriptive alt text across the repo or keep the rule disabled until the remaining violations are addressed.

Useful? React with 👍 / 👎.

2 changes: 0 additions & 2 deletions docs/contributing/benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,6 @@ vLLM provides comprehensive benchmarking tools for performance testing and evalu
- **[Parameter sweeps](#parameter-sweeps)**: Automate `vllm bench` runs for multiple configurations
- **[Performance benchmarks](#performance-benchmarks)**: Automated CI benchmarks for development

[Benchmark CLI]: #benchmark-cli

## Benchmark CLI

This section guides you through running benchmark tests with the extensive
Expand Down
2 changes: 1 addition & 1 deletion docs/contributing/ci/update_pytorch_version.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ when manually triggering a build on Buildkite. This branch accomplishes two thin
to warm it up so that future builds are faster.

<p align="center" width="100%">
<img width="60%" src="https://github.com/user-attachments/assets/a8ff0fcd-76e0-4e91-b72f-014e3fdb6b94">
<img width="60%" alt="Buildkite new build popup" src="https://github.com/user-attachments/assets/a8ff0fcd-76e0-4e91-b72f-014e3fdb6b94">
</p>

## Update dependencies
Expand Down
4 changes: 2 additions & 2 deletions docs/deployment/frameworks/chatbox.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ pip install vllm
- API Path: `/chat/completions`
- Model: `qwen/Qwen1.5-0.5B-Chat`

![](../../assets/deployment/chatbox-settings.png)
![Chatbox settings screen](../../assets/deployment/chatbox-settings.png)

1. Go to `Just chat`, and start to chat:

![](../../assets/deployment/chatbox-chat.png)
![Chatbot chat screen](../../assets/deployment/chatbox-chat.png)
6 changes: 3 additions & 3 deletions docs/deployment/frameworks/dify.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,12 @@ And install [Docker](https://docs.docker.com/engine/install/) and [Docker Compos
- **Model Name for API Endpoint**: `Qwen/Qwen1.5-7B-Chat`
- **Completion Mode**: `Completion`

![](../../assets/deployment/dify-settings.png)
![Dify settings screen](../../assets/deployment/dify-settings.png)

1. To create a test chatbot, go to `Studio → Chatbot → Create from Blank`, then select Chatbot as the type:

![](../../assets/deployment/dify-create-chatbot.png)
![Dify create chatbot screen](../../assets/deployment/dify-create-chatbot.png)

1. Click the chatbot you just created to open the chat interface and start interacting with the model:

![](../../assets/deployment/dify-chat.png)
![Dify chat screen](../../assets/deployment/dify-chat.png)
8 changes: 4 additions & 4 deletions docs/design/fused_moe_modular_kernel.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ The input activation format completely depends on the All2All Dispatch being use

The FusedMoE operation is generally made of multiple operations, in both the Contiguous and Batched variants, as described in the diagrams below

![](../assets/design/fused_moe_modular_kernel/fused_moe_non_batched.png "FusedMoE Non-Batched")
![FusedMoE Non-Batched](../assets/design/fused_moe_modular_kernel/fused_moe_non_batched.png)

![](../assets/design/fused_moe_modular_kernel/fused_moe_batched.png "FusedMoE Batched")
![FusedMoE Batched](../assets/design/fused_moe_modular_kernel/fused_moe_batched.png)

!!! note
The main difference, in terms of operations, between the Batched and Non-Batched cases is the Permute / Unpermute operations. All other operations remain.
Expand Down Expand Up @@ -57,7 +57,7 @@ The `FusedMoEModularKernel` acts as a bridge between the `FusedMoEPermuteExperts
The `FusedMoEPrepareAndFinalize` abstract class exposes `prepare`, `prepare_no_receive` and `finalize` functions.
The `prepare` function is responsible for input activation Quantization and All2All Dispatch. If implemented, The `prepare_no_receive` is like `prepare` except it does not wait to receive results from other workers. Instead it returns a "receiver" callback that must be invoked to wait for the final results of worker. It is not required that this method is supported by all `FusedMoEPrepareAndFinalize` classes, but if it is available, it can be used to interleave work with the initial all to all communication, e.g. interleaving shared experts with fused experts. The `finalize` function is responsible for invoking the All2All Combine. Additionally the `finalize` function may or may not do the TopK weight application and reduction (Please refer to the TopKWeightAndReduce section)

![](../assets/design/fused_moe_modular_kernel/prepare_and_finalize_blocks.png "FusedMoEPrepareAndFinalize Blocks")
![FusedMoEPrepareAndFinalize Blocks](../assets/design/fused_moe_modular_kernel/prepare_and_finalize_blocks.png)

### FusedMoEPermuteExpertsUnpermute

Expand Down Expand Up @@ -88,7 +88,7 @@ The core FusedMoE implementation performs a series of operations. It would be in
It is sometimes efficient to perform TopK weight application and Reduction inside the `FusedMoEPermuteExpertsUnpermute::apply()`. Find an example [here](https://github.com/vllm-project/vllm/pull/20228). We have a `TopKWeightAndReduce` abstract class to facilitate such implementations. Please refer to the TopKWeightAndReduce section.
`FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl()` returns the `TopKWeightAndReduce` object that the implementation wants the `FusedMoEPrepareAndFinalize::finalize()` to use.

![](../assets/design/fused_moe_modular_kernel/fused_experts_blocks.png "FusedMoEPermuteExpertsUnpermute Blocks")
![FusedMoEPermuteExpertsUnpermute Blocks](../assets/design/fused_moe_modular_kernel/fused_experts_blocks.png)

### FusedMoEModularKernel

Expand Down