Conversation
Signed-off-by: wangyxbh <wangyxbh@qq.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a CPU offload mechanism for Mixture-of-Experts (MoE) models, which is a significant and complex feature. The changes span across the build system (CMake), C++/CUDA kernels, and Python configuration and execution logic. The implementation includes optimizations like AVX/AMX for CPU kernels and asynchronous GPU-CPU interaction. My review focuses on critical issues related to process stability, resource utilization, and portability. I've identified a critical issue where memory allocation failures can terminate the entire process, and several high-severity issues related to inefficient busy-waiting in both CPU and GPU code, as well as hardcoded system configurations that limit portability. Addressing these points will improve the robustness and efficiency of the new MoE offload feature.
| if (buffer == nullptr) { | ||
| std::cerr << "Memory allocation failed for buffer: " << name | ||
| << " size: " << size << std::endl; | ||
| exit(-1); | ||
| } |
There was a problem hiding this comment.
Using exit(-1) on memory allocation failure is problematic for a library, as it terminates the entire process abruptly. This prevents the calling application from handling the error gracefully, such as by releasing resources or attempting recovery. It's better to throw an exception, like std::bad_alloc, to allow the caller to catch and manage the allocation failure. You will need to include the <new> header.
if (buffer == nullptr) {
throw std::bad_alloc();
}| while (thread_state_[i].status->load(std::memory_order_acquire) == ThreadStatus::WORKING) { | ||
| // Busy-wait for completion (consider condition variables for production) | ||
| } |
There was a problem hiding this comment.
This busy-wait loop on the main thread can consume 100% of a CPU core while waiting for worker threads to complete. This is highly inefficient. While this might be an intentional choice to minimize latency, it's generally better to use a more power-efficient synchronization mechanism. The comment already suggests condition variables, which would be ideal. A simpler improvement would be to yield the thread's time slice or sleep for a very short duration inside the loop to reduce CPU pressure, similar to what's done in worker_thread.
while (thread_state_[i].status->load(std::memory_order_acquire) == ThreadStatus::WORKING) {
std::this_thread::yield();
}| while (data->callback_completed == 0) { | ||
| __threadfence_system(); // 确保从 CPU 读取最新值 | ||
| } |
There was a problem hiding this comment.
This busy-wait loop inside sync_kernel can be problematic. While it's only a single thread due to the <<<1, 1>>> launch configuration, it will still occupy a GPU execution slot and spin, consuming resources while waiting for the CPU. If the CPU task is delayed for any reason (e.g., OS preemption, heavy load), this could lead to wasted GPU cycles and potential device-wide issues if not handled carefully. A more robust approach would be to use CUDA events for synchronization, which are more efficient and don't involve active spinning on the GPU.
| # the following cpu setting only for DCG's shenzhen AI servers | ||
| core_per_worker = 64 // tp_size |
There was a problem hiding this comment.
The CPU core affinity logic is hardcoded for a specific server configuration with 64 cores, as noted in the comment. This will cause issues or suboptimal performance on machines with a different number of CPU cores. This should be generalized to dynamically determine the number of available cores, for example by using psutil.cpu_count(logical=False).
| # the following cpu setting only for DCG's shenzhen AI servers | |
| core_per_worker = 64 // tp_size | |
| # Determine cores per worker dynamically | |
| total_cores = psutil.cpu_count(logical=False) or psutil.cpu_count(logical=True) | |
| core_per_worker = total_cores // tp_size |
|
Hi @wangyxbh, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
|
Documentation preview: https://vllm--31938.org.readthedocs.build/en/31938/ |
| data->submit_count += 1; | ||
| __threadfence_system(); // 确保写入对 CPU 可见 | ||
| } | ||
| } |
There was a problem hiding this comment.
Memory fence placement causes GPU-CPU synchronization race condition
High Severity
The __threadfence_system() in submit_kernel is called after writing gpu_signal = 1, not before. This means the CPU polling loop in cpu_polling_loop could observe gpu_signal == 1 before the data fields (layer_idx, batch_idx, num_tokens) are visible, causing the CPU to read stale or uninitialized values. The fence should be placed between writing the data fields and setting gpu_signal = 1 to ensure proper release-acquire semantics.
Additional Locations (1)
| activation="silu", # 可从 layer 配置获取 | ||
| global_num_experts=config['total_experts'], | ||
| apply_router_weight_on_input=False, | ||
| expert_map=miss_map, # 使用 layer 的 miss_map |
There was a problem hiding this comment.
Unused map_modified causes experts computed with invalid weights
High Severity
In forward_prefetch, map_modified is created and sets experts from n_copy onwards to -1, but then miss_map (the unmodified version) is passed to fused_experts as expert_map. Since only experts 0 to n_copy-1 are copied to temp_layer, using miss_map causes fused_experts to attempt computing experts with indices >= n_copy using uninitialized or stale weights in the temp buffer. The expert_map argument should use map_modified when n_copy < 256.
| if n_copy < 256: | ||
| map_modified = miss_map.clone() | ||
| map_modified[n_copy:] = -1 | ||
| mask = (topk_ids > n_copy) & (miss_map[topk_ids] >= 0) |
There was a problem hiding this comment.
Off-by-one error in prefetch expert threshold comparison
Medium Severity
The mask condition topk_ids > n_copy excludes expert with ID equal to n_copy from being sent to CPU. Since update_expert_cache copies experts with IDs 0 through n_copy-1, the expert at index n_copy is not copied to temp_layer but also won't be sent to CPU for computation. This should be topk_ids >= n_copy to correctly route all non-copied experts to CPU.
| 统一入口:根据配置选择 DBO 或 Prefetch 模式 | ||
| """ | ||
| if self.tp_rank == 1 and layer_id == 4: | ||
| print(f"dbo_enable():{dbo_enabled()} and ntok = {hidden_states.shape[0]}") |
There was a problem hiding this comment.
Debug print statements left in production code
Low Severity
Multiple debug print statements appear to have been left in the codebase: conditional prints in forward_offload (line 548), _build_attn_metadata (line 1591), _dummy_run (lines 4254, 4350), and warmup loop (line 4911). These will produce noisy output during production inference and degrade performance.
Additional Locations (2)
|
|
||
| # 如果 buffer 为空,返回零张量 | ||
| if not temp_layer: | ||
| return torch.zeros_like(hidden_states) |
There was a problem hiding this comment.
Inconsistent return value causes tuple unpacking error
High Severity
In forward_prefetch, when temp_layer is None, the function returns only a single tensor (torch.zeros_like(hidden_states)) instead of the expected tuple of two values. The caller in fp8.py unpacks the result as miss_output, expert_map = layer.cpu_offload_eng.forward_offload(...), which will raise a ValueError when this code path is hit. The return statement needs to include both the tensor and a cache_map value.
|
|
||
| schedule.run_on_nodes(cpu_node) | ||
| memory.set_membind_nodes(cpu_node) | ||
| proc_handle.cpu_affinity(cpu_cores) |
There was a problem hiding this comment.
Hardcoded NUMA configuration fails on non-matching systems
Medium Severity
The NUMA binding code assumes a 64-core, 2-NUMA-node system. When moe_offload is enabled on systems with different configurations (fewer cores, different NUMA topology, single NUMA node), this will cause cpu_affinity to fail with invalid core IDs or set_membind_nodes to fail with non-existent NUMA node 1. There's no validation or graceful fallback for unsupported hardware configurations.
| thread_num_ = std::min(max_thread_num_, task_num); | ||
|
|
||
| const int base = task_num / thread_num_; | ||
| const int remain = task_num % thread_num_; |
There was a problem hiding this comment.
Division by zero when task count is zero
Medium Severity
In do_work_stealing_job, when task_num is 0, thread_num_ becomes 0 via std::min(max_thread_num_, task_num). The subsequent division task_num / thread_num_ causes undefined behavior (division by zero). This can occur when num_tokens is 0 or when n_tokens / BM evaluates to 0 for small token counts. The function topk_sort_inplace is called unconditionally from Moe::forward before any size checks, propagating zero task_num to this code.
|
cc @mgoin although would probably be good to get input from @robertgshaw2-redhat @bnellnm @varun-sundar-rabindranath too |
Signed-off-by: wangyxbh <wangyxbh@digitalchina.com>
|
Please create an RFC or change to a draft PR given the large scope of this PR. |
PR: CPU Offload for Mixture-of-Experts (MoE) Inference in vLLM
Summary
This PR proposes a CPU Offload Module for MoE inference in vLLM, enabling a large portion of expert weights and computation to be dynamically offloaded to the CPU while keeping only a small, hot subset of experts cached on GPU.
The design supports:
This approach significantly reduces GPU memory footprint and improves scalability for large MoE models with thousands of experts.
Motivation
Problem Statement
Large MoE models (e.g., Deepseek-like architectures) suffer from GPU memory pressure during inference:
Existing approaches typically:
Key Observations
Goals of This RFC
Proposed Change
Architecture Overview
Core Design Philosophy
The core design principle is that the GPU no longer stores all expert weights for each layer, but instead caches only a limited number of hot experts. The CPU maintains the complete set of experts and dynamically determines which experts need to be copied to the GPU and which should be computed directly on the CPU based on actual token routing behavior.
The entire mechanism revolves around:
Key Components
Initialization Phase
During model initialization:
cache_expert_numfor each layer, storing the most frequently accessed expertsTo track the state of experts in the GPU cache, the system maintains per-layer metadata:
Forward Pass Execution Flow
Step 1: Expert Cache Policy Matching
At the start of a forward pass, the model has already obtained
topk_idsfor each token from the router. The system callsexpert_cache_policyto match thesetopk_idsagainst the current layer's cache state.This process outputs two key pieces of information:
Important:
copy_mapdoes not directly correspond to "experts copied to GPU cache". It is simply a list of experts that need to be copied in this pass, and their final destination depends on the execution mode.Step 2: Execution Mode Selection
The system operates in two primary execution modes:
DBO Mode (Dual Batch Overlap)
When the system is in DBO mode or in decode/small batch scenarios, the forward pass enters a fully parallel CPU-GPU execution path:
copy_mapare asynchronously copied to the GPU Expert Cache for subsequentfused_expertscomputationPrefetch Mode
In Prefetch mode (typically for larger prefill batches), system behavior adjusts based on the number of tokens in the batch:
As token count increases, more experts are triggered in the forward pass
The system dynamically calculates
n_copyto limit the maximum number of experts copied in this passIf
n_copyis less than the total number of experts:copy_mapare not placed in the GPU cachetemp_layer)fused_expertsWhen batch size is extremely large and
n_copycovers all or nearly all experts:fused_expertscomputation is completed on the GPU sideDouble-Buffered Miss Expert Buffer Management: To prevent miss experts from being overwritten during cross-layer execution, the system globally maintains only two Miss Expert Buffers, using
layer_id % 2for double-buffering:By coordinating with independent CUDA streams and events:
Note
Cursor Bugbot is generating a summary for commit 1dfbf12. Configure here.