[Do not merge][Test] Revert "[Attention] Use sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2"#30715
[Do not merge][Test] Revert "[Attention] Use sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2"#30715LucasWilkinson wants to merge 1 commit intomainfrom
Conversation
…pSeek-v3…" This reverts commit 3e41992.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request reverts the introduction of the sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2. The changes primarily involve removing the cp_gather_and_upconvert_fp8_kv_cache kernel, its associated logic, and the global WorkspaceManager. The WorkspaceManager is replaced by a more encapsulated SharedResizableBuffer within the FusedMoEModularKernel. While the revert is largely correct, I've identified a critical issue with the new buffer implementation that could break CUDA graph compatibility due to the lack of a locking mechanism. Additionally, there's a performance concern in deepseek_v2.py where repeated memory allocations have been introduced inside a loop.
| class SharedResizableBuffer: | ||
| def __init__(self): | ||
| self.buffer = None | ||
|
|
||
| def get( | ||
| self, shape: tuple[int, ...], device: torch.device, dtype: torch.dtype | ||
| ) -> torch.Tensor: | ||
| assert shape != () | ||
| shape_numel = prod(shape) | ||
| if ( | ||
| self.buffer is None | ||
| or self.buffer.numel() < shape_numel | ||
| or self.buffer.device != device | ||
| or self.buffer.dtype != dtype | ||
| ): | ||
| self.buffer = torch.empty(shape_numel, device=device, dtype=dtype) | ||
| return self.buffer[:shape_numel].view(*shape) |
There was a problem hiding this comment.
The new SharedResizableBuffer implementation lacks a locking mechanism, which was present in the removed WorkspaceManager. The WorkspaceManager was locked after profiling/warmup runs to prevent the workspace from resizing during execution, which is crucial for CUDA graph stability. Without a lock, if a request requiring a larger buffer arrives after a CUDA graph has been captured, SharedResizableBuffer.get() will reallocate the buffer, breaking the captured graph. This can lead to runtime errors or incorrect computations. A locking mechanism should be re-introduced to prevent buffer reallocation after CUDA graph capture.
| for chunk in prefill_metadata.chunks: | ||
| k_fp8 = k_fp8_full[: chunk.total_seq_lens] | ||
| k_scale = k_scale_full[: chunk.total_seq_lens] | ||
| k_fp8 = torch.empty( | ||
| [chunk.total_seq_lens, head_dim], | ||
| device=k.device, | ||
| dtype=fp8_dtype, | ||
| ) | ||
| k_scale = torch.empty( | ||
| [chunk.total_seq_lens, 4], | ||
| device=k.device, | ||
| dtype=torch.uint8, | ||
| ) |
There was a problem hiding this comment.
The change from using a pre-allocated workspace to calling torch.empty inside the loop for each prefill chunk introduces repeated memory allocations. This can lead to performance degradation and memory fragmentation, especially when processing many small chunks. The previous approach of allocating a single large buffer and slicing it for each chunk is generally more efficient. Consider reverting to a similar pattern of pre-allocating a buffer outside the loop to avoid repeated allocations in the hot path.
|
resolved by: #30744 |
Reverts #27532