-
Notifications
You must be signed in to change notification settings - Fork 1
avoid copying new_tokens to cpu #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feat/ad-2025-07-22
Are you sure you want to change the base?
Conversation
Signed-off-by: Suyog Gupta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes GPU memory usage by avoiding unnecessary CPU transfers when processing new tokens in the auto deploy executor. Instead of copying new tokens from GPU to CPU and back, the changes preserve the new tokens on GPU throughout the processing pipeline.
- Replace CPU conversion of new_tokens with direct GPU tensor handling
- Update sequence information interface to accept GPU tensors directly
- Add NVTX profiling ranges for performance monitoring
- Optimize tensor creation with pinned memory for faster transfers
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py | Removes CPU conversion of new_tokens and passes GPU tensors directly to sequence interface |
| tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py | Updates sequence interface to handle GPU tensors and adds performance optimizations |
| max_num_tokens=max_num_tokens, | ||
| ) | ||
|
|
||
| print(" in seq_info for device: ", torch.cuda.current_device()) |
Copilot
AI
Jul 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug print statement should be removed before merging to production. This appears to be leftover debugging code.
| print(" in seq_info for device: ", torch.cuda.current_device()) | |
| ad_logger.info(f"in seq_info for device: {torch.cuda.current_device()}") |
| self.input_ids[self.input_ids == -1] = new_tokens[0,previous_batch_indices,0] | ||
|
|
Copilot
AI
Jul 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indexing assumes new_tokens has at least 3 dimensions and previous_batch_indices is valid, but there's no bounds checking. If previous_batch_indices contains invalid indices or new_tokens has different dimensions, this will cause a runtime error.
| self.input_ids[self.input_ids == -1] = new_tokens[0,previous_batch_indices,0] | |
| # Validate new_tokens dimensions | |
| if new_tokens.dim() < 3: | |
| raise ValueError(f"new_tokens must have at least 3 dimensions, but got {new_tokens.dim()}.") | |
| # Validate previous_batch_indices | |
| max_index = new_tokens.size(1) - 1 | |
| if any(idx < 0 or idx > max_index for idx in previous_batch_indices): | |
| raise IndexError(f"previous_batch_indices contains out-of-bounds indices for new_tokens' second dimension (valid range: 0 to {max_index}).") | |
| self.input_ids[self.input_ids == -1] = new_tokens[0, previous_batch_indices, 0] |
| self.input_ids = torch.ones(self.max_num_tokens, dtype=torch.int, device=self.device) | ||
| self.position_ids = torch.zeros(self.max_batch_size, 1, dtype=torch.long) | ||
| self.seq_len = torch.empty(self.max_batch_size, dtype=torch.int) | ||
| self.input_pos = torch.empty_like(self.seq_len) |
Copilot
AI
Jul 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tensor is created with device=self.device, but self.device is a property that depends on self.input_pos.device. At this point in post_init, input_pos hasn't been initialized yet, which could cause an AttributeError.
| self.input_ids = torch.ones(self.max_num_tokens, dtype=torch.int, device=self.device) | |
| self.position_ids = torch.zeros(self.max_batch_size, 1, dtype=torch.long) | |
| self.seq_len = torch.empty(self.max_batch_size, dtype=torch.int) | |
| self.input_pos = torch.empty_like(self.seq_len) | |
| self.seq_len = torch.empty(self.max_batch_size, dtype=torch.int) | |
| self.input_pos = torch.empty_like(self.seq_len) | |
| self.input_ids = torch.ones(self.max_num_tokens, dtype=torch.int, device=self.device) | |
| self.position_ids = torch.zeros(self.max_batch_size, 1, dtype=torch.long) |
@coderabbitai summary
pytorch:
AutoDeploy: