[Model Runner V2] Fix seq_lens_cpu_upper_bound#42202
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the management of the CPU mirror for computed tokens to prevent divergence in Multi-Token Prediction (MTP) scenarios. Key changes include moving the optimistic increment of num_computed_tokens_np to prepare_inputs, refreshing these values in update_requests, and consolidating prefill token updates into a new helper method. Feedback was provided regarding a potential performance bottleneck in the update_requests loop, where per-request dictionary lookups and scalar assignments could be optimized through vectorization in the future.
| for req_id, num_computed_tokens, req_new_block_ids in zip( | ||
| reqs.req_ids, reqs.num_computed_tokens, reqs.new_block_ids | ||
| ): | ||
| req_index = self.req_states.req_id_to_index[req_id] | ||
| num_computed_tokens_np[req_index] = num_computed_tokens |
There was a problem hiding this comment.
The loop in update_requests now performs a dictionary lookup (req_id_to_index) and a scalar assignment to num_computed_tokens_np for every cached request. While this is necessary to refresh the CPU mirror from the scheduler's state, it could be a performance bottleneck if the number of cached requests is very large. Consider if this can be vectorized in the future, although the current implementation is correct for fixing the divergence issue.
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Follow-on from #40654 -
num_computed_tokens_npwas only ever incremented bynum_scheduled_tokenseach step and so would diverge indefinitely for MTP. It should be refreshed each step with the adjusted value from the scheduler, no need to increment on model runner side.Also:
num_computed_tokens_npassignment inadd_requestcomputed_prefill_cpuupdate logic; move from postprocess to update_requests