-
Notifications
You must be signed in to change notification settings - Fork 941
[Core] Support Async & Sync AutoRegressive Scheduling #3306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
d6161c5
3310835
0c6076f
9447e63
96f4d71
370d019
7e17a5d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -562,6 +562,7 @@ def get_kv_transfer_metadata( | |
| *, | ||
| num_computed_tokens: int | None = None, | ||
| ) -> dict[str, Any] | None: | ||
| # NOTE: num_computed_tokens will not include async placeholders | ||
| meta = self._ropes_metadata.pop(req_id, None) | ||
| if meta is None: | ||
| return None | ||
|
|
@@ -854,7 +855,7 @@ def _adjust_positions_for_img2img( | |
| { | ||
| "ropes": [rope], | ||
| "image_shape": [img_H, img_W], | ||
| "prefill_position_count": int(end - start), | ||
| "prefill_position_count": req_len, | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am a bit confused. So your mean is that with this modification, bagel can work correctly under async scheduling?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And if so, we can revert the threshold from 10 to 5 and test it again. Thanks
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup, we should be able to use 5 now, just pushed to reduce it. This line is just a refactor, since Basically the problem was conflating token counts with positions in RoPE. In this case, the [0, 0, 0, ..., 0, 0, 0, 0, 1, 1, 1, 1, ..., 1, 1, 2, 3, 4, 5, 6, 7, 8, 9] # len is ~7600because visual components share the same position. So then for this example, In the previous code, if an async placeholder was let through, we'd get something like but With the fix the on model side of things, letting an async placeholder instead looks like a decode, so it would overwrite There are some things that still look a little weird to me in the AR part of Bagel, but at least positions now match synchronous so results are fine, so I think this PR should be fine now |
||
| } | ||
| ) | ||
| img2img_idx += 1 | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -313,10 +313,12 @@ def execute_model( | |
| if finished_reqs and hasattr(self.model, "get_kv_transfer_metadata"): | ||
| for req_id, data in finished_reqs.items(): | ||
| try: | ||
| req_idx = self.input_batch.req_id_to_index.get(req_id) | ||
| num_computed = ( | ||
| int(self.input_batch.num_computed_tokens_cpu[req_idx]) if req_idx is not None else None | ||
| ) | ||
| # NOTE: seq_len is the same as num_computed_tokens_cpu in current | ||
| # async scheduling, since both exclude async placeholders. We use | ||
| # seq_len since we control it, just in case upstream async scheduler | ||
| # semantics change in the future. | ||
| num_computed = data.get("seq_len") | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mostly for safety since this is what is set when we mark the request for transfer |
||
|
|
||
| model_meta = self.model.get_kv_transfer_metadata( | ||
| req_id, | ||
| num_computed_tokens=num_computed, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed since we can just get this from inheritance https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/async_scheduler.py#L37