Skip to content

Conversation

@fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Mar 23, 2025

Motivation

Currently, when submitting requests (e.g. engine.generate or HTTP call), we have no control which requests will be run together in a single batch and which will not, partially because the intrinsic undeterminism of IPC. However, in some scenarios, it would be great if we have more control. For example:

  • Benchmarking and profiling (e.g. we want to know the behavior when exactly having "1024 token x 8 req per GPU"; this is the primary reason why I made this PR)
  • Testing (e.g. in two-batch-overlap, we may want to test when "one card has 2 req while another card has 1 req, it should be disabled")

Thus this PR adds this feature. Since it is only used for benchmarking or testing, the code is not efficient (e.g. it calls torch.distributed that may be reducible to some extent), and may have rough edges.

Modifications

Checklist

@fzyzcjy fzyzcjy requested a review from merrymercy as a code owner March 23, 2025 12:17
@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Apr 1, 2025

Ping me when this PR is to be merged - currently I only resolve conflicts in #4068, and will port the resolve code back here when pinged.

@fzyzcjy fzyzcjy mentioned this pull request Apr 11, 2025
@fzyzcjy fzyzcjy closed this Jul 11, 2025
@fzyzcjy fzyzcjy reopened this Jul 11, 2025
@fzyzcjy fzyzcjy closed this Jul 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant