-
-
Notifications
You must be signed in to change notification settings - Fork 16.5k
[Bugfix][P/D] Fix throughput stats in disaggregated setup #27569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
45f732f
96f91ba
d4fc218
d038f1c
a1b82fb
4d2bacb
f0f16c0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -121,6 +121,8 @@ class EngineCoreOutput( | |
| trace_headers: Mapping[str, str] | None = None | ||
| # The number of tokens with prefix cache hits. | ||
| num_cached_tokens: int = 0 | ||
| # The number of tokens that have been computed remotely. | ||
| num_external_computed_tokens: int = 0 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd be tempted to refactor these two into a
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have a strong opinion on this tbh, we can probably wait to have a few more things to bundle before executing the suggestion |
||
|
|
||
| # The number of NaNs in logits. | ||
| # A value greater than 0 indicates that the output is corrupted. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -121,7 +121,7 @@ def _reset(self, now): | |
|
|
||
| def _track_iteration_stats(self, iteration_stats: IterationStats): | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Presumably you want to update the Prometheus metric too?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @markmc which one? I intentionally left |
||
| # Save tracked stats for token counters. | ||
| self.num_prompt_tokens += iteration_stats.num_prompt_tokens | ||
| self.num_prompt_tokens += iteration_stats.num_local_prompt_tokens | ||
|
NickLucche marked this conversation as resolved.
|
||
| self.num_generation_tokens += iteration_stats.num_generation_tokens | ||
| self.num_corrupted_reqs += iteration_stats.num_corrupted_reqs | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -220,6 +220,8 @@ def __init__(self): | |
| self.num_generation_tokens = 0 | ||
| self.num_prompt_tokens = 0 | ||
| self.num_preempted_reqs = 0 | ||
| # Num of prompt tokens that have been computed locally. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the naming here a big confusing? By "computed locally" here we mean both computed and locally cached? If you just tracked
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes the behavior is unchanged, cached ones would still result in higher throughput even in regular aggregated setup.
I think looking at the diff this is pretty clear that I just want to rule out the remote tokens ie I assume the semantic was the intended one from the beginning, it's just "local" used to be redundant |
||
| self.num_local_prompt_tokens = 0 | ||
| self.finished_requests: list[FinishedRequestStats] = [] | ||
| self.max_num_generation_tokens_iter: list[int] = [] | ||
| self.n_params_iter: list[int] = [] | ||
|
|
@@ -250,6 +252,9 @@ def update_from_output( | |
| self.num_generation_tokens += num_new_generation_tokens | ||
| if is_prefilling: | ||
| self.num_prompt_tokens += prompt_len | ||
| self.num_local_prompt_tokens += ( | ||
| prompt_len - output.num_external_computed_tokens | ||
| ) | ||
|
NickLucche marked this conversation as resolved.
|
||
|
|
||
| first_token_latency = self._time_since(req_stats.arrival_time) | ||
| self.time_to_first_tokens_iter.append(first_token_latency) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this comment looks incorrect ... assuming "prefix cache" refers to the local cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with it cc @chaunceyjiang