Skip to content

Conversation

ajcasagrande
Copy link
Contributor

@ajcasagrande ajcasagrande commented Oct 2, 2025

Add formulas for each metric on the main readme
add separate metric docs to explain everything in detail.

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive Metrics Reference covering Streaming, Token-Based, Reasoning, and General metrics, with formulas, notes, and examples.
    • Introduced a new metrics_reference page with classifications (Record, Aggregate, Derived), a Quick Reference, and detailed Metric Flags definitions.
    • Linked the Metrics Reference from the README navigation for easier access.
    • Performed minor README formatting cleanup.

Copy link

coderabbitai bot commented Oct 2, 2025

Walkthrough

Adds a comprehensive Metrics Reference document and links it into the README. Updates README navigation, inserts the Metrics Reference section twice, and adjusts whitespace around INSTALLATION. Introduces docs/metrics_reference.md detailing metric categories, formulas, flags, and examples.

Changes

Cohort / File(s) Summary
README updates
README.md
Added Metrics Reference link in navigation; inserted two Metrics Reference blocks; minor whitespace tweak around INSTALLATION.
Metrics documentation
docs/metrics_reference.md
New, detailed AIPerf metrics reference covering categories, formulas, dependencies, notes, quick reference, and metric flags.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I thump my paws—new metrics bloom,
Tables, flags, and formulas zoom.
Two echoes in README—double cheer!
A burrow of docs now crystal-clear.
With whiskers twitching, I hop and see—
Benchmarks aligned, as sweet as tea. 🐇📈

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly describes the addition of comprehensive metrics documentation and aligns with both the updates to the main README and the new detailed metrics reference file, making the scope and intent of the change clear to reviewers.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ajc/metric-docs2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

codecov bot commented Oct 2, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (4)
docs/metrics_reference.md (3)

173-176: Ensure ITL is in seconds for this inverse relationship.

The formula assumes inter_token_latency_seconds; make sure the prior section defines ITL in seconds (not ns/ms) to keep this correct.


238-246: Filter to valid records in aggregate sums.

Exclude failed/invalid records from totals to match the description.

Apply this diff:

-total_output_tokens = sum(output_token_count for record in records)
+total_output_tokens = sum(r.output_token_count for r in records if r.valid)
-total_osl = sum(output_sequence_length for record in records)
+total_osl = sum(r.output_sequence_length for r in records if r.valid)
-total_isl = sum(input_sequence_length for record in records)
+total_isl = sum(r.input_sequence_length for r in records if r.valid)

Also applies to: 254-260, 268-275


338-341: Use consistent variable naming with request.start_perf_ns.

Align with earlier formulas to avoid ambiguity.

Apply this diff:

-request_latency = responses[-1].perf_ns - start_perf_ns
+request_latency_ns = responses[-1].perf_ns - request.start_perf_ns
README.md (1)

280-283: Grammar nit: “single values” → “single value”.

Minor text cleanup for clarity.

Apply this diff:

-> [!IMPORTANT]
-> This metric is computed as a single values across all requests, and it includes the TTFT in the equation, so it is **not** directly comparable to the [Output Token Throughput Per User](#output-token-throughput-per-user) metric.
+> [!IMPORTANT]
+> This metric is computed as a single value across all requests and includes TTFT in the equation, so it is **not** directly comparable to the [Output Token Throughput Per User](#output-token-throughput-per-user) metric.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d056b9e and c25567e.

📒 Files selected for processing (2)
  • README.md (2 hunks)
  • docs/metrics_reference.md (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/metrics_reference.md

54-54: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4

(MD001, heading-increment)


67-67: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4

(MD001, heading-increment)


80-80: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4

(MD001, heading-increment)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build (ubuntu-latest, 3.10)
🔇 Additional comments (1)
README.md (1)

15-15: Navigation update LGTM.

Good addition of Metrics Reference link.

Comment on lines +54 to +63
#### Example Metrics
`request_latency`, `ttft`, `inter_token_latency`, `output_token_count`, `input_sequence_length`

#### Dependencies
Record Metrics can depend on raw request/response data and other Record Metrics from the same request.

#### Example Scenario
`request_latency` measures the time for each individual request from start to final response. If you send 100 requests, you get 100 latency values that form a distribution showing how latency varies across requests.

## Aggregate Metrics
Copy link

@coderabbitai coderabbitai bot Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix heading level increments (MD001).

Use h3 (###) under h2 sections to satisfy markdownlint and improve structure.

Apply this diff:

-#### Example Metrics
+### Example Metrics
-#### Dependencies
+### Dependencies
-#### Example Scenario
+### Example Scenario

Repeat the same replacements for the Aggregate Metrics and Derived Metrics sections at the corresponding lines. Based on static analysis hints

Also applies to: 67-75, 80-88

🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

54-54: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4

(MD001, heading-increment)

🤖 Prompt for AI Agents
In docs/metrics_reference.md around lines 54-63 (and likewise for the other
affected ranges 67-75 and 80-88), heading levels increment incorrectly (MD001);
change the h4 headings (####) under the h2 section to h3 (###) so subsections
use h3 under the parent h2, and apply the same replacement for the "Aggregate
Metrics" and "Derived Metrics" sections at the indicated line ranges to satisfy
markdownlint and restore proper document hierarchy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt the sizing was just too big when it was only h3 and not h4

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajcasagrande, I understand your preference for the smaller visual sizing that h4 provides over h3. Since this is a deliberate design choice for better readability in your documentation, you have a few options:

  1. Keep the h4 headings as you prefer and suppress the markdownlint rule by adding a comment at the top of the file:

    <!-- markdownlint-disable MD001 -->
  2. Configure markdownlint to disable MD001 globally in your .markdownlint.json or .markdownlint.yaml:

    {
      "MD001": false
    }
  3. Use custom CSS (if your documentation renderer supports it) to reduce h3 font size while maintaining proper heading hierarchy.

Option 1 or 2 would let you keep your preferred visual hierarchy while silencing the linter warnings.


✏️ Learnings added
Learnt from: ajcasagrande
PR: ai-dynamo/aiperf#321
File: docs/metrics_reference.md:54-63
Timestamp: 2025-10-02T19:31:57.827Z
Learning: In the aiperf repository's docs/metrics_reference.md file, the maintainer prefers using h4 headings (####) for subsections under h2 headings instead of h3 (###) for better visual sizing and readability, even though this violates markdownlint rule MD001.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment on lines +111 to +115
**Formula:**
```python
ttft = responses[0].perf_ns - request.start_perf_ns
```

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clarify units and convert from ns to ms/seconds in formulas.

Formulas use perf_ns, but these are commonly displayed as ms and drive per-user throughput in seconds. Make conversions explicit to avoid confusion.

Apply this diff:

-ttft = responses[0].perf_ns - request.start_perf_ns
+# nanoseconds → milliseconds
+ttft_ms = (responses[0].perf_ns - request.start_perf_ns) / 1e6
-ttst = responses[1].perf_ns - responses[0].perf_ns
+# nanoseconds → milliseconds
+ttst_ms = (responses[1].perf_ns - responses[0].perf_ns) / 1e6
-inter_token_latency = (request_latency - ttft) / (output_sequence_length - 1)
+# Assuming request_latency_ns and ttft_ns in nanoseconds:
+# ns → seconds, then divide by tokens for sec/token
+inter_token_latency_seconds = ((request_latency_ns - ttft_ns) / 1e9) / (output_sequence_length - 1)

Also applies to: 125-129, 139-146

🤖 Prompt for AI Agents
In docs/metrics_reference.md around lines 111-115 (and also apply same changes
to 125-129 and 139-146), the formulas use perf_ns (nanoseconds) but do not state
units or perform conversions; update each formula and surrounding text to
explicitly convert perf_ns to milliseconds (ms) and seconds where appropriate
(e.g., divide nanoseconds by 1e6 for ms and by 1e9 for seconds), and annotate
the formula lines and examples to show both the raw perf_ns expression and the
converted values (e.g., ttft_ms = (responses[0].perf_ns - request.start_perf_ns)
/ 1e6 and ttft_s = (responses[0].perf_ns - request.start_perf_ns) / 1e9), making
clear which unit is used for downstream throughput calculations (per-second
rates should use seconds).

Comment on lines +199 to +203
**Notes:**
- For streaming requests with multiple responses, the responses are joined together and then tokens are counted.
- For models that support reasoning tokens, this metric counts only the non-reasoning output tokens.
- This **will** count tokens inside of the `<think>` tags, if they are present in the `content` field of the response.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Resolve contradiction about reasoning token exclusion vs tokens counted.

Clarify that this metric excludes reasoning tokens only when they are exposed separately; if reasoning appears inside content (e.g., ), they will be counted unless filtered.

Apply this diff to Notes:

-- For models that support reasoning tokens, this metric counts only the non-reasoning output tokens.
-- This **will** count tokens inside of the `<think>` tags, if they are present in the `content` field of the response.
+- For models that expose reasoning in a separate `reasoning_content` field, this metric counts only non‑reasoning output tokens.
+- If reasoning appears inside the regular `content` (e.g., `<think>` blocks), those tokens will be counted unless explicitly filtered.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Notes:**
- For streaming requests with multiple responses, the responses are joined together and then tokens are counted.
- For models that support reasoning tokens, this metric counts only the non-reasoning output tokens.
- This **will** count tokens inside of the `<think>` tags, if they are present in the `content` field of the response.
**Notes:**
- For streaming requests with multiple responses, the responses are joined together and then tokens are counted.
- For models that expose reasoning in a separate `reasoning_content` field, this metric counts only non-reasoning output tokens.
- If reasoning appears inside the regular `content` (e.g., `<think>` blocks), those tokens will be counted unless explicitly filtered.
🤖 Prompt for AI Agents
In docs/metrics_reference.md around lines 199 to 203, the Notes section contains
a contradictory statement about reasoning token exclusion vs counting tokens
inside <think> tags; update the Notes to clearly state that reasoning tokens are
excluded only when the model exposes them separately (e.g., a distinct reasoning
token stream or field), and that any reasoning text embedded in the response
content (for example inside <think> tags) will be counted toward this metric
unless the content is pre-filtered; replace the existing three bullets with a
concise set that conveys both behaviors and an explicit recommendation to filter
out embedded reasoning if those tokens should not be counted.

Comment on lines +368 to +372
**Formula:**
```python
request_count = sum(1 for record if record.valid)
```

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix generator syntax and scope in formulas.

Missing “in records” makes these invalid; also keep naming consistent.

Apply this diff:

-request_count = sum(1 for record if record.valid)
+request_count = sum(1 for r in records if r.valid)
-error_request_count = sum(1 for record if not record.valid)
+error_request_count = sum(1 for r in records if not r.valid)

Also applies to: 379-383

🤖 Prompt for AI Agents
In docs/metrics_reference.md around lines 368-372 (and also 379-383), the Python
generator expressions are missing the iteration source and have inconsistent
variable naming; update the formulas to include "for record in records" (e.g.,
sum(1 for record in records if record.valid)) and ensure metric variable names
are consistent across examples (use the same name like request_count where
applicable).

Comment on lines +173 to +177
| [**Time to First Token (TTFT)**](docs/metrics_reference.md#time-to-first-token-ttft) | `ttft` | `responses[0].perf_ns - request.start_perf_ns` | `ms` |
| [**Time to Second Token (TTST)**](docs/metrics_reference.md#time-to-second-token-ttst) | `ttst` | `responses[1].perf_ns - responses[0].perf_ns` | `ms` |
| [**Inter Token Latency (ITL)**](docs/metrics_reference.md#inter-token-latency-itl) | `inter_token_latency` | `(request_latency - ttft) / (output_sequence_length - 1)` | `ms` |
| [**Inter Chunk Latency (ICL)**](docs/metrics_reference.md#inter-chunk-latency-icl) | `inter_chunk_latency` | `[responses[i].perf_ns - responses[i-1].perf_ns for i in range(1, len(responses))]` | `ms` |
| [**Output Token Throughput Per User**](docs/metrics_reference.md#output-token-throughput-per-user) | `output_token_throughput_per_user` | `1.0 / inter_token_latency_seconds` | `tokens/sec/user` |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Convert perf_ns formulas to ms and clarify ITL units.

Table shows ms, but formulas are in ns. Adjust for correctness.

Apply this diff:

-| [**Time to First Token (TTFT)**](docs/metrics_reference.md#time-to-first-token-ttft) | `ttft` | `responses[0].perf_ns - request.start_perf_ns` | `ms` |
+| [**Time to First Token (TTFT)**](docs/metrics_reference.md#time-to-first-token-ttft) | `ttft_ms` | `(responses[0].perf_ns - request.start_perf_ns) / 1e6` | `ms` |
-| [**Time to Second Token (TTST)**](docs/metrics_reference.md#time-to-second-token-ttst) | `ttst` | `responses[1].perf_ns - responses[0].perf_ns` | `ms` |
+| [**Time to Second Token (TTST)**](docs/metrics_reference.md#time-to-second-token-ttst) | `ttst_ms` | `(responses[1].perf_ns - responses[0].perf_ns) / 1e6` | `ms` |
-| [**Inter Token Latency (ITL)**](docs/metrics_reference.md#inter-token-latency-itl) | `inter_token_latency` | `(request_latency - ttft) / (output_sequence_length - 1)` | `ms` |
+| [**Inter Token Latency (ITL)**](docs/metrics_reference.md#inter-token-latency-itl) | `inter_token_latency_seconds` | `((request_latency_ns - ttft_ns) / 1e9) / (output_sequence_length - 1)` | `sec` |

Note: Keeping ITL in seconds matches the per‑user throughput row. If you prefer ms, multiply by 1e3 and rename accordingly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| [**Time to First Token (TTFT)**](docs/metrics_reference.md#time-to-first-token-ttft) | `ttft` | `responses[0].perf_ns - request.start_perf_ns` | `ms` |
| [**Time to Second Token (TTST)**](docs/metrics_reference.md#time-to-second-token-ttst) | `ttst` | `responses[1].perf_ns - responses[0].perf_ns` | `ms` |
| [**Inter Token Latency (ITL)**](docs/metrics_reference.md#inter-token-latency-itl) | `inter_token_latency` | `(request_latency - ttft) / (output_sequence_length - 1)` | `ms` |
| [**Inter Chunk Latency (ICL)**](docs/metrics_reference.md#inter-chunk-latency-icl) | `inter_chunk_latency` | `[responses[i].perf_ns - responses[i-1].perf_ns for i in range(1, len(responses))]` | `ms` |
| [**Output Token Throughput Per User**](docs/metrics_reference.md#output-token-throughput-per-user) | `output_token_throughput_per_user` | `1.0 / inter_token_latency_seconds` | `tokens/sec/user` |
| [**Time to First Token (TTFT)**](docs/metrics_reference.md#time-to-first-token-ttft) | `ttft_ms` | `(responses[0].perf_ns - request.start_perf_ns) / 1e6` | `ms` |
| [**Time to Second Token (TTST)**](docs/metrics_reference.md#time-to-second-token-ttst) | `ttst_ms` | `(responses[1].perf_ns - responses[0].perf_ns) / 1e6` | `ms` |
| [**Inter Token Latency (ITL)**](docs/metrics_reference.md#inter-token-latency-itl) | `inter_token_latency_seconds` | `((request_latency_ns - ttft_ns) / 1e9) / (output_sequence_length - 1)` | `sec` |
| [**Inter Chunk Latency (ICL)**](docs/metrics_reference.md#inter-chunk-latency-icl) | `inter_chunk_latency` | `[responses[i].perf_ns - responses[i-1].perf_ns for i in range(1, len(responses))]` | `ms` |
| [**Output Token Throughput Per User**](docs/metrics_reference.md#output-token-throughput-per-user) | `output_token_throughput_per_user` | `1.0 / inter_token_latency_seconds` | `tokens/sec/user` |
🤖 Prompt for AI Agents
In README.md lines 173-177 update the formulas to convert perf_ns (nanoseconds)
to milliseconds where the table unit is ms, and clarify the ITL row to use
seconds to match the throughput row: divide any perf_ns differences by 1e6 to
produce ms for TTFT, TTST and the Inter Chunk Latency list, and change the Inter
Token Latency formula to compute seconds by dividing the nanosecond result by
1e9 (or explicitly compute (request_latency_ns -
ttft_ns)/(output_sequence_length-1)/1e9) and update that row's units to s (or
rename the metric to inter_token_latency_seconds) so the units are consistent
with the output_token_throughput_per_user row.

Comment on lines +208 to +215
| [**Request Latency**](docs/metrics_reference.md#request-latency) | `request_latency` | `responses[-1].perf_ns - start_perf_ns` | `ms` |
| [**Request Throughput**](docs/metrics_reference.md#request-throughput) | `request_throughput` | `request_count / benchmark_duration_seconds` | `requests/sec` |
| [**Request Count**](docs/metrics_reference.md#request-count) | `request_count` | `sum(1 for record if record.valid)` | `requests` |
| [**Error Request Count**](docs/metrics_reference.md#error-request-count) | `error_request_count` | `sum(1 for record if not record.valid)` | `requests` |
| [**Minimum Request Timestamp**](docs/metrics_reference.md#minimum-request-timestamp) | `min_request_timestamp` | `min(timestamp_ns for record in records)` | `datetime` |
| [**Maximum Response Timestamp**](docs/metrics_reference.md#maximum-response-timestamp) | `max_response_timestamp` | `max(timestamp_ns + request_latency for record in records)` | `datetime` |
| [**Benchmark Duration**](docs/metrics_reference.md#benchmark-duration) | `benchmark_duration` | `max_response_timestamp - min_request_timestamp` | `sec` |

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix Request Latency units; correct timestamp formulas and units.

Ensure ms conversion; avoid mixing clocks; make benchmark duration seconds explicit.

Apply this diff:

-| [**Request Latency**](docs/metrics_reference.md#request-latency) | `request_latency` | `responses[-1].perf_ns - start_perf_ns` | `ms` |
+| [**Request Latency**](docs/metrics_reference.md#request-latency) | `request_latency_ms` | `(responses[-1].perf_ns - request.start_perf_ns) / 1e6` | `ms` |
-| [**Request Count**](docs/metrics_reference.md#request-count) | `request_count` | `sum(1 for record if record.valid)` | `requests` |
+| [**Request Count**](docs/metrics_reference.md#request-count) | `request_count` | `sum(1 for r in records if r.valid)` | `requests` |
-| [**Error Request Count**](docs/metrics_reference.md#error-request-count) | `error_request_count` | `sum(1 for record if not record.valid)` | `requests` |
+| [**Error Request Count**](docs/metrics_reference.md#error-request-count) | `error_request_count` | `sum(1 for r in records if not r.valid)` | `requests` |
-| [**Minimum Request Timestamp**](docs/metrics_reference.md#minimum-request-timestamp) | `min_request_timestamp` | `min(timestamp_ns for record in records)` | `datetime` |
+| [**Minimum Request Timestamp**](docs/metrics_reference.md#minimum-request-timestamp) | `min_request_timestamp_ns` | `min(r.request_timestamp_ns for r in records)` | `ns` |
-| [**Maximum Response Timestamp**](docs/metrics_reference.md#maximum-response-timestamp) | `max_response_timestamp` | `max(timestamp_ns + request_latency for record in records)` | `datetime` |
+| [**Maximum Response Timestamp**](docs/metrics_reference.md#maximum-response-timestamp) | `max_response_timestamp_ns` | `max(r.last_response_timestamp_ns for r in records)` | `ns` |
-| [**Benchmark Duration**](docs/metrics_reference.md#benchmark-duration) | `benchmark_duration` | `max_response_timestamp - min_request_timestamp` | `sec` |
+| [**Benchmark Duration**](docs/metrics_reference.md#benchmark-duration) | `benchmark_duration_seconds` | `(max_response_timestamp_ns - min_request_timestamp_ns) / 1e9` | `sec` |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| [**Request Latency**](docs/metrics_reference.md#request-latency) | `request_latency` | `responses[-1].perf_ns - start_perf_ns` | `ms` |
| [**Request Throughput**](docs/metrics_reference.md#request-throughput) | `request_throughput` | `request_count / benchmark_duration_seconds` | `requests/sec` |
| [**Request Count**](docs/metrics_reference.md#request-count) | `request_count` | `sum(1 for record if record.valid)` | `requests` |
| [**Error Request Count**](docs/metrics_reference.md#error-request-count) | `error_request_count` | `sum(1 for record if not record.valid)` | `requests` |
| [**Minimum Request Timestamp**](docs/metrics_reference.md#minimum-request-timestamp) | `min_request_timestamp` | `min(timestamp_ns for record in records)` | `datetime` |
| [**Maximum Response Timestamp**](docs/metrics_reference.md#maximum-response-timestamp) | `max_response_timestamp` | `max(timestamp_ns + request_latency for record in records)` | `datetime` |
| [**Benchmark Duration**](docs/metrics_reference.md#benchmark-duration) | `benchmark_duration` | `max_response_timestamp - min_request_timestamp` | `sec` |
| [**Request Latency**](docs/metrics_reference.md#request-latency) | `request_latency_ms` | `(responses[-1].perf_ns - request.start_perf_ns) / 1e6` | `ms` |
| [**Request Throughput**](docs/metrics_reference.md#request-throughput) | `request_throughput` | `request_count / benchmark_duration_seconds` | `requests/sec` |
| [**Request Count**](docs/metrics_reference.md#request-count) | `request_count` | `sum(1 for r in records if r.valid)` | `requests` |
| [**Error Request Count**](docs/metrics_reference.md#error-request-count) | `error_request_count` | `sum(1 for r in records if not r.valid)` | `requests` |
| [**Minimum Request Timestamp**](docs/metrics_reference.md#minimum-request-timestamp) | `min_request_timestamp_ns` | `min(r.request_timestamp_ns for r in records)` | `ns` |
| [**Maximum Response Timestamp**](docs/metrics_reference.md#maximum-response-timestamp) | `max_response_timestamp_ns` | `max(r.last_response_timestamp_ns for r in records)` | `ns` |
| [**Benchmark Duration**](docs/metrics_reference.md#benchmark-duration) | `benchmark_duration_seconds` | `(max_response_timestamp_ns - min_request_timestamp_ns) / 1e9` | `sec` |

Copy link
Contributor

@debermudez debermudez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved but I would like another set of eyes on it before we publish.
So only commenting for now since we didnt set it up for multiple approvals.

- [Record Metrics](#record-metrics)
- [Aggregate Metrics](#aggregate-metrics)
- [Derived Metrics](#derived-metrics)
- [Quick Reference](#quick-reference)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be the first section in this list.

Copy link
Contributor Author

@ajcasagrande ajcasagrande Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you help clarify? you want me to move the quick reference content up in the document? Or you want me to do something different with the ToC?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely. I think move this up above understanding metric types


| Metric | Tag | Formula | Unit |
|--------|-----|---------|------|
| [**Output Token Count**](docs/metrics_reference.md#output-token-count) | `output_token_count` | `len(tokenizer.encode(content))` | `tokens` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we have add_special_tokens=False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. could be good to add a note

|--------|-----|---------|------|
| [**Output Token Count**](docs/metrics_reference.md#output-token-count) | `output_token_count` | `len(tokenizer.encode(content))` | `tokens` |
| [**Output Sequence Length (OSL)**](docs/metrics_reference.md#output-sequence-length-osl) | `output_sequence_length` | `(output_token_count or 0) + (reasoning_token_count or 0)` | `tokens` |
| [**Input Sequence Length (ISL)**](docs/metrics_reference.md#input-sequence-length-isl) | `input_sequence_length` | `len(tokenizer.encode(prompt))` | `tokens` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as output_token_count

Copy link
Contributor Author

@ajcasagrande ajcasagrande Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IzzyPutterman can you explain what is the same as output_token_count? Are you referring to the ISL? Is it the wording on prompt?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @IzzyPutterman is intending that his feedback here is the same as his feedback in #321 (comment)


| Metric | Tag | Formula | Unit |
|--------|-----|---------|------|
| [**Time to First Token (TTFT)**](docs/metrics_reference.md#time-to-first-token-ttft) | `ttft` | `responses[0].perf_ns - request.start_perf_ns` | `ms` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a mention that responses are "chunks with non-empty content"


**Formula:**
```python
ttft = responses[0].perf_ns - request.start_perf_ns

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small knit-pick, I get what this is telling me but I just was wondering why request wasn't indexed. It might make it clearer if there was a pointer to the class or structure where this is used? My initial expectation was that the i-th responses would map to the i-th request.

This isn't a gating comment, just something that on initial impression was a little confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I called out something like that out here: #321 (comment)
so I think this would be helpful, especially for someone looking to contribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FrankD412 yeah, its hard to provide both easy to understand but true to life values, when the real formula is longer than the single line and you want it easy to understand.

Technically everywhere you see responses[x] it is really request.responses[x], but that was kinda wordy. One option is to drop the request from the start_perf, or to add request back in the first part.

@debermudez I agree that the links would be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to my first point. Most sum(...) type metrics do not actually use sum at all. that is to make it easier for the user to understand. instead they are computed in 2 stages like is mentioned in the above sections:

Example Scenario

request_count increments by 1 for each successful request. At the end of a benchmark with 100 successful requests, this metric equals 100 (a single value, not a distribution).

class MinRequestTimestampMetric(BaseAggregateMetric[int]):
    """
    Post-processor for calculating the minimum request time stamp metric from records.

    Formula:
        Minimum Request Timestamp = Min(Request Timestamps)
    """

    tag = "min_request_timestamp"
    header = "Minimum Request Timestamp"
    short_header = "Min Req"
    short_header_hide_unit = True
    unit = MetricTimeUnit.NANOSECONDS
    display_unit = MetricDateTimeUnit.DATE_TIME
    flags = MetricFlags.HIDDEN
    required_metrics = None

    def __init__(self) -> None:
        # Default to a large value, so that any request timestamp will be smaller.
        super().__init__(default_value=sys.maxsize)

    def _parse_record(
        self,
        record: ParsedResponseRecord,
        record_metrics: MetricRecordDict,
    ) -> int:
        """Return the request timestamp."""
        # NOTE: Use the request timestamp_ns, not the start_perf_ns, because we want wall-clock timestamps,
        return record.timestamp_ns

    def _aggregate_value(self, value: int) -> None:
        """Aggregate the metric value. For this metric, we just take the min of the values from the different processes."""
        if value < self._value:
            self._value = value

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's more complicated to try and map to implementation, a thought in my mind is maybe move to make the first part of this document the formal definition of the metrics. Like, define a number of pseudo variables -- then once the "theory" is laid out you can have a section or a link to another guide that explains the metric implementation?

Sometimes the implementation gets in the way of clear expression.

> [!NOTE]
> Metrics in this section are available for all benchmark runs with no special requirements.
### Request Latency

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same as the benchmark duration below? Is this intended to be a per-request metric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as my response at #321 (comment) I guess. start_perf_ns is by request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, initially I had listed the metric type for each one, but i was trying my best not to make the doc too long. I think it may be good to have it, especially since I directly explain what the differences are up above.

the other thing is that I originally grouped the metrics by type (record, aggregate, derived), but felt it flowed better to be by use-case, especially to help people understand why they are or are not seeing certain metrics. (removed my need to explain the --streaming under each streaming one as well).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm -- that's fair. Might it be worth explaining what a per-request statistic is in a central place and then labeling specific metrics as per-request (then linking to the per-request definition)?

```

**Notes:**
- Error rate can be computed as `error_request_count / (request_count + error_request_count)`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-gating comment, but might be worth just defining a total requests as the summation of valid + invalid requests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants