Skip to content

Generation time + Input Sequence Length#865

Merged
Kipok merged 14 commits intomainfrom
shtoshni/isl
Sep 30, 2025
Merged

Generation time + Input Sequence Length#865
Kipok merged 14 commits intomainfrom
shtoshni/isl

Conversation

@shtoshni
Copy link
Copy Markdown
Contributor

@shtoshni shtoshni commented Sep 29, 2025

Summary by CodeRabbit

  • New Features

    • Optional prompt-token counting via a new configurable flag and a utility that counts tokens for plain text or chat-style prompts; tokenizer initialization and clearer errors when counting is requested.
  • Refactor

    • Generation timing captured around each generation and token-length/timing fields are included only when enabled; server/completions handling aligned with config usage.
  • Tests

    • Test model identifiers updated to use the NVIDIA Nemo model.

Shubham Toshniwal added 3 commits September 29, 2025 15:29
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
@shtoshni shtoshni requested a review from Kipok September 29, 2025 22:44
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Sep 29, 2025

Walkthrough

Adds prompt-token counting via a new count_prompt_tokens flag, initializes an HF tokenizer when requested, moves per-datapoint generation timing to be captured around generation and included only when add_generation_stats is True, updates dump logic and Megatron server config usage, and updates test model identifiers.

Changes

Cohort / File(s) Summary
Inference generation & config
nemo_skills/inference/generate.py
Adds count_prompt_tokens: bool to GenerateSolutionsConfig; initializes HF tokenizer when counting is enabled; computes and stores input_sequence_length when counting; captures per-datapoint generation timing (start/end/duration) around the generation call and appends timing fields only if add_generation_stats is True; removes prior timing placement; updates dump_outputs to drop token/timing fields when stats disabled; aligns Megatron server handling with config use_completions_api.
Prompt utilities: token counting
nemo_skills/prompt/utils.py
Adds `get_token_count(tokenizer, messages: Union[str, list[dict]]) -> int
Tests: model identifier updates
tests/test_generation.py
Replaces model identifier meta/llama-3.1-8b-instruct with nvidia/nvidia-nemotron-nano-9b-v2 across multiple test commands (including judge model reference). No control-flow changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant Generator as process_single_datapoint
  participant Model as Model/CompletionsAPI
  participant Tok as HF Tokenizer

  Client->>Generator: datapoint (+ flags: count_prompt_tokens, add_generation_stats)
  alt count_prompt_tokens == True
    Generator->>Tok: compute input_sequence_length
    Tok-->>Generator: token_count
  else
    note right of Generator: skip token counting
  end
  Generator->>Generator: record start_time
  Generator->>Model: generate(outputs)
  Model-->>Generator: outputs (+ num_generated_tokens)
  Generator->>Generator: record end_time, compute generation_time
  alt add_generation_stats == True
    Generator-->>Client: result with generation_start_time, generation_end_time, generation_time, (input_sequence_length if counted)
  else
    Generator-->>Client: result without timing/token fields
  end
Loading
sequenceDiagram
  autonumber
  participant Caller
  participant get_token_count
  participant Tokenizer

  Caller->>get_token_count: (tokenizer, messages)
  alt tokenizer is None or messages is None
    get_token_count-->>Caller: None
  else messages is string
    get_token_count->>Tokenizer: encode(text, add_special_tokens=False)
    Tokenizer-->>get_token_count: token_ids
    get_token_count-->>Caller: len(token_ids)
  else messages is list[dict]
    get_token_count->>Tokenizer: apply_chat_template(messages, tokenize=True, add_generation_prompt=True)
    Tokenizer-->>get_token_count: token_ids
    get_token_count-->>Caller: len(token_ids)
  else invalid input
    get_token_count-->>Caller: ValueError
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I count each token, hop from start to end,
Timings tucked or shown depending on the flag I send.
A tokenizer ready when prompts need a clue,
Tests now point to a new model too.
A rabbit cheers — the pipeline hops anew. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “Generation time + Input Sequence Length” directly reflects the two main features introduced in the pull request—capturing generation timing metrics and tracking prompt token lengths—and does so succinctly without extraneous wording. It is clear, focused on the primary changes, and allows a reviewer to immediately grasp the PR’s purpose.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch shtoshni/isl

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 075f94f and 1fd3e73.

📒 Files selected for processing (1)
  • tests/test_generation.py (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (6)
tests/test_generation.py (6)

29-29: LGTM! Model update with correct spacing.

The model parameter has been correctly updated to nvidia/nvidia-nemotron-nano-9b-v2 with proper trailing space to prevent parameter concatenation issues.


56-56: LGTM! Model parameters updated correctly.

Both the model and judge_model parameters have been updated to nvidia/nvidia-nemotron-nano-9b-v2 with proper trailing spaces, addressing the previously identified spacing issue.

Also applies to: 60-60


88-88: LGTM! Consistent model update.

Model parameter correctly updated with proper spacing, maintaining consistency with other test functions.


109-109: LGTM! Proper spacing maintained.

Model parameter updated correctly with trailing space to prevent concatenation issues.


138-138: LGTM! Model update complete.

Final test function updated consistently with the rest of the file, maintaining proper spacing.


29-138: Excellent consistency across all test updates.

All test functions have been updated uniformly to use nvidia/nvidia-nemotron-nano-9b-v2, with proper spacing on all model parameters. This aligns with the PR discussion about choosing a model that is well-supported on build.nvidia.com. The previous spacing concern has been addressed throughout the file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

  • Public repositories are always opted into early access features.
  • You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
nemo_skills/prompt/utils.py (1)

315-316: Consider whether returning None is the best API design.

When the tokenizer is not set, the method returns None. This requires callers to handle the None case explicitly. Consider whether it would be better to:

  1. Raise a descriptive error when tokenizer is not available, or
  2. Document this behavior clearly in the docstring

The current docstring mentions "or None if no tokenizer is set" but callers need to be aware of this nullable return type.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 16481b7 and 6cf7890.

📒 Files selected for processing (2)
  • nemo_skills/inference/generate.py (2 hunks)
  • nemo_skills/prompt/utils.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/inference/generate.py (2)
nemo_skills/prompt/utils.py (1)
  • get_token_count (305-326)
nemo_skills/inference/model/base.py (1)
  • generate_async (203-289)
🪛 Ruff (0.13.1)
nemo_skills/prompt/utils.py

323-323: Do not catch blind exception: Exception

(BLE001)


324-324: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


324-324: Avoid specifying long messages outside the exception class

(TRY003)


326-326: Prefer TypeError exception for invalid type

(TRY004)


326-326: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (3)
nemo_skills/prompt/utils.py (1)

21-21: LGTM!

The addition of Union to the imports is appropriate for the new method signature.

nemo_skills/inference/generate.py (2)

474-478: LGTM with minor observation.

The conditional removal of timing fields is implemented correctly. The code now properly removes generation_start_time, generation_end_time, and generation_time when add_generation_stats is False, which aligns with the PR objective of centralizing generation timing metadata.

Note: input_sequence_length is always included in the output (line 524), even when add_generation_stats is False. This appears intentional based on the PR objectives.


519-523: Verify timing accuracy for parallel thinking mode
The get_parallel_thinking_model wrapper must await all parallel calls in its generate_async implementation—confirm it doesn’t return on the first result so that generation_end_time – generation_start_time reflects the total time for all parallel requests.

Copy link
Copy Markdown
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, a few small changes

Shubham Toshniwal and others added 2 commits September 29, 2025 16:07
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Shubham Toshniwal <shtoshni@gmail.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Shubham Toshniwal and others added 2 commits September 29, 2025 16:32
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Shubham Toshniwal added 3 commits September 29, 2025 17:00
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6cf7890 and 117fbba.

📒 Files selected for processing (2)
  • nemo_skills/inference/generate.py (10 hunks)
  • nemo_skills/prompt/utils.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/inference/generate.py (1)
nemo_skills/prompt/utils.py (2)
  • get_prompt (370-403)
  • get_token_count (309-333)
🪛 Ruff (0.13.1)
nemo_skills/inference/generate.py

304-304: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/prompt/utils.py

330-330: Do not catch blind exception: Exception

(BLE001)


331-331: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


331-331: Avoid specifying long messages outside the exception class

(TRY003)


333-333: Prefer TypeError exception for invalid type

(TRY004)


333-333: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (10)
nemo_skills/prompt/utils.py (2)

21-21: LGTM!

The addition of Union to the typing imports is appropriate for the new get_token_count function signature.


309-333: Improve exception handling to follow best practices.

The exception handling needs refinement:

  1. Line 330: Catching bare Exception is too broad and can mask unexpected errors
  2. Line 331: Missing exception chaining (from e) loses the original traceback
  3. Line 333: Should use TypeError instead of ValueError for invalid type errors

Apply this diff to improve exception handling:

     elif isinstance(messages, list):
         try:
             return len(tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True))
-        except Exception as e:
-            raise ValueError(f"Invalid chat message format: {e}")
+        except (ValueError, KeyError, AttributeError) as e:
+            raise ValueError(f"Invalid chat message format: {e}") from e
     else:
-        raise ValueError("messages must be a string or a list of dictionaries")
+        raise TypeError("messages must be a string or a list of dictionaries")

Based on static analysis hints.

nemo_skills/inference/generate.py (8)

32-32: LGTM!

The AutoTokenizer import is necessary for initializing the HF tokenizer when count_prompt_tokens is enabled.


43-43: LGTM!

The get_token_count import is necessary for the token counting functionality.


117-118: LGTM!

The count_prompt_tokens config field provides a clear opt-in mechanism for token counting, which helps avoid performance overhead when not needed (e.g., for ruler benchmarks).


275-284: LGTM!

The tokenizer setup logic correctly includes count_prompt_tokens as a condition for initializing the tokenizer. This ensures the tokenizer is available when needed for token counting.


295-304: Good handling of tokenizer initialization for both prompt formats.

The logic correctly handles tokenizer initialization for both NS format (using self.prompt.tokenizer) and OpenAI format (initializing AutoTokenizer directly). This addresses the past review comment about calculating token counts even when prompt isn't defined.


494-499: LGTM!

The logic correctly removes all generation statistics (including the new input_sequence_length field) when add_generation_stats is False. This addresses the past review comment about preventing judge jobs from overriding statistics.


534-537: LGTM!

The token counting logic is correctly implemented:

  • Only counts when count_prompt_tokens is enabled
  • Uses the guaranteed-to-be-initialized hf_tokenizer
  • Defensively checks for None before setting the result field

553-560: LGTM!

The timing capture is correctly placed in _process_single_datapoint_with_semaphore, which wraps the call to process_single_datapoint. This ensures timing is captured even when process_single_datapoint is overridden in subclasses, addressing the past review comment.


if self.cfg.count_prompt_tokens:
input_sequence_length = get_token_count(self.hf_tokenizer, generation_params["prompt"])
if input_sequence_length is not None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove this if, I guess as we always expect this to be not None?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i added it for precaution if prompt is None for some reason.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that will just crash the generation, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should, let me remove it then.

Copy link
Copy Markdown
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last small comment but otherwise looks good to me! Thanks!

@Kipok
Copy link
Copy Markdown
Collaborator

Kipok commented Sep 30, 2025

@shtoshni could you please try to update the model in generation tests from llama to nvidia/nvidia-nemotron-nano-9b-v2? I think the capacity for llama is maybe reduced and we are hitting service errors, but hopefully the nano model is served well there

@shtoshni
Copy link
Copy Markdown
Contributor Author

nvidia/nvidia-nemotron-nano-9b-v2

Why not Qwen/Qwen3-1.7B or Qwen/Qwen3-4B? Much smaller and faster

@Kipok
Copy link
Copy Markdown
Collaborator

Kipok commented Sep 30, 2025

this is not being hosted locally, the requests are being sent to build.nvidia.com, so we need to pick a model that's well supported there. I'm not even sure they support those small qwens but you can check

@shtoshni
Copy link
Copy Markdown
Contributor Author

Actually Qwen small models are not hosted there

Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 117fbba and 075f94f.

📒 Files selected for processing (2)
  • nemo_skills/inference/generate.py (9 hunks)
  • tests/test_generation.py (5 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-29T22:54:33.244Z
Learnt from: lizziew
PR: NVIDIA/NeMo-Skills#862
File: nemo_skills/inference/generate.py:198-198
Timestamp: 2025-09-29T22:54:33.244Z
Learning: In the nemo_skills/inference/generate.py file, GenerateSolutionsConfig is a dataclass with use_completions_api as a direct attribute. Within its __post_init__ validation methods, it should be accessed as self.use_completions_api (not self.cfg.use_completions_api). The GenerationTask class receives a GenerateSolutionsConfig instance and stores it as self.cfg, so within GenerationTask it's accessed as self.cfg.use_completions_api.

Applied to files:

  • nemo_skills/inference/generate.py
🧬 Code graph analysis (1)
nemo_skills/inference/generate.py (1)
nemo_skills/prompt/utils.py (1)
  • get_token_count (309-333)
🪛 Ruff (0.13.1)
nemo_skills/inference/generate.py

304-304: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (10)
tests/test_generation.py (4)

29-29: LGTM!

Model change to nvidia/nvidia-nemotron-nano-9b-v2 aligns with the discussion about service availability and capacity limits on build.nvidia.com.


88-88: LGTM!

Consistent model update across test cases.


109-109: LGTM!

Consistent model update.


138-138: LGTM!

Consistent model update across all test functions.

nemo_skills/inference/generate.py (6)

32-32: LGTM!

Required imports for token counting functionality are correctly added.

Also applies to: 43-43


117-118: LGTM!

The count_prompt_tokens flag is appropriately opt-in (defaults to False) to avoid performance overhead for use cases that don't need token counting, such as ruler benchmarks.


275-304: LGTM!

The tokenizer initialization logic correctly handles both scenarios:

  • When self.prompt exists, it reuses self.prompt.tokenizer
  • When self.prompt is None (openai format), it initializes AutoTokenizer directly

The explicit error check on line 303-304 ensures users get a clear message when token counting is requested but cannot be performed.


494-499: LGTM!

The dump logic correctly removes all generation statistics (including the new input_sequence_length) when add_generation_stats is False, which prevents judge jobs from overwriting original generation metadata.


534-536: LGTM!

Token counting is correctly guarded by the count_prompt_tokens flag. Since hf_tokenizer is guaranteed to be non-None when this flag is True (enforced by the error check at lines 303-304), get_token_count should not return None in this code path.


552-559: LGTM!

Timing capture is now correctly positioned outside process_single_datapoint, which ensures accurate measurement even when that method is overridden by subclasses. The conditional inclusion based on add_generation_stats maintains clean outputs for judge jobs.

Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com>
@Kipok
Copy link
Copy Markdown
Collaborator

Kipok commented Sep 30, 2025

seems like maybe there is some outage or something, but given that this new logic is only enabled with a parameter and shouldn't be touched in the tests, let's merge

@Kipok Kipok merged commit 8679e5f into main Sep 30, 2025
4 of 7 checks passed
@Kipok Kipok deleted the shtoshni/isl branch September 30, 2025 00:55
wasiahmad pushed a commit that referenced this pull request Oct 1, 2025
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Shubham Toshniwal <shtoshni@gmail.com>
Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
SeanNaren pushed a commit to SeanNaren/NeMo-Skills that referenced this pull request Oct 9, 2025
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Shubham Toshniwal <shtoshni@gmail.com>
Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
SeanNaren pushed a commit that referenced this pull request Oct 9, 2025
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Shubham Toshniwal <shtoshni@gmail.com>
Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Shubham Toshniwal <shtoshni@gmail.com>
Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants