Conversation
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
WalkthroughAdds prompt-token counting via a new Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant Generator as process_single_datapoint
participant Model as Model/CompletionsAPI
participant Tok as HF Tokenizer
Client->>Generator: datapoint (+ flags: count_prompt_tokens, add_generation_stats)
alt count_prompt_tokens == True
Generator->>Tok: compute input_sequence_length
Tok-->>Generator: token_count
else
note right of Generator: skip token counting
end
Generator->>Generator: record start_time
Generator->>Model: generate(outputs)
Model-->>Generator: outputs (+ num_generated_tokens)
Generator->>Generator: record end_time, compute generation_time
alt add_generation_stats == True
Generator-->>Client: result with generation_start_time, generation_end_time, generation_time, (input_sequence_length if counted)
else
Generator-->>Client: result without timing/token fields
end
sequenceDiagram
autonumber
participant Caller
participant get_token_count
participant Tokenizer
Caller->>get_token_count: (tokenizer, messages)
alt tokenizer is None or messages is None
get_token_count-->>Caller: None
else messages is string
get_token_count->>Tokenizer: encode(text, add_special_tokens=False)
Tokenizer-->>get_token_count: token_ids
get_token_count-->>Caller: len(token_ids)
else messages is list[dict]
get_token_count->>Tokenizer: apply_chat_template(messages, tokenize=True, add_generation_prompt=True)
Tokenizer-->>get_token_count: token_ids
get_token_count-->>Caller: len(token_ids)
else invalid input
get_token_count-->>Caller: ValueError
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (6)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🧪 Early access (Sonnet 4.5): enabledWe are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience. Note:
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
nemo_skills/prompt/utils.py (1)
315-316: Consider whether returning None is the best API design.When the tokenizer is not set, the method returns
None. This requires callers to handle theNonecase explicitly. Consider whether it would be better to:
- Raise a descriptive error when tokenizer is not available, or
- Document this behavior clearly in the docstring
The current docstring mentions "or None if no tokenizer is set" but callers need to be aware of this nullable return type.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_skills/inference/generate.py(2 hunks)nemo_skills/prompt/utils.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/inference/generate.py (2)
nemo_skills/prompt/utils.py (1)
get_token_count(305-326)nemo_skills/inference/model/base.py (1)
generate_async(203-289)
🪛 Ruff (0.13.1)
nemo_skills/prompt/utils.py
323-323: Do not catch blind exception: Exception
(BLE001)
324-324: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
324-324: Avoid specifying long messages outside the exception class
(TRY003)
326-326: Prefer TypeError exception for invalid type
(TRY004)
326-326: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (3)
nemo_skills/prompt/utils.py (1)
21-21: LGTM!The addition of
Unionto the imports is appropriate for the new method signature.nemo_skills/inference/generate.py (2)
474-478: LGTM with minor observation.The conditional removal of timing fields is implemented correctly. The code now properly removes
generation_start_time,generation_end_time, andgeneration_timewhenadd_generation_statsis False, which aligns with the PR objective of centralizing generation timing metadata.Note:
input_sequence_lengthis always included in the output (line 524), even whenadd_generation_statsis False. This appears intentional based on the PR objectives.
519-523: Verify timing accuracy for parallel thinking mode
Theget_parallel_thinking_modelwrapper must await all parallel calls in itsgenerate_asyncimplementation—confirm it doesn’t return on the first result so thatgeneration_end_time – generation_start_timereflects the total time for all parallel requests.
Kipok
left a comment
There was a problem hiding this comment.
thanks, a few small changes
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: Shubham Toshniwal <shtoshni@gmail.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_skills/inference/generate.py(10 hunks)nemo_skills/prompt/utils.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/inference/generate.py (1)
nemo_skills/prompt/utils.py (2)
get_prompt(370-403)get_token_count(309-333)
🪛 Ruff (0.13.1)
nemo_skills/inference/generate.py
304-304: Avoid specifying long messages outside the exception class
(TRY003)
nemo_skills/prompt/utils.py
330-330: Do not catch blind exception: Exception
(BLE001)
331-331: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
331-331: Avoid specifying long messages outside the exception class
(TRY003)
333-333: Prefer TypeError exception for invalid type
(TRY004)
333-333: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (10)
nemo_skills/prompt/utils.py (2)
21-21: LGTM!The addition of
Unionto the typing imports is appropriate for the newget_token_countfunction signature.
309-333: Improve exception handling to follow best practices.The exception handling needs refinement:
- Line 330: Catching bare
Exceptionis too broad and can mask unexpected errors- Line 331: Missing exception chaining (
from e) loses the original traceback- Line 333: Should use
TypeErrorinstead ofValueErrorfor invalid type errorsApply this diff to improve exception handling:
elif isinstance(messages, list): try: return len(tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True)) - except Exception as e: - raise ValueError(f"Invalid chat message format: {e}") + except (ValueError, KeyError, AttributeError) as e: + raise ValueError(f"Invalid chat message format: {e}") from e else: - raise ValueError("messages must be a string or a list of dictionaries") + raise TypeError("messages must be a string or a list of dictionaries")Based on static analysis hints.
nemo_skills/inference/generate.py (8)
32-32: LGTM!The
AutoTokenizerimport is necessary for initializing the HF tokenizer whencount_prompt_tokensis enabled.
43-43: LGTM!The
get_token_countimport is necessary for the token counting functionality.
117-118: LGTM!The
count_prompt_tokensconfig field provides a clear opt-in mechanism for token counting, which helps avoid performance overhead when not needed (e.g., for ruler benchmarks).
275-284: LGTM!The tokenizer setup logic correctly includes
count_prompt_tokensas a condition for initializing the tokenizer. This ensures the tokenizer is available when needed for token counting.
295-304: Good handling of tokenizer initialization for both prompt formats.The logic correctly handles tokenizer initialization for both NS format (using
self.prompt.tokenizer) and OpenAI format (initializingAutoTokenizerdirectly). This addresses the past review comment about calculating token counts even when prompt isn't defined.
494-499: LGTM!The logic correctly removes all generation statistics (including the new
input_sequence_lengthfield) whenadd_generation_statsis False. This addresses the past review comment about preventing judge jobs from overriding statistics.
534-537: LGTM!The token counting logic is correctly implemented:
- Only counts when
count_prompt_tokensis enabled- Uses the guaranteed-to-be-initialized
hf_tokenizer- Defensively checks for
Nonebefore setting the result field
553-560: LGTM!The timing capture is correctly placed in
_process_single_datapoint_with_semaphore, which wraps the call toprocess_single_datapoint. This ensures timing is captured even whenprocess_single_datapointis overridden in subclasses, addressing the past review comment.
nemo_skills/inference/generate.py
Outdated
|
|
||
| if self.cfg.count_prompt_tokens: | ||
| input_sequence_length = get_token_count(self.hf_tokenizer, generation_params["prompt"]) | ||
| if input_sequence_length is not None: |
There was a problem hiding this comment.
can remove this if, I guess as we always expect this to be not None?
There was a problem hiding this comment.
hmm i added it for precaution if prompt is None for some reason.
There was a problem hiding this comment.
but that will just crash the generation, right?
There was a problem hiding this comment.
it should, let me remove it then.
Kipok
left a comment
There was a problem hiding this comment.
one last small comment but otherwise looks good to me! Thanks!
|
@shtoshni could you please try to update the model in generation tests from llama to nvidia/nvidia-nemotron-nano-9b-v2? I think the capacity for llama is maybe reduced and we are hitting service errors, but hopefully the nano model is served well there |
Why not Qwen/Qwen3-1.7B or Qwen/Qwen3-4B? Much smaller and faster |
|
this is not being hosted locally, the requests are being sent to build.nvidia.com, so we need to pick a model that's well supported there. I'm not even sure they support those small qwens but you can check |
|
Actually Qwen small models are not hosted there |
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_skills/inference/generate.py(9 hunks)tests/test_generation.py(5 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-29T22:54:33.244Z
Learnt from: lizziew
PR: NVIDIA/NeMo-Skills#862
File: nemo_skills/inference/generate.py:198-198
Timestamp: 2025-09-29T22:54:33.244Z
Learning: In the nemo_skills/inference/generate.py file, GenerateSolutionsConfig is a dataclass with use_completions_api as a direct attribute. Within its __post_init__ validation methods, it should be accessed as self.use_completions_api (not self.cfg.use_completions_api). The GenerationTask class receives a GenerateSolutionsConfig instance and stores it as self.cfg, so within GenerationTask it's accessed as self.cfg.use_completions_api.
Applied to files:
nemo_skills/inference/generate.py
🧬 Code graph analysis (1)
nemo_skills/inference/generate.py (1)
nemo_skills/prompt/utils.py (1)
get_token_count(309-333)
🪛 Ruff (0.13.1)
nemo_skills/inference/generate.py
304-304: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (10)
tests/test_generation.py (4)
29-29: LGTM!Model change to
nvidia/nvidia-nemotron-nano-9b-v2aligns with the discussion about service availability and capacity limits on build.nvidia.com.
88-88: LGTM!Consistent model update across test cases.
109-109: LGTM!Consistent model update.
138-138: LGTM!Consistent model update across all test functions.
nemo_skills/inference/generate.py (6)
32-32: LGTM!Required imports for token counting functionality are correctly added.
Also applies to: 43-43
117-118: LGTM!The
count_prompt_tokensflag is appropriately opt-in (defaults to False) to avoid performance overhead for use cases that don't need token counting, such as ruler benchmarks.
275-304: LGTM!The tokenizer initialization logic correctly handles both scenarios:
- When
self.promptexists, it reusesself.prompt.tokenizer- When
self.promptisNone(openai format), it initializesAutoTokenizerdirectlyThe explicit error check on line 303-304 ensures users get a clear message when token counting is requested but cannot be performed.
494-499: LGTM!The dump logic correctly removes all generation statistics (including the new
input_sequence_length) whenadd_generation_statsis False, which prevents judge jobs from overwriting original generation metadata.
534-536: LGTM!Token counting is correctly guarded by the
count_prompt_tokensflag. Sincehf_tokenizeris guaranteed to be non-None when this flag is True (enforced by the error check at lines 303-304),get_token_countshould not return None in this code path.
552-559: LGTM!Timing capture is now correctly positioned outside
process_single_datapoint, which ensures accurate measurement even when that method is overridden by subclasses. The conditional inclusion based onadd_generation_statsmaintains clean outputs for judge jobs.
Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com>
|
seems like maybe there is some outage or something, but given that this new logic is only enabled with a parameter and shouldn't be touched in the tests, let's merge |
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com> Signed-off-by: Shubham Toshniwal <shtoshni@gmail.com> Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com> Signed-off-by: Shubham Toshniwal <shtoshni@gmail.com> Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com> Signed-off-by: Shubham Toshniwal <shtoshni@gmail.com> Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com> Signed-off-by: Shubham Toshniwal <shtoshni@gmail.com> Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Summary by CodeRabbit
New Features
Refactor
Tests