Conversation
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
| server_config = dict(self.cfg.server) | ||
| if needs_audio and server_config.get("server_type") not in ["vllm", "vllm_multimodal"]: | ||
| if needs_audio and server_config.get("server_type") not in ["vllm", "vllm_multimodal", "api_multimodal"]: | ||
| LOG.warning( |
There was a problem hiding this comment.
Updated warnign, as it was misleading
|
@Kipok Please consider for rewiev only files without my comments. |
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
📝 WalkthroughWalkthroughThis PR introduces multimodal API audio support by adding an APIMultimodal model class with audio chunking capabilities, registers it in the inference ecosystem, extends generation configuration with audio format options, makes certain evaluators optional imports, and adds ASR normalization control to the audio evaluator. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant APIMultimodal
participant AudioProcessor
participant OpenAIModel
Client->>APIMultimodal: generate_async(prompt with audio)
APIMultimodal->>APIMultimodal: _preprocess_messages_for_model()
APIMultimodal->>APIMultimodal: convert audio refs to base64
APIMultimodal->>APIMultimodal: _needs_audio_chunking()
alt Audio Chunking Required
APIMultimodal->>AudioProcessor: load audio, determine duration
AudioProcessor-->>APIMultimodal: audio duration
APIMultimodal->>APIMultimodal: split audio into chunks
loop For Each Audio Chunk
APIMultimodal->>OpenAIModel: generate_async(chunk)
OpenAIModel-->>APIMultimodal: result (text, tokens)
APIMultimodal->>APIMultimodal: aggregate result
end
else No Chunking Needed
APIMultimodal->>OpenAIModel: generate_async(full message)
OpenAIModel-->>APIMultimodal: result
end
APIMultimodal-->>Client: final aggregated result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Important Action Needed: IP Allowlist UpdateIf your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:
Failure to add the new IP will result in interrupted reviews. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/evaluation/evaluator/__init__.py (1)
141-141:⚠️ Potential issue | 🟡 MinorRemove debug print statement.
This appears to be leftover debug code that should be removed or converted to a proper log statement.
Proposed fix
if eval_type in EVALUATOR_CLASS_MAP: evaluator = get_evaluator_class(eval_type, eval_config) - print(f"evaluator: {evaluator}") return asyncio.run(evaluator.eval_full())Or if logging is desired:
if eval_type in EVALUATOR_CLASS_MAP: evaluator = get_evaluator_class(eval_type, eval_config) - print(f"evaluator: {evaluator}") + LOG.debug("evaluator: %s", evaluator) return asyncio.run(evaluator.eval_full())
🧹 Nitpick comments (4)
nemo_skills/evaluation/evaluator/audio.py (1)
511-531: Consider extracting repeated mode selection logic.The pattern
mode = config.normalization_mode if config.apply_whisper_normalization else "none"is duplicated across ASR-PC, ASR, and ASR_LEADERBOARD branches. This is acceptable given the localized scope, but could be extracted to a helper or computed once at the start ofevaluate_sampleif more task types need the same logic.nemo_skills/inference/model/api_multimodal.py (3)
154-154: Use explicit| Nonetype annotation for optional parameters.Per PEP 484, using
= Nonewithout| Nonein the type hint creates an implicit Optional which is discouraged.Proposed fix
- def _needs_audio_chunking(self, messages: list[dict], task_type: str = None) -> tuple[bool, str, float]: + def _needs_audio_chunking(self, messages: list[dict], task_type: str | None = None) -> tuple[bool, str, float]:- task_type: str = None, + task_type: str | None = None,Also applies to: 291-291
230-230: Rename unused loop variable.The
chunk_idxvariable is not used within the loop body. Rename to_chunk_idxor use_to indicate it's intentionally unused.Proposed fix
- for chunk_idx, audio_chunk in enumerate(chunks): + for _chunk_idx, audio_chunk in enumerate(chunks):Or if the index isn't needed at all:
- for chunk_idx, audio_chunk in enumerate(chunks): + for audio_chunk in chunks:
315-321: Eliminate redundant audio preprocessing in the non-chunking path.When
generate_asynctakes the non-chunking path (lines 316-318), it preprocesses messages withcontent_text_to_listand_preprocess_messages_for_model, then callssuper().generate_async(). The parent'sgenerate_asynccallsself._build_chat_request_params(line 275 in base.py), triggering the same preprocessing again in the overridden method (lines 328-329).While safe—
content_text_to_listis idempotent after the first call—the redundant deep copies (lines 318 and 328) add unnecessary overhead. Preprocess messages only once before calling the parent, or remove preprocessing from the non-chunking path if_build_chat_request_paramshandles it universally.
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Additional Comments (1)
|
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Additional Comments (4)
If the goal is to unit-test |
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
add ability pro process audio with inference server
Summary by CodeRabbit
New Features
Improvements