Api call server by Jorjeous · Pull Request #1189 · NVIDIA-NeMo/Skills

Jorjeous · 2026-01-27T14:54:54Z

add ability pro process audio with inference server

Summary by CodeRabbit

New Features
- Added APIMultimodal server type for audio processing with configurable audio chunking
- Introduced apply_whisper_normalization option to control audio normalization behavior
- Added audio_format field to generation configuration
Improvements
- Enhanced import robustness for optional evaluators

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

nemo_skills/evaluation/evaluator/__init__.py

nemo_skills/evaluation/evaluator/audio.py

Jorjeous · 2026-01-27T14:57:02Z

nemo_skills/inference/generate.py

        server_config = dict(self.cfg.server)
-        if needs_audio and server_config.get("server_type") not in ["vllm", "vllm_multimodal"]:
+        if needs_audio and server_config.get("server_type") not in ["vllm", "vllm_multimodal", "api_multimodal"]:
            LOG.warning(


Updated warnign, as it was misleading

Jorjeous · 2026-01-27T14:59:19Z

@Kipok Please consider for rewiev only files without my comments.
files with comments will be reverted to "main" branch condition

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

karpnv

Minor changes needed

nemo_skills/inference/generate.py

greptile-apps

_{1 file reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/inference/model/api_multimodal.py

coderabbitai · 2026-02-03T13:08:02Z

📝 Walkthrough

Walkthrough

This PR introduces multimodal API audio support by adding an APIMultimodal model class with audio chunking capabilities, registers it in the inference ecosystem, extends generation configuration with audio format options, makes certain evaluators optional imports, and adds ASR normalization control to the audio evaluator.

Changes

Cohort / File(s)	Summary
Evaluator Import & Registration `nemo_skills/evaluation/evaluator/__init__.py`	Wrapped ComputeEvalEvaluator and ruler imports in try/except blocks; defer map population until import success. Added validation to detect overlaps between class- and function-based evaluator maps.
Audio Evaluator Configuration `nemo_skills/evaluation/evaluator/audio.py`	Added `apply_whisper_normalization: bool = True` flag to AudioEvaluatorConfig to conditionally gate Whisper normalization in ASR evaluation paths.
Generation Task Configuration `nemo_skills/inference/generate.py`	Added `audio_format: str = "audio_url"` field; updated `drop_content_types` default to include both "audio_url" and "input_audio"; extended server type allowances to "api_multimodal" with audio settings passthrough.
Model Registry & Server Support `nemo_skills/inference/model/__init__.py`, `nemo_skills/pipeline/utils/server.py`	Imported and registered new APIMultimodal model class in models mapping; added `api_multimodal = "api_multimodal"` enum member to SupportedServers.
Multimodal API Model Implementation `nemo_skills/inference/model/api_multimodal.py`	New module implementing APIMultimodal(OpenAIModel) with audio input encoding to base64, configurable audio chunking by duration threshold and task type filtering, chunked generation with result aggregation, and message preprocessing for audio block placement before text content.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant APIMultimodal
    participant AudioProcessor
    participant OpenAIModel
    
    Client->>APIMultimodal: generate_async(prompt with audio)
    APIMultimodal->>APIMultimodal: _preprocess_messages_for_model()
    APIMultimodal->>APIMultimodal: convert audio refs to base64
    APIMultimodal->>APIMultimodal: _needs_audio_chunking()
    
    alt Audio Chunking Required
        APIMultimodal->>AudioProcessor: load audio, determine duration
        AudioProcessor-->>APIMultimodal: audio duration
        APIMultimodal->>APIMultimodal: split audio into chunks
        
        loop For Each Audio Chunk
            APIMultimodal->>OpenAIModel: generate_async(chunk)
            OpenAIModel-->>APIMultimodal: result (text, tokens)
            APIMultimodal->>APIMultimodal: aggregate result
        end
    else No Chunking Needed
        APIMultimodal->>OpenAIModel: generate_async(full message)
        OpenAIModel-->>APIMultimodal: result
    end
    
    APIMultimodal-->>Client: final aggregated result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

HF ASR Leaderboard Fix #1140: Modifies ASR normalization behavior and AudioEvaluatorConfig similarly to add normalization control.
Introduced vLLM_multimodal model to save multimodal outputs #1136: Registers a new multimodal model class (VLLMMultimodal) in the same model registry system.
Add compute eval #1158: Made ComputeEvalEvaluator import optional and conditionally registered it, directly related to evaluator/init.py changes.

Suggested reviewers

Kipok
gwarmstrong
melllinia

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Api call server' is vague and generic, failing to convey the specific nature of the changes which center on adding multimodal audio support to an inference API client.	Consider a more descriptive title such as 'Add APIMultimodal server with audio chunking support' or 'Support audio processing in API multimodal inference' to better communicate the primary change.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch api_call_server

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

✨ 136.113.208.247/32 (new)
34.170.211.100/32
35.222.179.152/32

Failure to add the new IP will result in interrupted reviews.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/evaluation/evaluator/__init__.py (1)
141-141: ⚠️ Potential issue | 🟡 Minor

Remove debug print statement.

This appears to be leftover debug code that should be removed or converted to a proper log statement.
Proposed fix
     if eval_type in EVALUATOR_CLASS_MAP:
         evaluator = get_evaluator_class(eval_type, eval_config)
-        print(f"evaluator: {evaluator}")
         return asyncio.run(evaluator.eval_full())
Or if logging is desired:
     if eval_type in EVALUATOR_CLASS_MAP:
         evaluator = get_evaluator_class(eval_type, eval_config)
-        print(f"evaluator: {evaluator}")
+        LOG.debug("evaluator: %s", evaluator)
         return asyncio.run(evaluator.eval_full())

🧹 Nitpick comments (4)

nemo_skills/evaluation/evaluator/audio.py (1)

511-531: Consider extracting repeated mode selection logic.

The pattern mode = config.normalization_mode if config.apply_whisper_normalization else "none" is duplicated across ASR-PC, ASR, and ASR_LEADERBOARD branches. This is acceptable given the localized scope, but could be extracted to a helper or computed once at the start of evaluate_sample if more task types need the same logic.
nemo_skills/inference/model/api_multimodal.py (3)
154-154: Use explicit | None type annotation for optional parameters.

Per PEP 484, using = None without | None in the type hint creates an implicit Optional which is discouraged.
Proposed fix
-    def _needs_audio_chunking(self, messages: list[dict], task_type: str = None) -> tuple[bool, str, float]:
+    def _needs_audio_chunking(self, messages: list[dict], task_type: str | None = None) -> tuple[bool, str, float]:
-        task_type: str = None,
+        task_type: str | None = None,
Also applies to: 291-291

230-230: Rename unused loop variable.

The chunk_idx variable is not used within the loop body. Rename to _chunk_idx or use _ to indicate it's intentionally unused.
Proposed fix
-        for chunk_idx, audio_chunk in enumerate(chunks):
+        for _chunk_idx, audio_chunk in enumerate(chunks):
Or if the index isn't needed at all:
-        for chunk_idx, audio_chunk in enumerate(chunks):
+        for audio_chunk in chunks:
315-321: Eliminate redundant audio preprocessing in the non-chunking path.

When generate_async takes the non-chunking path (lines 316-318), it preprocesses messages with content_text_to_list and _preprocess_messages_for_model, then calls super().generate_async(). The parent's generate_async calls self._build_chat_request_params (line 275 in base.py), triggering the same preprocessing again in the overridden method (lines 328-329).

While safe—content_text_to_list is idempotent after the first call—the redundant deep copies (lines 318 and 328) add unnecessary overhead. Preprocess messages only once before calling the parent, or remove preprocessing from the non-chunking path if _build_chat_request_params handles it universally.

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps

_{2 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/inference/model/api_multimodal.py

greptile-apps · 2026-02-03T14:14:15Z

Additional Comments (1)

nemo_skills/inference/model/vllm_multimodal.py
content_text_to_list mutates the input message dict (modifying message["content"], deleting message["audio"]/message["audios"]). Same mutation issue as in api_multimodal.py - this violates the principle of avoiding silent bugs through mutation.

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps

_{5 files reviewed, 5 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/inference/model/api_multimodal.py

greptile-apps · 2026-02-04T10:02:38Z

Additional Comments (4)

nemo_skills/evaluation/evaluator/__init__.py
print(f"evaluator: {evaluator}") introduces an unconditional stdout side-effect in the evaluator execution path. This can break callers/pipelines that expect clean stdout (e.g., JSON output) and bypasses the project’s logging patterns.

    # (remove debug print; use logging if needed)

nemo_skills/inference/model/__init__.py
server_type is normalized for the registry lookup (server_type.lower()), but later checks use the original casing (e.g., if server_type == "trtllm" ...). If a caller passes TRTLLM, the model loads but the trtllm-specific validation/behavior is skipped.

if server_type.lower() == "trtllm" and kwargs.get("enable_soft_fail", False):

nemo_skills/inference/generate.py
This cleanup assumes litellm.cache.cache always has force_save(). If litellm.cache is configured differently (or the cache wrapper shape changes), generation can crash during teardown when enable_litellm_cache=True. Safer to call getattr(litellm.cache, "cache", None) / hasattr(..., "force_save") or rely on the cache implementation’s public API.

tests/test_vllm_audio.py
This fixture patches VLLMMultimodalModel.__init__ to lambda: None, so the object never runs VLLMModel/BaseModel initialization. That makes the test brittle (it can pass while real construction fails) and may mask regressions tied to init-time defaults.

If the goal is to unit-test _preprocess_messages_for_model, consider constructing a minimal instance without patching __init__ (or patch only the specific heavyweight parts called by __init__) so the object’s invariants match production.

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/inference/model/api_multimodal.py

nemo_skills/inference/model/vllm_multimodal.py

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Jorjeous added 4 commits January 27, 2026 06:35

test api call server

b17f3c9

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

add server

2504c3e

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

fix compute eval problem

6ad0b60

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

apply hotfix

d92b023

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous commented Jan 27, 2026

View reviewed changes

nemo_skills/evaluation/evaluator/__init__.py Show resolved Hide resolved

Jorjeous commented Jan 27, 2026

View reviewed changes

nemo_skills/evaluation/evaluator/__init__.py Show resolved Hide resolved

Jorjeous commented Jan 27, 2026

View reviewed changes

nemo_skills/evaluation/evaluator/__init__.py Show resolved Hide resolved

Jorjeous commented Jan 27, 2026

View reviewed changes

nemo_skills/evaluation/evaluator/audio.py Show resolved Hide resolved

Jorjeous commented Jan 27, 2026

View reviewed changes

Jorjeous requested a review from Kipok January 27, 2026 14:58

Jorjeous and others added 4 commits January 27, 2026 07:23

update

e4c895c

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

remove debug, default usage of input audio

1cbd24f

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

fix lint and format

b3bc4ed

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Merge branch 'main' into api_call_server

38bd585

karpnv requested changes Feb 2, 2026

View reviewed changes

nemo_skills/inference/generate.py Outdated Show resolved Hide resolved

Jorjeous marked this pull request as ready for review February 3, 2026 13:03

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

nemo_skills/inference/model/api_multimodal.py Outdated Show resolved Hide resolved

nemo_skills/inference/model/api_multimodal.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

Update, remove audio_url

6195764

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous requested a review from melllinia February 3, 2026 14:10

Merge branch 'main' into api_call_server

8df02c0

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

nemo_skills/inference/model/api_multimodal.py Outdated Show resolved Hide resolved

nemo_skills/inference/model/api_multimodal.py Outdated Show resolved Hide resolved

nemo_skills/inference/model/api_multimodal.py Outdated Show resolved Hide resolved

Jorjeous requested a review from karpnv February 3, 2026 14:43

Jorjeous added the run GPU tests label Feb 3, 2026

Jorjeous and others added 2 commits February 4, 2026 01:56

refactor to match guidelines

d1445be

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Merge branch 'main' into api_call_server

fcd2191

greptile-apps bot reviewed Feb 4, 2026

View reviewed changes

nemo_skills/inference/model/api_multimodal.py Show resolved Hide resolved

fix tests

f94ae76

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps bot reviewed Feb 4, 2026

View reviewed changes

nemo_skills/inference/model/api_multimodal.py Show resolved Hide resolved

nemo_skills/inference/model/vllm_multimodal.py Show resolved Hide resolved

refator based on comments

26f1431

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps bot reviewed Feb 4, 2026

View reviewed changes

Jorjeous closed this Feb 5, 2026

Conversation

Jorjeous commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jorjeous Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Jorjeous commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karpnv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Feb 3, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Feb 3, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot commented Feb 4, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Jorjeous commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

Jorjeous commented Jan 27, 2026 •

edited

Loading