Unified server to work with models not supported by vLLM by vmendelev · Pull Request #1138 · NVIDIA-NeMo/Skills

vmendelev · 2025-12-20T13:50:55Z

This is a simple FastAPI server to which anybody can add a backend to comply with nemo-skills client-server approach. I already have some backends, and will push them later.
If somebody has a better idea how to make it easier to run nemo-skills with early research models not supported by vLLM or SGLANG -- please suggest.

Summary by CodeRabbit

Release Notes

New Features
- Added unified inference server with OpenAI-compatible API supporting multiple backends (SALM, TTS, S2S)
- Introduced session management for multi-turn conversations with TTL-based lifecycle
- Added health check and model listing endpoints
- Enabled audio input/output processing and configurable batch request handling
- Support for saving response metadata and audio artifacts

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>

coderabbitai · 2025-12-20T13:54:34Z

📝 Walkthrough

Walkthrough

This PR introduces a new unified inference server for NeMo supporting multiple backends (SALM, TTS, S2S) with OpenAI-compatible REST API. It includes a CLI launcher, backend abstraction layer, session management for multi-turn interactions, request batching, and FastAPI-based server implementation.

Changes

Cohort / File(s)	Summary
CLI Launcher `nemo_skills/inference/server/serve_unified.py`	Adds vllm-compatible CLI wrapper to launch the unified server. Handles Python path setup, safetensors patching, environment variable configuration, backend-specific options, and uvicorn server startup via `recipes.multimodal.server.unified_server`.
Backend Infrastructure `recipes/multimodal/server/backends/base.py` `recipes/multimodal/server/backends/__init__.py`	Introduces abstract backend interfaces and registry: Modality enum, BackendConfig/GenerationRequest/GenerationResult dataclasses, InferenceBackend abstract base class with load_model()/generate() contracts, and registry-based backend lazy-loading via get_backend()/list_backends().
Session Management `recipes/multimodal/server/session_manager.py`	Adds thread-safe session lifecycle management with TTL-based cleanup and capacity eviction. Tracks session state (LLM cache, audio/text buffers, turn history) and per-turn metadata (TurnData).
Server Package & Core `recipes/multimodal/server/__init__.py` `recipes/multimodal/server/unified_server.py`	Package init exports backend interfaces. unified_server.py implements FastAPI app with OpenAI-compatible /v1/chat/completions, /health, session endpoints, request batching (RequestBatcher), audio/message parsing utilities, and create_app() factory.

Sequence Diagram(s)

sequenceDiagram
    participant Client as External Client
    participant CLI as CLI<br/>(serve_unified.py)
    participant App as FastAPI App<br/>(unified_server.py)
    participant Batcher as RequestBatcher
    participant Backend as Backend<br/>(SALM/TTS/S2S)
    participant Session as SessionManager<br/>(S2S only)

    Note over CLI: Startup Phase
    Client->>CLI: python serve_unified.py --backend salm ...
    CLI->>CLI: setup_pythonpath()
    CLI->>CLI: apply_safetensors_patch()
    CLI->>CLI: Populate env vars & build extra_config
    CLI->>App: create_app(backend_type, model_path, ...)
    App->>Backend: __init__(config)
    App->>Backend: load_model()
    App->>App: Initialize RequestBatcher
    alt S2S with sessions
        App->>Session: Initialize SessionManager
    end
    App-->>CLI: FastAPI app ready
    CLI->>CLI: uvicorn.run(app)

    Note over Client,Session: Request Phase
    Client->>App: POST /v1/chat/completions<br/>(messages with text/audio)
    App->>App: extract_text_from_messages()
    App->>App: extract_audio_from_messages()
    App->>App: extract_system_prompt()
    App->>App: Create GenerationRequest
    
    alt S2S with session
        App->>Session: get_or_create_session(session_id)
        Session-->>App: SessionState
        App->>App: Populate session_id in request
    end
    
    App->>Batcher: add_request(GenerationRequest)
    
    alt Batch full or immediate
        Batcher->>Batcher: _process_batch()
        Batcher->>Backend: generate([requests])
        Backend-->>Batcher: [GenerationResults]
    else Batch timeout
        Batcher->>Batcher: _timeout_handler()
        Batcher->>Batcher: _process_batch()
        Batcher->>Backend: generate([requests])
        Backend-->>Batcher: [GenerationResults]
    end
    
    Batcher-->>App: GenerationResult
    
    alt S2S with session
        App->>Session: save_session(session_id, updated_state)
    end
    
    App->>App: Save outputs (JSON/WAV) to disk
    App-->>Client: OpenAI-compatible response<br/>(text + optional audio)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–75 minutes

Multiple interconnected subsystems: Backend abstraction, session management, request batching, and async FastAPI server with non-trivial dependencies between them.
Concurrency concerns: Thread-safe session manager with locks, async request batching, and concurrent request handling require careful validation.
Complex data flow: Audio extraction, message parsing, session state mutation, batch processing, and result assembly across multiple layers.
Backend extensibility: Abstract base class design and registry pattern should support future backends without modification.

Areas requiring extra attention:

Thread safety in SessionManager (lock management, reentrant lock usage, eviction logic)
Async batching logic in RequestBatcher (timeout handling, concurrent request queueing, future resolution)
Audio data handling and format conversions (bytes, paths, sample rates, WAV serialization)
CLI environment setup correctness (Python path discovery, patch application, safetensors handling)
Backend interface contract and error propagation (ImportError handling, invalid backends, missing dependencies)

Suggested labels

run GPU tests

Suggested reviewers

activatedgeek
Kipok

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: a unified server enabling models not supported by vLLM, which aligns with the PR's core objective to add a FastAPI server for backend integration with early research models.
Docstring Coverage	✅ Passed	Docstring coverage is 91.30% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch vmendelev/2512_unified_serve

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (5)

recipes/multimodal/server/__init__.py (1)
32-38: Consider sorting __all__ alphabetically.

Static analysis flagged that __all__ is not sorted. This is a minor style preference.
🔎 Proposed fix
 __all__ = [
-    "InferenceBackend",
+    "BackendConfig",
     "GenerationRequest",
     "GenerationResult",
-    "BackendConfig",
     "get_backend",
+    "InferenceBackend",
 ]
recipes/multimodal/server/backends/__init__.py (1)
28-36: Consider sorting __all__ alphabetically.

Similar to the parent __init__.py, this is a minor style preference flagged by static analysis.
🔎 Proposed fix
 __all__ = [
-    "InferenceBackend",
+    "BackendConfig",
     "GenerationRequest",
     "GenerationResult",
-    "BackendConfig",
+    "get_backend",
+    "InferenceBackend",
+    "list_backends",
     "Modality",
-    "get_backend",
-    "list_backends",
 ]
nemo_skills/inference/server/serve_unified.py (1)
258-259: Prefix unused variable with underscore.

extra_args is captured but never used. Prefix with underscore to indicate it's intentionally unused.
🔎 Proposed fix
     # Parse known args, allowing extra args to be passed through
-    args, extra_args = parser.parse_known_args()
+    args, _extra_args = parser.parse_known_args()
recipes/multimodal/server/unified_server.py (2)
605-611: Use raise ... from e for exception chaining.

When re-raising an exception, use from e to preserve the exception chain for better debugging.
🔎 Proposed fix
         except HTTPException:
             raise
         except Exception as e:
             import traceback

             traceback.print_exc()
-            raise HTTPException(status_code=500, detail=str(e))
+            raise HTTPException(status_code=500, detail=str(e)) from e
144-148: Add strict=True to zip to catch mismatched lengths.

If batch and results have different lengths due to a backend bug, this would silently drop items. Using strict=True (Python 3.10+) would raise an error instead.
             # Complete futures
-            for pending, result in zip(batch, results):
+            for pending, result in zip(batch, results, strict=True):
                 if not pending.future.done():
                     pending.future.set_result(result)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7205c43 and c98a94a.

📒 Files selected for processing (6)

nemo_skills/inference/server/serve_unified.py (1 hunks)
recipes/multimodal/server/__init__.py (1 hunks)
recipes/multimodal/server/backends/__init__.py (1 hunks)
recipes/multimodal/server/backends/base.py (1 hunks)
recipes/multimodal/server/session_manager.py (1 hunks)
recipes/multimodal/server/unified_server.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

recipes/multimodal/server/backends/__init__.py (1)

recipes/multimodal/server/backends/base.py (5)

BackendConfig (37-59)

GenerationRequest (63-95)

GenerationResult (99-126)

InferenceBackend (129-251)

Modality (28-33)

recipes/multimodal/server/session_manager.py (2)

recipes/multimodal/server/unified_server.py (3)

get_session (346-353)

delete_session (356-380)

list_sessions (335-343)

nemo_skills/inference/chat_interface/core.py (1)

get (136-151)

recipes/multimodal/server/__init__.py (1)

recipes/multimodal/server/backends/base.py (4)

BackendConfig (37-59)

GenerationRequest (63-95)

GenerationResult (99-126)

InferenceBackend (129-251)

🪛 Ruff (0.14.8)

recipes/multimodal/server/backends/__init__.py

28-36: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

69-69: Avoid specifying long messages outside the exception class

(TRY003)

79-81: Avoid specifying long messages outside the exception class

(TRY003)

recipes/multimodal/server/__init__.py

32-38: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

nemo_skills/inference/server/serve_unified.py

121-121: Do not catch blind exception: Exception

(BLE001)

149-149: Possible binding to all interfaces

(S104)

259-259: Unpacked variable extra_args is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

388-388: Do not catch blind exception: Exception

(BLE001)

recipes/multimodal/server/unified_server.py

49-49: Possible binding to all interfaces

(S104)

105-105: Store a reference to the return value of asyncio.create_task

(RUF006)

108-108: Store a reference to the return value of asyncio.create_task

(RUF006)

122-122: Store a reference to the return value of asyncio.create_task

(RUF006)

146-146: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

154-154: Do not catch blind exception: Exception

(BLE001)

165-165: Store a reference to the return value of asyncio.create_task

(RUF006)

242-242: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

371-371: Do not catch blind exception: Exception

(BLE001)

434-434: Abstract raise to an inner function

(TRY301)

466-466: Probable use of insecure hash functions in hashlib: md5

(S324)

472-472: Abstract raise to an inner function

(TRY301)

498-498: Abstract raise to an inner function

(TRY301)

501-501: Probable use of insecure hash functions in hashlib: md5

(S324)

535-535: Do not catch blind exception: Exception

(BLE001)

547-547: Do not catch blind exception: Exception

(BLE001)

603-603: Consider moving this statement to an else block

(TRY300)

607-607: Do not catch blind exception: Exception

(BLE001)

611-611: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (17)

recipes/multimodal/server/__init__.py (1)

1-30: LGTM!

The module correctly re-exports the core backend interfaces from the .backends subpackage, providing a clean public API surface for the unified server package. The docstring accurately describes the supported backends.

recipes/multimodal/server/backends/__init__.py (1)

67-81: LGTM - Lazy loading pattern is appropriate.

The lazy loading implementation correctly defers importing backend modules until they're needed, which is ideal for optional dependencies. The error handling provides clear messages for debugging import issues.

nemo_skills/inference/server/serve_unified.py (3)

109-122: The safetensors patching approach is risky and may cause issues.

This function overwrites an installed package file, which:

Could fail on read-only/containerized environments

May cause unexpected behavior in other applications using safetensors

Does not restore the original file

Consider documenting this behavior more prominently or using a safer patching mechanism (e.g., monkey-patching at runtime instead of file replacement).

365-393: LGTM - Server startup with proper error handling.

The import-time error handling correctly distinguishes between missing dependencies (ImportError) and runtime errors, providing helpful messages in both cases. The traceback printing for generic exceptions aids debugging.

125-177: LGTM - CLI argument structure is comprehensive.

The argument parser provides a complete interface for configuring the unified server with sensible defaults and clear help text. Backend-specific options are well-organized.

recipes/multimodal/server/unified_server.py (3)

77-92: LGTM - RequestBatcher design is sound.

The batcher correctly implements configurable batching with timeout and immediate processing modes. The use of asyncio.Lock for thread safety and Future for result propagation is appropriate.

177-231: LGTM - Message parsing utilities are well-implemented.

The extraction functions correctly handle OpenAI-format messages with both string and list content types. The audio URL parsing with regex is appropriate for data URLs.

272-318: LGTM - Startup routine is comprehensive.

The startup event correctly initializes the backend, creates the batcher, and optionally enables session management. The configuration logging is helpful for debugging.

recipes/multimodal/server/session_manager.py (4)

31-75: LGTM - Data structures are well-designed.

TurnData and SessionState provide comprehensive state tracking for multi-turn conversations. The use of field(default_factory=...) for mutable defaults is correct, and the touch() method provides clean timestamp updates.

77-121: LGTM - SessionManager initialization and create_session are well-implemented.

The manager correctly uses RLock for reentrant locking, implements TTL-based cleanup, and handles capacity eviction. The session creation flow is robust.

123-162: LGTM - Session retrieval with TTL enforcement.

get_session correctly checks expiry before returning, and get_or_create_session provides a convenient idempotent accessor. The touch() call on access properly extends the session lifetime.

224-249: LGTM - Cleanup and eviction logic is correct.

The expired session cleanup iterates over a snapshot of session IDs, and the eviction correctly finds the oldest by last_accessed. Using min() with a key function is clean.

recipes/multimodal/server/backends/base.py (5)

1-26: LGTM!

License header, module docstring, and imports are well-structured and appropriate for an abstract base interface module.

28-33: LGTM!

Good use of (str, Enum) pattern for serialization compatibility with JSON/OpenAPI.

36-59: LGTM!

Well-designed config class with proper handling of mutable defaults and flexible from_dict factory that cleanly separates known fields from extra configuration.

62-96: LGTM!

Comprehensive request structure supporting multiple input modalities and generation parameter overrides. Mutable default for extra_params is correctly handled with field(default_factory=dict).

98-127: LGTM!

Clean result structure with proper metadata tracking and simple error handling pattern.

coderabbitai · 2025-12-20T13:54:37Z

recipes/multimodal/server/backends/__init__.py

+def get_backend(backend_name: str) -> type:
+    """
+    Get backend class by name with lazy loading.
+
+    Args:
+        backend_name: One of 'salm', 'tts', 's2s'
+
+    Returns:
+        Backend class (not instance)
+
+    Raises:
+        ValueError: If backend name is unknown
+        ImportError: If backend dependencies are not available
+    """


⚠️ Potential issue | 🟡 Minor

Update docstring to reflect all available backends.

The docstring says backend_name: One of 'salm', 'tts', 's2s' but the registry includes 5 backends: salm, tts, s2s, s2s_incremental, and s2s_session.

🔎 Proposed fix

""" Get backend class by name with lazy loading. Args: - backend_name: One of 'salm', 'tts', 's2s' + backend_name: One of 'salm', 'tts', 's2s', 's2s_incremental', 's2s_session' Returns: Backend class (not instance) Raises: ValueError: If backend name is unknown ImportError: If backend dependencies are not available """

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_backend(backend_name: str) -> type:

"""

Get backend class by name with lazy loading.

Args:

backend_name: One of 'salm', 'tts', 's2s'

Returns:

Backend class (not instance)

Raises:

ValueError: If backend name is unknown

ImportError: If backend dependencies are not available

"""

def get_backend(backend_name: str) -> type:

"""

Get backend class by name with lazy loading.

Args:

backend_name: One of 'salm', 'tts', 's2s', 's2s_incremental', 's2s_session'

Returns:

Backend class (not instance)

Raises:

ValueError: If backend name is unknown

ImportError: If backend dependencies are not available

"""

🤖 Prompt for AI Agents

In recipes/multimodal/server/backends/__init__.py around lines 53 to 66, the docstring's parameter description lists only 'salm', 'tts', 's2s' but the registry exposes five backends; update the docstring to enumerate all supported backend names (e.g., 'salm', 'tts', 's2s', 's2s_incremental', 's2s_session') and adjust any surrounding text so the Returns/Raises remain accurate.

coderabbitai · 2025-12-20T13:54:37Z

recipes/multimodal/server/backends/base.py

+    def get_generation_params(self, request: GenerationRequest) -> Dict[str, Any]:
+        """
+        Get effective generation parameters, merging request with config defaults.
+        """
+        return {
+            "max_new_tokens": request.max_new_tokens or self.config.max_new_tokens,
+            "temperature": request.temperature or self.config.temperature,
+            "top_p": request.top_p or self.config.top_p,
+            "top_k": request.top_k or self.config.top_k,
+        }


⚠️ Potential issue | 🔴 Critical

Falsy value bug: temperature=0.0 and top_p=0.0 are silently overwritten.

Using or to merge request overrides with config defaults treats 0.0 as falsy. If a caller sets temperature=0.0 for greedy decoding, the config default will be used instead.

🔎 Proposed fix using explicit None checks

def get_generation_params(self, request: GenerationRequest) -> Dict[str, Any]: """ Get effective generation parameters, merging request with config defaults. """ return { - "max_new_tokens": request.max_new_tokens or self.config.max_new_tokens, - "temperature": request.temperature or self.config.temperature, - "top_p": request.top_p or self.config.top_p, - "top_k": request.top_k or self.config.top_k, + "max_new_tokens": request.max_new_tokens if request.max_new_tokens is not None else self.config.max_new_tokens, + "temperature": request.temperature if request.temperature is not None else self.config.temperature, + "top_p": request.top_p if request.top_p is not None else self.config.top_p, + "top_k": request.top_k if request.top_k is not None else self.config.top_k, }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_generation_params(self, request: GenerationRequest) -> Dict[str, Any]:

"""

Get effective generation parameters, merging request with config defaults.

"""

return {

"max_new_tokens": request.max_new_tokens or self.config.max_new_tokens,

"temperature": request.temperature or self.config.temperature,

"top_p": request.top_p or self.config.top_p,

"top_k": request.top_k or self.config.top_k,

}

def get_generation_params(self, request: GenerationRequest) -> Dict[str, Any]:

"""

Get effective generation parameters, merging request with config defaults.

"""

return {

"max_new_tokens": request.max_new_tokens if request.max_new_tokens is not None else self.config.max_new_tokens,

"temperature": request.temperature if request.temperature is not None else self.config.temperature,

"top_p": request.top_p if request.top_p is not None else self.config.top_p,

"top_k": request.top_k if request.top_k is not None else self.config.top_k,

}

🤖 Prompt for AI Agents

In recipes/multimodal/server/backends/base.py around lines 221 to 230, the current merge uses "or" which treats falsy values (e.g., temperature=0.0, top_p=0.0) as missing and overwrites them with config defaults; change each merge to an explicit None check (e.g., use request.temperature if request.temperature is not None else self.config.temperature) for max_new_tokens, temperature, top_p, and top_k so that legitimate zero or falsy values from the request are preserved while only None falls back to config.

feels like something we should fix @vmendelev

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

coderabbitai · 2025-12-20T13:54:37Z

recipes/multimodal/server/backends/base.py

+        has_text_input = request.text is not None
+        has_audio_input = request.audio_bytes is not None or request.audio_path is not None


⚠️ Potential issue | 🟡 Minor

Validation misses multi-turn audio inputs.

has_audio_input only checks audio_bytes and audio_path, but GenerationRequest also supports audio_bytes_list and audio_paths for multi-turn scenarios. Requests with only multi-turn audio would incorrectly fail validation.

🔎 Proposed fix to include multi-turn audio fields

has_text_input = request.text is not None - has_audio_input = request.audio_bytes is not None or request.audio_path is not None + has_audio_input = ( + request.audio_bytes is not None + or request.audio_path is not None + or request.audio_bytes_list is not None + or request.audio_paths is not None + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

has_text_input = request.text is not None

has_audio_input = request.audio_bytes is not None or request.audio_path is not None

has_text_input = request.text is not None

has_audio_input = (

request.audio_bytes is not None

or request.audio_path is not None

or request.audio_bytes_list is not None

or request.audio_paths is not None

)

🤖 Prompt for AI Agents

In recipes/multimodal/server/backends/base.py around lines 241 to 242, the validation flag has_audio_input currently only checks request.audio_bytes and request.audio_path and therefore misses multi-turn fields; update the check to also consider request.audio_bytes_list and request.audio_paths (and treat non-empty lists as valid) so requests that supply only multi-turn audio pass validation; ensure you account for both presence and non-empty lists when evaluating has_audio_input.

coderabbitai · 2025-12-20T13:54:37Z

recipes/multimodal/server/session_manager.py

+                "has_llm_cache": state.llm_cache is not None,
+                "has_input_embeds_history": state.input_embeds_history is not None
+                and len(state.input_embeds_history) > 0,
+            }


⚠️ Potential issue | 🟡 Minor

Potential AttributeError if input_embeds_history is not a list.

The code calls len(state.input_embeds_history) but input_embeds_history is typed as Any, which could be any type (e.g., a tensor, None after initial check, etc.). If it's a type without __len__, this will raise an AttributeError.

🔎 Proposed fix

return { "session_id": state.session_id, "frame_idx": state.frame_idx, "turn_count": state.turn_count, "created_at": state.created_at, "last_accessed": state.last_accessed, "has_llm_cache": state.llm_cache is not None, - "has_input_embeds_history": state.input_embeds_history is not None - and len(state.input_embeds_history) > 0, + "has_input_embeds_history": state.input_embeds_history is not None + and hasattr(state.input_embeds_history, "__len__") + and len(state.input_embeds_history) > 0, }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"has_llm_cache": state.llm_cache is not None,

"has_input_embeds_history": state.input_embeds_history is not None

and len(state.input_embeds_history) > 0,

}

return {

"session_id": state.session_id,

"frame_idx": state.frame_idx,

"turn_count": state.turn_count,

"created_at": state.created_at,

"last_accessed": state.last_accessed,

"has_llm_cache": state.llm_cache is not None,

"has_input_embeds_history": state.input_embeds_history is not None

and hasattr(state.input_embeds_history, "__len__")

and len(state.input_embeds_history) > 0,

}

🤖 Prompt for AI Agents

In recipes/multimodal/server/session_manager.py around lines 214 to 217, the boolean expression uses len(state.input_embeds_history) but input_embeds_history is typed Any and may lack __len__, causing AttributeError; update the condition to safely check for a non-empty sequence by first ensuring the attribute is not None and supports len (e.g., hasattr(state.input_embeds_history, "__len__") or isinstance(state.input_embeds_history, collections.abc.Sized)) before calling len, or use a try/except around len to return False on TypeError/AttributeError so the code only treats it as "has_input_embeds_history" when it is a length-supporting container with length > 0.

coderabbitai · 2025-12-20T13:54:37Z

recipes/multimodal/server/unified_server.py

+                asyncio.create_task(self._process_batch())
+            elif self.batch_timeout == 0:
+                # No delay mode
+                asyncio.create_task(self._process_batch())
+            elif self.timeout_task is None or self.timeout_task.done():
+                # Schedule timeout
+                self.timeout_task = asyncio.create_task(self._timeout_handler())


⚠️ Potential issue | 🟠 Major

Store asyncio.create_task references to prevent garbage collection.

Tasks created with asyncio.create_task() without storing the reference can be garbage collected before completion. This could cause batch processing to be interrupted unexpectedly.

🔎 Proposed fix

Consider tracking background tasks in a set:

+ # Track background tasks + self._background_tasks: set = set() + async def add_request(self, request: GenerationRequest) -> GenerationResult: """Add a request and wait for result.""" future = asyncio.Future() pending = PendingRequest(request=request, future=future, timestamp=time.time()) async with self.lock: self.pending_requests.append(pending) # Check if we should process immediately if len(self.pending_requests) >= self.batch_size: if DEBUG: print(f"[Batcher] Batch full ({self.batch_size}), processing immediately") - asyncio.create_task(self._process_batch()) + task = asyncio.create_task(self._process_batch()) + self._background_tasks.add(task) + task.add_done_callback(self._background_tasks.discard) elif self.batch_timeout == 0: # No delay mode - asyncio.create_task(self._process_batch()) + task = asyncio.create_task(self._process_batch()) + self._background_tasks.add(task) + task.add_done_callback(self._background_tasks.discard) elif self.timeout_task is None or self.timeout_task.done(): # Schedule timeout self.timeout_task = asyncio.create_task(self._timeout_handler())

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.14.8)

105-105: Store a reference to the return value of asyncio.create_task

(RUF006)

108-108: Store a reference to the return value of asyncio.create_task

(RUF006)

🤖 Prompt for AI Agents

In recipes/multimodal/server/unified_server.py around lines 105 to 111, tasks created via asyncio.create_task() are not stored and can be garbage-collected; modify the code to keep references (e.g., maintain a self._background_tasks set), add each created task to that set, attach a done callback to remove the task when finished (and handle exceptions in the callback), and ensure tasks are cancelled/awaited on shutdown to avoid leaks or premature cancellation.

coderabbitai · 2025-12-20T13:54:37Z

recipes/multimodal/server/unified_server.py

+    extra_config: Dict[str, Any] = None,
+) -> FastAPI:
+    """Create and configure the FastAPI app."""
+    global backend_instance, request_batcher, session_manager, server_config
+
+    # Extract server-level config from extra_config
+    ignore_system_prompt = extra_config.pop("ignore_system_prompt", False) if extra_config else False
+    session_ttl = extra_config.pop("session_ttl", 300.0) if extra_config else 300.0
+    max_sessions = extra_config.pop("max_sessions", 100) if extra_config else 100


⚠️ Potential issue | 🟠 Major

Avoid mutating the caller's dict with pop() and fix implicit Optional.

The extra_config.pop() calls mutate the dictionary passed by the caller, which could cause unexpected behavior. Also, the type hint should use Optional[Dict[str, Any]] instead of Dict[str, Any] = None.

🔎 Proposed fix

def create_app( backend_type: str = BACKEND_TYPE, model_path: str = MODEL_PATH, codec_model_path: str = CODEC_MODEL_PATH, batch_size: int = BATCH_SIZE, batch_timeout: float = BATCH_TIMEOUT, device: str = "cuda", dtype: str = "bfloat16", - extra_config: Dict[str, Any] = None, + extra_config: Optional[Dict[str, Any]] = None, ) -> FastAPI: """Create and configure the FastAPI app.""" global backend_instance, request_batcher, session_manager, server_config + # Make a copy to avoid mutating the caller's dict + extra_config = dict(extra_config) if extra_config else {} + # Extract server-level config from extra_config - ignore_system_prompt = extra_config.pop("ignore_system_prompt", False) if extra_config else False - session_ttl = extra_config.pop("session_ttl", 300.0) if extra_config else 300.0 - max_sessions = extra_config.pop("max_sessions", 100) if extra_config else 100 + ignore_system_prompt = extra_config.pop("ignore_system_prompt", False) + session_ttl = extra_config.pop("session_ttl", 300.0) + max_sessions = extra_config.pop("max_sessions", 100)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

extra_config: Dict[str, Any] = None,

) -> FastAPI:

"""Create and configure the FastAPI app."""

global backend_instance, request_batcher, session_manager, server_config

# Extract server-level config from extra_config

ignore_system_prompt = extra_config.pop("ignore_system_prompt", False) if extra_config else False

session_ttl = extra_config.pop("session_ttl", 300.0) if extra_config else 300.0

max_sessions = extra_config.pop("max_sessions", 100) if extra_config else 100

extra_config: Optional[Dict[str, Any]] = None,

) -> FastAPI:

"""Create and configure the FastAPI app."""

global backend_instance, request_batcher, session_manager, server_config

# Make a copy to avoid mutating the caller's dict

extra_config = dict(extra_config) if extra_config else {}

# Extract server-level config from extra_config

ignore_system_prompt = extra_config.pop("ignore_system_prompt", False)

session_ttl = extra_config.pop("session_ttl", 300.0)

max_sessions = extra_config.pop("max_sessions", 100)

🧰 Tools

🪛 Ruff (0.14.8)

242-242: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

🤖 Prompt for AI Agents

In recipes/multimodal/server/unified_server.py around lines 242 to 250, the function signature and use of extra_config are wrong: the type hint should be Optional[Dict[str, Any]] instead of Dict[str, Any] = None, and using extra_config.pop(...) mutates the caller's dict. Fix by changing the parameter type to Optional[Dict[str, Any]] and safely read values without mutating the original (e.g., if extra_config is None set a local empty dict, or make a shallow copy and use .get(...) for ignore_system_prompt, session_ttl and max_sessions, or copy = dict(extra_config) then pop from the copy), preserving defaults if keys are missing.

coderabbitai · 2025-12-20T13:54:37Z

recipes/multimodal/server/unified_server.py

+            save_dir = os.environ.get(
+                "AUDIO_SAVE_DIR", "/lustre/fsw/portfolios/llmservice/users/vmendelev/tmp/voicebench_test"
+            )


⚠️ Potential issue | 🟠 Major

Remove hardcoded user-specific path from default.

The default save directory contains a user-specific path (/lustre/fsw/portfolios/llmservice/users/vmendelev/tmp/voicebench_test). This should use a more generic default like /tmp/unified_server_outputs or require explicit configuration.

🔎 Proposed fix

save_dir = os.environ.get( - "AUDIO_SAVE_DIR", "/lustre/fsw/portfolios/llmservice/users/vmendelev/tmp/voicebench_test" + "AUDIO_SAVE_DIR", "/tmp/unified_server_outputs" )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

save_dir = os.environ.get(

"AUDIO_SAVE_DIR", "/lustre/fsw/portfolios/llmservice/users/vmendelev/tmp/voicebench_test"

)

save_dir = os.environ.get(

"AUDIO_SAVE_DIR", "/tmp/unified_server_outputs"

)

🤖 Prompt for AI Agents

In recipes/multimodal/server/unified_server.py around lines 511 to 513, the default AUDIO_SAVE_DIR uses a hardcoded user-specific path; change the default to a generic location (e.g., "/tmp/unified_server_outputs") or require callers to set AUDIO_SAVE_DIR by raising a clear error if unset; update the os.environ.get call to use the generic path (or replace with os.environ["AUDIO_SAVE_DIR"] and add a clear startup-check that logs/raises if not provided) and ensure any tests or docs reflecting the old path are updated accordingly.

Kipok

this looks good to me, I don't think there is any other way to support this unfortunately. I didn't review low-level details, but as long as this works for you, I think we can merge.

The only thing I'd suggest is let's maybe rename serve_unified to something more explicit that makes it clear this is for audio models?

Kipok · 2026-01-10T03:40:47Z

recipes/multimodal/server/backends/base.py

+    def get_generation_params(self, request: GenerationRequest) -> Dict[str, Any]:
+        """
+        Get effective generation parameters, merging request with config defaults.
+        """
+        return {
+            "max_new_tokens": request.max_new_tokens or self.config.max_new_tokens,
+            "temperature": request.temperature or self.config.temperature,
+            "top_p": request.top_p or self.config.top_p,
+            "top_k": request.top_k or self.config.top_k,
+        }


feels like something we should fix @vmendelev

vmendelev · 2026-02-27T09:57:31Z

Closed in favor of #1250

vmendelev · 2026-02-27T09:57:39Z

#1250

Unified server to work with NeMo models not supported by vLLM

606eb81

Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>

vmendelev force-pushed the vmendelev/2512_unified_serve branch from c98a94a to 606eb81 Compare December 20, 2025 13:52

coderabbitai bot reviewed Dec 20, 2025

View reviewed changes

Kipok reviewed Jan 12, 2026

View reviewed changes

vmendelev closed this Feb 27, 2026

		has_text_input = request.text is not None
		has_audio_input = request.audio_bytes is not None or request.audio_path is not None

Conversation

vmendelev commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kipok Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Kipok left a comment

Choose a reason for hiding this comment

Uh oh!

Kipok Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

vmendelev commented Feb 27, 2026

Uh oh!

vmendelev commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vmendelev commented Dec 20, 2025 •

edited

Loading

coderabbitai bot commented Dec 20, 2025 •

edited

Loading

coderabbitai bot Dec 20, 2025 •

edited

Loading