Skip to content

Unified server to work with models not supported by vLLM#1138

Closed
vmendelev wants to merge 1 commit intomainfrom
vmendelev/2512_unified_serve
Closed

Unified server to work with models not supported by vLLM#1138
vmendelev wants to merge 1 commit intomainfrom
vmendelev/2512_unified_serve

Conversation

@vmendelev
Copy link
Collaborator

@vmendelev vmendelev commented Dec 20, 2025

This is a simple FastAPI server to which anybody can add a backend to comply with nemo-skills client-server approach. I already have some backends, and will push them later.
If somebody has a better idea how to make it easier to run nemo-skills with early research models not supported by vLLM or SGLANG -- please suggest.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added unified inference server with OpenAI-compatible API supporting multiple backends (SALM, TTS, S2S)
    • Introduced session management for multi-turn conversations with TTL-based lifecycle
    • Added health check and model listing endpoints
    • Enabled audio input/output processing and configurable batch request handling
    • Support for saving response metadata and audio artifacts

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>
@vmendelev vmendelev force-pushed the vmendelev/2512_unified_serve branch from c98a94a to 606eb81 Compare December 20, 2025 13:52
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 20, 2025

📝 Walkthrough

Walkthrough

This PR introduces a new unified inference server for NeMo supporting multiple backends (SALM, TTS, S2S) with OpenAI-compatible REST API. It includes a CLI launcher, backend abstraction layer, session management for multi-turn interactions, request batching, and FastAPI-based server implementation.

Changes

Cohort / File(s) Summary
CLI Launcher
nemo_skills/inference/server/serve_unified.py
Adds vllm-compatible CLI wrapper to launch the unified server. Handles Python path setup, safetensors patching, environment variable configuration, backend-specific options, and uvicorn server startup via recipes.multimodal.server.unified_server.
Backend Infrastructure
recipes/multimodal/server/backends/base.py
recipes/multimodal/server/backends/__init__.py
Introduces abstract backend interfaces and registry: Modality enum, BackendConfig/GenerationRequest/GenerationResult dataclasses, InferenceBackend abstract base class with load_model()/generate() contracts, and registry-based backend lazy-loading via get_backend()/list_backends().
Session Management
recipes/multimodal/server/session_manager.py
Adds thread-safe session lifecycle management with TTL-based cleanup and capacity eviction. Tracks session state (LLM cache, audio/text buffers, turn history) and per-turn metadata (TurnData).
Server Package & Core
recipes/multimodal/server/__init__.py
recipes/multimodal/server/unified_server.py
Package init exports backend interfaces. unified_server.py implements FastAPI app with OpenAI-compatible /v1/chat/completions, /health, session endpoints, request batching (RequestBatcher), audio/message parsing utilities, and create_app() factory.

Sequence Diagram(s)

sequenceDiagram
    participant Client as External Client
    participant CLI as CLI<br/>(serve_unified.py)
    participant App as FastAPI App<br/>(unified_server.py)
    participant Batcher as RequestBatcher
    participant Backend as Backend<br/>(SALM/TTS/S2S)
    participant Session as SessionManager<br/>(S2S only)

    Note over CLI: Startup Phase
    Client->>CLI: python serve_unified.py --backend salm ...
    CLI->>CLI: setup_pythonpath()
    CLI->>CLI: apply_safetensors_patch()
    CLI->>CLI: Populate env vars & build extra_config
    CLI->>App: create_app(backend_type, model_path, ...)
    App->>Backend: __init__(config)
    App->>Backend: load_model()
    App->>App: Initialize RequestBatcher
    alt S2S with sessions
        App->>Session: Initialize SessionManager
    end
    App-->>CLI: FastAPI app ready
    CLI->>CLI: uvicorn.run(app)

    Note over Client,Session: Request Phase
    Client->>App: POST /v1/chat/completions<br/>(messages with text/audio)
    App->>App: extract_text_from_messages()
    App->>App: extract_audio_from_messages()
    App->>App: extract_system_prompt()
    App->>App: Create GenerationRequest
    
    alt S2S with session
        App->>Session: get_or_create_session(session_id)
        Session-->>App: SessionState
        App->>App: Populate session_id in request
    end
    
    App->>Batcher: add_request(GenerationRequest)
    
    alt Batch full or immediate
        Batcher->>Batcher: _process_batch()
        Batcher->>Backend: generate([requests])
        Backend-->>Batcher: [GenerationResults]
    else Batch timeout
        Batcher->>Batcher: _timeout_handler()
        Batcher->>Batcher: _process_batch()
        Batcher->>Backend: generate([requests])
        Backend-->>Batcher: [GenerationResults]
    end
    
    Batcher-->>App: GenerationResult
    
    alt S2S with session
        App->>Session: save_session(session_id, updated_state)
    end
    
    App->>App: Save outputs (JSON/WAV) to disk
    App-->>Client: OpenAI-compatible response<br/>(text + optional audio)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–75 minutes

  • Multiple interconnected subsystems: Backend abstraction, session management, request batching, and async FastAPI server with non-trivial dependencies between them.
  • Concurrency concerns: Thread-safe session manager with locks, async request batching, and concurrent request handling require careful validation.
  • Complex data flow: Audio extraction, message parsing, session state mutation, batch processing, and result assembly across multiple layers.
  • Backend extensibility: Abstract base class design and registry pattern should support future backends without modification.

Areas requiring extra attention:

  • Thread safety in SessionManager (lock management, reentrant lock usage, eviction logic)
  • Async batching logic in RequestBatcher (timeout handling, concurrent request queueing, future resolution)
  • Audio data handling and format conversions (bytes, paths, sample rates, WAV serialization)
  • CLI environment setup correctness (Python path discovery, patch application, safetensors handling)
  • Backend interface contract and error propagation (ImportError handling, invalid backends, missing dependencies)

Suggested labels

run GPU tests

Suggested reviewers

  • activatedgeek
  • Kipok

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: a unified server enabling models not supported by vLLM, which aligns with the PR's core objective to add a FastAPI server for backend integration with early research models.
Docstring Coverage ✅ Passed Docstring coverage is 91.30% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch vmendelev/2512_unified_serve

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (5)
recipes/multimodal/server/__init__.py (1)

32-38: Consider sorting __all__ alphabetically.

Static analysis flagged that __all__ is not sorted. This is a minor style preference.

🔎 Proposed fix
 __all__ = [
-    "InferenceBackend",
+    "BackendConfig",
     "GenerationRequest",
     "GenerationResult",
-    "BackendConfig",
     "get_backend",
+    "InferenceBackend",
 ]
recipes/multimodal/server/backends/__init__.py (1)

28-36: Consider sorting __all__ alphabetically.

Similar to the parent __init__.py, this is a minor style preference flagged by static analysis.

🔎 Proposed fix
 __all__ = [
-    "InferenceBackend",
+    "BackendConfig",
     "GenerationRequest",
     "GenerationResult",
-    "BackendConfig",
+    "get_backend",
+    "InferenceBackend",
+    "list_backends",
     "Modality",
-    "get_backend",
-    "list_backends",
 ]
nemo_skills/inference/server/serve_unified.py (1)

258-259: Prefix unused variable with underscore.

extra_args is captured but never used. Prefix with underscore to indicate it's intentionally unused.

🔎 Proposed fix
     # Parse known args, allowing extra args to be passed through
-    args, extra_args = parser.parse_known_args()
+    args, _extra_args = parser.parse_known_args()
recipes/multimodal/server/unified_server.py (2)

605-611: Use raise ... from e for exception chaining.

When re-raising an exception, use from e to preserve the exception chain for better debugging.

🔎 Proposed fix
         except HTTPException:
             raise
         except Exception as e:
             import traceback

             traceback.print_exc()
-            raise HTTPException(status_code=500, detail=str(e))
+            raise HTTPException(status_code=500, detail=str(e)) from e

144-148: Add strict=True to zip to catch mismatched lengths.

If batch and results have different lengths due to a backend bug, this would silently drop items. Using strict=True (Python 3.10+) would raise an error instead.

             # Complete futures
-            for pending, result in zip(batch, results):
+            for pending, result in zip(batch, results, strict=True):
                 if not pending.future.done():
                     pending.future.set_result(result)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7205c43 and c98a94a.

📒 Files selected for processing (6)
  • nemo_skills/inference/server/serve_unified.py (1 hunks)
  • recipes/multimodal/server/__init__.py (1 hunks)
  • recipes/multimodal/server/backends/__init__.py (1 hunks)
  • recipes/multimodal/server/backends/base.py (1 hunks)
  • recipes/multimodal/server/session_manager.py (1 hunks)
  • recipes/multimodal/server/unified_server.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
recipes/multimodal/server/backends/__init__.py (1)
recipes/multimodal/server/backends/base.py (5)
  • BackendConfig (37-59)
  • GenerationRequest (63-95)
  • GenerationResult (99-126)
  • InferenceBackend (129-251)
  • Modality (28-33)
recipes/multimodal/server/session_manager.py (2)
recipes/multimodal/server/unified_server.py (3)
  • get_session (346-353)
  • delete_session (356-380)
  • list_sessions (335-343)
nemo_skills/inference/chat_interface/core.py (1)
  • get (136-151)
recipes/multimodal/server/__init__.py (1)
recipes/multimodal/server/backends/base.py (4)
  • BackendConfig (37-59)
  • GenerationRequest (63-95)
  • GenerationResult (99-126)
  • InferenceBackend (129-251)
🪛 Ruff (0.14.8)
recipes/multimodal/server/backends/__init__.py

28-36: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)


69-69: Avoid specifying long messages outside the exception class

(TRY003)


79-81: Avoid specifying long messages outside the exception class

(TRY003)

recipes/multimodal/server/__init__.py

32-38: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

nemo_skills/inference/server/serve_unified.py

121-121: Do not catch blind exception: Exception

(BLE001)


149-149: Possible binding to all interfaces

(S104)


259-259: Unpacked variable extra_args is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


388-388: Do not catch blind exception: Exception

(BLE001)

recipes/multimodal/server/unified_server.py

49-49: Possible binding to all interfaces

(S104)


105-105: Store a reference to the return value of asyncio.create_task

(RUF006)


108-108: Store a reference to the return value of asyncio.create_task

(RUF006)


122-122: Store a reference to the return value of asyncio.create_task

(RUF006)


146-146: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


154-154: Do not catch blind exception: Exception

(BLE001)


165-165: Store a reference to the return value of asyncio.create_task

(RUF006)


242-242: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


371-371: Do not catch blind exception: Exception

(BLE001)


434-434: Abstract raise to an inner function

(TRY301)


466-466: Probable use of insecure hash functions in hashlib: md5

(S324)


472-472: Abstract raise to an inner function

(TRY301)


498-498: Abstract raise to an inner function

(TRY301)


501-501: Probable use of insecure hash functions in hashlib: md5

(S324)


535-535: Do not catch blind exception: Exception

(BLE001)


547-547: Do not catch blind exception: Exception

(BLE001)


603-603: Consider moving this statement to an else block

(TRY300)


607-607: Do not catch blind exception: Exception

(BLE001)


611-611: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (17)
recipes/multimodal/server/__init__.py (1)

1-30: LGTM!

The module correctly re-exports the core backend interfaces from the .backends subpackage, providing a clean public API surface for the unified server package. The docstring accurately describes the supported backends.

recipes/multimodal/server/backends/__init__.py (1)

67-81: LGTM - Lazy loading pattern is appropriate.

The lazy loading implementation correctly defers importing backend modules until they're needed, which is ideal for optional dependencies. The error handling provides clear messages for debugging import issues.

nemo_skills/inference/server/serve_unified.py (3)

109-122: The safetensors patching approach is risky and may cause issues.

This function overwrites an installed package file, which:

  1. Could fail on read-only/containerized environments
  2. May cause unexpected behavior in other applications using safetensors
  3. Does not restore the original file

Consider documenting this behavior more prominently or using a safer patching mechanism (e.g., monkey-patching at runtime instead of file replacement).


365-393: LGTM - Server startup with proper error handling.

The import-time error handling correctly distinguishes between missing dependencies (ImportError) and runtime errors, providing helpful messages in both cases. The traceback printing for generic exceptions aids debugging.


125-177: LGTM - CLI argument structure is comprehensive.

The argument parser provides a complete interface for configuring the unified server with sensible defaults and clear help text. Backend-specific options are well-organized.

recipes/multimodal/server/unified_server.py (3)

77-92: LGTM - RequestBatcher design is sound.

The batcher correctly implements configurable batching with timeout and immediate processing modes. The use of asyncio.Lock for thread safety and Future for result propagation is appropriate.


177-231: LGTM - Message parsing utilities are well-implemented.

The extraction functions correctly handle OpenAI-format messages with both string and list content types. The audio URL parsing with regex is appropriate for data URLs.


272-318: LGTM - Startup routine is comprehensive.

The startup event correctly initializes the backend, creates the batcher, and optionally enables session management. The configuration logging is helpful for debugging.

recipes/multimodal/server/session_manager.py (4)

31-75: LGTM - Data structures are well-designed.

TurnData and SessionState provide comprehensive state tracking for multi-turn conversations. The use of field(default_factory=...) for mutable defaults is correct, and the touch() method provides clean timestamp updates.


77-121: LGTM - SessionManager initialization and create_session are well-implemented.

The manager correctly uses RLock for reentrant locking, implements TTL-based cleanup, and handles capacity eviction. The session creation flow is robust.


123-162: LGTM - Session retrieval with TTL enforcement.

get_session correctly checks expiry before returning, and get_or_create_session provides a convenient idempotent accessor. The touch() call on access properly extends the session lifetime.


224-249: LGTM - Cleanup and eviction logic is correct.

The expired session cleanup iterates over a snapshot of session IDs, and the eviction correctly finds the oldest by last_accessed. Using min() with a key function is clean.

recipes/multimodal/server/backends/base.py (5)

1-26: LGTM!

License header, module docstring, and imports are well-structured and appropriate for an abstract base interface module.


28-33: LGTM!

Good use of (str, Enum) pattern for serialization compatibility with JSON/OpenAPI.


36-59: LGTM!

Well-designed config class with proper handling of mutable defaults and flexible from_dict factory that cleanly separates known fields from extra configuration.


62-96: LGTM!

Comprehensive request structure supporting multiple input modalities and generation parameter overrides. Mutable default for extra_params is correctly handled with field(default_factory=dict).


98-127: LGTM!

Clean result structure with proper metadata tracking and simple error handling pattern.

Comment on lines +53 to +66
def get_backend(backend_name: str) -> type:
"""
Get backend class by name with lazy loading.

Args:
backend_name: One of 'salm', 'tts', 's2s'

Returns:
Backend class (not instance)

Raises:
ValueError: If backend name is unknown
ImportError: If backend dependencies are not available
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update docstring to reflect all available backends.

The docstring says backend_name: One of 'salm', 'tts', 's2s' but the registry includes 5 backends: salm, tts, s2s, s2s_incremental, and s2s_session.

🔎 Proposed fix
     """
     Get backend class by name with lazy loading.

     Args:
-        backend_name: One of 'salm', 'tts', 's2s'
+        backend_name: One of 'salm', 'tts', 's2s', 's2s_incremental', 's2s_session'

     Returns:
         Backend class (not instance)

     Raises:
         ValueError: If backend name is unknown
         ImportError: If backend dependencies are not available
     """
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_backend(backend_name: str) -> type:
"""
Get backend class by name with lazy loading.
Args:
backend_name: One of 'salm', 'tts', 's2s'
Returns:
Backend class (not instance)
Raises:
ValueError: If backend name is unknown
ImportError: If backend dependencies are not available
"""
def get_backend(backend_name: str) -> type:
"""
Get backend class by name with lazy loading.
Args:
backend_name: One of 'salm', 'tts', 's2s', 's2s_incremental', 's2s_session'
Returns:
Backend class (not instance)
Raises:
ValueError: If backend name is unknown
ImportError: If backend dependencies are not available
"""
🤖 Prompt for AI Agents
In recipes/multimodal/server/backends/__init__.py around lines 53 to 66, the
docstring's parameter description lists only 'salm', 'tts', 's2s' but the
registry exposes five backends; update the docstring to enumerate all supported
backend names (e.g., 'salm', 'tts', 's2s', 's2s_incremental', 's2s_session') and
adjust any surrounding text so the Returns/Raises remain accurate.

Comment on lines +221 to +230
def get_generation_params(self, request: GenerationRequest) -> Dict[str, Any]:
"""
Get effective generation parameters, merging request with config defaults.
"""
return {
"max_new_tokens": request.max_new_tokens or self.config.max_new_tokens,
"temperature": request.temperature or self.config.temperature,
"top_p": request.top_p or self.config.top_p,
"top_k": request.top_k or self.config.top_k,
}
Copy link
Contributor

@coderabbitai coderabbitai bot Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Falsy value bug: temperature=0.0 and top_p=0.0 are silently overwritten.

Using or to merge request overrides with config defaults treats 0.0 as falsy. If a caller sets temperature=0.0 for greedy decoding, the config default will be used instead.

🔎 Proposed fix using explicit None checks
     def get_generation_params(self, request: GenerationRequest) -> Dict[str, Any]:
         """
         Get effective generation parameters, merging request with config defaults.
         """
         return {
-            "max_new_tokens": request.max_new_tokens or self.config.max_new_tokens,
-            "temperature": request.temperature or self.config.temperature,
-            "top_p": request.top_p or self.config.top_p,
-            "top_k": request.top_k or self.config.top_k,
+            "max_new_tokens": request.max_new_tokens if request.max_new_tokens is not None else self.config.max_new_tokens,
+            "temperature": request.temperature if request.temperature is not None else self.config.temperature,
+            "top_p": request.top_p if request.top_p is not None else self.config.top_p,
+            "top_k": request.top_k if request.top_k is not None else self.config.top_k,
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_generation_params(self, request: GenerationRequest) -> Dict[str, Any]:
"""
Get effective generation parameters, merging request with config defaults.
"""
return {
"max_new_tokens": request.max_new_tokens or self.config.max_new_tokens,
"temperature": request.temperature or self.config.temperature,
"top_p": request.top_p or self.config.top_p,
"top_k": request.top_k or self.config.top_k,
}
def get_generation_params(self, request: GenerationRequest) -> Dict[str, Any]:
"""
Get effective generation parameters, merging request with config defaults.
"""
return {
"max_new_tokens": request.max_new_tokens if request.max_new_tokens is not None else self.config.max_new_tokens,
"temperature": request.temperature if request.temperature is not None else self.config.temperature,
"top_p": request.top_p if request.top_p is not None else self.config.top_p,
"top_k": request.top_k if request.top_k is not None else self.config.top_k,
}
🤖 Prompt for AI Agents
In recipes/multimodal/server/backends/base.py around lines 221 to 230, the
current merge uses "or" which treats falsy values (e.g., temperature=0.0,
top_p=0.0) as missing and overwrites them with config defaults; change each
merge to an explicit None check (e.g., use request.temperature if
request.temperature is not None else self.config.temperature) for
max_new_tokens, temperature, top_p, and top_k so that legitimate zero or falsy
values from the request are preserved while only None falls back to config.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like something we should fix @vmendelev

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment on lines +241 to +242
has_text_input = request.text is not None
has_audio_input = request.audio_bytes is not None or request.audio_path is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validation misses multi-turn audio inputs.

has_audio_input only checks audio_bytes and audio_path, but GenerationRequest also supports audio_bytes_list and audio_paths for multi-turn scenarios. Requests with only multi-turn audio would incorrectly fail validation.

🔎 Proposed fix to include multi-turn audio fields
         has_text_input = request.text is not None
-        has_audio_input = request.audio_bytes is not None or request.audio_path is not None
+        has_audio_input = (
+            request.audio_bytes is not None
+            or request.audio_path is not None
+            or request.audio_bytes_list is not None
+            or request.audio_paths is not None
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
has_text_input = request.text is not None
has_audio_input = request.audio_bytes is not None or request.audio_path is not None
has_text_input = request.text is not None
has_audio_input = (
request.audio_bytes is not None
or request.audio_path is not None
or request.audio_bytes_list is not None
or request.audio_paths is not None
)
🤖 Prompt for AI Agents
In recipes/multimodal/server/backends/base.py around lines 241 to 242, the
validation flag has_audio_input currently only checks request.audio_bytes and
request.audio_path and therefore misses multi-turn fields; update the check to
also consider request.audio_bytes_list and request.audio_paths (and treat
non-empty lists as valid) so requests that supply only multi-turn audio pass
validation; ensure you account for both presence and non-empty lists when
evaluating has_audio_input.

Comment on lines +214 to +217
"has_llm_cache": state.llm_cache is not None,
"has_input_embeds_history": state.input_embeds_history is not None
and len(state.input_embeds_history) > 0,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential AttributeError if input_embeds_history is not a list.

The code calls len(state.input_embeds_history) but input_embeds_history is typed as Any, which could be any type (e.g., a tensor, None after initial check, etc.). If it's a type without __len__, this will raise an AttributeError.

🔎 Proposed fix
             return {
                 "session_id": state.session_id,
                 "frame_idx": state.frame_idx,
                 "turn_count": state.turn_count,
                 "created_at": state.created_at,
                 "last_accessed": state.last_accessed,
                 "has_llm_cache": state.llm_cache is not None,
-                "has_input_embeds_history": state.input_embeds_history is not None
-                and len(state.input_embeds_history) > 0,
+                "has_input_embeds_history": state.input_embeds_history is not None
+                and hasattr(state.input_embeds_history, "__len__")
+                and len(state.input_embeds_history) > 0,
             }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"has_llm_cache": state.llm_cache is not None,
"has_input_embeds_history": state.input_embeds_history is not None
and len(state.input_embeds_history) > 0,
}
return {
"session_id": state.session_id,
"frame_idx": state.frame_idx,
"turn_count": state.turn_count,
"created_at": state.created_at,
"last_accessed": state.last_accessed,
"has_llm_cache": state.llm_cache is not None,
"has_input_embeds_history": state.input_embeds_history is not None
and hasattr(state.input_embeds_history, "__len__")
and len(state.input_embeds_history) > 0,
}
🤖 Prompt for AI Agents
In recipes/multimodal/server/session_manager.py around lines 214 to 217, the
boolean expression uses len(state.input_embeds_history) but input_embeds_history
is typed Any and may lack __len__, causing AttributeError; update the condition
to safely check for a non-empty sequence by first ensuring the attribute is not
None and supports len (e.g., hasattr(state.input_embeds_history, "__len__") or
isinstance(state.input_embeds_history, collections.abc.Sized)) before calling
len, or use a try/except around len to return False on TypeError/AttributeError
so the code only treats it as "has_input_embeds_history" when it is a
length-supporting container with length > 0.

Comment on lines +105 to +111
asyncio.create_task(self._process_batch())
elif self.batch_timeout == 0:
# No delay mode
asyncio.create_task(self._process_batch())
elif self.timeout_task is None or self.timeout_task.done():
# Schedule timeout
self.timeout_task = asyncio.create_task(self._timeout_handler())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Store asyncio.create_task references to prevent garbage collection.

Tasks created with asyncio.create_task() without storing the reference can be garbage collected before completion. This could cause batch processing to be interrupted unexpectedly.

🔎 Proposed fix

Consider tracking background tasks in a set:

+    # Track background tasks
+    self._background_tasks: set = set()
+
     async def add_request(self, request: GenerationRequest) -> GenerationResult:
         """Add a request and wait for result."""
         future = asyncio.Future()
         pending = PendingRequest(request=request, future=future, timestamp=time.time())

         async with self.lock:
             self.pending_requests.append(pending)

             # Check if we should process immediately
             if len(self.pending_requests) >= self.batch_size:
                 if DEBUG:
                     print(f"[Batcher] Batch full ({self.batch_size}), processing immediately")
-                asyncio.create_task(self._process_batch())
+                task = asyncio.create_task(self._process_batch())
+                self._background_tasks.add(task)
+                task.add_done_callback(self._background_tasks.discard)
             elif self.batch_timeout == 0:
                 # No delay mode
-                asyncio.create_task(self._process_batch())
+                task = asyncio.create_task(self._process_batch())
+                self._background_tasks.add(task)
+                task.add_done_callback(self._background_tasks.discard)
             elif self.timeout_task is None or self.timeout_task.done():
                 # Schedule timeout
                 self.timeout_task = asyncio.create_task(self._timeout_handler())

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.8)

105-105: Store a reference to the return value of asyncio.create_task

(RUF006)


108-108: Store a reference to the return value of asyncio.create_task

(RUF006)

🤖 Prompt for AI Agents
In recipes/multimodal/server/unified_server.py around lines 105 to 111, tasks
created via asyncio.create_task() are not stored and can be garbage-collected;
modify the code to keep references (e.g., maintain a self._background_tasks
set), add each created task to that set, attach a done callback to remove the
task when finished (and handle exceptions in the callback), and ensure tasks are
cancelled/awaited on shutdown to avoid leaks or premature cancellation.

Comment on lines +242 to +250
extra_config: Dict[str, Any] = None,
) -> FastAPI:
"""Create and configure the FastAPI app."""
global backend_instance, request_batcher, session_manager, server_config

# Extract server-level config from extra_config
ignore_system_prompt = extra_config.pop("ignore_system_prompt", False) if extra_config else False
session_ttl = extra_config.pop("session_ttl", 300.0) if extra_config else 300.0
max_sessions = extra_config.pop("max_sessions", 100) if extra_config else 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid mutating the caller's dict with pop() and fix implicit Optional.

The extra_config.pop() calls mutate the dictionary passed by the caller, which could cause unexpected behavior. Also, the type hint should use Optional[Dict[str, Any]] instead of Dict[str, Any] = None.

🔎 Proposed fix
 def create_app(
     backend_type: str = BACKEND_TYPE,
     model_path: str = MODEL_PATH,
     codec_model_path: str = CODEC_MODEL_PATH,
     batch_size: int = BATCH_SIZE,
     batch_timeout: float = BATCH_TIMEOUT,
     device: str = "cuda",
     dtype: str = "bfloat16",
-    extra_config: Dict[str, Any] = None,
+    extra_config: Optional[Dict[str, Any]] = None,
 ) -> FastAPI:
     """Create and configure the FastAPI app."""
     global backend_instance, request_batcher, session_manager, server_config

+    # Make a copy to avoid mutating the caller's dict
+    extra_config = dict(extra_config) if extra_config else {}
+
     # Extract server-level config from extra_config
-    ignore_system_prompt = extra_config.pop("ignore_system_prompt", False) if extra_config else False
-    session_ttl = extra_config.pop("session_ttl", 300.0) if extra_config else 300.0
-    max_sessions = extra_config.pop("max_sessions", 100) if extra_config else 100
+    ignore_system_prompt = extra_config.pop("ignore_system_prompt", False)
+    session_ttl = extra_config.pop("session_ttl", 300.0)
+    max_sessions = extra_config.pop("max_sessions", 100)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
extra_config: Dict[str, Any] = None,
) -> FastAPI:
"""Create and configure the FastAPI app."""
global backend_instance, request_batcher, session_manager, server_config
# Extract server-level config from extra_config
ignore_system_prompt = extra_config.pop("ignore_system_prompt", False) if extra_config else False
session_ttl = extra_config.pop("session_ttl", 300.0) if extra_config else 300.0
max_sessions = extra_config.pop("max_sessions", 100) if extra_config else 100
extra_config: Optional[Dict[str, Any]] = None,
) -> FastAPI:
"""Create and configure the FastAPI app."""
global backend_instance, request_batcher, session_manager, server_config
# Make a copy to avoid mutating the caller's dict
extra_config = dict(extra_config) if extra_config else {}
# Extract server-level config from extra_config
ignore_system_prompt = extra_config.pop("ignore_system_prompt", False)
session_ttl = extra_config.pop("session_ttl", 300.0)
max_sessions = extra_config.pop("max_sessions", 100)
🧰 Tools
🪛 Ruff (0.14.8)

242-242: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

🤖 Prompt for AI Agents
In recipes/multimodal/server/unified_server.py around lines 242 to 250, the
function signature and use of extra_config are wrong: the type hint should be
Optional[Dict[str, Any]] instead of Dict[str, Any] = None, and using
extra_config.pop(...) mutates the caller's dict. Fix by changing the parameter
type to Optional[Dict[str, Any]] and safely read values without mutating the
original (e.g., if extra_config is None set a local empty dict, or make a
shallow copy and use .get(...) for ignore_system_prompt, session_ttl and
max_sessions, or copy = dict(extra_config) then pop from the copy), preserving
defaults if keys are missing.

Comment on lines +511 to +513
save_dir = os.environ.get(
"AUDIO_SAVE_DIR", "/lustre/fsw/portfolios/llmservice/users/vmendelev/tmp/voicebench_test"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove hardcoded user-specific path from default.

The default save directory contains a user-specific path (/lustre/fsw/portfolios/llmservice/users/vmendelev/tmp/voicebench_test). This should use a more generic default like /tmp/unified_server_outputs or require explicit configuration.

🔎 Proposed fix
             save_dir = os.environ.get(
-                "AUDIO_SAVE_DIR", "/lustre/fsw/portfolios/llmservice/users/vmendelev/tmp/voicebench_test"
+                "AUDIO_SAVE_DIR", "/tmp/unified_server_outputs"
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
save_dir = os.environ.get(
"AUDIO_SAVE_DIR", "/lustre/fsw/portfolios/llmservice/users/vmendelev/tmp/voicebench_test"
)
save_dir = os.environ.get(
"AUDIO_SAVE_DIR", "/tmp/unified_server_outputs"
)
🤖 Prompt for AI Agents
In recipes/multimodal/server/unified_server.py around lines 511 to 513, the
default AUDIO_SAVE_DIR uses a hardcoded user-specific path; change the default
to a generic location (e.g., "/tmp/unified_server_outputs") or require callers
to set AUDIO_SAVE_DIR by raising a clear error if unset; update the
os.environ.get call to use the generic path (or replace with
os.environ["AUDIO_SAVE_DIR"] and add a clear startup-check that logs/raises if
not provided) and ensure any tests or docs reflecting the old path are updated
accordingly.

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me, I don't think there is any other way to support this unfortunately. I didn't review low-level details, but as long as this works for you, I think we can merge.

The only thing I'd suggest is let's maybe rename serve_unified to something more explicit that makes it clear this is for audio models?

Comment on lines +221 to +230
def get_generation_params(self, request: GenerationRequest) -> Dict[str, Any]:
"""
Get effective generation parameters, merging request with config defaults.
"""
return {
"max_new_tokens": request.max_new_tokens or self.config.max_new_tokens,
"temperature": request.temperature or self.config.temperature,
"top_p": request.top_p or self.config.top_p,
"top_k": request.top_k or self.config.top_k,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like something we should fix @vmendelev

@vmendelev
Copy link
Collaborator Author

Closed in favor of #1250

@vmendelev vmendelev closed this Feb 27, 2026
@vmendelev
Copy link
Collaborator Author

#1250

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants