Unified server to work with models not supported by vLLM#1138
Unified server to work with models not supported by vLLM#1138
Conversation
Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>
c98a94a to
606eb81
Compare
📝 WalkthroughWalkthroughThis PR introduces a new unified inference server for NeMo supporting multiple backends (SALM, TTS, S2S) with OpenAI-compatible REST API. It includes a CLI launcher, backend abstraction layer, session management for multi-turn interactions, request batching, and FastAPI-based server implementation. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as External Client
participant CLI as CLI<br/>(serve_unified.py)
participant App as FastAPI App<br/>(unified_server.py)
participant Batcher as RequestBatcher
participant Backend as Backend<br/>(SALM/TTS/S2S)
participant Session as SessionManager<br/>(S2S only)
Note over CLI: Startup Phase
Client->>CLI: python serve_unified.py --backend salm ...
CLI->>CLI: setup_pythonpath()
CLI->>CLI: apply_safetensors_patch()
CLI->>CLI: Populate env vars & build extra_config
CLI->>App: create_app(backend_type, model_path, ...)
App->>Backend: __init__(config)
App->>Backend: load_model()
App->>App: Initialize RequestBatcher
alt S2S with sessions
App->>Session: Initialize SessionManager
end
App-->>CLI: FastAPI app ready
CLI->>CLI: uvicorn.run(app)
Note over Client,Session: Request Phase
Client->>App: POST /v1/chat/completions<br/>(messages with text/audio)
App->>App: extract_text_from_messages()
App->>App: extract_audio_from_messages()
App->>App: extract_system_prompt()
App->>App: Create GenerationRequest
alt S2S with session
App->>Session: get_or_create_session(session_id)
Session-->>App: SessionState
App->>App: Populate session_id in request
end
App->>Batcher: add_request(GenerationRequest)
alt Batch full or immediate
Batcher->>Batcher: _process_batch()
Batcher->>Backend: generate([requests])
Backend-->>Batcher: [GenerationResults]
else Batch timeout
Batcher->>Batcher: _timeout_handler()
Batcher->>Batcher: _process_batch()
Batcher->>Backend: generate([requests])
Backend-->>Batcher: [GenerationResults]
end
Batcher-->>App: GenerationResult
alt S2S with session
App->>Session: save_session(session_id, updated_state)
end
App->>App: Save outputs (JSON/WAV) to disk
App-->>Client: OpenAI-compatible response<br/>(text + optional audio)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60–75 minutes
Areas requiring extra attention:
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (5)
recipes/multimodal/server/__init__.py (1)
32-38: Consider sorting__all__alphabetically.Static analysis flagged that
__all__is not sorted. This is a minor style preference.🔎 Proposed fix
__all__ = [ - "InferenceBackend", + "BackendConfig", "GenerationRequest", "GenerationResult", - "BackendConfig", "get_backend", + "InferenceBackend", ]recipes/multimodal/server/backends/__init__.py (1)
28-36: Consider sorting__all__alphabetically.Similar to the parent
__init__.py, this is a minor style preference flagged by static analysis.🔎 Proposed fix
__all__ = [ - "InferenceBackend", + "BackendConfig", "GenerationRequest", "GenerationResult", - "BackendConfig", + "get_backend", + "InferenceBackend", + "list_backends", "Modality", - "get_backend", - "list_backends", ]nemo_skills/inference/server/serve_unified.py (1)
258-259: Prefix unused variable with underscore.
extra_argsis captured but never used. Prefix with underscore to indicate it's intentionally unused.🔎 Proposed fix
# Parse known args, allowing extra args to be passed through - args, extra_args = parser.parse_known_args() + args, _extra_args = parser.parse_known_args()recipes/multimodal/server/unified_server.py (2)
605-611: Useraise ... from efor exception chaining.When re-raising an exception, use
from eto preserve the exception chain for better debugging.🔎 Proposed fix
except HTTPException: raise except Exception as e: import traceback traceback.print_exc() - raise HTTPException(status_code=500, detail=str(e)) + raise HTTPException(status_code=500, detail=str(e)) from e
144-148: Addstrict=Trueto zip to catch mismatched lengths.If
batchandresultshave different lengths due to a backend bug, this would silently drop items. Usingstrict=True(Python 3.10+) would raise an error instead.# Complete futures - for pending, result in zip(batch, results): + for pending, result in zip(batch, results, strict=True): if not pending.future.done(): pending.future.set_result(result)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
nemo_skills/inference/server/serve_unified.py(1 hunks)recipes/multimodal/server/__init__.py(1 hunks)recipes/multimodal/server/backends/__init__.py(1 hunks)recipes/multimodal/server/backends/base.py(1 hunks)recipes/multimodal/server/session_manager.py(1 hunks)recipes/multimodal/server/unified_server.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
recipes/multimodal/server/backends/__init__.py (1)
recipes/multimodal/server/backends/base.py (5)
BackendConfig(37-59)GenerationRequest(63-95)GenerationResult(99-126)InferenceBackend(129-251)Modality(28-33)
recipes/multimodal/server/session_manager.py (2)
recipes/multimodal/server/unified_server.py (3)
get_session(346-353)delete_session(356-380)list_sessions(335-343)nemo_skills/inference/chat_interface/core.py (1)
get(136-151)
recipes/multimodal/server/__init__.py (1)
recipes/multimodal/server/backends/base.py (4)
BackendConfig(37-59)GenerationRequest(63-95)GenerationResult(99-126)InferenceBackend(129-251)
🪛 Ruff (0.14.8)
recipes/multimodal/server/backends/__init__.py
28-36: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
69-69: Avoid specifying long messages outside the exception class
(TRY003)
79-81: Avoid specifying long messages outside the exception class
(TRY003)
recipes/multimodal/server/__init__.py
32-38: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
nemo_skills/inference/server/serve_unified.py
121-121: Do not catch blind exception: Exception
(BLE001)
149-149: Possible binding to all interfaces
(S104)
259-259: Unpacked variable extra_args is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
388-388: Do not catch blind exception: Exception
(BLE001)
recipes/multimodal/server/unified_server.py
49-49: Possible binding to all interfaces
(S104)
105-105: Store a reference to the return value of asyncio.create_task
(RUF006)
108-108: Store a reference to the return value of asyncio.create_task
(RUF006)
122-122: Store a reference to the return value of asyncio.create_task
(RUF006)
146-146: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
154-154: Do not catch blind exception: Exception
(BLE001)
165-165: Store a reference to the return value of asyncio.create_task
(RUF006)
242-242: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
371-371: Do not catch blind exception: Exception
(BLE001)
434-434: Abstract raise to an inner function
(TRY301)
466-466: Probable use of insecure hash functions in hashlib: md5
(S324)
472-472: Abstract raise to an inner function
(TRY301)
498-498: Abstract raise to an inner function
(TRY301)
501-501: Probable use of insecure hash functions in hashlib: md5
(S324)
535-535: Do not catch blind exception: Exception
(BLE001)
547-547: Do not catch blind exception: Exception
(BLE001)
603-603: Consider moving this statement to an else block
(TRY300)
607-607: Do not catch blind exception: Exception
(BLE001)
611-611: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🔇 Additional comments (17)
recipes/multimodal/server/__init__.py (1)
1-30: LGTM!The module correctly re-exports the core backend interfaces from the
.backendssubpackage, providing a clean public API surface for the unified server package. The docstring accurately describes the supported backends.recipes/multimodal/server/backends/__init__.py (1)
67-81: LGTM - Lazy loading pattern is appropriate.The lazy loading implementation correctly defers importing backend modules until they're needed, which is ideal for optional dependencies. The error handling provides clear messages for debugging import issues.
nemo_skills/inference/server/serve_unified.py (3)
109-122: The safetensors patching approach is risky and may cause issues.This function overwrites an installed package file, which:
- Could fail on read-only/containerized environments
- May cause unexpected behavior in other applications using safetensors
- Does not restore the original file
Consider documenting this behavior more prominently or using a safer patching mechanism (e.g., monkey-patching at runtime instead of file replacement).
365-393: LGTM - Server startup with proper error handling.The import-time error handling correctly distinguishes between missing dependencies (ImportError) and runtime errors, providing helpful messages in both cases. The traceback printing for generic exceptions aids debugging.
125-177: LGTM - CLI argument structure is comprehensive.The argument parser provides a complete interface for configuring the unified server with sensible defaults and clear help text. Backend-specific options are well-organized.
recipes/multimodal/server/unified_server.py (3)
77-92: LGTM - RequestBatcher design is sound.The batcher correctly implements configurable batching with timeout and immediate processing modes. The use of asyncio.Lock for thread safety and Future for result propagation is appropriate.
177-231: LGTM - Message parsing utilities are well-implemented.The extraction functions correctly handle OpenAI-format messages with both string and list content types. The audio URL parsing with regex is appropriate for data URLs.
272-318: LGTM - Startup routine is comprehensive.The startup event correctly initializes the backend, creates the batcher, and optionally enables session management. The configuration logging is helpful for debugging.
recipes/multimodal/server/session_manager.py (4)
31-75: LGTM - Data structures are well-designed.
TurnDataandSessionStateprovide comprehensive state tracking for multi-turn conversations. The use offield(default_factory=...)for mutable defaults is correct, and thetouch()method provides clean timestamp updates.
77-121: LGTM - SessionManager initialization and create_session are well-implemented.The manager correctly uses
RLockfor reentrant locking, implements TTL-based cleanup, and handles capacity eviction. The session creation flow is robust.
123-162: LGTM - Session retrieval with TTL enforcement.
get_sessioncorrectly checks expiry before returning, andget_or_create_sessionprovides a convenient idempotent accessor. Thetouch()call on access properly extends the session lifetime.
224-249: LGTM - Cleanup and eviction logic is correct.The expired session cleanup iterates over a snapshot of session IDs, and the eviction correctly finds the oldest by
last_accessed. Usingmin()with a key function is clean.recipes/multimodal/server/backends/base.py (5)
1-26: LGTM!License header, module docstring, and imports are well-structured and appropriate for an abstract base interface module.
28-33: LGTM!Good use of
(str, Enum)pattern for serialization compatibility with JSON/OpenAPI.
36-59: LGTM!Well-designed config class with proper handling of mutable defaults and flexible
from_dictfactory that cleanly separates known fields from extra configuration.
62-96: LGTM!Comprehensive request structure supporting multiple input modalities and generation parameter overrides. Mutable default for
extra_paramsis correctly handled withfield(default_factory=dict).
98-127: LGTM!Clean result structure with proper metadata tracking and simple error handling pattern.
| def get_backend(backend_name: str) -> type: | ||
| """ | ||
| Get backend class by name with lazy loading. | ||
|
|
||
| Args: | ||
| backend_name: One of 'salm', 'tts', 's2s' | ||
|
|
||
| Returns: | ||
| Backend class (not instance) | ||
|
|
||
| Raises: | ||
| ValueError: If backend name is unknown | ||
| ImportError: If backend dependencies are not available | ||
| """ |
There was a problem hiding this comment.
Update docstring to reflect all available backends.
The docstring says backend_name: One of 'salm', 'tts', 's2s' but the registry includes 5 backends: salm, tts, s2s, s2s_incremental, and s2s_session.
🔎 Proposed fix
"""
Get backend class by name with lazy loading.
Args:
- backend_name: One of 'salm', 'tts', 's2s'
+ backend_name: One of 'salm', 'tts', 's2s', 's2s_incremental', 's2s_session'
Returns:
Backend class (not instance)
Raises:
ValueError: If backend name is unknown
ImportError: If backend dependencies are not available
"""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_backend(backend_name: str) -> type: | |
| """ | |
| Get backend class by name with lazy loading. | |
| Args: | |
| backend_name: One of 'salm', 'tts', 's2s' | |
| Returns: | |
| Backend class (not instance) | |
| Raises: | |
| ValueError: If backend name is unknown | |
| ImportError: If backend dependencies are not available | |
| """ | |
| def get_backend(backend_name: str) -> type: | |
| """ | |
| Get backend class by name with lazy loading. | |
| Args: | |
| backend_name: One of 'salm', 'tts', 's2s', 's2s_incremental', 's2s_session' | |
| Returns: | |
| Backend class (not instance) | |
| Raises: | |
| ValueError: If backend name is unknown | |
| ImportError: If backend dependencies are not available | |
| """ |
🤖 Prompt for AI Agents
In recipes/multimodal/server/backends/__init__.py around lines 53 to 66, the
docstring's parameter description lists only 'salm', 'tts', 's2s' but the
registry exposes five backends; update the docstring to enumerate all supported
backend names (e.g., 'salm', 'tts', 's2s', 's2s_incremental', 's2s_session') and
adjust any surrounding text so the Returns/Raises remain accurate.
| def get_generation_params(self, request: GenerationRequest) -> Dict[str, Any]: | ||
| """ | ||
| Get effective generation parameters, merging request with config defaults. | ||
| """ | ||
| return { | ||
| "max_new_tokens": request.max_new_tokens or self.config.max_new_tokens, | ||
| "temperature": request.temperature or self.config.temperature, | ||
| "top_p": request.top_p or self.config.top_p, | ||
| "top_k": request.top_k or self.config.top_k, | ||
| } |
There was a problem hiding this comment.
Falsy value bug: temperature=0.0 and top_p=0.0 are silently overwritten.
Using or to merge request overrides with config defaults treats 0.0 as falsy. If a caller sets temperature=0.0 for greedy decoding, the config default will be used instead.
🔎 Proposed fix using explicit None checks
def get_generation_params(self, request: GenerationRequest) -> Dict[str, Any]:
"""
Get effective generation parameters, merging request with config defaults.
"""
return {
- "max_new_tokens": request.max_new_tokens or self.config.max_new_tokens,
- "temperature": request.temperature or self.config.temperature,
- "top_p": request.top_p or self.config.top_p,
- "top_k": request.top_k or self.config.top_k,
+ "max_new_tokens": request.max_new_tokens if request.max_new_tokens is not None else self.config.max_new_tokens,
+ "temperature": request.temperature if request.temperature is not None else self.config.temperature,
+ "top_p": request.top_p if request.top_p is not None else self.config.top_p,
+ "top_k": request.top_k if request.top_k is not None else self.config.top_k,
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_generation_params(self, request: GenerationRequest) -> Dict[str, Any]: | |
| """ | |
| Get effective generation parameters, merging request with config defaults. | |
| """ | |
| return { | |
| "max_new_tokens": request.max_new_tokens or self.config.max_new_tokens, | |
| "temperature": request.temperature or self.config.temperature, | |
| "top_p": request.top_p or self.config.top_p, | |
| "top_k": request.top_k or self.config.top_k, | |
| } | |
| def get_generation_params(self, request: GenerationRequest) -> Dict[str, Any]: | |
| """ | |
| Get effective generation parameters, merging request with config defaults. | |
| """ | |
| return { | |
| "max_new_tokens": request.max_new_tokens if request.max_new_tokens is not None else self.config.max_new_tokens, | |
| "temperature": request.temperature if request.temperature is not None else self.config.temperature, | |
| "top_p": request.top_p if request.top_p is not None else self.config.top_p, | |
| "top_k": request.top_k if request.top_k is not None else self.config.top_k, | |
| } |
🤖 Prompt for AI Agents
In recipes/multimodal/server/backends/base.py around lines 221 to 230, the
current merge uses "or" which treats falsy values (e.g., temperature=0.0,
top_p=0.0) as missing and overwrites them with config defaults; change each
merge to an explicit None check (e.g., use request.temperature if
request.temperature is not None else self.config.temperature) for
max_new_tokens, temperature, top_p, and top_k so that legitimate zero or falsy
values from the request are preserved while only None falls back to config.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| has_text_input = request.text is not None | ||
| has_audio_input = request.audio_bytes is not None or request.audio_path is not None |
There was a problem hiding this comment.
Validation misses multi-turn audio inputs.
has_audio_input only checks audio_bytes and audio_path, but GenerationRequest also supports audio_bytes_list and audio_paths for multi-turn scenarios. Requests with only multi-turn audio would incorrectly fail validation.
🔎 Proposed fix to include multi-turn audio fields
has_text_input = request.text is not None
- has_audio_input = request.audio_bytes is not None or request.audio_path is not None
+ has_audio_input = (
+ request.audio_bytes is not None
+ or request.audio_path is not None
+ or request.audio_bytes_list is not None
+ or request.audio_paths is not None
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| has_text_input = request.text is not None | |
| has_audio_input = request.audio_bytes is not None or request.audio_path is not None | |
| has_text_input = request.text is not None | |
| has_audio_input = ( | |
| request.audio_bytes is not None | |
| or request.audio_path is not None | |
| or request.audio_bytes_list is not None | |
| or request.audio_paths is not None | |
| ) |
🤖 Prompt for AI Agents
In recipes/multimodal/server/backends/base.py around lines 241 to 242, the
validation flag has_audio_input currently only checks request.audio_bytes and
request.audio_path and therefore misses multi-turn fields; update the check to
also consider request.audio_bytes_list and request.audio_paths (and treat
non-empty lists as valid) so requests that supply only multi-turn audio pass
validation; ensure you account for both presence and non-empty lists when
evaluating has_audio_input.
| "has_llm_cache": state.llm_cache is not None, | ||
| "has_input_embeds_history": state.input_embeds_history is not None | ||
| and len(state.input_embeds_history) > 0, | ||
| } |
There was a problem hiding this comment.
Potential AttributeError if input_embeds_history is not a list.
The code calls len(state.input_embeds_history) but input_embeds_history is typed as Any, which could be any type (e.g., a tensor, None after initial check, etc.). If it's a type without __len__, this will raise an AttributeError.
🔎 Proposed fix
return {
"session_id": state.session_id,
"frame_idx": state.frame_idx,
"turn_count": state.turn_count,
"created_at": state.created_at,
"last_accessed": state.last_accessed,
"has_llm_cache": state.llm_cache is not None,
- "has_input_embeds_history": state.input_embeds_history is not None
- and len(state.input_embeds_history) > 0,
+ "has_input_embeds_history": state.input_embeds_history is not None
+ and hasattr(state.input_embeds_history, "__len__")
+ and len(state.input_embeds_history) > 0,
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "has_llm_cache": state.llm_cache is not None, | |
| "has_input_embeds_history": state.input_embeds_history is not None | |
| and len(state.input_embeds_history) > 0, | |
| } | |
| return { | |
| "session_id": state.session_id, | |
| "frame_idx": state.frame_idx, | |
| "turn_count": state.turn_count, | |
| "created_at": state.created_at, | |
| "last_accessed": state.last_accessed, | |
| "has_llm_cache": state.llm_cache is not None, | |
| "has_input_embeds_history": state.input_embeds_history is not None | |
| and hasattr(state.input_embeds_history, "__len__") | |
| and len(state.input_embeds_history) > 0, | |
| } |
🤖 Prompt for AI Agents
In recipes/multimodal/server/session_manager.py around lines 214 to 217, the
boolean expression uses len(state.input_embeds_history) but input_embeds_history
is typed Any and may lack __len__, causing AttributeError; update the condition
to safely check for a non-empty sequence by first ensuring the attribute is not
None and supports len (e.g., hasattr(state.input_embeds_history, "__len__") or
isinstance(state.input_embeds_history, collections.abc.Sized)) before calling
len, or use a try/except around len to return False on TypeError/AttributeError
so the code only treats it as "has_input_embeds_history" when it is a
length-supporting container with length > 0.
| asyncio.create_task(self._process_batch()) | ||
| elif self.batch_timeout == 0: | ||
| # No delay mode | ||
| asyncio.create_task(self._process_batch()) | ||
| elif self.timeout_task is None or self.timeout_task.done(): | ||
| # Schedule timeout | ||
| self.timeout_task = asyncio.create_task(self._timeout_handler()) |
There was a problem hiding this comment.
Store asyncio.create_task references to prevent garbage collection.
Tasks created with asyncio.create_task() without storing the reference can be garbage collected before completion. This could cause batch processing to be interrupted unexpectedly.
🔎 Proposed fix
Consider tracking background tasks in a set:
+ # Track background tasks
+ self._background_tasks: set = set()
+
async def add_request(self, request: GenerationRequest) -> GenerationResult:
"""Add a request and wait for result."""
future = asyncio.Future()
pending = PendingRequest(request=request, future=future, timestamp=time.time())
async with self.lock:
self.pending_requests.append(pending)
# Check if we should process immediately
if len(self.pending_requests) >= self.batch_size:
if DEBUG:
print(f"[Batcher] Batch full ({self.batch_size}), processing immediately")
- asyncio.create_task(self._process_batch())
+ task = asyncio.create_task(self._process_batch())
+ self._background_tasks.add(task)
+ task.add_done_callback(self._background_tasks.discard)
elif self.batch_timeout == 0:
# No delay mode
- asyncio.create_task(self._process_batch())
+ task = asyncio.create_task(self._process_batch())
+ self._background_tasks.add(task)
+ task.add_done_callback(self._background_tasks.discard)
elif self.timeout_task is None or self.timeout_task.done():
# Schedule timeout
self.timeout_task = asyncio.create_task(self._timeout_handler())Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.8)
105-105: Store a reference to the return value of asyncio.create_task
(RUF006)
108-108: Store a reference to the return value of asyncio.create_task
(RUF006)
🤖 Prompt for AI Agents
In recipes/multimodal/server/unified_server.py around lines 105 to 111, tasks
created via asyncio.create_task() are not stored and can be garbage-collected;
modify the code to keep references (e.g., maintain a self._background_tasks
set), add each created task to that set, attach a done callback to remove the
task when finished (and handle exceptions in the callback), and ensure tasks are
cancelled/awaited on shutdown to avoid leaks or premature cancellation.
| extra_config: Dict[str, Any] = None, | ||
| ) -> FastAPI: | ||
| """Create and configure the FastAPI app.""" | ||
| global backend_instance, request_batcher, session_manager, server_config | ||
|
|
||
| # Extract server-level config from extra_config | ||
| ignore_system_prompt = extra_config.pop("ignore_system_prompt", False) if extra_config else False | ||
| session_ttl = extra_config.pop("session_ttl", 300.0) if extra_config else 300.0 | ||
| max_sessions = extra_config.pop("max_sessions", 100) if extra_config else 100 |
There was a problem hiding this comment.
Avoid mutating the caller's dict with pop() and fix implicit Optional.
The extra_config.pop() calls mutate the dictionary passed by the caller, which could cause unexpected behavior. Also, the type hint should use Optional[Dict[str, Any]] instead of Dict[str, Any] = None.
🔎 Proposed fix
def create_app(
backend_type: str = BACKEND_TYPE,
model_path: str = MODEL_PATH,
codec_model_path: str = CODEC_MODEL_PATH,
batch_size: int = BATCH_SIZE,
batch_timeout: float = BATCH_TIMEOUT,
device: str = "cuda",
dtype: str = "bfloat16",
- extra_config: Dict[str, Any] = None,
+ extra_config: Optional[Dict[str, Any]] = None,
) -> FastAPI:
"""Create and configure the FastAPI app."""
global backend_instance, request_batcher, session_manager, server_config
+ # Make a copy to avoid mutating the caller's dict
+ extra_config = dict(extra_config) if extra_config else {}
+
# Extract server-level config from extra_config
- ignore_system_prompt = extra_config.pop("ignore_system_prompt", False) if extra_config else False
- session_ttl = extra_config.pop("session_ttl", 300.0) if extra_config else 300.0
- max_sessions = extra_config.pop("max_sessions", 100) if extra_config else 100
+ ignore_system_prompt = extra_config.pop("ignore_system_prompt", False)
+ session_ttl = extra_config.pop("session_ttl", 300.0)
+ max_sessions = extra_config.pop("max_sessions", 100)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| extra_config: Dict[str, Any] = None, | |
| ) -> FastAPI: | |
| """Create and configure the FastAPI app.""" | |
| global backend_instance, request_batcher, session_manager, server_config | |
| # Extract server-level config from extra_config | |
| ignore_system_prompt = extra_config.pop("ignore_system_prompt", False) if extra_config else False | |
| session_ttl = extra_config.pop("session_ttl", 300.0) if extra_config else 300.0 | |
| max_sessions = extra_config.pop("max_sessions", 100) if extra_config else 100 | |
| extra_config: Optional[Dict[str, Any]] = None, | |
| ) -> FastAPI: | |
| """Create and configure the FastAPI app.""" | |
| global backend_instance, request_batcher, session_manager, server_config | |
| # Make a copy to avoid mutating the caller's dict | |
| extra_config = dict(extra_config) if extra_config else {} | |
| # Extract server-level config from extra_config | |
| ignore_system_prompt = extra_config.pop("ignore_system_prompt", False) | |
| session_ttl = extra_config.pop("session_ttl", 300.0) | |
| max_sessions = extra_config.pop("max_sessions", 100) |
🧰 Tools
🪛 Ruff (0.14.8)
242-242: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
🤖 Prompt for AI Agents
In recipes/multimodal/server/unified_server.py around lines 242 to 250, the
function signature and use of extra_config are wrong: the type hint should be
Optional[Dict[str, Any]] instead of Dict[str, Any] = None, and using
extra_config.pop(...) mutates the caller's dict. Fix by changing the parameter
type to Optional[Dict[str, Any]] and safely read values without mutating the
original (e.g., if extra_config is None set a local empty dict, or make a
shallow copy and use .get(...) for ignore_system_prompt, session_ttl and
max_sessions, or copy = dict(extra_config) then pop from the copy), preserving
defaults if keys are missing.
| save_dir = os.environ.get( | ||
| "AUDIO_SAVE_DIR", "/lustre/fsw/portfolios/llmservice/users/vmendelev/tmp/voicebench_test" | ||
| ) |
There was a problem hiding this comment.
Remove hardcoded user-specific path from default.
The default save directory contains a user-specific path (/lustre/fsw/portfolios/llmservice/users/vmendelev/tmp/voicebench_test). This should use a more generic default like /tmp/unified_server_outputs or require explicit configuration.
🔎 Proposed fix
save_dir = os.environ.get(
- "AUDIO_SAVE_DIR", "/lustre/fsw/portfolios/llmservice/users/vmendelev/tmp/voicebench_test"
+ "AUDIO_SAVE_DIR", "/tmp/unified_server_outputs"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| save_dir = os.environ.get( | |
| "AUDIO_SAVE_DIR", "/lustre/fsw/portfolios/llmservice/users/vmendelev/tmp/voicebench_test" | |
| ) | |
| save_dir = os.environ.get( | |
| "AUDIO_SAVE_DIR", "/tmp/unified_server_outputs" | |
| ) |
🤖 Prompt for AI Agents
In recipes/multimodal/server/unified_server.py around lines 511 to 513, the
default AUDIO_SAVE_DIR uses a hardcoded user-specific path; change the default
to a generic location (e.g., "/tmp/unified_server_outputs") or require callers
to set AUDIO_SAVE_DIR by raising a clear error if unset; update the
os.environ.get call to use the generic path (or replace with
os.environ["AUDIO_SAVE_DIR"] and add a clear startup-check that logs/raises if
not provided) and ensure any tests or docs reflecting the old path are updated
accordingly.
Kipok
left a comment
There was a problem hiding this comment.
this looks good to me, I don't think there is any other way to support this unfortunately. I didn't review low-level details, but as long as this works for you, I think we can merge.
The only thing I'd suggest is let's maybe rename serve_unified to something more explicit that makes it clear this is for audio models?
| def get_generation_params(self, request: GenerationRequest) -> Dict[str, Any]: | ||
| """ | ||
| Get effective generation parameters, merging request with config defaults. | ||
| """ | ||
| return { | ||
| "max_new_tokens": request.max_new_tokens or self.config.max_new_tokens, | ||
| "temperature": request.temperature or self.config.temperature, | ||
| "top_p": request.top_p or self.config.top_p, | ||
| "top_k": request.top_k or self.config.top_k, | ||
| } |
|
Closed in favor of #1250 |
This is a simple FastAPI server to which anybody can add a backend to comply with nemo-skills client-server approach. I already have some backends, and will push them later.
If somebody has a better idea how to make it easier to run nemo-skills with early research models not supported by vLLM or SGLANG -- please suggest.
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.