Skip to content

move audio processing into model#1137

Merged
karpnv merged 4 commits intoaudio_binfrom
georgea/update-audio_bin
Dec 23, 2025
Merged

move audio processing into model#1137
karpnv merged 4 commits intoaudio_binfrom
georgea/update-audio_bin

Conversation

@gwarmstrong
Copy link
Collaborator

@gwarmstrong gwarmstrong commented Dec 19, 2025

Key Components

1. AudioProcessorConfig (nested dataclass)

@nested_dataclass(kw_only=True)
class AudioProcessorConfig:
    data_dir: str = ""              # Base directory for audio files
    enable_chunking: bool = True    # Whether to chunk long audio
    chunk_task_types: list[str] | None = None
    chunk_threshold_sec: int = 30   # Chunk audio longer than this

2. AudioProcessor (wrapper class)

class AudioProcessor:
    def __init__(self, model, config, eval_config=None, eval_type=None):
        self.model = model
        # Resolves data_dir from config or eval_config
        ...
    
    async def generate_async(self, prompt, task_type=None, **kwargs):
        if isinstance(prompt, list):
            # Check if chunking needed
            # Convert audio files to base64
            # Handle chunking if audio is long
        return await self.model.generate_async(prompt=prompt, **kwargs)

3. Integration in generate.py

# In GenerateSolutionsConfig:
audio: AudioProcessorConfig | None = None  # Nested config, None = disabled

# In setup_llm():
if self.cfg.audio is not None:
    audio_supported_servers = {"vllm"}
    if server_type not in audio_supported_servers:
        raise ValueError(f"Audio not supported for {server_type}")
    llm = AudioProcessor(llm, self.cfg.audio, eval_config=..., eval_type=...)

Benefits

1. Clean Separation of Concerns

  • VLLMModel: ~150 lines, pure VLLM API client
  • AudioProcessor: ~350 lines, all audio logic in one place

2. Composability

Audio processing wraps the base model, so it works with any underlying model:

# Works with VLLM
llm = AudioProcessor(VLLMModel(...), config)

# Could work with our tool calling modules
llm = AudioProcessor(ToolCallingWrapper(VLLMModel(...)), config)

Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
@Jorjeous
Copy link
Member

Probably should set default chunking to 60s, or decide automaticly, based on model is being used.
For megatron it's 60s, for qwern - 30s

Copy link
Collaborator

@karpnv karpnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@karpnv karpnv merged commit a88736e into audio_bin Dec 23, 2025
2 checks passed
@karpnv karpnv deleted the georgea/update-audio_bin branch December 23, 2025 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants