Picking up session-based streaming input from vLLM by vklimkov-nvidia · Pull Request #986 · vllm-project/vllm-omni

vklimkov-nvidia · 2026-01-27T13:52:35Z

Purpose

Introduce streaming input support.

Streaming input was recently introduced to vLLM main: vllm-project/vllm#28973.
It is an important feature to have for real-time processing, for example for dialogue agents, simultaneous translation systems. Feature adds StreamingInput as alternative input to generate method. PR mainly takes care of updating vllm-omni abstractions to be up-to-date with latest developments of vllm.

Test Plan

Check that doesn't affect any current tests.
Introduce a simple e2e test for streaming input

Test Result

In progress

Essential Elements of an Effective PR Description Checklist

[ x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb736f8e78

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-01-27T13:57:55Z

+                # For streaming input, mark output as not finished while streaming
+                if isinstance(req_state, OmniRequestState) and req_state.streaming_input:
+                    ro.finished = False


Avoid marking final streaming outputs unfinished

The new streaming-input path forces ro.finished = False whenever req_state.streaming_input is true, but the cleanup branch for a finished request (lines 372–377) only clears input_chunk_queue and never flips streaming_input or calls _finish_request when there are no more chunks. This means the last chunk in a streaming-input session is still emitted as unfinished and the request state remains active, so non‑streaming OpenAI chat handling drops the output entirely (serving_chat.py filters out finished == False at lines 1336–1337). The result is a missing final response and a request-state leak for streaming-input sessions that end without additional queued chunks.

Useful? React with 👍 / 👎.

I believe the concern is not valid. this change mirrors the one in vllm: this

While session is active - every output has finished=False, because additional input might appear from the user. Actual final signal comes through STREAM_FINISHED being pushedto the queue when generator closes.

streaming input is an internal API that one can use via engine.generate(). It is not exposed through completions.

There is no request state leak. _update_streaming_request_state handles finalization.

vklimkov-nvidia · 2026-01-27T13:59:54Z

Streaming input does not work for prompt embeddings. Here is example of iteratively providing input tokens for text llama. we basically replace what was decoded with a different token and proceed with decoding step.

from vllm.usage.usage_lib import UsageContext

from vllm_omni.engine.arg_utils import AsyncOmniEngineArgs
from vllm_omni.entrypoints.async_omni_llm import AsyncOmniLLM

# Model configuration
MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Create engine args - similar to vLLM's AsyncEngineArgs but with omni extensions
engine_args = AsyncOmniEngineArgs(
    model=MODEL,
    model_arch="LlamaForCausalLM",  # Architecture for TinyLlama
    model_stage="thinker",
    stage_id=0,
    engine_output_type="text",
    gpu_memory_utilization=0.8,
    max_model_len=512,
    enforce_eager=False,
)

# Create vLLM config from engine args
usage_context = UsageContext.LLM_CLASS
vllm_config = engine_args.create_engine_config(usage_context=usage_context)

# Create AsyncOmniLLM using the factory method
engine = AsyncOmniLLM.from_vllm_config(
    vllm_config=vllm_config,
    engine_args=engine_args,
    usage_context=usage_context,
)

import asyncio
from collections.abc import AsyncGenerator

from vllm import SamplingParams
from vllm.v1.engine.async_llm import StreamingInput
from vllm.sampling_params import RequestOutputKind

# Get tokenizer for encoding/decoding
tokenizer = engine.tokenizer

# Initial prompt and injection phrase
initial_prompt = "what is your name."
inject_phrase = "i am james and i like to play football everyday"

prompt_tokens = tokenizer.encode(initial_prompt, add_special_tokens=False)
inject_tokens = tokenizer.encode(inject_phrase, add_special_tokens=False)

print(f"Initial prompt: '{initial_prompt}'")
print(f"Prompt tokens: {prompt_tokens}")
print(f"\nInjection phrase: '{inject_phrase}'")
print(f"Injection tokens ({len(inject_tokens)}): {inject_tokens}")
print("=" * 70)

# Sampling params for streaming input
# NOTE: output_kind must be DELTA (not FINAL_ONLY), stop strings not supported
sampling_params = SamplingParams(
    max_tokens=1,  # Generate 1 token per input chunk
    temperature=0.7, 
    top_p=0.9, 
    ignore_eos=True,
    output_kind=RequestOutputKind.DELTA,
)
session_id = "streaming_session"

# Queue for synchronizing: output consumer tells input producer when to send next chunk
next_input_queue: asyncio.Queue[int | None] = asyncio.Queue()

# Track state
kv_context = prompt_tokens.copy()
decoded_tokens: list[int] = []

async def streaming_input_generator() -> AsyncGenerator[StreamingInput, None]:
    """Async generator that yields input chunks.
    
    This generator:
    1. Sends the initial prompt
    2. Waits for signal from output consumer
    3. Sends injection tokens one at a time
    4. Exits when all injection tokens are sent
    """
    # Send initial prompt
    yield StreamingInput(
        prompt={"prompt_token_ids": prompt_tokens},
        sampling_params=sampling_params,
    )
    
    # Wait for outputs and inject tokens
    step = 0
    while step < len(inject_tokens):
        # Wait for signal from output consumer (contains the step number or None to stop)
        signal = await next_input_queue.get()
        if signal is None:
            break
            
        # Inject next token
        inject_token = inject_tokens[step]
        inject_text = tokenizer.decode([inject_token])
        print(f"  Injecting token {step}: '{inject_text}' (id={inject_token})")
        kv_context.append(inject_token)
        
        # Send the injection token as next input chunk
        yield StreamingInput(
            prompt={"prompt_token_ids": [inject_token]},
            sampling_params=sampling_params,
        )
        step += 1
    
    print("Input generator finished")

# Run the streaming session
step = 0
async for output in engine.generate(
    streaming_input_generator(),  # Pass the async generator directly!
    sampling_params=sampling_params,
    request_id=session_id,
):
    # Process each output
    if output.outputs and output.outputs[0].token_ids:
        decoded_token = output.outputs[0].token_ids[-1]
        decoded_text = tokenizer.decode([decoded_token])
        decoded_tokens.append(decoded_token)
        kv_context.append(decoded_token)
        
        print(f"Step {step}: decoded '{decoded_text}' (id={decoded_token})")
        
        # Tell input generator to send next token
        if step < len(inject_tokens):
            await next_input_queue.put(step)
        
        step += 1
    
    # Check if we should stop
    if step >= len(inject_tokens):
        # Let the input generator know we're done
        await next_input_queue.put(None)
        break

… input requests Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>

Gaohan123

Thanks for your contribution. Please refer to AI review and my reviews. Besides, there are several quetions and suggestion:

which version of vLLM does the PR work on?
Could you please provide an example of validation?
Please supplement some Unit Tests for protecting key methods you added.

Gaohan123 · 2026-01-29T16:21:08Z

Why do you remove this?

which version of vLLM does the PR work on?

this PR is based just on top of the change that we want to take in - input streaming. It was merged to main in 91601ff47810de38fa117e09e34c18504a919a68. I run vllm-omni with vllm from this commit. If you'd like specific version of vllm after the commit - let me know and i will move to that.

Could you please provide an example of validation?

Added e2e test, here is a basic flow: test

Please supplement some Unit Tests for protecting key methods you added.

added unit tests:

test_omni_ar_scheduler_streaming.py

test_output_processor_streaming.py

test_omni_gpu_model_runner_streaming.py

more tests for this change are in vllm: vllm-project/vllm#28973

Gaohan123 · 2026-01-29T16:23:02Z

 from vllm.v1.engine.core_client import EngineCoreClient
-from vllm.v1.executor.abstract import Executor
-from vllm.v1.metrics.loggers import StatLoggerFactory, StatLoggerManager
+from vllm.v1.executor import Executor


All modifications in the file seems only related to logging. Is it related to streaming input?

not really. I try to use more recent vllm (with streaming input) and it has changes to the logging. AsyncOmniLLM is based on top of AsyncLLM which now has this new logger related arguments (here).

since we dont call parent constructor, we need to replicate changes in async_omni_llm as well

Gaohan123 · 2026-02-11T16:42:47Z

@vklimkov-nvidia Hello, any updates?

vklimkov-nvidia · 2026-02-12T15:16:34Z

sorry for the delay on this! I got caught up with some updates to our fork (https://github.com/vklimkov-nvidia/vllm/tree/vklimkov/voicechat). I'm getting back into this PR now and should have an update for you shortly

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>

vklimkov-nvidia · 2026-02-18T16:38:14Z

confirmed that since v0.16 vllm support merged to main, streaming input is already supported. closing

ekagra-ranjan · 2026-02-19T02:47:08Z

Hi @vklimkov-nvidia - I just had a look at the changes in your PR. It changes vllm-omni files. I am curious as to how does upgrading to vllm 0.16 bypass the need for your changes?

hsliuustc0106 · 2026-02-25T02:28:21Z

confirmed that since v0.16 vllm support merged to main, streaming input is already supported. closing

have you tested locally?

cc @amy-why-3459 @Gaohan123

vklimkov-nvidia added 2 commits January 26, 2026 23:21

Update to work with vLLM at 91601ff4 (streaming requests)

e410b0f

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>

async_omni_llm.py: update to match latest AsyncLLM

eb736f8

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>

vklimkov-nvidia requested a review from hsliuustc0106 as a code owner January 27, 2026 13:52

chatgpt-codex-connector Bot reviewed Jan 27, 2026

View reviewed changes

Gaohan123 mentioned this pull request Jan 28, 2026

[RFC]: Qwen3-Omni deployment #409

Open

35 tasks

omni_ar_scheduler: fix to reflect how vllm scheduler treats streaming…

914b4cf

… input requests Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>

Gaohan123 reviewed Jan 29, 2026

View reviewed changes

amy-why-3459 mentioned this pull request Feb 5, 2026

[RFC]: Omni Model 2026 Q1 Roadmap #1191

Open

1 task

This was referenced Feb 5, 2026

[RFC]: Qwen3-TTS Production Ready - February Milestone #938

Open

[Feature][TTS] Streaming Text Input for Qwen3-TTS via WebSocket #1230

Closed

vklimkov-nvidia added 2 commits February 18, 2026 14:57

Tests for streaming input

0bbc81f

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>

Fix imports after switching to newer vllm with streaming input support

66ce70b

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>

vklimkov-nvidia closed this Feb 18, 2026

Conversation

vklimkov-nvidia commented Jan 27, 2026

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

vklimkov-nvidia Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

vklimkov-nvidia commented Jan 27, 2026

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Gaohan123 Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

vklimkov-nvidia Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Gaohan123 Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

vklimkov-nvidia Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Gaohan123 commented Feb 11, 2026

Uh oh!

vklimkov-nvidia commented Feb 12, 2026

Uh oh!

vklimkov-nvidia commented Feb 18, 2026

Uh oh!

ekagra-ranjan commented Feb 19, 2026

Uh oh!

hsliuustc0106 commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants