webui: support video files as input#22830
Conversation
|
IMO this can be an acceptable stop-gap solution. But just one concern is that we will eventually have native video support in mtmd, so we should make sure changes from this PR can be easily reverted when it happens. |
|
This is for mtmd (see the issue #20741). Why this needs to be reverted when video support in mtmd is ready? |
|
Please rebase this on latest commit on |
ServeurpersoCom
left a comment
There was a problem hiding this comment.
LGTM, video has been added symmetrically to image and audio.
If refactor is required, all three can be maintained in the same way.
|
Is that possible to submit a video file via OpenAI Chat Completion or Responses API? If so, would you mind showing an example snippet of json payload? |
The PR adds an input_video content part that mirrors input_audio exactly. Based on the diff (chat.service.ts and api.d.ts), the WebUI sends something like this on /v1/chat/completions: format is one of "mp4", "ogg" or "auto". The server side needs to understand input_video for this to work end to end. |
|
Is it required to set a specific flag to enable video input? I tried with Gemma 4 E4B but there is no option to upload a video file. In the model info, Vision (Video) is not shown either
But E4B has video capability unsloth/gemma-4-E4B-it-GGUF · Hugging Face
Starting the llama-server with this command ./build/bin/llama-server \
-hf unsloth/gemma-4-E4B-it-GGUF:Q8_K_XL \
-c 32768 \
--n-gpu-layers auto \
--mmproj-auto --mmproj-offload |
|
@OPerepadia this is implemented ahead of mtmd. Once video support is ready in mtmd (and a little update in server), video support will works. See also #20741. |
|
Thanks @ServeurpersoCom for the format. However, if I send the following message, I get this error below. Data has base64.b64encode video. I encoded the same way I encode image. def encode_video(path):
with open(path, "rb") as video_file:
content = video_file.read()
return base64.b64encode(content).decode("utf-8")
data = encode_video(path)Here's a message. {
"role": "user",
"content": [
{ "type": "text", "text": "Describe this video." },
{
"type": "input_video",
"input_video": {
"data": data,
"format": "mp4"
}
}
]
}srv operator(): got exception: {"error":{"code":400,"message":"unsupported content[].type","type":"invalid_request_error"}} If I send without the block for video below, it works, but the model says it can't find video as expected. {
"type": "input_video",
"input_video": {
"data": data,
"format": "mp4"
}
}I tried with both qwen-3.6 and gemma-4. |
That's exactly the request llama-ui sends, your JSON is correct. The piece that's missing is the backend: the server's content parser doesn't handle input_video yet, so it rejects it. The client half is in, the server half isn't, which is why even the llama-ui itself can't do video end to end right now. |
|
@chigkim To make video input work, at least 3 modifications are needed:
This PR only updated WebUI. If you want to see how the whole thing works, you can try it with chatllm.cpp (SmolVL, Gemma-4-E2B). |
|
Ah ok, for some reason I thought the entire workflow was ready. :) |
* turboquant/HEAD: (82 commits) docs(readme): credit Google's original TurboQuant + explain the '+' docs(readme): fix turbo ladder ordering + cite K-compression paper docs(readme): reorder KV configs as a ladder + 'start light' guidance docs(readme): add Chronara to deployments + AtomicChat link docs: restructure README — professional layout, deployments, paper links docs: tighten README — add turbo2, missing features, paper links docs: keep upstream README, prepend fork-specific summary docs: replace upstream README with fork-specific summary fix(xxd.cmake): handle missing input file (not just empty) fix(ci): 4 cross-vendor -Werror failures + defensive xxd.cmake cmake : fix LLAMA_BUILD_UI logic (ggml-org#23190) fix(ggml-cuda): HIP nodiscard + MUSA cudaMemcpyToSymbol alias fix(turbo-quant): add forward declaration for turbo_cpu_fwht_inverse fix(metal): set ne12/ne13/r2/r3 function constants in mul_mm_tq_rotated pipeline webui: support video files as input (ggml-org#22830) server: (router) alloc tmp buffer on heap (ggml-org#23159) server: skip device enumeration in router mode to avoid creating CUDA primary context (ggml-org#23137) vulkan: removed duplicate #include <memory> in headers (ggml-org#23144) ui: Add request timeout for MCP tool calls (ggml-org#23138) sync : ggml ...

Overview
Support adding video files as input. This can fix #20741.
Everything is done almost the same as audio files.
Note
To make video input work, at least 3 modifications are needed:
This PR only updated WebUI.
Detailed Modifications
ChatAttachmentsListItemThumbnailFile) likeChatAttachmentsPreviewThumbnailStrip;input_video(just likeinput_audiofor audio files);mp4andogg);videotoModalities.Test & Sceenshots
I have tested this with chatllm.cpp.
Additional information
Some findings or thoughts that are out of the scope of this PR.
Requirements