Skip to content

webui: support video files as input#22830

Merged
allozaur merged 1 commit into
ggml-org:masterfrom
foldl:webui-video-files
May 17, 2026
Merged

webui: support video files as input#22830
allozaur merged 1 commit into
ggml-org:masterfrom
foldl:webui-video-files

Conversation

@foldl
Copy link
Copy Markdown
Contributor

@foldl foldl commented May 8, 2026

Overview

Support adding video files as input. This can fix #20741.

Everything is done almost the same as audio files.

Note

To make video input work, at least 3 modifications are needed:

  • mtmd.
  • server.
  • webui.

This PR only updated WebUI.

Detailed Modifications

  1. Add a menu item for uploading video files;
  2. Show an icon in the chat input box (ChatAttachmentsListItemThumbnailFile) like ChatAttachmentsPreviewThumbnailStrip;
  3. A new preview window for video files;
  4. Video files are sent to the server through input_video (just like input_audio for audio files);
  5. Two types of video files are defined (mp4 and ogg);
  6. On Model Information window, video modality is shown as "Vision (Video)", and the vision modality is shown as "Vision (Image)";
  7. Add a new bool field video to Modalities.

Test & Sceenshots

I have tested this with chatllm.cpp.

image image image

Additional information

Some findings or thoughts that are out of the scope of this PR.

  • How to properly show the modalities of image-only, and image-video?
  • Video files often contain audio. At present, when sending to servers, media types are inferred from file extension but not the menu item which is clicked by users.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: NO. I coded this all by myself (copied and modified some existing codes).

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented May 8, 2026

IMO this can be an acceptable stop-gap solution. But just one concern is that we will eventually have native video support in mtmd, so we should make sure changes from this PR can be easily reverted when it happens.

@foldl
Copy link
Copy Markdown
Contributor Author

foldl commented May 8, 2026

This is for mtmd (see the issue #20741). Why this needs to be reverted when video support in mtmd is ready?

@allozaur
Copy link
Copy Markdown
Contributor

Please rebase this on latest commit on master and solve conflicts.

@foldl foldl force-pushed the webui-video-files branch from 7713550 to eb04056 Compare May 16, 2026 12:53
Copy link
Copy Markdown
Contributor

@ServeurpersoCom ServeurpersoCom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, video has been added symmetrically to image and audio.
If refactor is required, all three can be maintained in the same way.

@allozaur allozaur merged commit 4f13cb7 into ggml-org:master May 17, 2026
6 checks passed
@chigkim
Copy link
Copy Markdown

chigkim commented May 17, 2026

Is that possible to submit a video file via OpenAI Chat Completion or Responses API? If so, would you mind showing an example snippet of json payload?
Thanks!

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

Is that possible to submit a video file via OpenAI Chat Completion or Responses API? If so, would you mind showing an example snippet of json payload? Thanks!

The PR adds an input_video content part that mirrors input_audio exactly. Based on the diff (chat.service.ts and api.d.ts), the WebUI sends something like this on /v1/chat/completions:

{
  "model": "your-model",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Describe this video." },
        {
          "type": "input_video",
          "input_video": {
            "data": "<base64 encoded video bytes>",
            "format": "mp4"
          }
        }
      ]
    }
  ]
}

format is one of "mp4", "ogg" or "auto". The server side needs to understand input_video for this to work end to end.

kgrama pushed a commit to kgrama/llama.cpp that referenced this pull request May 19, 2026
@OPerepadia
Copy link
Copy Markdown

OPerepadia commented May 19, 2026

Is it required to set a specific flag to enable video input?

I tried with Gemma 4 E4B but there is no option to upload a video file.

In the model info, Vision (Video) is not shown either

image

But E4B has video capability

unsloth/gemma-4-E4B-it-GGUF · Hugging Face

Extended Multimodalities – Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B and E4B models).

Starting the llama-server with this command

./build/bin/llama-server \
    -hf unsloth/gemma-4-E4B-it-GGUF:Q8_K_XL \
    -c 32768 \
    --n-gpu-layers auto \
    --mmproj-auto --mmproj-offload

@foldl
Copy link
Copy Markdown
Contributor Author

foldl commented May 19, 2026

@OPerepadia this is implemented ahead of mtmd. Once video support is ready in mtmd (and a little update in server), video support will works.

See also #20741.

xxmustafacooTR pushed a commit to xxPlayground/llama-cpp-turboquant that referenced this pull request May 19, 2026
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 19, 2026
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request May 19, 2026
@chigkim
Copy link
Copy Markdown

chigkim commented May 21, 2026

Thanks @ServeurpersoCom for the format.

However, if I send the following message, I get this error below.

Data has base64.b64encode video. I encoded the same way I encode image.

def encode_video(path):
    with open(path, "rb") as video_file:
        content = video_file.read()
    return base64.b64encode(content).decode("utf-8")

data = encode_video(path)

Here's a message.

{
  "role": "user",
  "content": [
    { "type": "text", "text": "Describe this video." },
    {
      "type": "input_video",
      "input_video": {
        "data": data,
        "format": "mp4"
      }
    }
  ]
}

srv operator(): got exception: {"error":{"code":400,"message":"unsupported content[].type","type":"invalid_request_error"}}

If I send without the block for video below, it works, but the model says it can't find video as expected.

    {
      "type": "input_video",
      "input_video": {
        "data": data,
        "format": "mp4"
      }
    }

I tried with both qwen-3.6 and gemma-4.
Thanks for your help!

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

However, if I send the following message, I get this error below.

That's exactly the request llama-ui sends, your JSON is correct. The piece that's missing is the backend: the server's content parser doesn't handle input_video yet, so it rejects it. The client half is in, the server half isn't, which is why even the llama-ui itself can't do video end to end right now.

@foldl
Copy link
Copy Markdown
Contributor Author

foldl commented May 21, 2026

@chigkim To make video input work, at least 3 modifications are needed:

  • mtmd.
  • server.
  • webui.

This PR only updated WebUI. If you want to see how the whole thing works, you can try it with chatllm.cpp (SmolVL, Gemma-4-E2B).

@chigkim
Copy link
Copy Markdown

chigkim commented May 21, 2026

Ah ok, for some reason I thought the entire workflow was ready. :)

baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026
winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
Jcfunk added a commit to Jcfunk/llama.cpp that referenced this pull request Jun 2, 2026
* turboquant/HEAD: (82 commits)
  docs(readme): credit Google's original TurboQuant + explain the '+'
  docs(readme): fix turbo ladder ordering + cite K-compression paper
  docs(readme): reorder KV configs as a ladder + 'start light' guidance
  docs(readme): add Chronara to deployments + AtomicChat link
  docs: restructure README — professional layout, deployments, paper links
  docs: tighten README — add turbo2, missing features, paper links
  docs: keep upstream README, prepend fork-specific summary
  docs: replace upstream README with fork-specific summary
  fix(xxd.cmake): handle missing input file (not just empty)
  fix(ci): 4 cross-vendor -Werror failures + defensive xxd.cmake
  cmake : fix LLAMA_BUILD_UI logic (ggml-org#23190)
  fix(ggml-cuda): HIP nodiscard + MUSA cudaMemcpyToSymbol alias
  fix(turbo-quant): add forward declaration for turbo_cpu_fwht_inverse
  fix(metal): set ne12/ne13/r2/r3 function constants in mul_mm_tq_rotated pipeline
  webui: support video files as input (ggml-org#22830)
  server: (router) alloc tmp buffer on heap (ggml-org#23159)
  server: skip device enumeration in router mode to avoid creating CUDA primary context (ggml-org#23137)
  vulkan: removed duplicate #include <memory> in headers (ggml-org#23144)
  ui: Add request timeout for MCP tool calls (ggml-org#23138)
  sync : ggml
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: support video files in WebUI

6 participants