QVAC-18026 Feature/video by aegioscy · Pull Request #1879 · tetherto/qvac

aegioscy · 2026-05-04T10:19:30Z

QVAC-18026 Wan 2.1 text-to-video generation support

Add complete text-to-video inference pipeline for Wan 2.1 (1.3B & 14B) models with GGML backend support:

Core components: Video dispatch handlers, frame processing, and AVI muxer
Native layer: C++ handlers for Wan video generation with GGML overlay patches
JS API: VideoStableDiffusion class with text-to-video, image-to-video, and frame-to-video modes
Examples: Three runnable demos (generate, img2vid, flf2vid) with tuned hyperparameters
Tests: Comprehensive unit tests for AVI writer, video handlers, and frame validation
Build: Model download scripts, build config, and full TypeScript definitions

Wan model optimizations included: flow_shift=3.0 for smooth motion, frame constraints (4k+1), and motion-explicit prompt guidance.

- Extract shared dl() function into reusable dl-functions.sh module - Update all download-model-*.sh scripts to source shared utilities - Add download-model-wan.sh for Wan 2.1 video generation models - Reduces code duplication and improves maintainability Wan 2.1 downloads (~8.3 GB): - wan2.1_t2v_1.3B_fp16.safetensors (diffusion model) - wan_2.1_vae.safetensors (VAE encoder/decoder) - umt5_xxl_fp16.safetensors (text encoder) Co-authored-by: Cursor <cursoragent@cursor.com>

… shared parsers Phase 1-4 of Wan 2.1 / 2.2 video generation support in the diffusion-cpp addon. Configuration + parsing layer only; dispatch + callback plumbing + JS surface land in follow-up commits on this branch. SdCtxConfig: - Add highNoiseDiffusionModelPath for Wan 2.2 MoE high-noise expert (leave empty for Wan 2.1 and all non-Wan models) - Add previewMode / previewInterval / previewDenoised / previewNoisy for optional mid-denoising preview frames via sd_set_preview_callback - Wire both through SdCtxHandlers (new JS keys: preview_mode, preview_interval, preview_denoised, preview_noisy) and AddonJs (highNoiseDiffusionModelPath in args map) AviWriter (new utility): - addon/src/utils/AviWriter.{hpp,cpp} ports the upstream avi_writer.h MJPG encoder onto an in-memory std::vector<uint8_t> sink (no stdio, no temp files) so video bytes flow through the existing OutputCallBackJs queue - Full input validation (numFrames, fps, jpegQuality, channel count, frame homogeneity, null data) -- StatusError on any rejection SdParsers (new shared module): - Extract parseSampler / parseScheduler / parseCacheMode / parseVaeTileSize / parseCachePreset / requireNum/Str/Bool from SdGenHandlers into addon/src/handlers/SdParsers.{hpp,cpp} - Reused by both SdGenHandlers (image) and SdVidGenHandlers (video) SdVidGenHandlers (new): - SdVidGenConfig struct with full Wan 2.1 + 2.2 surface: mode (txt2vid/img2vid/flf2vid), prompts, dimensions, videoFrames (4k+1 validated), fps, seed, low-noise expert sample params, high-noise expert sample params, moeBoundary, strength, vaceStrength, VAE tiling, cache mode/preset/threshold - 22 JSON handlers with validation for each field Tests (all pass): - 5 new SdCtxHandlers tests for preview_* + high_noise path default - 18 new AviWriter tests covering happy path, RIFF header structure, all validation rejections, JPEG round-trip - 54 new SdVidGenHandlers tests covering every field + integration payload + defaults - Zero regressions across existing 144 fast-unit tests No user-facing JS API changes yet. Co-authored-by: Cursor <cursoragent@cursor.com>

…rapper + examples Builds on the Wan foundation commit by wiring the video path end-to-end from JS to C++ and back. Adds txt2vid / img2vid / flf2vid generation via a new VideoStableDiffusion class that shares the single native binding with the existing ImgStableDiffusion class. Native: - SdModel::process() dispatches on the JSON "mode" field to processImage() (existing) or the new processVideo() path. - processVideo() applies SdVidGenHandlers, validates mode-vs-inputs invariants (img2vid requires init_image; flf2vid requires both; txt2vid rejects both; end_image only valid on flf2vid), decodes init/end/control frames, fills sd_vid_gen_params_t, and encodes the returned sd_image_t* sequence to an in-memory MJPG AVI. - SdVideoFrames RAII wrapper extracted to addon/src/utils/ so it can be unit-tested without a loaded model. - GenerationJob grows endImageBytes and controlFramesBytes plus an optional per-frame frameCallback (unused from JS in this PR; reserved for the preview follow-up). - AddonJs::runJob reads endImageBuffer (single Uint8Array) and controlFramesBuffers (Array of Uint8Array) as typed-array args, no JSON encoding. JS surface: - video.js / video.d.ts: new VideoStableDiffusion class with full per-mode validation, 4k+1 frame-count rule, fps range, moe_boundary range, Uint8Array type checks, and warning when high_noise_* params are set without files.highNoiseDiffusionModel. - addon.js: SdInterface.runJob threads end_image and control_frames through to the native runJob without round-tripping through JSON. - index.js / index.d.ts: unchanged -- image wrapper continues to work exactly as before. Both classes compose the same SdInterface and hit the same binding.cpp entry points. - package.json: exports "./video", ships video.js / video.d.ts, adds generate:video / generate:img2vid / generate:flf2vid scripts. Examples: - examples/generate-video-wan.js (txt2vid @ 832x480, 33 frames) - examples/img2vid-wan.js (reuses assets/von-neumann.jpg as first frame) - examples/flf2vid-wan.js (expects flf-first.png / flf-last.png) Tests: - test_sd_video_frames.cpp: 12 RAII tests (empty states, destruction of 4k+1 production sizes, null-pixel tolerance, bounds-checked operator[], compile-time copy/move deletion). - test_wan_video.cpp: 12 validation tests reusing the SD2.1 context to satisfy isLoaded() and exercise every processVideo() guard before generate_video() runs; plus an opt-in happy-path smoke test (SD_RUN_WAN_SMOKE=1) gated off by default because ggml-metal lacks IM2COL_3D for Wan's 3D convs. Gates: npm run lint, npm run test:dts, npm run build, and the fast subset of addon-test (178/178) all pass. Co-authored-by: Cursor <cursoragent@cursor.com>

Add a vcpkg overlay-port for ggml at vcpkg/ports/ggml/ that pins tetherto/qvac-ext-ggml @ feature/metal-pr-16669-clean (commit bc053644). The fork adds Metal kernels for IM2COL_3D and 3-axis PAD-left, both required by Wan 2.1 / 2.2 video generation; without them ggml hard-aborts mid-run with "unsupported op 'IM2COL_3D'". Rationale lives in portfile.cmake -- the overlay is transient and will be removed once the registry baseline rolls forward. Add JS test coverage for VideoStableDiffusion: - test/unit/video-validation.test.js: 63 input-validation cases mirroring the existing input-validation.test.js pattern. - test/integration/generate-video-wan.test.js: opt-in (WAN_INTEGRATION=1) end-to-end T2V smoke test plus sniffAvi self-tests. Tune the Wan examples: - generate-video-wan.js: env-var-driven (PROMPT, FRAMES, STEPS, SEED, CFG_SCALE, FLOW_SHIFT, ...), inline frame-count cheat sheet, (4*k+1) pre-flight check, default FRAMES bumped to 81 (Wan 1.3B's native training length). - img2vid-wan.js, flf2vid-wan.js: flow_shift 5.0 -> 3.0 to match the upstream test-wan reference scripts. Refresh the C++ smoke-test gating doc in test_wan_video.cpp to reflect that Metal works once the overlay is in place. Drop build.md: the vcpkg overlay rationale already lives next to the overlay (portfile.cmake header), and transient infrastructure doesn't earn its own long-form doc. Co-authored-by: Cursor <cursoragent@cursor.com>

gianni-cor · 2026-05-12T16:18:06Z

packages/diffusion-cpp/addon.js:24 — mapAddonEvent JobEnded fallback will misclassify any non-uint8_t typed-array output as JobEnded and prematurely close the response stream

(General PR comment — line 24 of addon.js is outside this PR's diff hunks so inline anchoring isn't possible.)

The else if (rawData && typeof rawData === 'object') branch at addon.js:24-26 accepts every typed array (Float32Array, Int16Array, …), not just the runtime-stats POJO it's intended for. Today only JsTypedArrayOutputHandler<uint8_t> is registered (AddonJs.hpp:103-104), so it works by coincidence. The moment a second typed-array handler ships — most obviously when SdModel::GenerationJob::frameCallback (declared at SdModel.hpp:139-141, fired at SdModel.cpp:979-985, flagged "reserved for future use" in the PR description) is wired from JS — every per-frame event gets re-typed as JobEnded and the stream ends after the first frame. C++ tests pass; JS silently breaks.

Change packages/diffusion-cpp/addon.js:24 from:

if (rawData && typeof rawData === 'object') {

to:

if (rawData && typeof rawData === 'object' && !ArrayBuffer.isView(rawData)) {

ArrayBuffer.isView is true for every typed array and DataView, false for plain objects (the runtime-stats POJO) — exactly the discrimination needed. One-token fix, latent-bug-proof.

gianni-cor

Five C++-side issues from a re-review of the latest commits (one real bug + three hardening + one feature-completeness clarification). Per-comment inline below; the init_image regression and JS-side issues were posted in earlier comment threads.

…, cancel, etc.) Addresses all 8 outstanding comments on PR #1879 (one regression from commit 59f2663 plus a CHANGES_REQUESTED batch of seven items). Major points below; per-file rationale in the inline comments. == Regression fix (highest priority) * gianni-cor flagged that the new init_image strict-equality check from commit 59f2663 rejects every off-grid frame with a confusing error citing wrapper-picked dims. Root cause: addon.js _fillDimsFromImage was silently doing Math.ceil(d/8)*8, so a 100x100 init_image got dispatched as 104x104 and the native check then threw "100x100 != 104x104" -- citing a value the caller never passed. Fixes: - addon.js _fillDimsFromImage now passes dims through verbatim (no rounding). The image SDEdit path already realigns internally (SdModel.cpp ~600) and the FLUX2 ref path uses auto_resize_ref_image, so dropping the rounding is safe across every path. - video.js _runInternal pre-empts the cryptic native error with a JS-layer off-grid probe: when width/height aren't explicit it reads init_image / end_image / control_frames[i] dimensions and throws a clear "your image is off-grid, pre-align or pass explicit dims" message naming the exact buffer. - Removes the ceil-vs-round inconsistency wart between _fillDimsFromImage (ceil) and the user-facing validator (round). - Three new JS regression tests for off-grid init / end / control, plus one positive test for explicit aligned dims overriding the probe. == JS hardening * params.prompt is documented Required but was never validated -- undefined / "" / 42 each produced a different failure mode (silent noise, silent noise, far-away C++ error). video.js now throws a loud TypeError at the wrapper boundary. Four new prompt-validation tests. * mapAddonEvent JobEnded fallback accepted every typed-array view -- works today only because uint8_t is the sole registered TypedArrayOutputHandler. When frameCallback (SdModel.hpp:139) gets wired through to JS, every per-frame event would have been misclassified as JobEnded and the response stream would have closed after the first frame. One-token fix: add `&& !ArrayBuffer.isView(rawData)` to the discriminator. ArrayBuffer.isView is true for every TypedArray + DataView, false for plain objects -- exactly the discrimination needed for the runtime-stats POJO. == C++ parser hardening (NaN / Inf / int64 / range) * Promoted requireInt from SdVidGenHandlers.cpp's anonymous namespace into parsers::, and added two siblings: - requireFiniteFloat: rejects NaN / +inf / -inf before the float cast (NaN compares false against every bound, so range checks of the form `f < lo || f > hi` previously let it sneak through). - requireInt64: same finite + integer guards as requireInt, range check against representable [INT64_MIN, INT64_MAX] doubles. - requireFiniteFloatInRange: convenience wrapper for [lo, hi] checks. * Routed every previously-vulnerable cast through the new helpers: - SdVidGenHandlers.cpp: seed (int64), cfg_scale, flow_shift, high_noise_cfg_scale, high_noise_flow_shift, vae_tile_overlap, cache_threshold, moe_boundary, strength, vace_strength - SdGenHandlers.cpp (image path, reviewer asked for symmetric fix): eta, cfg_scale, guidance, img_cfg_scale, seed, batch_count, strength, clip_skip, vae_tile_overlap, cache_threshold, width, height, steps, parseUpscaleRepeats * parseVaeTileSize (SdParsers.cpp): numeric form now routes through requireInt (rejects NaN/Inf/fractional/out-of-range), and BOTH forms (numeric and "WxH" string) now reject <= 0. Five new tests. == Cancellation gap + typed status * SdModel.cpp processVideo cancelRequested_ was checked exactly once after generate_video() returns -- the slow tail (per-frame PNG fan-out + AVI mux, multi-second on 81-frame 832x480 videos) had no cancellation visibility. Added 2 checks: top of frame-callback loop body, and immediately before encodeFramesToAvi. * Switched both Job cancelled throws (image path at SdModel.cpp:730, video path at :987, plus the 2 new C1 sites) from bare std::runtime_error to StatusError tagged with localCodeMsg="Cancelled", so the JS layer can discriminate cancel from real internal failures via codeString() ("[ General :: Cancelled ]") instead of string-matching the exception message. Note: this PR deliberately does NOT add `Cancelled = 6` to the shared inference-addon-cpp Errors.hpp enum, because that header ships via vcpkg to every package in the monorepo and a cross-package coordinated change is out of scope. Instead we use the 3-arg StatusError ctor (addonId, localCodeMsg, errorMsg) which produces the same codeString without touching the shared enum. When the enum is updated later, the 4 call sites can switch to the 2-arg ctor in a one-line follow-up. == C5 (preview_*) -- product decision deferred * The header comment at SdCtxHandlers.hpp:112 claimed preview_mode et al are "Wired to sd_set_preview_callback() in SdModel::process()", but a grep across packages/diffusion-cpp for sd_set_preview_callback returns zero matches -- the four config keys are validated and stored but the upstream callback is never installed, so they're a silent no-op end-to-end. Downgraded the misleading comment to an explicit TODO(QVAC-18026 follow-up) documenting the gap and the two viable resolution paths (wire it up alongside sd_set_abort_callback, OR remove the handlers + fields + tests). Reviewer asked which path is intended; this commit picks neither and just stops claiming the wiring exists. The choice can land in a separate PR without holding this one up. == Test surface * +8 JS tests (prompt validation x4, off-grid probe x4) * +5 C++ tests (vae_tile_size zero/negative/fractional/out-of-range rejection, plus the existing IntCoercion suite carried over to the promoted helpers transparently) * Cancel-context test updated to assert the typed "[ General :: Cancelled ]" codeString in addition to the message. Verified locally: JS unit tests: 75/75 pass with prebuild, 75/75 also without (CI sanity-checks mode, no native binary loaded) C++ unit tests: 209/210 pass, 1 opt-in skip (SdWanHappyPathTest needs ~8GB Wan weights) npm run lint: clean npm run test:dts: clean Co-authored-by: Cursor <cursoragent@cursor.com>

aegioscy · 2026-05-12T18:23:15Z

Done in fe4d10f — one-token fix as you suggested. mapAddonEvent (addon.js:24) now discriminates with && !ArrayBuffer.isView(rawData), so future typed-array handlers (per-frame frameCallback, audio, anything else) won't get misclassified as JobEnded and prematurely close the response stream.

Inline comment added at the call site explaining what each side of the discriminator catches and why (TypedArray + DataView vs plain runtime-stats POJO), so the rationale is visible for whoever wires SdModel::GenerationJob::frameCallback through to JS next.

Bumps @qvac/diffusion-cpp to 0.8.0 and documents the Wan 2.1 / Wan 2.2 video pipeline shipped since 0.7.0: new VideoStableDiffusion class (txt2vid / img2vid / flf2vid), MoE high-noise expert routing, streaming MJPG AVI muxer, refactored download helpers + Wan model script, plus the supporting JS + C++ test coverage and validation hardening. Co-authored-by: Cursor <cursoragent@cursor.com>

_fillDimsFromImage was passing raw image dimensions through verbatim since fe4d10f, but the native SdGenHandlers validates width/height % 8 == 0 before the downstream alignment in SdModel::processImage ever runs. Any img2img call with a non-aligned source image (e.g. the bundled 500x627 von-neumann.jpg used by the FLUX2 i2i integration test) therefore failed with: height must be a positive multiple of 8, got: 627 Restore the Math.ceil(d/8)*8 round-up that was removed in fe4d10f. The original motivation for the removal -- avoiding a spurious dim mismatch on the video path where processVideo strict-compares decoded frame dims against vid.width/vid.height -- is already handled at the JS layer by VideoStableDiffusion's off-grid pre-validation in video.js, which runs before this helper and rejects unaligned init/end/control frames with a clear caller-facing error. The ceil() is therefore a no-op on the video path. Co-authored-by: Cursor <cursoragent@cursor.com>

cpp-lint surfaced clang-format drift in 4 files that accumulated across recent Wan-video commits. No semantic changes -- only mechanical line-wrap / arg-break placement to match the project's .clang-format. Co-authored-by: Cursor <cursoragent@cursor.com>

…ntegration test The generate-video-wan.test.js test was using a relative import (require('../../video')) that breaks when test files are bundled and relocated to the test-framework backend directory during mobile test setup. Change to the package export pattern (@qvac/diffusion-cpp/video) used by other integration tests, which remains valid regardless of file location. Fixes: https://github.com/tetherto/qvac/actions/runs/25929776543/job/76221440417 Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

gianni-cor · 2026-05-18T11:09:39Z

/review

gianni-cor · 2026-05-18T11:10:35Z

/review

aegioscy and others added 3 commits May 4, 2026 09:42

aegioscy had a problem deploying to release May 4, 2026 10:19 — with GitHub Actions Failure

aegioscy had a problem deploying to release May 4, 2026 10:21 — with GitHub Actions Failure

aegioscy had a problem deploying to release May 5, 2026 11:26 — with GitHub Actions Failure

aegioscy had a problem deploying to release May 5, 2026 11:27 — with GitHub Actions Failure

gianni-cor requested changes May 12, 2026

View reviewed changes

aegioscy and others added 2 commits May 12, 2026 21:07

Merge branch 'main' into feature/video

51599d3

aegioscy added 2 commits May 13, 2026 08:54

Merge branch 'main' into feature/video

98da3a1

Merge branch 'main' into feature/video

cf14d4d

gabrielgrigoras-serv previously approved these changes May 13, 2026

View reviewed changes

aegioscy and others added 4 commits May 13, 2026 14:04

Merge branch 'main' into feature/video

e8cc385

Merge branch 'main' into feature/video

08e4d9c

Merge branch 'main' into feature/video

2523fa5

gianni-cor previously approved these changes May 15, 2026

View reviewed changes

gianni-cor and others added 12 commits May 15, 2026 10:37

Merge branch 'main' into feature/video

5fee526

Merge branch 'main' into feature/video

c11f0ba

Merge branch 'main' into feature/video

0f00555

Merge branch 'main' into feature/video

eb6f0b2

fix(diffusion-cpp): expose video API from package root

b4008a4

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge branch 'main' into feature/video

8e66466

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge branch 'main' into feature/video

acef512

fix(diffusion-cpp): repair variable names in SdModel after merge

c67541f

Co-authored-by: Cursor <cursoragent@cursor.com>

style(diffusion-cpp): apply git-clang-format

c328e70

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge branch 'main' into feature/video

8a00f5c

gianni-cor approved these changes May 18, 2026

View reviewed changes

maxim-smotrov approved these changes May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QVAC-18026 Feature/video#1879

QVAC-18026 Feature/video#1879
gianni-cor merged 47 commits into
mainfrom
feature/video

aegioscy commented May 4, 2026 •

edited

Loading

Uh oh!

gianni-cor commented May 12, 2026

Uh oh!

gianni-cor left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aegioscy commented May 12, 2026

Uh oh!

gianni-cor commented May 18, 2026

Uh oh!

gianni-cor commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

aegioscy commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

QVAC-18026 Wan 2.1 text-to-video generation support

Uh oh!

gianni-cor commented May 12, 2026

Uh oh!

gianni-cor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aegioscy commented May 12, 2026

Uh oh!

gianni-cor commented May 18, 2026

Uh oh!

gianni-cor commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aegioscy commented May 4, 2026 •

edited

Loading