Skip to content

QVAC-18026 Feature/video#1879

Merged
gianni-cor merged 47 commits into
mainfrom
feature/video
May 18, 2026
Merged

QVAC-18026 Feature/video#1879
gianni-cor merged 47 commits into
mainfrom
feature/video

Conversation

@aegioscy

@aegioscy aegioscy commented May 4, 2026

Copy link
Copy Markdown
Contributor

QVAC-18026 Wan 2.1 text-to-video generation support

Add complete text-to-video inference pipeline for Wan 2.1 (1.3B & 14B) models with GGML backend support:

  • Core components: Video dispatch handlers, frame processing, and AVI muxer
  • Native layer: C++ handlers for Wan video generation with GGML overlay patches
  • JS API: VideoStableDiffusion class with text-to-video, image-to-video, and frame-to-video modes
  • Examples: Three runnable demos (generate, img2vid, flf2vid) with tuned hyperparameters
  • Tests: Comprehensive unit tests for AVI writer, video handlers, and frame validation
  • Build: Model download scripts, build config, and full TypeScript definitions

Wan model optimizations included: flow_shift=3.0 for smooth motion, frame constraints (4k+1), and motion-explicit prompt guidance.

aegioscy and others added 3 commits May 4, 2026 09:42
- Extract shared dl() function into reusable dl-functions.sh module
- Update all download-model-*.sh scripts to source shared utilities
- Add download-model-wan.sh for Wan 2.1 video generation models
- Reduces code duplication and improves maintainability

Wan 2.1 downloads (~8.3 GB):
- wan2.1_t2v_1.3B_fp16.safetensors (diffusion model)
- wan_2.1_vae.safetensors (VAE encoder/decoder)
- umt5_xxl_fp16.safetensors (text encoder)

Co-authored-by: Cursor <cursoragent@cursor.com>
… shared parsers

Phase 1-4 of Wan 2.1 / 2.2 video generation support in the diffusion-cpp
addon. Configuration + parsing layer only; dispatch + callback plumbing +
JS surface land in follow-up commits on this branch.

SdCtxConfig:
  - Add highNoiseDiffusionModelPath for Wan 2.2 MoE high-noise expert
    (leave empty for Wan 2.1 and all non-Wan models)
  - Add previewMode / previewInterval / previewDenoised / previewNoisy
    for optional mid-denoising preview frames via sd_set_preview_callback
  - Wire both through SdCtxHandlers (new JS keys: preview_mode,
    preview_interval, preview_denoised, preview_noisy) and AddonJs
    (highNoiseDiffusionModelPath in args map)

AviWriter (new utility):
  - addon/src/utils/AviWriter.{hpp,cpp} ports the upstream avi_writer.h
    MJPG encoder onto an in-memory std::vector<uint8_t> sink (no stdio,
    no temp files) so video bytes flow through the existing
    OutputCallBackJs queue
  - Full input validation (numFrames, fps, jpegQuality, channel count,
    frame homogeneity, null data) -- StatusError on any rejection

SdParsers (new shared module):
  - Extract parseSampler / parseScheduler / parseCacheMode /
    parseVaeTileSize / parseCachePreset / requireNum/Str/Bool from
    SdGenHandlers into addon/src/handlers/SdParsers.{hpp,cpp}
  - Reused by both SdGenHandlers (image) and SdVidGenHandlers (video)

SdVidGenHandlers (new):
  - SdVidGenConfig struct with full Wan 2.1 + 2.2 surface: mode
    (txt2vid/img2vid/flf2vid), prompts, dimensions, videoFrames (4k+1
    validated), fps, seed, low-noise expert sample params, high-noise
    expert sample params, moeBoundary, strength, vaceStrength, VAE
    tiling, cache mode/preset/threshold
  - 22 JSON handlers with validation for each field

Tests (all pass):
  - 5 new SdCtxHandlers tests for preview_* + high_noise path default
  - 18 new AviWriter tests covering happy path, RIFF header structure,
    all validation rejections, JPEG round-trip
  - 54 new SdVidGenHandlers tests covering every field + integration
    payload + defaults
  - Zero regressions across existing 144 fast-unit tests

No user-facing JS API changes yet.

Co-authored-by: Cursor <cursoragent@cursor.com>
…rapper + examples

Builds on the Wan foundation commit by wiring the video path end-to-end
from JS to C++ and back. Adds txt2vid / img2vid / flf2vid generation
via a new VideoStableDiffusion class that shares the single native
binding with the existing ImgStableDiffusion class.

Native:
- SdModel::process() dispatches on the JSON "mode" field to
  processImage() (existing) or the new processVideo() path.
- processVideo() applies SdVidGenHandlers, validates mode-vs-inputs
  invariants (img2vid requires init_image; flf2vid requires both;
  txt2vid rejects both; end_image only valid on flf2vid), decodes
  init/end/control frames, fills sd_vid_gen_params_t, and encodes
  the returned sd_image_t* sequence to an in-memory MJPG AVI.
- SdVideoFrames RAII wrapper extracted to addon/src/utils/ so it
  can be unit-tested without a loaded model.
- GenerationJob grows endImageBytes and controlFramesBytes plus an
  optional per-frame frameCallback (unused from JS in this PR;
  reserved for the preview follow-up).
- AddonJs::runJob reads endImageBuffer (single Uint8Array) and
  controlFramesBuffers (Array of Uint8Array) as typed-array args,
  no JSON encoding.

JS surface:
- video.js / video.d.ts: new VideoStableDiffusion class with
  full per-mode validation, 4k+1 frame-count rule, fps range,
  moe_boundary range, Uint8Array type checks, and warning when
  high_noise_* params are set without files.highNoiseDiffusionModel.
- addon.js: SdInterface.runJob threads end_image and control_frames
  through to the native runJob without round-tripping through JSON.
- index.js / index.d.ts: unchanged -- image wrapper continues to
  work exactly as before. Both classes compose the same SdInterface
  and hit the same binding.cpp entry points.
- package.json: exports "./video", ships video.js / video.d.ts,
  adds generate:video / generate:img2vid / generate:flf2vid scripts.

Examples:
- examples/generate-video-wan.js (txt2vid @ 832x480, 33 frames)
- examples/img2vid-wan.js (reuses assets/von-neumann.jpg as first frame)
- examples/flf2vid-wan.js (expects flf-first.png / flf-last.png)

Tests:
- test_sd_video_frames.cpp: 12 RAII tests (empty states, destruction
  of 4k+1 production sizes, null-pixel tolerance, bounds-checked
  operator[], compile-time copy/move deletion).
- test_wan_video.cpp: 12 validation tests reusing the SD2.1 context
  to satisfy isLoaded() and exercise every processVideo() guard
  before generate_video() runs; plus an opt-in happy-path smoke
  test (SD_RUN_WAN_SMOKE=1) gated off by default because ggml-metal
  lacks IM2COL_3D for Wan's 3D convs.

Gates: npm run lint, npm run test:dts, npm run build, and the
fast subset of addon-test (178/178) all pass.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add a vcpkg overlay-port for ggml at vcpkg/ports/ggml/ that pins
tetherto/qvac-ext-ggml @ feature/metal-pr-16669-clean (commit
bc053644). The fork adds Metal kernels for IM2COL_3D and 3-axis
PAD-left, both required by Wan 2.1 / 2.2 video generation; without
them ggml hard-aborts mid-run with "unsupported op 'IM2COL_3D'".
Rationale lives in portfile.cmake -- the overlay is transient and
will be removed once the registry baseline rolls forward.

Add JS test coverage for VideoStableDiffusion:
  - test/unit/video-validation.test.js: 63 input-validation cases
    mirroring the existing input-validation.test.js pattern.
  - test/integration/generate-video-wan.test.js: opt-in
    (WAN_INTEGRATION=1) end-to-end T2V smoke test plus sniffAvi
    self-tests.

Tune the Wan examples:
  - generate-video-wan.js: env-var-driven (PROMPT, FRAMES, STEPS,
    SEED, CFG_SCALE, FLOW_SHIFT, ...), inline frame-count cheat
    sheet, (4*k+1) pre-flight check, default FRAMES bumped to 81
    (Wan 1.3B's native training length).
  - img2vid-wan.js, flf2vid-wan.js: flow_shift 5.0 -> 3.0 to match
    the upstream test-wan reference scripts.

Refresh the C++ smoke-test gating doc in test_wan_video.cpp to
reflect that Metal works once the overlay is in place.

Drop build.md: the vcpkg overlay rationale already lives next to
the overlay (portfile.cmake header), and transient infrastructure
doesn't earn its own long-form doc.

Co-authored-by: Cursor <cursoragent@cursor.com>
@gianni-cor

Copy link
Copy Markdown
Contributor

packages/diffusion-cpp/addon.js:24mapAddonEvent JobEnded fallback will misclassify any non-uint8_t typed-array output as JobEnded and prematurely close the response stream

(General PR comment — line 24 of addon.js is outside this PR's diff hunks so inline anchoring isn't possible.)

The else if (rawData && typeof rawData === 'object') branch at addon.js:24-26 accepts every typed array (Float32Array, Int16Array, …), not just the runtime-stats POJO it's intended for. Today only JsTypedArrayOutputHandler<uint8_t> is registered (AddonJs.hpp:103-104), so it works by coincidence. The moment a second typed-array handler ships — most obviously when SdModel::GenerationJob::frameCallback (declared at SdModel.hpp:139-141, fired at SdModel.cpp:979-985, flagged "reserved for future use" in the PR description) is wired from JS — every per-frame event gets re-typed as JobEnded and the stream ends after the first frame. C++ tests pass; JS silently breaks.

Change packages/diffusion-cpp/addon.js:24 from:

if (rawData && typeof rawData === 'object') {

to:

if (rawData && typeof rawData === 'object' && !ArrayBuffer.isView(rawData)) {

ArrayBuffer.isView is true for every typed array and DataView, false for plain objects (the runtime-stats POJO) — exactly the discrimination needed. One-token fix, latent-bug-proof.

@gianni-cor gianni-cor left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Five C++-side issues from a re-review of the latest commits (one real bug + three hardening + one feature-completeness clarification). Per-comment inline below; the init_image regression and JS-side issues were posted in earlier comment threads.

Comment thread packages/diffusion-cpp/addon/src/model-interface/SdModel.cpp
Comment thread packages/diffusion-cpp/addon/src/model-interface/SdModel.cpp
Comment thread packages/diffusion-cpp/addon/src/handlers/SdVidGenHandlers.cpp Outdated
Comment thread packages/diffusion-cpp/addon/src/handlers/SdParsers.cpp Outdated
Comment thread packages/diffusion-cpp/addon/src/handlers/SdCtxHandlers.hpp Outdated
aegioscy and others added 2 commits May 12, 2026 21:07
…, cancel, etc.)

Addresses all 8 outstanding comments on PR #1879 (one regression from
commit 59f2663 plus a CHANGES_REQUESTED batch of seven items). Major
points below; per-file rationale in the inline comments.

== Regression fix (highest priority)
* gianni-cor flagged that the new init_image strict-equality check from
  commit 59f2663 rejects every off-grid frame with a confusing error
  citing wrapper-picked dims. Root cause: addon.js _fillDimsFromImage
  was silently doing Math.ceil(d/8)*8, so a 100x100 init_image got
  dispatched as 104x104 and the native check then threw "100x100 != 104x104"
  -- citing a value the caller never passed. Fixes:
  - addon.js _fillDimsFromImage now passes dims through verbatim
    (no rounding). The image SDEdit path already realigns internally
    (SdModel.cpp ~600) and the FLUX2 ref path uses
    auto_resize_ref_image, so dropping the rounding is safe across
    every path.
  - video.js _runInternal pre-empts the cryptic native error with a
    JS-layer off-grid probe: when width/height aren't explicit it
    reads init_image / end_image / control_frames[i] dimensions and
    throws a clear "your image is off-grid, pre-align or pass explicit
    dims" message naming the exact buffer.
  - Removes the ceil-vs-round inconsistency wart between
    _fillDimsFromImage (ceil) and the user-facing validator (round).
  - Three new JS regression tests for off-grid init / end / control,
    plus one positive test for explicit aligned dims overriding the
    probe.

== JS hardening
* params.prompt is documented Required but was never validated --
  undefined / "" / 42 each produced a different failure mode (silent
  noise, silent noise, far-away C++ error). video.js now throws a loud
  TypeError at the wrapper boundary. Four new prompt-validation tests.
* mapAddonEvent JobEnded fallback accepted every typed-array view --
  works today only because uint8_t is the sole registered
  TypedArrayOutputHandler. When frameCallback (SdModel.hpp:139) gets
  wired through to JS, every per-frame event would have been
  misclassified as JobEnded and the response stream would have closed
  after the first frame. One-token fix: add `&& !ArrayBuffer.isView(rawData)`
  to the discriminator. ArrayBuffer.isView is true for every TypedArray
  + DataView, false for plain objects -- exactly the discrimination
  needed for the runtime-stats POJO.

== C++ parser hardening (NaN / Inf / int64 / range)
* Promoted requireInt from SdVidGenHandlers.cpp's anonymous namespace
  into parsers::, and added two siblings:
  - requireFiniteFloat: rejects NaN / +inf / -inf before the float
    cast (NaN compares false against every bound, so range checks
    of the form `f < lo || f > hi` previously let it sneak through).
  - requireInt64: same finite + integer guards as requireInt, range
    check against representable [INT64_MIN, INT64_MAX] doubles.
  - requireFiniteFloatInRange: convenience wrapper for [lo, hi] checks.
* Routed every previously-vulnerable cast through the new helpers:
  - SdVidGenHandlers.cpp: seed (int64), cfg_scale, flow_shift,
    high_noise_cfg_scale, high_noise_flow_shift, vae_tile_overlap,
    cache_threshold, moe_boundary, strength, vace_strength
  - SdGenHandlers.cpp (image path, reviewer asked for symmetric fix):
    eta, cfg_scale, guidance, img_cfg_scale, seed, batch_count,
    strength, clip_skip, vae_tile_overlap, cache_threshold, width,
    height, steps, parseUpscaleRepeats
* parseVaeTileSize (SdParsers.cpp): numeric form now routes through
  requireInt (rejects NaN/Inf/fractional/out-of-range), and BOTH
  forms (numeric and "WxH" string) now reject <= 0. Five new tests.

== Cancellation gap + typed status
* SdModel.cpp processVideo cancelRequested_ was checked exactly once
  after generate_video() returns -- the slow tail (per-frame PNG
  fan-out + AVI mux, multi-second on 81-frame 832x480 videos) had no
  cancellation visibility. Added 2 checks: top of frame-callback loop
  body, and immediately before encodeFramesToAvi.
* Switched both Job cancelled throws (image path at SdModel.cpp:730,
  video path at :987, plus the 2 new C1 sites) from bare
  std::runtime_error to StatusError tagged with
  localCodeMsg="Cancelled", so the JS layer can discriminate cancel
  from real internal failures via codeString() ("[ General :: Cancelled ]")
  instead of string-matching the exception message.

  Note: this PR deliberately does NOT add `Cancelled = 6` to the
  shared inference-addon-cpp Errors.hpp enum, because that header
  ships via vcpkg to every package in the monorepo and a cross-package
  coordinated change is out of scope. Instead we use the 3-arg
  StatusError ctor (addonId, localCodeMsg, errorMsg) which produces
  the same codeString without touching the shared enum. When the
  enum is updated later, the 4 call sites can switch to the 2-arg
  ctor in a one-line follow-up.

== C5 (preview_*) -- product decision deferred
* The header comment at SdCtxHandlers.hpp:112 claimed preview_mode et
  al are "Wired to sd_set_preview_callback() in SdModel::process()",
  but a grep across packages/diffusion-cpp for sd_set_preview_callback
  returns zero matches -- the four config keys are validated and stored
  but the upstream callback is never installed, so they're a silent
  no-op end-to-end. Downgraded the misleading comment to an explicit
  TODO(QVAC-18026 follow-up) documenting the gap and the two viable
  resolution paths (wire it up alongside sd_set_abort_callback, OR
  remove the handlers + fields + tests). Reviewer asked which path is
  intended; this commit picks neither and just stops claiming the
  wiring exists. The choice can land in a separate PR without holding
  this one up.

== Test surface
* +8 JS tests (prompt validation x4, off-grid probe x4)
* +5 C++ tests (vae_tile_size zero/negative/fractional/out-of-range
  rejection, plus the existing IntCoercion suite carried over to the
  promoted helpers transparently)
* Cancel-context test updated to assert the typed
  "[ General :: Cancelled ]" codeString in addition to the message.

Verified locally:
  JS unit tests:   75/75 pass with prebuild, 75/75 also without
                   (CI sanity-checks mode, no native binary loaded)
  C++ unit tests: 209/210 pass, 1 opt-in skip
                   (SdWanHappyPathTest needs ~8GB Wan weights)
  npm run lint:    clean
  npm run test:dts: clean

Co-authored-by: Cursor <cursoragent@cursor.com>
@aegioscy

Copy link
Copy Markdown
Contributor Author

Done in fe4d10f — one-token fix as you suggested. mapAddonEvent (addon.js:24) now discriminates with && !ArrayBuffer.isView(rawData), so future typed-array handlers (per-frame frameCallback, audio, anything else) won't get misclassified as JobEnded and prematurely close the response stream.

Inline comment added at the call site explaining what each side of the discriminator catches and why (TypedArray + DataView vs plain runtime-stats POJO), so the rationale is visible for whoever wires SdModel::GenerationJob::frameCallback through to JS next.

aegioscy and others added 4 commits May 13, 2026 14:04
Bumps @qvac/diffusion-cpp to 0.8.0 and documents the Wan 2.1 / Wan 2.2
video pipeline shipped since 0.7.0: new VideoStableDiffusion class
(txt2vid / img2vid / flf2vid), MoE high-noise expert routing, streaming
MJPG AVI muxer, refactored download helpers + Wan model script, plus
the supporting JS + C++ test coverage and validation hardening.

Co-authored-by: Cursor <cursoragent@cursor.com>
gianni-cor
gianni-cor previously approved these changes May 15, 2026
gianni-cor and others added 12 commits May 15, 2026 10:37
_fillDimsFromImage was passing raw image dimensions through verbatim
since fe4d10f, but the native SdGenHandlers validates width/height
% 8 == 0 before the downstream alignment in SdModel::processImage
ever runs. Any img2img call with a non-aligned source image (e.g.
the bundled 500x627 von-neumann.jpg used by the FLUX2 i2i integration
test) therefore failed with:

  height must be a positive multiple of 8, got: 627

Restore the Math.ceil(d/8)*8 round-up that was removed in fe4d10f.
The original motivation for the removal -- avoiding a spurious dim
mismatch on the video path where processVideo strict-compares decoded
frame dims against vid.width/vid.height -- is already handled at the
JS layer by VideoStableDiffusion's off-grid pre-validation in
video.js, which runs before this helper and rejects unaligned
init/end/control frames with a clear caller-facing error. The ceil()
is therefore a no-op on the video path.

Co-authored-by: Cursor <cursoragent@cursor.com>
cpp-lint surfaced clang-format drift in 4 files that accumulated
across recent Wan-video commits. No semantic changes -- only
mechanical line-wrap / arg-break placement to match the project's
.clang-format.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ntegration test

The generate-video-wan.test.js test was using a relative import
(require('../../video')) that breaks when test files are bundled
and relocated to the test-framework backend directory during mobile
test setup.

Change to the package export pattern (@qvac/diffusion-cpp/video)
used by other integration tests, which remains valid regardless of
file location.

Fixes: https://github.com/tetherto/qvac/actions/runs/25929776543/job/76221440417
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@gianni-cor

Copy link
Copy Markdown
Contributor

/review

@gianni-cor

Copy link
Copy Markdown
Contributor

/review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

verified Authorize secrets / label-gate in PR workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants