Skip to content

Test fixes#208

Closed
reeselevine wants to merge 52 commits into
ngxson:masterfrom
reeselevine:test-fixes
Closed

Test fixes#208
reeselevine wants to merge 52 commits into
ngxson:masterfrom
reeselevine:test-fixes

Conversation

@reeselevine

@reeselevine reeselevine commented Apr 7, 2026

Copy link
Copy Markdown
Collaborator

Fix npm run test locally and add a couple webgpu tests

Summary by CodeRabbit

Release Notes

  • New Features

    • GPU acceleration support via WebGPU backend with experimental stability on select devices
    • Performance monitoring capabilities: retrieve token generation rates and timing metrics
    • Performance metrics reset functionality
    • GitHub Pages automatic deployment workflow
  • Improvements

    • Enhanced model selection UI with WebGPU budget awareness and memory limits
    • Expanded WASM build variants for improved compatibility
    • Updated model library and dependencies
    • Improved documentation reflecting GPU acceleration availability
  • Chores

    • Build infrastructure enhancements
    • Type definitions for WebGPU support

@reeselevine reeselevine closed this Apr 7, 2026
@reeselevine

Copy link
Copy Markdown
Collaborator Author

whoops meant to open on my repo

@coderabbitai

coderabbitai Bot commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d1f9da21-eed1-45a5-8ec5-9120ba14f414

📥 Commits

Reviewing files that changed from the base of the PR and between 8778d7b and 0362e25.

⛔ Files ignored due to path filters (9)
  • examples/main/package-lock.json is excluded by !**/package-lock.json
  • package-lock.json is excluded by !**/package-lock.json
  • src/asyncify-multi-thread/wllama.wasm is excluded by !**/*.wasm
  • src/asyncify-single-thread/wllama.wasm is excluded by !**/*.wasm
  • src/jspi-multi-thread/wllama.wasm is excluded by !**/*.wasm
  • src/jspi-single-thread/wllama.wasm is excluded by !**/*.wasm
  • src/multi-thread/wllama.wasm is excluded by !**/*.wasm
  • src/single-thread/wllama.wasm is excluded by !**/*.wasm
  • src/webgpu-single-thread/wllama.wasm is excluded by !**/*.wasm
📒 Files selected for processing (39)
  • .github/workflows/deploy-examples-main.yml
  • CMakeLists.txt
  • cpp/actions.hpp
  • cpp/glue.hpp
  • cpp/test_glue.cpp
  • cpp/wllama.cpp
  • examples/main/package.json
  • examples/main/src/components/ChatScreen.tsx
  • examples/main/src/components/GuideScreen.tsx
  • examples/main/src/components/ModelScreen.tsx
  • examples/main/src/config.ts
  • examples/main/src/utils/custom-models.tsx
  • examples/main/src/utils/displayed-model.tsx
  • examples/main/src/utils/types.ts
  • examples/main/src/utils/utils.ts
  • examples/main/src/utils/wllama.context.tsx
  • examples/main/tsconfig.app.json
  • llama.cpp
  • package.json
  • scripts/build_wasm.sh
  • scripts/build_worker.sh
  • scripts/docker-compose.yml
  • src/asyncify-multi-thread/wllama.js
  • src/asyncify-single-thread/wllama.js
  • src/cache-manager.ts
  • src/glue/messages.ts
  • src/jspi-multi-thread/wllama.js
  • src/jspi-single-thread/wllama.js
  • src/mjs.test.ts
  • src/multi-thread/wllama.js
  • src/single-thread/wllama.js
  • src/webgpu-single-thread/wllama.js
  • src/wllama.test.ts
  • src/wllama.ts
  • src/worker.ts
  • src/workers-code/generated.ts
  • src/workers-code/llama-cpp.js
  • tsconfig.build.json
  • vitest.config.ts

📝 Walkthrough

Walkthrough

This PR adds comprehensive WebGPU backend support, introduces performance context APIs for timing metrics, restructures WASM build variants (JSPI and asyncify), updates the glue protocol (v1→v2), and enhances the example app UI with WebGPU memory budgeting, model filtering, and performance statistics display.

Changes

Cohort / File(s) Summary
WebGPU Backend Infrastructure
CMakeLists.txt, cpp/actions.hpp, cpp/glue.hpp, cpp/wllama.cpp
Added CMake options (GGML_WEBGPU, GGML_WEBGPU_JSPI, LLAMA_WASM_MEM64). Extended C++ backend to select and manage WebGPU/CPU device via ggml_backend_dev_t, with fallback logic and device unavailability error handling.
Performance Context APIs
cpp/actions.hpp, cpp/glue.hpp, cpp/test_glue.cpp, cpp/wllama.cpp, src/glue/messages.ts
Introduced action_perf_context and action_perf_reset handlers with new message types (pctx_req, pctx_res, prst_req, prst_res). Wire protocol bumped from v1 to v2. Load request now includes use_webgpu and no_perf flags.
WASM Build System Refactor
scripts/build_wasm.sh, scripts/build_worker.sh, scripts/docker-compose.yml, llama.cpp
Updated Emscripten from 4.0.3 to 4.0.20. Replaced binary single/multi-thread builds with four variants (JSPI & asyncify, each single & multi-thread). Added Dawn WebGPU package integration into build. Updated exported build constants.
TypeScript Worker & Runtime
src/worker.ts, src/workers-code/llama-cpp.js, src/wllama.ts
Refactored worker module loading to support build-type selection (JSPI vs asyncify). Added unsigned pointer normalization in llama-cpp.js. Generalized async/sync wrapper logic. Updated Wllama class with preferWebGPU, noPerf, getPerfContext(), resetPerfContext(), and usingWebGPU() APIs.
Example App - Performance Display
examples/main/src/components/ChatScreen.tsx, examples/main/src/utils/utils.ts, examples/main/src/utils/wllama.context.tsx
Added UI for prefill/decode token rates with "Reset" button. Implemented getWebGPUMemoryBudget() with iOS-specific capping. Extended context to parameterize WebGPU preference and track runtime WebGPU status.
Example App - Model Selection & Budget
examples/main/src/components/ModelScreen.tsx, examples/main/src/config.ts, examples/main/src/utils/displayed-model.tsx, examples/main/src/utils/custom-models.tsx
Added "Prefer WebGPU" checkbox, WebGPU memory budget UI, and model size blocking logic. Implemented GGUF split-file parsing and i-quant model filtering. Updated model list and added isIQuantModel() utility.
Configuration & Types
examples/main/src/utils/types.ts, examples/main/package.json, examples/main/tsconfig.app.json, package.json, tsconfig.build.json
Extended RuntimeInfo and InferenceParams with WebGPU fields. Updated @huggingface/jinja (0.2.2→0.5.3) and added @webgpu/types to dependencies and TypeScript configurations.
Documentation & Tests
examples/main/src/components/GuideScreen.tsx, src/wllama.test.ts, src/mjs.test.ts, vitest.config.ts, .github/workflows/deploy-examples-main.yml
Updated guide copy to reflect experimental WebGPU support. Added WebGPU-specific completion tests. Updated tests to use new build-variant paths. Increased test timeout to 60s. Added GitHub Actions workflow for example app deployment.
Utilities
src/cache-manager.ts
Added await and metadata file cleanup in deleteMany operation.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Wllama
    participant Worker
    participant WASM
    participant Backend as WebGPU/CPU Backend

    Client->>Wllama: loadModel(preferWebGPU: true)
    Wllama->>Wllama: Check navigator.gpu availability
    alt WebGPU Available
        Wllama->>Worker: init(buildType: 'jspi', use_webgpu: true)
    else WebGPU Unavailable
        Wllama->>Wllama: Fallback to CPU (warn)
        Wllama->>Worker: init(buildType: 'jspi', use_webgpu: false)
    end
    Worker->>WASM: wllama_start()
    Worker->>WASM: wllama_action('load', {...use_webgpu, n_gpu_layers...})
    WASM->>Backend: ggml_backend_dev_by_name('WebGPU') or ggml_backend_dev_by_type(CPU)
    Backend-->>WASM: device_handle
    WASM-->>Worker: load_res
    Worker-->>Wllama: Model loaded with device
    Wllama-->>Client: Ready (usingWebGPU() = true/false)
Loading
sequenceDiagram
    participant UI as ChatScreen
    participant Wllama
    participant Worker
    participant WASM

    UI->>Wllama: createCompletion(prompt)
    Wllama->>Worker: wllama_action('completion', ...)
    Worker->>WASM: inference on selected backend
    WASM-->>Worker: completion_res
    Worker-->>Wllama: result
    Wllama-->>UI: completion done
    UI->>Wllama: getPerfContext()
    Wllama->>Worker: wllama_action('perf_context', ...)
    Worker->>WASM: perf_context (retrieve t_p_eval_ms, t_eval_ms, n_p_eval, n_eval)
    WASM-->>Worker: pctx_res {timing, counters}
    Worker-->>Wllama: metrics
    Wllama-->>UI: PerfContextData
    UI->>UI: Display token rates (tok/s)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • ngxson

Poem

🐰 Hops with joy through WebGPU gates,
Where tokens dance at faster rates,
Perf metrics bloom, a rabbit's delight,
JSPI and asyncify shine so bright,
GPU backends now within reach—
A speedier wllama we teach! 🚀

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants