Skip to content

Add support for async file read#221

Merged
ngxson merged 5 commits into
masterfrom
xsn/async_file_read
May 17, 2026
Merged

Add support for async file read#221
ngxson merged 5 commits into
masterfrom
xsn/async_file_read

Conversation

@ngxson

@ngxson ngxson commented May 16, 2026

Copy link
Copy Markdown
Owner

Add support for async file read via JSPI, allow the wasm context to read file on-demand.

This implementation hooks into the fread() function and translate underlay call to async Blob read. The result is that tensors can be loaded on-demand without allocating the whole file. This is particularly useful for example, if large part of the weight is offloaded to WebGPU backend, then they won't allocated on RAM.

Hopefully allow support for bigger models (> 2GB in size) to run directly on WebGPU. CC @reeselevine if you are interested in doing some testings.

Note: on systems that doesn't support JSPI, we fallback to the HeapFS implementation (allocate file on heap, then doing mmap to avoid copying file)

Summary by CodeRabbit

  • New Features

    • Async file-read support for faster, non-blocking model loads in supported environments
    • Optional in-memory 1MB read-ahead cache for small reads
    • Worker/bridge updates to support deferred Blob-based reads and conditional memory-mapped loading
  • Bug Fixes

    • Updated browser compatibility notes: Safari excluded due to missing Memory64 support
  • Documentation

    • Expanded developer guide with a new “File access” section describing async reads, HeapFS usage, caching, and memory considerations
  • Chores

    • Build config updated to enable WebAssembly-level file I/O interception

Review Change Stack

@coderabbitai

coderabbitai Bot commented May 16, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR adds JSPI‑gated async file reads for WebAssembly: linker wraps and C++ fread wrappers with a 1MB cache, a JS bridge to request chunks from the main thread, worker-side handlers that await responses and write into wasm heap, and runtime toggles to disable mmap when async is available.

Changes

Async file read support for WebAssembly

Layer / File(s) Summary
Capability detection and model load configuration
src/utils.ts, src/wllama.ts
canUseAsyncFileRead() exports JSPI support; model loading uses use_mmap: !canUseAsyncFileRead() and disables use_mlock.
C++ filesystem interception and read bridging
CMakeLists.txt, cpp/wllama-fs.h, cpp/wllama.cpp
Linker wrap options for fopen/fclose/fread/fseek/ftell; C++ wrappers maintain FILE*→path map, implement a 1MB read‑ahead cache, and provide js_file_read EM_ASYNC_JS bridge calling _wllama_js_file_read.
Worker thread async file read handler
src/workers-code/llama-cpp.js
Worker uses BigInt mmap pointers, optional mmap-backed allocation, writes directly into wasm heap, implements _wllama_js_file_read that posts fs.read_req and awaits fs.read_res, and guards message handling during JSPI suspension.
Main thread file blob storage and read response
src/worker.ts
ProxyToWorker tracks useAsyncFile and per-file Blobs, passes allocBuffer to fs.alloc, skips streaming when async enabled, and slices/postMessage fs.read_res with transferable ArrayBuffers for read requests.
Documentation and minor UX changes
README-dev.md, examples/main/src/components/GuideScreen.tsx, examples/main/src/utils/custom-models.tsx, llama.cpp
Adds a detailed "File access" doc section describing async reads and HeapFS; updates GuideScreen Safari Memory64 note; removes a max-size validation in examples; advances llama.cpp submodule pointer.

Sequence Diagram(s)

sequenceDiagram
  participant Wasm as Wasm code
  participant Wrap as __wrap_fread
  participant Cache as C++ cache
  participant Bridge as js_file_read
  participant Worker as Worker (llama-cpp.js)
  participant Main as ProxyToWorker (main thread)
  Wasm->>Wrap: fread(ptr,size,nmemb,FILE*)
  Wrap->>Cache: try_cache(FILE*, pos, bytes)
  alt cache hit
    Cache->>Wrap: copy bytes -> ptr
  else cache miss or large read
    Wrap->>Bridge: js_file_read(path, offset, size, out_ptr)
    Bridge->>Worker: calls _wllama_js_file_read
    Worker->>Main: postMessage fs.read_req
    Main->>Worker: postMessage fs.read_res (ArrayBuffer)
    Worker->>Bridge: resolve, write into wasm heap
    Bridge->>Wrap: return byte count
  end
  Wrap->>Wasm: return elements read
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • ngxson/wllama#215: JSPI support and capability detection work related to async Blob/file-read plumbing.

Poem

🐰 I hop from heap to JS and back,

I fetch the bytes the models lack,
One-meg chunks in gentle streams,
I bridge the wasm-to-main-thread beams,
A caching rabbit in the stack.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add support for async file read' directly and clearly summarizes the main change of the PR: adding async file reading capability via JSPI and related infrastructure.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch xsn/async_file_read

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
examples/main/src/components/GuideScreen.tsx (1)

51-51: ⚡ Quick win

Consider version-scoping the Safari support claim for maintainability.

The claim "Safari is not supported" is currently accurate (as of Safari 26.5 in May 2026, Memory64 remains unsupported in both stable and Technology Preview releases). However, versioning the statement will help prevent the guide from misleading users as browser support evolves.

For example: "Safari 26.5 and earlier do not support Memory64" or "As of May 2026, Memory64 is not supported in Safari."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/main/src/components/GuideScreen.tsx` at line 51, Update the static
message in the GuideScreen component that currently reads "Safari is not
supported" to include a version or date scope; locate the string in
GuideScreen.tsx (the text node containing "Safari is not supported, due to lack
of Memory64 support") and replace it with a version-scoped phrasing such as "As
of May 2026, Safari (≤ 26.5) does not support Memory64" or "Safari 26.5 and
earlier do not support Memory64" so the statement remains accurate as browser
support changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/wllama-fs.h`:
- Around line 12-43: Global state (s_file_path_map, ready, use_async,
cache_data, cache_start, cache_file) is mutated concurrently from
fopen/fclose/fread and must be serialized; either protect accesses with a mutex
(e.g., add a static std::mutex like s_fs_mutex and lock it in make_sure_ready(),
try_cache(), and all fopen/fclose/fread wrappers) or refactor the single cache
into per-FILE* entries stored in s_file_path_map so each FILE* has its own cache
and is updated under a mutex. Update all code paths that read/write these
symbols (make_sure_ready, try_cache and the functions in the 61-147 region) to
acquire the mutex while touching the globals to prevent races.

In `@README-dev.md`:
- Line 72: The README sentence currently mentions only fopen, fseek and fread
but should list the entire set of intercepted libc stdio functions implemented
in this PR; update that line to explicitly enumerate all wrapped stdio calls
(not just fopen, fseek, fread) by copying the exact function names from the
wrapper implementation (the module that defines the proxies to the main thread)
and replace the sentence so it accurately documents the full intercepted surface
(use the existing symbols fopen, fseek, fread to locate the sentence to edit).

In `@src/worker.ts`:
- Around line 244-263: fileReadResponse currently awaits chunk.arrayBuffer()
without error handling, so a rejection never posts a response and the
worker-side _wllama_js_file_read stays hung; wrap the blob read in try/catch and
on error post a failure message (e.g., verb 'fs.read_err' with an error string
or code) via this.worker.postMessage so the worker bridge can reject the pending
read; apply the same pattern to the analogous range/other read handler
referenced (same logic as in fileReadResponse) and ensure you include
identifying symbols: fileReadResponse, onRecvMsg, and _wllama_js_file_read to
locate and mirror the error-path behavior.

---

Nitpick comments:
In `@examples/main/src/components/GuideScreen.tsx`:
- Line 51: Update the static message in the GuideScreen component that currently
reads "Safari is not supported" to include a version or date scope; locate the
string in GuideScreen.tsx (the text node containing "Safari is not supported,
due to lack of Memory64 support") and replace it with a version-scoped phrasing
such as "As of May 2026, Safari (≤ 26.5) does not support Memory64" or "Safari
26.5 and earlier do not support Memory64" so the statement remains accurate as
browser support changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 583cb48d-dbd4-404f-9c29-524d2062ee53

📥 Commits

Reviewing files that changed from the base of the PR and between e923fba and 2d693a9.

⛔ Files ignored due to path filters (1)
  • src/wasm/wllama.wasm is excluded by !**/*.wasm
📒 Files selected for processing (12)
  • CMakeLists.txt
  • README-dev.md
  • cpp/wllama-fs.h
  • cpp/wllama.cpp
  • examples/main/src/components/GuideScreen.tsx
  • llama.cpp
  • src/utils.ts
  • src/wasm/wllama.js
  • src/wllama.ts
  • src/worker.ts
  • src/workers-code/generated.ts
  • src/workers-code/llama-cpp.js

Comment thread cpp/wllama-fs.h
Comment thread README-dev.md
Comment thread src/worker.ts Outdated
@ngxson

ngxson commented May 17, 2026

Copy link
Copy Markdown
Owner Author

Ok so I was able to load the Qwen3VL-8B-Q4_K_M with this PR, all layers offloaded to GPU:

image image image

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/main/src/utils/custom-models.tsx (1)

60-64: ⚡ Quick win

Remove the commented-out validation code or restore it conditionally based on async file read availability.

Leaving code commented out is a code smell. The PR removes size validation to support large files via async reads (JSPI), but the fallback path (non-JSPI environments) still uses HeapFS with heap allocation, which has size constraints. Either:

  1. If async reads are guaranteed: Remove the commented code entirely.
  2. If HeapFS fallback is still used: Restore the validation conditionally to prevent cryptic heap allocation failures when users without JSPI support load oversized files.

Currently, users without JSPI will encounter silent runtime errors instead of the helpful upfront message about splitting shards.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/main/src/utils/custom-models.tsx` around lines 60 - 64, The
commented-out size validation around MAX_GGUF_SIZE (the sizes.some(...) check)
must be either removed or restored conditionally: detect whether async file
reads / JSPI are available (the code path that avoids HeapFS) and only enforce
the sizes.some((s) => s >= MAX_GGUF_SIZE) -> throw new Error(...) guard when
running the HeapFS fallback (non-JSPI) so users get a clear "GGUF file is too
big" message instead of cryptic heap allocation failures; if you guarantee
JSPI/async reads for all supported runtimes, remove the commented block
entirely, otherwise re-enable it and gate it on the runtime/feature-detection
flag used elsewhere in this module.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/main/src/utils/custom-models.tsx`:
- Around line 60-64: The commented-out size validation around MAX_GGUF_SIZE (the
sizes.some(...) check) must be either removed or restored conditionally: detect
whether async file reads / JSPI are available (the code path that avoids HeapFS)
and only enforce the sizes.some((s) => s >= MAX_GGUF_SIZE) -> throw new
Error(...) guard when running the HeapFS fallback (non-JSPI) so users get a
clear "GGUF file is too big" message instead of cryptic heap allocation
failures; if you guarantee JSPI/async reads for all supported runtimes, remove
the commented block entirely, otherwise re-enable it and gate it on the
runtime/feature-detection flag used elsewhere in this module.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b1bee32c-82f1-49a6-9b49-381dc6949138

📥 Commits

Reviewing files that changed from the base of the PR and between cd746d7 and b9fe57a.

📒 Files selected for processing (1)
  • examples/main/src/utils/custom-models.tsx

@ngxson ngxson merged commit e8e134f into master May 17, 2026
6 checks passed
@ngxson

ngxson commented May 17, 2026

Copy link
Copy Markdown
Owner Author

Quite insane but I managed to run Qwen3.6 27B 100% on WebGPU. Still trying to add MTP support...

Screen.Recording.2026-05-18.at.00.55.23.mov

@felladrin

Copy link
Copy Markdown
Contributor

Quite insane but I managed to run Qwen3.6 27B to run 100% on WebGPU. Still trying to add MTP support...

Great achievement!!

@reeselevine

Copy link
Copy Markdown
Collaborator

@ngxson not sure if you are running into issues with MTP support on Qwen specifically but it looks like the main PR didn't add the code to the WebGPU backend that is needed for GATED_DELTA_NET, I opened a PR here: ggml-org/llama.cpp#23299

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants