Parallel Worker Pool for VlConverter + V8 Platform Init Safety#242
Merged
Parallel Worker Pool for VlConverter + V8 Platform Init Safety#242
Conversation
Replace the single-threaded VlConverterRuntime with a WorkerPool that supports configurable numbers of Deno workers. Each worker gets its own tokio LocalSet and JsRuntime, providing true parallelism for concurrent conversion requests. Key changes: - WorkerPool struct with round-robin sender selection via AtomicUsize - spawn_worker_pool(num_workers) creates N independent worker threads - VlConverter is now Clone (wraps Arc<VlConverterInner>) - with_num_workers() constructor and num_workers() accessor - V8 platform initialization is one-time via Once/call_once - handle_command() helper centralizes command dispatch in worker loop - Worker startup uses sync channel to propagate initialization errors Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
VlConverter is now Clone and all conversion methods take `&self` since internal state is managed via Arc<VlConverterInner>. Drop the `mut` binding in tests and CLI callers to resolve unused_mut warnings. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Refactor the Python bindings to use a thread-safe, parallel VlConverter: - Switch VL_CONVERTER from Mutex<VlConverterRs> to RwLock<Arc<VlConverterRs>> so multiple conversion futures can run concurrently - Add converter_read_handle() and run_converter_future() helpers to centralize the read-lock + block_on pattern across all conversion functions - Allow Python's GIL to be released during blocking Deno calls via py.allow_threads() inside run_converter_future - Add set_num_workers(n) / get_num_workers() to configure and inspect the worker pool at runtime; set_num_workers replaces the global converter with a freshly-spawned one - Expose the new functions in vl_convert.pyi with NumPy-style docstrings Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Verify set/get_num_workers behavior: - Default is 1 worker - set_num_workers rejects zero - Workers can be reconfigured while conversions are running - Parallel conversions succeed with multiple configured workers Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a "Parallel Workers" section showing how to use set_num_workers() and get_num_workers() to scale throughput for batch workloads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On failure, check_png() now writes the actual and expected PNGs to vl-convert-python/tests/failed/ so rendering differences can be inspected. The macOS/Windows Python CI job uploads that directory as an artifact when tests fail. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wraps the reqwest fetch in `custom_string_resolver` with `backon` exponential backoff (500ms–10s, up to 4 retries). Retries on network errors and transient HTTP failures (429, 5xx); permanent errors (404, 403, etc.) short-circuit immediately without retrying. Fixes intermittent CI failures where Wikimedia rate-limits a second image fetch within the same test run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
We only need tokio-sleep + std; gloo-timers is the WASM timer backend and has no use in this crate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…selection Replaces the AtomicBool dispatch_pending approach with a proper outstanding-request counter per worker. The OutstandingTicket RAII guard increments on selection and decrements on drop, so the counter accurately covers three lifecycle phases: waiting on a full channel, queued in the channel, and actively executing on the worker thread. Key design points: - QueuedCommand wraps VlConvertCommand + OutstandingTicket so the ticket travels with the command through the MPSC channel; the worker drops _ticket after handle_command completes, not when it dequeues - dispatch_cursor rotates the scan start to break ties without index-0 bias - Early-exit when outstanding == 0 avoids scanning remaining workers - Cancellation-safe: if a send future is dropped, the ticket drops with it Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
jonmmease
commented
Feb 21, 2026
Collaborator
Author
jonmmease
left a comment
There was a problem hiding this comment.
completed my review
| if: failure() | ||
| with: | ||
| name: failed-images-python-${{ matrix.options[0] }} | ||
| path: vl-convert-python/tests/failed/ |
Collaborator
Author
There was a problem hiding this comment.
working out intermittent image test failures was a side quest that lead to using the backon retry logic
Collaborator
Author
There was a problem hiding this comment.
retry logic for fetching images from urls was helpful for improving ci reliability, and should help end user usage as well
| fn ensure_v8_platform_initialized() { | ||
| static V8_INIT: Once = Once::new(); | ||
| V8_INIT.call_once(|| deno_core::JsRuntime::init_platform(None, false)); | ||
| } |
Collaborator
Author
There was a problem hiding this comment.
this is the key to avoiding the segfault
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Parallel Worker Pool for VlConverter + V8 Platform Init Safety
Summary
Introduces a configurable parallel worker pool for
VlConverter, Python worker-count APIs (set_num_workers/get_num_workers), opt-in warmup APIs (warm_up/warm_up_workers) to pre-initialize workers, and a proactive V8 platform initialization guard that prevents a class of sporadicSIGSEGVcrashes when spawning multiple Deno isolates from a multithreaded process. Default behavior is unchanged (1 worker, lazy startup unless warmup is called).Also adds exponential-backoff retry for remote image loading (
image_loading.rs) — a pre-existing reliability gap surfaced by CI while testing this PR.Motivation
Background: The #206 segfault and the #237 workaround
Issue #206 reported a segfault when creating multiple
VlConverterinstances — each instance was spawning its own OS thread with its own V8 isolate, and the second isolate crashed insideJsRuntime::new_inner. PR #237 fixed the crash by collapsing to a single shared global worker thread (VL_CONVERTER_RUNTIME), so only one V8 isolate ever existed in the process. That fixed the segfault — but it made all conversions from all instances serialize through the same thread, with no path to concurrency.What was actually causing the crash
Since V8 11.6, V8 enforces W^X on JIT pages using Intel/AMD Memory Protection Keys for Userspace (PKU/MPK). PKU state is stored in the per-thread
PKRUregister, which is inherited from parent to child atpthread_create()time. V8 writes the correctPKRUstate during platform initialization. Any thread not descended from the platform-initializing thread gets wrongPKRUstate for V8's JIT pages — causingSIGSEGVnot at startup, but sporadically later when the JIT compiler activates.The original multi-instance code created new worker threads from arbitrary calling threads (not the platform-initializing thread), so V8's JIT pages were inaccessible from those workers. The global-singleton workaround (#237) was immune because only one worker thread ever existed, but it gave up concurrency to get there.
This PR: the proper fix enables multiple workers
By calling
init_platformonce on the parent thread (viaensure_v8_platform_initialized) before spawning any worker threads, all workers are descendants of the platform-initializing thread and inherit correctPKRUstate. This is a proactive workaround, not runtime detection — it unconditionally callsinit_platformonce via astd::sync::Onceguard. With this guard in place, N workers can safely coexist, which removes the architectural bottleneck from #237 and enables genuine parallel execution.References:
deno_core::JsRuntimedocs: "Since V8 11.6, all runtimes must have a common parent thread that initialized the V8 platform."deno_corePR #471: Segfault fix on PKU-enabled CPUsrusty_v8issue #1381: Direct SIGSEGV report on Intel 13th-gen hardwaredenoissue #20495: "Unless V8 platform is initialized on main thread the segfaults start appearing once JIT kicks in."What Changed
Rust —
vl-convert-rs/src/converter.rsVlConverteris nowCloneviaArc<VlConverterInner>. All conversion methods take&selfinstead of&mut self.New API:
VlConverter::with_num_workers(n: usize) -> Result<Self, AnyError>— construct with a specific worker count (validated ≥ 1)VlConverter::num_workers(&self) -> usize— query configured worker countVlConverter::warm_up(&self) -> Result<(), AnyError>— optionally pre-spawn workers before first conversion requestVlConverter::new()— unchanged, defaults to 1 workerClone semantics:
#[derive(Clone)]onVlConverterclones theArc<VlConverterInner>— it increments the reference count, not the pool. The clone and the original share the same worker pool, bundle cache, and configuration. The pool is torn down only when all clones are dropped. To get an independent pool (separate workers, separate memory), construct a newVlConverter::with_num_workers(n)rather than cloning.Per-instance pools (behavioral change from #237): Previously all
VlConverterinstances shared a single globalVL_CONVERTER_RUNTIMEthread. Now each instance owns its own worker pool —new()calls no longer share state. Cloning is the way to get multiple handles to one pool.Worker pool model: One dedicated OS thread per worker, each with a
new_current_thread()Tokio runtime +LocalSet(required because Deno'sMainWorkeruses!Sendtypes). Per-worker bounded channels (capacity 32) provide backpressure. Dispatch uses a least-outstanding strategy: each worker owns anArc<AtomicUsize>counter; anOutstandingTicketRAII guard increments the counter at dispatch time and decrements it when the worker finishes the command, covering both queue time and execution time.next_senderpicks the worker with the lowest outstanding count; a rotatingdispatch_cursorbreaks ties without index-0 bias. Per-worker dispatch (vs. a single shared queue) is required because each Deno runtime is stateful and cannot have work redistributed mid-execution. Pool startup remains lazy by default, with a new explicitwarm_up()path that triggers the same startup handshake ahead of first request.V8 init guard:
Called at the start of
spawn_worker_pool, before any worker thread is created.Send retry: If a worker's channel is closed (worker died),
send_command_with_retryresets the pool and retries once. If the retry also fails, the error is returned to the caller.Worker-local transfer state: JSON args / MessagePack scenegraph payloads are now stored per worker in Deno
OpState(WorkerTransferState), removing process-wide contention onJSON_ARGS/MSGPACK_RESULTS/NEXT_ID.JsonArgGuardandMsgpackResultGuardremain RAII-based, but now clean up worker-local state on all error paths.Rust —
vl-convert-rs/src/image_loading.rsThe existing
custom_string_resolverfetched remote images with a singlereqwestcall and silently dropped the image on any error. CI exposed this gap: thetest_pdf[remote_images]test makes two fetches to Wikimedia in the same run — one for Vega, one for Vega-Lite — and the second was being rate-limited (HTTP 429 / 503), causing the Firefox logo to vanish from the PDF output.The fix wraps the fetch with
backonexponential-backoff retry:Retry policy:
(None, None)viaOksobackondoes not retryPython —
vl-convert-python/src/lib.rs+vl_convert.pyiThe global converter state changed from
Mutex<VlConverterRs>toRwLock<Arc<VlConverterRs>>. The read lock is held only briefly to clone theArc; conversion runs after the lock is released, so concurrent conversion calls don't block each other.set_num_workerswrite-locks briefly to swap in a newArc; in-flight callers hold the oldArcand complete normally.New public API:
GIL release: Vega-Lite and Vega conversion functions (those backed by Deno/Tokio) now call
py.allow_threads(|| ...)to release the Python GIL during the blocking Rust execution, enabling true concurrency when called from Python threads. (Note:svg_to_png,svg_to_jpeg,svg_to_pdfare synchronous and not affected.)Not asyncio-compatible. These functions call
block_oninternally. Calling from an asyncio event loop without an executor will stall the loop. Useloop.run_in_executor(None, ...)from async contexts.Tests and Docs
New Rust tests:
test_with_num_workers_rejects_zero,test_num_workers_reports_configured_value,test_warm_up_spawns_pool_without_request,test_warm_up_is_idempotent,test_parallel_conversions_with_shared_converter.New Python tests (
test_workers.py): default count, zero-worker rejection, 16 parallel conversions with 4 workers, warmup-before-submit scenario, andset_num_workersduring concurrent submissions.Python image comparison tests now save failed images to
tests/failed/on assertion failure; CI uploads them as an artifact on job failure for post-mortem diagnosis.README: added "Configure Worker Parallelism" section.
User-Facing Impact
warm_up()/warm_up_workers()to avoid first-request initialization latency.&mut selfto&self(backwards-compatible; enables shared access without exclusive borrow).set_num_workers/get_num_workers/warm_up_workersare additive.Memory note: Each worker is a full Deno/V8 runtime with Vega-Lite loaded. Pool state is per
VlConverterinstance — reuse or clone one converter handle rather than constructing multiple instances unless independent pools are intentional. See the README's "Parallel Workers" section for usage guidance.Review Tour
Core Change
vl-convert-rs/src/converter.rsNew types near top:
WorkerPool(~line 60),ensure_v8_platform_initialized(~line 89).VlConverterInner/VlConverterare further down (~lines 1739–1750). Worker pool implementation:spawn_worker_pool,get_or_spawn_sender,warm_up,send_command_with_retry,request. RAII guards (JsonArgGuard,MsgpackResultGuard) at top of file.vl-convert-rs/src/image_loading.rs— Retry logic withbackon. No behavior change on the happy path; only affects failed/throttled fetches.Python Integration
vl-convert-python/src/lib.rsVL_CONVERTERdeclaration (globalRwLock<Arc<>>),converter_read_handle,run_converter_future(clones theArc, releases GIL, drives the future),set_num_workers/get_num_workers/warm_up_workers.Tests and Types
vl-convert-python/tests/test_workers.py— New file. Five focused tests for the worker API.vl-convert-python/vl_convert.pyi— Type stubs forset_num_workers,get_num_workers, andwarm_up_workers.Mechanical
Minor
let mut→letchanges invl-convert/src/main.rs,test_specs.rs,test_themes.rs, and docs-only addition toREADME.md.