Skip to content

feat: add multi-GPU pipeline parallelism via split-mode config to embed addon#1833

Merged
gianni-cor merged 10 commits into
tetherto:mainfrom
donriddo:feat/embed-multi-gpu-pipeline-parallelism
May 2, 2026
Merged

feat: add multi-GPU pipeline parallelism via split-mode config to embed addon#1833
gianni-cor merged 10 commits into
tetherto:mainfrom
donriddo:feat/embed-multi-gpu-pipeline-parallelism

Conversation

@donriddo

@donriddo donriddo commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

🎯 What problem does this PR solve?

The embed addon currently pins model inference to a single GPU device. On multi-GPU systems this leaves additional GPUs idle, preventing users from leveraging pipeline (layer) or tensor (row) parallelism for embedding workloads.

📝 How does it solve it?

  • Adds split-mode config option (none | layer | row) to BertModel setupParams, accepting both hyphen and underscore variants and case-insensitive values. When set to layer or row, the --device flag is omitted so llama.cpp distributes embedding layers across all available GPUs.
  • Adds tensor-split config key (comma-separated proportions) for controlling per-GPU layer allocation in split mode.
  • Updates tryMainGpuFromMap in BackendSelection to accept both main-gpu and main_gpu variants with conflict detection, matching the LLM addon.
  • Adds getEffectiveGpuDeviceCount() to BackendSelection for GPU inventory queries.
  • Adds BertModel::getCommonParams() accessor to expose parsed params for testing.
  • Adds split-mode and tensor-split fields to GGMLConfig in index.d.ts.
  • Bumps version to 0.15.0.

🧪 How was it tested?

  • C++ unit tests (test_bert_model.cpp): 8 new split-mode cases covering none, layer, row, case-insensitive input, underscore variant, CPU fallback clears GPU params, invalid value rejection, and both keys reject.
  • C++ unit tests (test_backend_selection.cpp): 9 new getEffectiveGpuDeviceCount cases covering zero devices, CPU-only, single dGPU, single iGPU, two dGPUs, dGPU+iGPU, two dGPUs+iGPU, two iGPUs, and accel/CPU ignored.
  • JS integration tests (multi-gpu.test.js): 4 tests gated on QVAC_HAS_MULTI_GPU=1 exercising split-mode=layer, split-mode=row, default single-device, and split-mode=layer with tensor-split and main-gpu.

🔌 API Changes

// GGMLBertArgs.config now accepts:
{
  'split-mode': 'none' | 'layer' | 'row',    // optional, default none
  'tensor-split': '1,1',                     // optional, proportional split across GPUs
  'main-gpu': 0 | 'integrated' | 'dedicated' // optional, primary GPU selection
}

@donriddo donriddo requested review from a team as code owners April 30, 2026 12:59
CPU fallback in setupParams was missing two details present in the final
LLM implementation:

- Set params.main_gpu = -1 on CPU fallback so llama.cpp does not retain
  a stale GPU index.
- Reset the local splitMode variable to LLAMA_SPLIT_MODE_NONE after the
  CPU-fallback warning so the --device gate below emits --device correctly
  instead of silently suppressing it when the requested split mode was
  layer or row.

Also add two missing BackendSelection unit tests for the main_gpu underscore
alias and both-key rejection introduced in tryMainGpuFromMap, mirroring the
coverage in the LLM package.

@gianni-cor gianni-cor left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for porting this over from the LLM addon. The core C++ changes look aligned with the upstream LLM implementation, but one important test-wiring difference needs fixing before merge.

packages/qvac-lib-infer-llamacpp-embed/test/integration/multi-gpu.test.js is added, but the embed package's test:integration script still runs only bare test/integration/addon.test.js. The embed PR workflow calls that script directly, so the new multi-GPU integration coverage will not run in desktop CI. In the LLM addon upstream change, test:integration generates/runs the full test/integration/*.test.js set, which is why its new multi-gpu.test.js is exercised.

Please update the embed integration test command/generation path so multi-gpu.test.js is included in CI, or otherwise wire this file into the existing integration test runner.

@gianni-cor

Copy link
Copy Markdown
Contributor

Please also fix the failing/blocked cpp-lint check before this is re-reviewed.

@github-actions

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (2/1)
- 1 Team Lead OR Management approval ✅ (2/1)

**Bypass rule:** Triggered (2+ Team Lead approvals (Tier 1 exception)). This PR is approved regardless of tier.

---
*This comment is automatically updated when reviews change.*

donriddo added 5 commits May 1, 2026 12:06
test:integration was hardcoded to addon.test.js, so multi-gpu.test.js
and multi-instance.test.js were never executed in desktop CI. Switch to
the same generate-then-run-all pattern used by the LLM addon: brittle -r
generates test/integration/all.js from the full *.test.js glob, then
bare runs it.
Apply clang-format and clang-tidy fixes flagged by the cpp-lint job:
- Use std::ranges::transform in BackendSelection.cpp and BertModel.cpp
- Drop else-after-return in parseMainGpu
- Rename short iterator names (it -> foundIt/configIt/splitModeIt)
- Use designated initializers for BackendInterface and BertEmbeddings::Layout
- Drop redundant (void) on BackendInterface function pointer
- Move pointer-arithmetic NOLINT to the diagnostic line in batchDecode
- Extract parseSplitMode helper to bring setupParams cognitive complexity
  back under the threshold
- Suppress non-const-global and macro-usage diagnostics in logging.hpp
- Reorder includes in test_bert_model.cpp and collapse getCommonParams
  to a single line for clang-format
@gianni-cor

Copy link
Copy Markdown
Contributor

/review

@gianni-cor

Copy link
Copy Markdown
Contributor

/review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants