feat: add multi-GPU pipeline parallelism via split-mode config to embed addon#1833
Conversation
CPU fallback in setupParams was missing two details present in the final LLM implementation: - Set params.main_gpu = -1 on CPU fallback so llama.cpp does not retain a stale GPU index. - Reset the local splitMode variable to LLAMA_SPLIT_MODE_NONE after the CPU-fallback warning so the --device gate below emits --device correctly instead of silently suppressing it when the requested split mode was layer or row. Also add two missing BackendSelection unit tests for the main_gpu underscore alias and both-key rejection introduced in tryMainGpuFromMap, mirroring the coverage in the LLM package.
gianni-cor
left a comment
There was a problem hiding this comment.
Thanks for porting this over from the LLM addon. The core C++ changes look aligned with the upstream LLM implementation, but one important test-wiring difference needs fixing before merge.
packages/qvac-lib-infer-llamacpp-embed/test/integration/multi-gpu.test.js is added, but the embed package's test:integration script still runs only bare test/integration/addon.test.js. The embed PR workflow calls that script directly, so the new multi-GPU integration coverage will not run in desktop CI. In the LLM addon upstream change, test:integration generates/runs the full test/integration/*.test.js set, which is why its new multi-gpu.test.js is exercised.
Please update the embed integration test command/generation path so multi-gpu.test.js is included in CI, or otherwise wire this file into the existing integration test runner.
|
Please also fix the failing/blocked |
Tier-based Approval Status |
test:integration was hardcoded to addon.test.js, so multi-gpu.test.js and multi-instance.test.js were never executed in desktop CI. Switch to the same generate-then-run-all pattern used by the LLM addon: brittle -r generates test/integration/all.js from the full *.test.js glob, then bare runs it.
Apply clang-format and clang-tidy fixes flagged by the cpp-lint job: - Use std::ranges::transform in BackendSelection.cpp and BertModel.cpp - Drop else-after-return in parseMainGpu - Rename short iterator names (it -> foundIt/configIt/splitModeIt) - Use designated initializers for BackendInterface and BertEmbeddings::Layout - Drop redundant (void) on BackendInterface function pointer - Move pointer-arithmetic NOLINT to the diagnostic line in batchDecode - Extract parseSplitMode helper to bring setupParams cognitive complexity back under the threshold - Suppress non-const-global and macro-usage diagnostics in logging.hpp - Reorder includes in test_bert_model.cpp and collapse getCommonParams to a single line for clang-format
|
/review |
|
/review |
🎯 What problem does this PR solve?
The embed addon currently pins model inference to a single GPU device. On multi-GPU systems this leaves additional GPUs idle, preventing users from leveraging pipeline (layer) or tensor (row) parallelism for embedding workloads.
📝 How does it solve it?
🧪 How was it tested?
🔌 API Changes