QVAC-18064 feat: optimize nmtcpp for Android GPU inference#1875
Conversation
Squashed commits: - Optimize nmtcpp for Android GPU inference with Vulkan backend support - Disable non-android CI jobs temporarily for isolated testing - Bump qvac-fabric to 7248.2.5 for OpenCL Adreno M-padding fix - Bump qvac-fabric to 7248.2.6 for Adreno q4_0 threshold fix - Update vcpkg registry baseline for correct SHA512 - Move beam search KV cache pool to CPU backend - Propagate config params after GGML context load and fix multi-GPU handling - Disable OpenCL and revert to tetherto vcpkg registry - Downgrade qvac-fabric to 7248.2.3 to match main branch registry - Revert temporary CI workflow changes - Add Android debug logging for Adreno 830 Vulkan crash investigation - Prevent backend device accumulation and skip OpenCL comparison test - Fix clang-format for ggml_backend_load_all_from_path call - Remove Android debug logging added for Adreno 830 crash investigation Co-authored-by: Cursor <cursoragent@cursor.com>
…errors Rename new nmt_utils functions/params/locals to camelCase per project readability-identifier-naming rules. Add NOLINT suppression for pre-existing snake_case declarations. Fix implicit const char* -> bool conversions to use explicit nullptr checks.
Security: bounded strnlen, clamp op_offload_min_batch, sanitize backend description, atomic g_backendsLoaded. Correctness: reset g_backendsLoaded on teardown, null physKey dedup, remove unused selDevName. Consistency: camelCase rename, JSDoc on silent catch, skip API for disabled test, restore USE_OPENCL comment. Race fix: first-writer-wins mutex for setenv GGML_VK_OFFLOAD_MIN_BATCH. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract sanitizePrintableAscii from anonymous namespaces in TranslationModel.cpp and NmtLazyInitializeBackend.cpp into shared nmt_utils.hpp/cpp. Consolidate duplicate JSDoc blocks on getActiveBackendDescription in marian.js. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
🧪 C++ Test Coverage ReportCoverage: 📊 Detailed Coverage |
Fix bergamot.cpp: rename snake_case locals/params, add explicit null checks for implicit bool conversions, NOLINT for owning-memory, narrowing conversions, deprecated u8path, and swappable params. Concatenate nested namespaces in bergamot.hpp. Fix TranslationModel.hpp: add explicit to single-arg ctor, use = default for default ctor, override instead of virtual, remove redundant access specifiers, NOLINT for special-member-functions, sync processString param names. Fix NOLINT placement in AddonJs.hpp, NmtLazyInitializeBackend.cpp, and TranslationModel.cpp (moved to correct diagnostic lines). Rename path_model to pathModel in nmt_loader definition. Co-authored-by: Cursor <cursoragent@cursor.com>
🧪 C++ Test Coverage ReportCoverage: 📊 Detailed Coverage |
Move NOLINT comments to exact diagnostic lines, use NOLINTNEXTLINE and NOLINTBEGIN/END for multi-line expressions that clang-format may reflow. Fix bergamot.cpp u8path, owning-memory, narrowing, and swappable-param suppressions. Fix NmtLazyInitializeBackend.cpp cognitive complexity and pointer-arithmetic blocks. Use [[maybe_unused]] for unused openclCacheDir parameter. Fix nmt_beam_search.hpp nmt_context NOLINT, nmt_loader path_model rename. Co-authored-by: Cursor <cursoragent@cursor.com>
🧪 C++ Test Coverage ReportCoverage: 📊 Detailed Coverage |
✅ E2E Mobile Test Results - iOSOverall Status: PASSED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
🧪 C++ Test Coverage ReportCoverage: 📊 Detailed Coverage |
✅ E2E Mobile Test Results - AndroidOverall Status: PASSED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
🧪 C++ Test Coverage ReportCoverage: 📊 Detailed Coverage |
✅ E2E Mobile Test Results - iOSOverall Status: PASSED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
✅ E2E Mobile Test Results - AndroidOverall Status: PASSED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
Tier-based Approval Status |
|
/review |
|
/review |
❌ E2E Mobile Test Results - iOSOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - AndroidOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
🧪 C++ Test Coverage ReportCoverage: 📊 Detailed Coverage |
❌ E2E Mobile Test Results - iOSOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - AndroidOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - iOSOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - AndroidOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - iOSOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - AndroidOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
Summary
nmt_backend_init) and JS (discoverGpuDevices): Vulkan0 + OpenCL0 on single-SoC Android devices now resolve to one backend instead of redundant scheduler splits that add sync overhead without parallel-compute benefit.g_backendsLoadedguard inNmtLazyInitializeBackend) to avoid redundantdlopen/registration when multiple model instances are created in the same process.op_offload_min_batchconfig propagated viaGGML_VK_OFFLOAD_MIN_BATCHenv var, allowing callers to force single-token decoder steps onto GPU (Vulkan default of 32 keeps them on CPU).get_optimal_thread_count()to stay on performance cores (big.LITTLE SoCs spread across all 8 cores, hitting slow efficiency cores).use_gpu && flash_attn(was returning 1 for all non-Metal/CUDA backends), fixing flash-attention alignment on Adreno/Mali. CPU sessions still get padding=1 via the early-return guard.backends.back()(CPU is always last in the vector) instead ofbackends[0](GPU when present). CPU allocation avoids contention with the main GPU inference pipeline on single-GPU mobile SoCs.getActiveBackendDescription()API exposed through JS/TS — returns human-readable device description (e.g. "Qualcomm Adreno (TM) 830") for perf-report tagging.discoverGpuBackends()test helper for Vulkan-vs-OpenCL comparison benchmarks (OpenCL test currently skipped due to upstream Adreno 830M%4assertion).nmt_select_gpu_device→nmtSelectGpuDevice,nmt_name_contains_ci→nmtNameContainsCito satisfy clang-tidyreadability-identifier-naming.USE_OPENCLis off), sogpuDeviceordinals map to distinct physical GPUs without OpenCL duplicates occupying slots.ggml_backend_reg_get_proc_addresscall to prevent crash whenregis null.devices[]by slot, not byd.index === gpuIdx), and includedescriptionin perf labels.Test plan
npm run test:integrationinpackages/qvac-lib-infer-nmtcpp(Bergamot, IndicTrans, pivot tests)getActiveBackendDescription()returns device string on GPU, empty string on CPU/unloadednpm run test:cpppasses clang-tidy naming checks