fix: improve mobile download robustness and prevent unhandled rejection crashes#1815
Conversation
1688620 to
6a4458c
Compare
gianni-cor
left a comment
There was a problem hiding this comment.
Thanks for the work on this — the underlying problem is real and the direction is right. A few things to address before this lands.
1. Use Bare.on(...) instead of process.on(...) for the unhandled-rejection handler
The new safety net in test/mobile/integration-runtime.cjs:
if (typeof process !== 'undefined' && typeof process.on === 'function') {
process.on('unhandledRejection', (reason) => {
console.error('[integration-runner] Unhandled rejection:', reason instanceof Error ? reason.stack : reason)
})
}process is not a default global in Bare — it must be required from bare-process. Per the Bare runtime docs, the canonical API is Bare.on('unhandledRejection', ...). The reason process.on(...) happens to work today is that qvac-test-addon-mobile/scripts/build-test-app.js does globalThis.process = require('bare-process') before loading this file, and bare-process then forwards Bare events via EventEmitter.forward(Bare, this, ['unhandledRejection', ...]).
That's a fragile dependency chain:
- If the global-process shim is removed or refactored, the conditional silently no-ops and the safety net is gone — defeating the entire purpose of the PR.
- The existing convention in this repo is
Bare.on(...)— seepackages/qvac-lib-infer-parakeet/test/mobile/integration-runtime.cjsandpackages/qvac-lib-infer-nmtcpp/test/integration/{bergamot,indictrans,opencl-cache,pivot-bergamot}.test.js.
Please follow the existing pattern, which also handles uncaughtException:
if (typeof Bare !== 'undefined' && typeof Bare.on === 'function') {
Bare.on('unhandledRejection', (reason) => {
console.error('[integration-runner] Unhandled rejection:', reason instanceof Error ? reason.stack : reason)
})
Bare.on('uncaughtException', (err) => {
console.error('[integration-runner] Uncaught exception:', err instanceof Error ? err.stack : err)
})
}2. ESIZE (zero-byte download) is not retried
In test/integration/utils.js, downloadFileWithRetries throws an ESIZE error when the downloaded file is below minBytes:
if (stat.size < minBytes) {
fs.unlinkSync(partPath)
throw Object.assign(new Error(`Downloaded file is empty from ${host}`), { code: 'ESIZE' })
}But 'ESIZE' is not in TRANSIENT_ERROR_CODES, so isTransientError() returns false and the loop bails out without retrying. A zero-byte / truncated download is exactly the kind of transient failure that should retry — it's literally one of the three motivations stated in the PR description ("Partial or zero-byte downloads can be cached and re-used on the next run"). The invariant is enforced on subsequent runs (because the file isn't cached) but the current run gets no retry benefit.
Fix: add 'ESIZE' to TRANSIENT_ERROR_CODES, or special-case it as transient in isTransientError().
3. Reduce test-name duplication via a safeTest wrapper
Every wrapped test now has the test name written twice:
test('cacheKey stores tokens but stays under n_predict', { timeout: 600_000 }, async t => {
try {
// ...
} catch (error) {
console.error(error)
t.fail('cacheKey stores tokens but stays under n_predict: ' + error.message)
}
})If anyone renames the test, the duplicate string in the catch block won't auto-update and silently drifts. With 16 modified test files this is a lot of fragile copy-paste, and it's a big chunk of why the diff is +1708 lines. A tiny wrapper removes the duplication and shrinks the diff dramatically:
function safeTest (name, opts, fn) {
test(name, opts, async (t) => {
try {
await fn(t)
} catch (err) {
console.error(err)
t.fail(`${name}: ${err.message}`)
}
})
}Each test then becomes safeTest('foo', opts, async t => { /* body */ }) — no try/catch boilerplate, no duplicated string. The embed PR (#1817) already does something similar with createDeviceModelTest; aligning here would give the three PRs a consistent shape.
4. (Suggested, non-blocking) Avoid mass t.fail cascade with a module-level "model unavailable" cache
Today, if ensureModel() truly fails (e.g., DNS down throughout the run), every test that calls it hits the same t.fail() path. A 16-test suite reports 16 "ENOTFOUND" failures instead of bailing out fast. That isn't worse than the original silent crash, but it is noisy and obscures the real signal.
Consider a small cache that converts cascading failures into skips:
let modelUnavailableReason = null
async function safeEnsureModel (t, args) {
if (modelUnavailableReason) {
t.skip(\`model unavailable: \${modelUnavailableReason}\`)
return null
}
try {
return await ensureModel(args)
} catch (err) {
modelUnavailableReason = err.code || err.message
t.skip(\`model download failed: \${modelUnavailableReason}\`)
return null
}
}Subsequent tests then early-return when safeEnsureModel returns null. This turns a network-outage run into 1 informative skip + N silent skips instead of N noisy failures. Nice-to-have, not blocking on its own.
gianni-cor
left a comment
There was a problem hiding this comment.
Thanks for the round of fixes — items 1, 2, and 4 from the previous review are all addressed cleanly, and the safeTest wrapper made the diff much smaller. A few items still need attention before this lands.
1. sanity-checks is failing on CI (lint, blocking)
Job log shows JS lint errors in cache-state-machine.test.js. The de-indent that came with removing the outer try/catch wrapper was incomplete: many t.ok(...) continuation arguments are at 4 spaces instead of 6, e.g.:
t.ok(
cancelAfterFirstStats._chunkCount <= fullStats._chunkCount,
`cancel-after-first chunkCount (${cancelAfterFirstStats._chunkCount}) <= full run (${fullStats._chunkCount})`
)standard reports indent errors at lines 358, 363, 368, 372, 378–396, etc. — running npx standard --fix packages/qvac-lib-infer-llamacpp-llm/test/integration/cache-state-machine.test.js should clear all of them in one shot.
2. Two test files still use plain test() and would still hit the silent-crash path on mobile (blocking)
The follow-up commit (952891072 fix: extend safeTest coverage to grammar, image, and multi-gpu tests) caught grammar/image/multi-gpu but missed two more files that have the same problem:
ocr-lighton.test.js:106
test('LightON OCR-2 can extract text from document image', { timeout: TEST_CONSTANTS.timeout }, async t => {
for (const deviceConfig of DEVICE_CONFIGS) {
// ...
const { inference } = await setupLightOnInference(t, deviceConfig.device) // calls ensureModel() x2
// ...
}
})There is no skip: clause and DEVICE_CONFIGS (line 35) is constructed for mobile via isMobile, so this test runs on iOS/Android. setupLightOnInference() calls ensureModel() at lines 51 and 54 with no outer containment.
finetuning-pause-resume.test.js:123, 279, 342, 416
Four tests, each calling await ensureModel(...) before an inner try { await model.load() ... } finally { ... } block. The skip predicate is:
const skipFinetuning = useCpu || (noGpu && !isWindows)Mobile (iOS/Android) is not in this set, so these tests will run on mobile. Example pattern:
test('finetuning pause and resume', { ... skip: skipFinetuning }, async t => {
for (const modelVariant of FINETUNE_MODELS) {
// ...
const [modelName, modelDir] = await ensureModel({ ... }) // <-- outside try/finally
// ...
try {
await model.load()
// ...
} finally { ... }
}
})Either convert these to safeTest (preferred, matches the rest of the file), or move the ensureModel calls inside the existing try/finally block.
Out of scope (just for the record)
model-loading.test.js:132 ('sharded model can run inference end-to-end') uses test() not safeTest(). Its skip predicate is !isLinuxX64, so it never runs on mobile — fine to leave as-is.
The previously-suggested safeEnsureModel / model-unavailable-cache pattern (item A) was non-blocking and remains a nice-to-have.
|
/review |
Tier-based Approval Status |
|
/review |
🎯 What problem does this PR solve?
📝 How does it solve it?
🧪 How was it tested?