Skip to content

fix: improve mobile download robustness and prevent unhandled rejection crashes#1815

Merged
gianni-cor merged 9 commits into
tetherto:mainfrom
donriddo:fix/mobile-download-robustness
May 20, 2026
Merged

fix: improve mobile download robustness and prevent unhandled rejection crashes#1815
gianni-cor merged 9 commits into
tetherto:mainfrom
donriddo:fix/mobile-download-robustness

Conversation

@donriddo

@donriddo donriddo commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

🎯 What problem does this PR solve?

  • On iOS/Android, if `ensureModel()` throws a network error while outside a `try/catch`, the Bare runtime treats the rejected promise as an unhandled rejection and terminates the process. Appium sees no FAIL marker — the crash is silent.
  • Transient download failures (DNS, connection reset, timeout, 5xx) produce no retry attempt; a single blip during CI model download causes the entire test suite to fail.
  • Partial or zero-byte downloads can be cached and re-used on the next run, causing all subsequent tests on that run to fail with a corrupted model file.

📝 How does it solve it?

  • `utils.js` — `ensureModel()` now uses `downloadFileWithRetries()`: exponential backoff with jitter for transient errors (DNS, reset, timeout, 5xx), `.part`-file atomic writes, zero-byte cache detection and removal before re-download, and an idle timeout on the response stream.
  • `integration-runtime.cjs` — wraps `import()` in `try/catch` to return a structured error summary instead of propagating the rejection; installs `process.on('unhandledRejection', ...)` as a last-resort containment layer.
  • All 16 integration test files — each test body is now wrapped in `try/catch` so `ensureModel()` / `resolveModel()` / `setupModel()` / `createToolModel()` / `createAddon()` network failures call `t.fail()` instead of leaving an unhandled rejection.

🧪 How was it tested?

  • Lint (`npx standard`) passes clean on all modified test files.
  • `npm run test:integration:generate` regenerates `all.js` and `integration.auto.cjs` successfully.
  • AWS Device Farm — Android (run 25334478072):
    • Google Pixel 9 Pro: PASSED (Group A + Group B)
    • Samsung Galaxy S25 Ultra: PASSED (Group A + Group B)
    • 12/12 passed across 4 device-group combinations; `runModelLoadingTest` and `runMultiInstanceTest` confirmed by Appium UI polling; app state 4 (foreground) after all tests
  • AWS Device Farm — iOS (run 25334478072): infrastructure failure at test spec upload step (Device Farm upload API error, unrelated to code changes)

@donriddo donriddo requested review from a team as code owners April 29, 2026 21:33
@donriddo donriddo force-pushed the fix/mobile-download-robustness branch from 1688620 to 6a4458c Compare April 29, 2026 22:13

@gianni-cor gianni-cor left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this — the underlying problem is real and the direction is right. A few things to address before this lands.

1. Use Bare.on(...) instead of process.on(...) for the unhandled-rejection handler

The new safety net in test/mobile/integration-runtime.cjs:

if (typeof process !== 'undefined' && typeof process.on === 'function') {
  process.on('unhandledRejection', (reason) => {
    console.error('[integration-runner] Unhandled rejection:', reason instanceof Error ? reason.stack : reason)
  })
}

process is not a default global in Bare — it must be required from bare-process. Per the Bare runtime docs, the canonical API is Bare.on('unhandledRejection', ...). The reason process.on(...) happens to work today is that qvac-test-addon-mobile/scripts/build-test-app.js does globalThis.process = require('bare-process') before loading this file, and bare-process then forwards Bare events via EventEmitter.forward(Bare, this, ['unhandledRejection', ...]).

That's a fragile dependency chain:

  • If the global-process shim is removed or refactored, the conditional silently no-ops and the safety net is gone — defeating the entire purpose of the PR.
  • The existing convention in this repo is Bare.on(...) — see packages/qvac-lib-infer-parakeet/test/mobile/integration-runtime.cjs and packages/qvac-lib-infer-nmtcpp/test/integration/{bergamot,indictrans,opencl-cache,pivot-bergamot}.test.js.

Please follow the existing pattern, which also handles uncaughtException:

if (typeof Bare !== 'undefined' && typeof Bare.on === 'function') {
  Bare.on('unhandledRejection', (reason) => {
    console.error('[integration-runner] Unhandled rejection:', reason instanceof Error ? reason.stack : reason)
  })
  Bare.on('uncaughtException', (err) => {
    console.error('[integration-runner] Uncaught exception:', err instanceof Error ? err.stack : err)
  })
}

2. ESIZE (zero-byte download) is not retried

In test/integration/utils.js, downloadFileWithRetries throws an ESIZE error when the downloaded file is below minBytes:

if (stat.size < minBytes) {
  fs.unlinkSync(partPath)
  throw Object.assign(new Error(`Downloaded file is empty from ${host}`), { code: 'ESIZE' })
}

But 'ESIZE' is not in TRANSIENT_ERROR_CODES, so isTransientError() returns false and the loop bails out without retrying. A zero-byte / truncated download is exactly the kind of transient failure that should retry — it's literally one of the three motivations stated in the PR description ("Partial or zero-byte downloads can be cached and re-used on the next run"). The invariant is enforced on subsequent runs (because the file isn't cached) but the current run gets no retry benefit.

Fix: add 'ESIZE' to TRANSIENT_ERROR_CODES, or special-case it as transient in isTransientError().

3. Reduce test-name duplication via a safeTest wrapper

Every wrapped test now has the test name written twice:

test('cacheKey stores tokens but stays under n_predict', { timeout: 600_000 }, async t => {
  try {
    // ...
  } catch (error) {
    console.error(error)
    t.fail('cacheKey stores tokens but stays under n_predict: ' + error.message)
  }
})

If anyone renames the test, the duplicate string in the catch block won't auto-update and silently drifts. With 16 modified test files this is a lot of fragile copy-paste, and it's a big chunk of why the diff is +1708 lines. A tiny wrapper removes the duplication and shrinks the diff dramatically:

function safeTest (name, opts, fn) {
  test(name, opts, async (t) => {
    try {
      await fn(t)
    } catch (err) {
      console.error(err)
      t.fail(`${name}: ${err.message}`)
    }
  })
}

Each test then becomes safeTest('foo', opts, async t => { /* body */ }) — no try/catch boilerplate, no duplicated string. The embed PR (#1817) already does something similar with createDeviceModelTest; aligning here would give the three PRs a consistent shape.

4. (Suggested, non-blocking) Avoid mass t.fail cascade with a module-level "model unavailable" cache

Today, if ensureModel() truly fails (e.g., DNS down throughout the run), every test that calls it hits the same t.fail() path. A 16-test suite reports 16 "ENOTFOUND" failures instead of bailing out fast. That isn't worse than the original silent crash, but it is noisy and obscures the real signal.

Consider a small cache that converts cascading failures into skips:

let modelUnavailableReason = null

async function safeEnsureModel (t, args) {
  if (modelUnavailableReason) {
    t.skip(\`model unavailable: \${modelUnavailableReason}\`)
    return null
  }
  try {
    return await ensureModel(args)
  } catch (err) {
    modelUnavailableReason = err.code || err.message
    t.skip(\`model download failed: \${modelUnavailableReason}\`)
    return null
  }
}

Subsequent tests then early-return when safeEnsureModel returns null. This turns a network-outage run into 1 informative skip + N silent skips instead of N noisy failures. Nice-to-have, not blocking on its own.

@gianni-cor gianni-cor left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the round of fixes — items 1, 2, and 4 from the previous review are all addressed cleanly, and the safeTest wrapper made the diff much smaller. A few items still need attention before this lands.

1. sanity-checks is failing on CI (lint, blocking)

Job log shows JS lint errors in cache-state-machine.test.js. The de-indent that came with removing the outer try/catch wrapper was incomplete: many t.ok(...) continuation arguments are at 4 spaces instead of 6, e.g.:

    t.ok(
      cancelAfterFirstStats._chunkCount <= fullStats._chunkCount,
    `cancel-after-first chunkCount (${cancelAfterFirstStats._chunkCount}) <= full run (${fullStats._chunkCount})`
    )

standard reports indent errors at lines 358, 363, 368, 372, 378–396, etc. — running npx standard --fix packages/qvac-lib-infer-llamacpp-llm/test/integration/cache-state-machine.test.js should clear all of them in one shot.

2. Two test files still use plain test() and would still hit the silent-crash path on mobile (blocking)

The follow-up commit (952891072 fix: extend safeTest coverage to grammar, image, and multi-gpu tests) caught grammar/image/multi-gpu but missed two more files that have the same problem:

ocr-lighton.test.js:106

test('LightON OCR-2 can extract text from document image', { timeout: TEST_CONSTANTS.timeout }, async t => {
  for (const deviceConfig of DEVICE_CONFIGS) {
    // ...
    const { inference } = await setupLightOnInference(t, deviceConfig.device)  // calls ensureModel() x2
    // ...
  }
})

There is no skip: clause and DEVICE_CONFIGS (line 35) is constructed for mobile via isMobile, so this test runs on iOS/Android. setupLightOnInference() calls ensureModel() at lines 51 and 54 with no outer containment.

finetuning-pause-resume.test.js:123, 279, 342, 416

Four tests, each calling await ensureModel(...) before an inner try { await model.load() ... } finally { ... } block. The skip predicate is:

const skipFinetuning = useCpu || (noGpu && !isWindows)

Mobile (iOS/Android) is not in this set, so these tests will run on mobile. Example pattern:

test('finetuning pause and resume', { ... skip: skipFinetuning }, async t => {
  for (const modelVariant of FINETUNE_MODELS) {
    // ...
    const [modelName, modelDir] = await ensureModel({ ... })  // <-- outside try/finally
    // ...
    try {
      await model.load()
      // ...
    } finally { ... }
  }
})

Either convert these to safeTest (preferred, matches the rest of the file), or move the ensureModel calls inside the existing try/finally block.

Out of scope (just for the record)

model-loading.test.js:132 ('sharded model can run inference end-to-end') uses test() not safeTest(). Its skip predicate is !isLinuxX64, so it never runs on mobile — fine to leave as-is.

The previously-suggested safeEnsureModel / model-unavailable-cache pattern (item A) was non-blocking and remains a nice-to-have.

@donriddo

Copy link
Copy Markdown
Contributor Author

/review

@github-actions

github-actions Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

@gianni-cor

Copy link
Copy Markdown
Contributor

/review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe-to-test Reviewer has verified safety of packages & github actions in this PR verified Authorize secrets / label-gate in PR workflows verify

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants