Skip to content

fix(etl): atomic counter updates + retry counter reset#2421

Merged
andrew-bierman merged 2 commits into
mainfrom
fix/etl-stale-jobs
May 14, 2026
Merged

fix(etl): atomic counter updates + retry counter reset#2421
andrew-bierman merged 2 commits into
mainfrom
fix/etl-stale-jobs

Conversation

@andrew-bierman
Copy link
Copy Markdown
Collaborator

@andrew-bierman andrew-bierman commented May 14, 2026

Summary

Three related ETL counter bugs that caused stalled/corrupted job state:

  • successRate > 100% (e.g. campsaver 700/600): processValidItemsBatch incremented totalValid in one DB call, then processCatalogEtl incremented totalProcessed in a separate call. CF Worker dying between these two left totalValid > totalProcessed permanently. Fixed by moving totalProcessed into updateEtlJobProgress so all three counters update atomically.

  • Counter double-counting on CF Queue retry: CF Worker CPU timeouts hard-kill the process, bypassing the catch block. CF Queue re-delivers the message with the same jobId, and additive counters accumulated 2× the actual row counts. Fixed by detecting a retry (same jobId already in running state with non-zero totalProcessed) and resetting counters to 0 before re-processing.

  • Broken auto-complete in updateEtlJobProgress: The CASE WHEN totalProcessed = totalValid + totalInvalid check always read a stale totalProcessed (it was updated after this function ran) and never fired. Removed the dead code — the explicit status='completed' set at the end of processCatalogEtl is the real completion mechanism.

Changes

  • updateEtlJobProgress.ts — add processed param, update all three counters atomically, remove broken auto-complete CASE
  • processValidItemsBatch.ts — pass processed: items.length
  • processLogsBatch.ts — pass processed: logs.length
  • processCatalogEtl.ts — retry detection at start; remove standalone totalProcessed increments
  • test/etl.test.ts — two new regression tests: retry counter reset, totalProcessed == totalValid + totalInvalid invariant

Post-Deploy Monitoring & Validation

  • What to monitor: /api/admin/analytics/catalog/etl — watch successRate on newly-processed jobs; should always be ≤ 100%
  • Call /api/admin/analytics/catalog/etl/reset-stuck after deploy to clear existing stale running jobs
  • Expected healthy behavior: new jobs complete with successRate ≤ 100%, totalProcessed == totalValid + totalInvalid
  • Failure signal: any job with successRate > 100% means the atomic update didn't land (rollback this PR)
  • Validation window: first 3 scraper runs after deploy
  • Owner: Andrew

Compound Engineered 🤖 Generated with Claude Code

Summary by CodeRabbit

  • Refactor

    • Improved ETL job progress tracking to ensure atomic updates, preventing inconsistencies if worker failures occur mid-operation.
  • Tests

    • Added validation test to verify ETL progress accounting invariants remain consistent upon job completion.

Review Change Stack

Copilot AI review requested due to automatic review settings May 14, 2026 15:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9ba3a035-6921-4678-9df0-26aa5f5b1867

📥 Commits

Reviewing files that changed from the base of the PR and between feb909d and c27ea32.

📒 Files selected for processing (5)
  • packages/api/src/services/etl/processCatalogEtl.ts
  • packages/api/src/services/etl/processLogsBatch.ts
  • packages/api/src/services/etl/processValidItemsBatch.ts
  • packages/api/src/services/etl/updateEtlJobProgress.ts
  • packages/api/test/etl.test.ts

Walkthrough

The PR refactors ETL progress accounting to use atomic database increments within batch processors instead of separate updates in the main catalog stream function. The updateEtlJobProgress function accepts a new processed counter, and both valid and invalid batch handlers now report their item counts alongside progress in a single update call.

Changes

ETL Progress Atomicity

Layer / File(s) Summary
Progress counter API update
packages/api/src/services/etl/updateEtlJobProgress.ts
updateEtlJobProgress now accepts an optional processed parameter and increments totalProcessed in the database. Completion-related status/timestamp updates are removed from this function.
Batch processors report progress atomically
packages/api/src/services/etl/processValidItemsBatch.ts, packages/api/src/services/etl/processLogsBatch.ts
processValidItemsBatch and processLogsBatch each call updateEtlJobProgress with both their item count and a processed field set to items length, ensuring progress and validity counters increment together in a single update.
Main ETL stream delegates progress to batch handlers
packages/api/src/services/etl/processCatalogEtl.ts
processCatalogEtl removes direct progress updates and remaining-row computation logic, delegating all progress accounting to batch processors. Unused sql import is removed.
Accounting invariant test
packages/api/test/etl.test.ts
New test asserts that totalProcessed equals totalValid + totalInvalid after catalog ETL completion, validating the atomicity invariant.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • PackRat-AI/PackRat#2419: Modifies processCatalogEtl's streaming logic in the same function; changes interact with batch flushing mechanics.
  • PackRat-AI/PackRat#2394: Updates ETL job finalization and progress counter behavior; directly connected to progress accounting changes.

Suggested labels

api

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the primary changes: atomic counter updates in ETL progress tracking and retry counter reset logic for handling re-delivered jobs.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/etl-stale-jobs

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread .github/workflows/migrations.yml Fixed
@cloudflare-workers-and-pages
Copy link
Copy Markdown
Contributor

Deploying packrat-landing with  Cloudflare Pages  Cloudflare Pages

Latest commit: 310e627
Status: ✅  Deploy successful!
Preview URL: https://bb85d6ca.packrat-landing.pages.dev
Branch Preview URL: https://fix-etl-stale-jobs.packrat-landing.pages.dev

View logs

@cloudflare-workers-and-pages
Copy link
Copy Markdown
Contributor

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
packrat-admin 310e627 Commit Preview URL

Branch Preview URL
May 14 2026, 03:56 PM

@cloudflare-workers-and-pages
Copy link
Copy Markdown
Contributor

Deploying packrat-guides with  Cloudflare Pages  Cloudflare Pages

Latest commit: 310e627
Status: ✅  Deploy successful!
Preview URL: https://6b445eed.packrat-guides-6gq.pages.dev
Branch Preview URL: https://fix-etl-stale-jobs.packrat-guides-6gq.pages.dev

View logs

Three related bugs that corrupted ETL job counters:

1. totalValid > totalProcessed (successRate > 100%): processValidItemsBatch
   called updateEtlJobProgress (which incremented totalValid) in one DB call,
   then processCatalogEtl incremented totalProcessed in a separate DB call.
   If the CF Worker died between these two calls, the job was permanently stuck
   with totalValid > totalProcessed.

   Fix: move totalProcessed increment into updateEtlJobProgress so all three
   counters (valid, invalid, processed) are updated in a single atomic query.

2. Counter double-counting on CF Queue retry: CF Worker CPU timeouts hard-kill
   the process, bypassing the catch block, so CF Queue re-delivers the message
   with the same jobId. The counters were additive, so retried jobs accumulated
   2x (or more) the actual row counts.

   Fix: at the start of processCatalogETL, detect a retry (running status with
   non-zero totalProcessed) and reset counters to 0 before re-processing.

3. Broken auto-complete in updateEtlJobProgress: the CASE WHEN check compared
   totalProcessed against totalValid + totalInvalid, but totalProcessed was
   always updated after updateEtlJobProgress ran, so the check always saw a
   stale value and never fired. The explicit status='completed' set at the end
   of processCatalogETL is the real completion mechanism.

   Fix: remove the dead auto-complete logic from updateEtlJobProgress.

Also adds two regression tests: one for the retry counter reset, one to assert
totalProcessed == totalValid + totalInvalid after a clean run.
…k processing

With chunked ETL (multiple queue messages per file, each covering a byte
range), chunk 2..N arrive with totalProcessed already > 0 from earlier
chunks. The retry-reset guard would have fired on every non-first chunk,
wiping the counters legitimately set by previous chunks.

The core atomic-counter fix (totalProcessed updated inside
updateEtlJobProgress alongside totalValid/totalInvalid) is the primary
guard against successRate > 100%. Stale running jobs are already handled
by the existing POST /etl/reset-stuck endpoint.
@github-actions
Copy link
Copy Markdown
Contributor

Coverage Report for Expo Unit Tests Coverage (./apps/expo)

Status Category Percentage Covered / Total
🔵 Lines 82.61% 480 / 581
🔵 Statements 82.61% (🎯 75%) 480 / 581
🔵 Functions 92.59% 50 / 54
🔵 Branches 90.9% 170 / 187
File CoverageNo changed files found.
Generated in workflow #1223 for commit c27ea32 by the Vitest Coverage Report Action

@github-actions
Copy link
Copy Markdown
Contributor

Coverage Report for API Unit Tests Coverage (./packages/api)

Status Category Percentage Covered / Total
🔵 Lines 76.17% 502 / 659
🔵 Statements 76.17% (🎯 65%) 502 / 659
🔵 Functions 95% 38 / 40
🔵 Branches 88.67% 227 / 256
File CoverageNo changed files found.
Generated in workflow #1223 for commit c27ea32 by the Vitest Coverage Report Action

@andrew-bierman andrew-bierman merged commit 107972a into main May 14, 2026
11 checks passed
@andrew-bierman andrew-bierman deleted the fix/etl-stale-jobs branch May 14, 2026 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants