fix(etl): atomic counter updates + retry counter reset#2421
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (5)
WalkthroughThe PR refactors ETL progress accounting to use atomic database increments within batch processors instead of separate updates in the main catalog stream function. The ChangesETL Progress Atomicity
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Deploying packrat-landing with
|
| Latest commit: |
310e627
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://bb85d6ca.packrat-landing.pages.dev |
| Branch Preview URL: | https://fix-etl-stale-jobs.packrat-landing.pages.dev |
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
packrat-admin | 310e627 | Commit Preview URL Branch Preview URL |
May 14 2026, 03:56 PM |
Deploying packrat-guides with
|
| Latest commit: |
310e627
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://6b445eed.packrat-guides-6gq.pages.dev |
| Branch Preview URL: | https://fix-etl-stale-jobs.packrat-guides-6gq.pages.dev |
Three related bugs that corrupted ETL job counters: 1. totalValid > totalProcessed (successRate > 100%): processValidItemsBatch called updateEtlJobProgress (which incremented totalValid) in one DB call, then processCatalogEtl incremented totalProcessed in a separate DB call. If the CF Worker died between these two calls, the job was permanently stuck with totalValid > totalProcessed. Fix: move totalProcessed increment into updateEtlJobProgress so all three counters (valid, invalid, processed) are updated in a single atomic query. 2. Counter double-counting on CF Queue retry: CF Worker CPU timeouts hard-kill the process, bypassing the catch block, so CF Queue re-delivers the message with the same jobId. The counters were additive, so retried jobs accumulated 2x (or more) the actual row counts. Fix: at the start of processCatalogETL, detect a retry (running status with non-zero totalProcessed) and reset counters to 0 before re-processing. 3. Broken auto-complete in updateEtlJobProgress: the CASE WHEN check compared totalProcessed against totalValid + totalInvalid, but totalProcessed was always updated after updateEtlJobProgress ran, so the check always saw a stale value and never fired. The explicit status='completed' set at the end of processCatalogETL is the real completion mechanism. Fix: remove the dead auto-complete logic from updateEtlJobProgress. Also adds two regression tests: one for the retry counter reset, one to assert totalProcessed == totalValid + totalInvalid after a clean run.
…k processing With chunked ETL (multiple queue messages per file, each covering a byte range), chunk 2..N arrive with totalProcessed already > 0 from earlier chunks. The retry-reset guard would have fired on every non-first chunk, wiping the counters legitimately set by previous chunks. The core atomic-counter fix (totalProcessed updated inside updateEtlJobProgress alongside totalValid/totalInvalid) is the primary guard against successRate > 100%. Stale running jobs are already handled by the existing POST /etl/reset-stuck endpoint.
310e627 to
c27ea32
Compare
Coverage Report for Expo Unit Tests Coverage (./apps/expo)
File CoverageNo changed files found. |
Coverage Report for API Unit Tests Coverage (./packages/api)
File CoverageNo changed files found. |
Summary
Three related ETL counter bugs that caused stalled/corrupted job state:
successRate > 100% (e.g. campsaver 700/600):
processValidItemsBatchincrementedtotalValidin one DB call, thenprocessCatalogEtlincrementedtotalProcessedin a separate call. CF Worker dying between these two lefttotalValid > totalProcessedpermanently. Fixed by movingtotalProcessedintoupdateEtlJobProgressso all three counters update atomically.Counter double-counting on CF Queue retry: CF Worker CPU timeouts hard-kill the process, bypassing the catch block. CF Queue re-delivers the message with the same
jobId, and additive counters accumulated 2× the actual row counts. Fixed by detecting a retry (samejobIdalready inrunningstate with non-zerototalProcessed) and resetting counters to 0 before re-processing.Broken auto-complete in
updateEtlJobProgress: TheCASE WHEN totalProcessed = totalValid + totalInvalidcheck always read a staletotalProcessed(it was updated after this function ran) and never fired. Removed the dead code — the explicitstatus='completed'set at the end ofprocessCatalogEtlis the real completion mechanism.Changes
updateEtlJobProgress.ts— addprocessedparam, update all three counters atomically, remove broken auto-complete CASEprocessValidItemsBatch.ts— passprocessed: items.lengthprocessLogsBatch.ts— passprocessed: logs.lengthprocessCatalogEtl.ts— retry detection at start; remove standalonetotalProcessedincrementstest/etl.test.ts— two new regression tests: retry counter reset,totalProcessed == totalValid + totalInvalidinvariantPost-Deploy Monitoring & Validation
/api/admin/analytics/catalog/etl— watchsuccessRateon newly-processed jobs; should always be ≤ 100%/api/admin/analytics/catalog/etl/reset-stuckafter deploy to clear existing stalerunningjobssuccessRate ≤ 100%,totalProcessed == totalValid + totalInvalidsuccessRate > 100%means the atomic update didn't land (rollback this PR)Summary by CodeRabbit
Refactor
Tests