Fix target scaler fallback race condition via shared storage#3191
Merged
Conversation
Replace static in-process _targetScalersInError HashSet with a new ITargetScalerErrorRepository abstraction backed by blob storage. When a target scaler throws NotSupportedException (e.g. ServiceBus without Manage claim), the scaler ID is persisted to blob storage so all host instances — including the primary host on a different worker — can see the failure and fall back to incremental monitoring. New files: - ITargetScalerErrorRepository: public interface (AddAsync, GetAsync, ClearAsync) - InMemoryTargetScalerErrorRepository: default in-process fallback - BlobStorageTargetScalerErrorRepository: blob-backed impl using the same IAzureBlobStorageProvider pattern as BlobStorageConcurrencyStatusRepository Key changes: - ScaleManager: inject ITargetScalerErrorRepository, write errors via AddAsync, read via GetAsync before GetScalersToSample - ScaleMonitorService: inject ITargetScalerErrorRepository, read errors before GetScalersToSample, call ClearAsync on startup so customers can recover by restarting after granting Manage claim - DI: register InMemory as default, BlobStorage when Azure Storage is configured (AddAzureStorageCoreServices/AddAzureStorageScaleServices) Co-authored-by: Dobby <dobby@microsoft.com>
Unit tests (P0/P1): - StartAsync_ClearsTargetScalerErrors: verifies ClearAsync called on startup - OnTimer_CrossWorkerFallback_SharedRepository: simulates multi-worker scenario where worker A writes error, worker B reads it - InMemoryTargetScalerErrorRepository_Lifecycle: add/get/clear cycle - InMemoryTargetScalerErrorRepository_AddAsync_Idempotent: double add - GetScalersToSample_WithErrorSet_FiltersCorrectly: selective exclusion - GetScalersToSample_WithNullErrorSet_NoFiltering: null safety - GetScaleStatusAsync_SecondCall_ReadsPersistedError: error persists E2E tests (P2 - requires Azurite/storage): - GetBlobPathAsync_ReturnsExpectedPath - AddAsync_WritesExpectedBlob (round-trip serialization) - GetAsync_ReadsExpectedBlob - GetAsync_NoBlob_ReturnsEmpty (404 handling) - ClearAsync_DeletesBlob - NoStorageConnection_HandledGracefully (mocked) Also adds ITargetScalerErrorRepository to PublicSurfaceTests. Co-authored-by: Dobby <dobby@microsoft.com>
…attern Align with NullConcurrencyStatusRepository convention: - Production default is now NullTargetScalerErrorRepository (discards writes, returns empty) — same pattern as NullConcurrencyStatusRepository - InMemoryTargetScalerErrorRepository moved to test project as a test helper for simulating shared state in cross-worker tests Co-authored-by: Dobby <dobby@microsoft.com>
mathewc
reviewed
Apr 23, 2026
mathewc
reviewed
Apr 23, 2026
mathewc
reviewed
Apr 23, 2026
mathewc
reviewed
Apr 23, 2026
mathewc
reviewed
Apr 23, 2026
mathewc
reviewed
Apr 23, 2026
…face - Move GetAsync into GetScalersToSample (inside IsTargetScalingEnabled block) - Make GetScalersToSample async, accept ITargetScalerErrorRepository directly - Remove null parameter support; require non-null repository - Add ETag-based optimistic concurrency with retry loop in AddAsync - Replace ClearAsync with TTL-based expiry (10-min default) - Change blob format to include lastUpdated timestamp - Make ITargetScalerErrorRepository internal - Remove from public surface area test - Add TTL expiry tests (stale data returns empty, fresh data returns scalers) - Fix existing E2E tests to use new TargetScalerErrorState blob format Co-authored-by: Dobby <dobby@microsoft.com>
mathewc
reviewed
Apr 28, 2026
mathewc
reviewed
Apr 28, 2026
mathewc
reviewed
Apr 28, 2026
- Delete OnTimer_CrossWorkerFallback_SharedRepository (redundant with existing ScaleManagerTests) - Delete InMemoryTargetScalerErrorRepository_Lifecycle and _AddAsync_Idempotent (test helper tests) - Simplify AddAsync: drop if-guard, always call set.Add() Co-authored-by: Dobby <dobby@microsoft.com>
mathewc
approved these changes
Apr 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On Elastic Premium plans, when a target scaler throws
NotSupportedException(e.g., ServiceBus without Manage claim), the fallback to incremental scale monitoring fails if the Scale Controller and primary host run on different workers._targetScalersInErrorwas a static in-processHashSet<string>. Only theadmin/host/scale/statuscode path (called by the Scale Controller) invokes target scalers and populates this set. TheScaleMonitorService.TakeMetricsSamplesAsync()path (running on the primary host) reads the set but never populates it. When these run on different workers, the primary host's_targetScalersInErrorstays empty — it never discovers the failure, excludes the incremental monitor, collects no metrics, and produces no scale-in votes. The app stays stuck at max workers indefinitely.Root Cause
The fundamental issue is that
_targetScalersInErroris process-local shared state that needs to be visible across multiple workers — the same class of problem solved byIScaleMetricsRepository(for metrics) andIConcurrencyStatusRepository(for concurrency snapshots), both of which use blob storage.Fix
Introduce a new internal
ITargetScalerErrorRepositoryabstraction backed by blob storage, following the establishedBlobStorageConcurrencyStatusRepositorypattern.New Files
ITargetScalerErrorRepositoryAddAsync,GetAsyncNullTargetScalerErrorRepositoryBlobStorageTargetScalerErrorRepositoryscale/{hostId}/targetScalersInError.jsonModified Files
ScaleManager.cs_targetScalersInError. InjectedITargetScalerErrorRepository. Writes errors viaAddAsynconNotSupportedException.GetScalersToSampleis now async and accepts the repository directly (reads insideIsTargetScalingEnabledblock).ScaleMonitorService.csITargetScalerErrorRepository. Passes repository toGetScalersToSample.WebJobsServiceCollectionExtensions.csNullTargetScalerErrorRepositoryas defaultStorageServiceCollectionExtensions.csBlobStorageTargetScalerErrorRepositorywhen Azure Storage is configuredKey Design Decisions
ETag-based optimistic concurrency:
AddAsyncuses a read-modify-write loop with ETag conditions to prevent lost updates when multiple instances write concurrently. UsesIfMatchfor existing blobs andIfNoneMatch = *for new blobs. Retries up to 3 times on 412/409.TTL-based expiry (no ClearAsync): Instead of explicitly clearing errors on startup (which races with other instances), the blob stores a
lastUpdatedtimestamp alongside the scaler set.GetAsyncreturns empty if the data is older than 10 minutes. This makes recovery automatic — once the error stops occurring (e.g., customer grants Manage claim and restarts), no newAddAsynccalls refresh the timestamp, and the data ages out.Internal interface:
ITargetScalerErrorRepositoryisinternal— it's an implementation detail, not public API surface.How It Fixes the Race Condition
GetScaleStatusAsync→ catchesNotSupportedException→ writes scaler ID to blob viaAddAsyncTakeMetricsSamplesAsync→GetScalersToSamplereads blob via repository → sees the error → falls back to incremental monitor → collects metrics → produces scale-in votesRecovery Flow
NotSupportedException→AddAsync()→ blob created/updated with current timestampAddAsynccalls → blob ages out after 10 minutes → TBS used againTesting
GetScalersToSample_FallsBackToMonitor_OnTargetScalerErrortest to use repositoryScaleManagerconstructor calls for new parameterGetScalersToSample_WithErrorSet_FiltersCorrectlytestGetScalersToSample_WithNullErrorSet_NoFilteringtest (usesNullTargetScalerErrorRepository)GetScaleStatusAsync_SecondCall_ReadsPersistedErrortestOnTimer_CrossWorkerFallback_SharedRepositorytestGetAsync_StaleData_ReturnsEmptyTTL expiry testGetAsync_FreshData_ReturnsScalersTTL expiry testTargetScalerErrorStateformat