Fix target scaler fallback race condition via shared storage by alrod · Pull Request #3191 · Azure/azure-webjobs-sdk

alrod · 2026-04-08T22:31:32Z

Problem

On Elastic Premium plans, when a target scaler throws NotSupportedException (e.g., ServiceBus without Manage claim), the fallback to incremental scale monitoring fails if the Scale Controller and primary host run on different workers.

_targetScalersInError was a static in-process HashSet<string>. Only the admin/host/scale/status code path (called by the Scale Controller) invokes target scalers and populates this set. The ScaleMonitorService.TakeMetricsSamplesAsync() path (running on the primary host) reads the set but never populates it. When these run on different workers, the primary host's _targetScalersInError stays empty — it never discovers the failure, excludes the incremental monitor, collects no metrics, and produces no scale-in votes. The app stays stuck at max workers indefinitely.

Root Cause

The fundamental issue is that _targetScalersInError is process-local shared state that needs to be visible across multiple workers — the same class of problem solved by IScaleMetricsRepository (for metrics) and IConcurrencyStatusRepository (for concurrency snapshots), both of which use blob storage.

Fix

Introduce a new internal ITargetScalerErrorRepository abstraction backed by blob storage, following the established BlobStorageConcurrencyStatusRepository pattern.

New Files

File	Description
`ITargetScalerErrorRepository`	Internal interface: `AddAsync`, `GetAsync`
`NullTargetScalerErrorRepository`	No-op default (used when storage isn't configured)
`BlobStorageTargetScalerErrorRepository`	Blob-backed impl persisting to `scale/{hostId}/targetScalersInError.json`

Modified Files

File	Description
`ScaleManager.cs`	Removed static `_targetScalersInError`. Injected `ITargetScalerErrorRepository`. Writes errors via `AddAsync` on `NotSupportedException`. `GetScalersToSample` is now async and accepts the repository directly (reads inside `IsTargetScalingEnabled` block).
`ScaleMonitorService.cs`	Injected `ITargetScalerErrorRepository`. Passes repository to `GetScalersToSample`.
`WebJobsServiceCollectionExtensions.cs`	Registers `NullTargetScalerErrorRepository` as default
`StorageServiceCollectionExtensions.cs`	Registers `BlobStorageTargetScalerErrorRepository` when Azure Storage is configured

Key Design Decisions

ETag-based optimistic concurrency: AddAsync uses a read-modify-write loop with ETag conditions to prevent lost updates when multiple instances write concurrently. Uses IfMatch for existing blobs and IfNoneMatch = * for new blobs. Retries up to 3 times on 412/409.

TTL-based expiry (no ClearAsync): Instead of explicitly clearing errors on startup (which races with other instances), the blob stores a lastUpdated timestamp alongside the scaler set. GetAsync returns empty if the data is older than 10 minutes. This makes recovery automatic — once the error stops occurring (e.g., customer grants Manage claim and restarts), no new AddAsync calls refresh the timestamp, and the data ages out.

Internal interface: ITargetScalerErrorRepository is internal — it's an implementation detail, not public API surface.

How It Fixes the Race Condition

Worker A (running Scale Controller) calls GetScaleStatusAsync → catches NotSupportedException → writes scaler ID to blob via AddAsync
Worker B (primary host) calls TakeMetricsSamplesAsync → GetScalersToSample reads blob via repository → sees the error → falls back to incremental monitor → collects metrics → produces scale-in votes

Recovery Flow

TBS call fails → NotSupportedException → AddAsync() → blob created/updated with current timestamp
As long as the error persists, each Scale Controller poll refreshes the timestamp
Customer grants Manage claim → restarts app → TBS succeeds → no AddAsync calls → blob ages out after 10 minutes → TBS used again

Testing

Updated existing GetScalersToSample_FallsBackToMonitor_OnTargetScalerError test to use repository
Updated all ScaleManager constructor calls for new parameter
Added GetScalersToSample_WithErrorSet_FiltersCorrectly test
Added GetScalersToSample_WithNullErrorSet_NoFiltering test (uses NullTargetScalerErrorRepository)
Added GetScaleStatusAsync_SecondCall_ReadsPersistedError test
Added OnTimer_CrossWorkerFallback_SharedRepository test
Added GetAsync_StaleData_ReturnsEmpty TTL expiry test
Added GetAsync_FreshData_ReturnsScalers TTL expiry test
Updated E2E blob tests to verify new TargetScalerErrorState format
All 162 scale + public surface tests pass (0 errors, 0 warnings)

Replace static in-process _targetScalersInError HashSet with a new ITargetScalerErrorRepository abstraction backed by blob storage. When a target scaler throws NotSupportedException (e.g. ServiceBus without Manage claim), the scaler ID is persisted to blob storage so all host instances — including the primary host on a different worker — can see the failure and fall back to incremental monitoring. New files: - ITargetScalerErrorRepository: public interface (AddAsync, GetAsync, ClearAsync) - InMemoryTargetScalerErrorRepository: default in-process fallback - BlobStorageTargetScalerErrorRepository: blob-backed impl using the same IAzureBlobStorageProvider pattern as BlobStorageConcurrencyStatusRepository Key changes: - ScaleManager: inject ITargetScalerErrorRepository, write errors via AddAsync, read via GetAsync before GetScalersToSample - ScaleMonitorService: inject ITargetScalerErrorRepository, read errors before GetScalersToSample, call ClearAsync on startup so customers can recover by restarting after granting Manage claim - DI: register InMemory as default, BlobStorage when Azure Storage is configured (AddAzureStorageCoreServices/AddAzureStorageScaleServices) Co-authored-by: Dobby <dobby@microsoft.com>

Unit tests (P0/P1): - StartAsync_ClearsTargetScalerErrors: verifies ClearAsync called on startup - OnTimer_CrossWorkerFallback_SharedRepository: simulates multi-worker scenario where worker A writes error, worker B reads it - InMemoryTargetScalerErrorRepository_Lifecycle: add/get/clear cycle - InMemoryTargetScalerErrorRepository_AddAsync_Idempotent: double add - GetScalersToSample_WithErrorSet_FiltersCorrectly: selective exclusion - GetScalersToSample_WithNullErrorSet_NoFiltering: null safety - GetScaleStatusAsync_SecondCall_ReadsPersistedError: error persists E2E tests (P2 - requires Azurite/storage): - GetBlobPathAsync_ReturnsExpectedPath - AddAsync_WritesExpectedBlob (round-trip serialization) - GetAsync_ReadsExpectedBlob - GetAsync_NoBlob_ReturnsEmpty (404 handling) - ClearAsync_DeletesBlob - NoStorageConnection_HandledGracefully (mocked) Also adds ITargetScalerErrorRepository to PublicSurfaceTests. Co-authored-by: Dobby <dobby@microsoft.com>

…attern Align with NullConcurrencyStatusRepository convention: - Production default is now NullTargetScalerErrorRepository (discards writes, returns empty) — same pattern as NullConcurrencyStatusRepository - InMemoryTargetScalerErrorRepository moved to test project as a test helper for simulating shared state in cross-worker tests Co-authored-by: Dobby <dobby@microsoft.com>

…face - Move GetAsync into GetScalersToSample (inside IsTargetScalingEnabled block) - Make GetScalersToSample async, accept ITargetScalerErrorRepository directly - Remove null parameter support; require non-null repository - Add ETag-based optimistic concurrency with retry loop in AddAsync - Replace ClearAsync with TTL-based expiry (10-min default) - Change blob format to include lastUpdated timestamp - Make ITargetScalerErrorRepository internal - Remove from public surface area test - Add TTL expiry tests (stale data returns empty, fresh data returns scalers) - Fix existing E2E tests to use new TargetScalerErrorState blob format Co-authored-by: Dobby <dobby@microsoft.com>

- Delete OnTimer_CrossWorkerFallback_SharedRepository (redundant with existing ScaleManagerTests) - Delete InMemoryTargetScalerErrorRepository_Lifecycle and _AddAsync_Idempotent (test helper tests) - Simplify AddAsync: drop if-guard, always call set.Add() Co-authored-by: Dobby <dobby@microsoft.com>

alrod and others added 3 commits April 8, 2026 15:30

alrod mentioned this pull request Apr 9, 2026

Fix target scaler fallback race condition on multi-worker plans #3190

Closed