Skip to content

Fix target scaler fallback race condition via shared storage#3191

Merged
alrod merged 5 commits into
devfrom
alrod/fix/target-scaler-shared-storage
Apr 29, 2026
Merged

Fix target scaler fallback race condition via shared storage#3191
alrod merged 5 commits into
devfrom
alrod/fix/target-scaler-shared-storage

Conversation

@alrod

@alrod alrod commented Apr 8, 2026

Copy link
Copy Markdown
Member

Problem

On Elastic Premium plans, when a target scaler throws NotSupportedException (e.g., ServiceBus without Manage claim), the fallback to incremental scale monitoring fails if the Scale Controller and primary host run on different workers.

_targetScalersInError was a static in-process HashSet<string>. Only the admin/host/scale/status code path (called by the Scale Controller) invokes target scalers and populates this set. The ScaleMonitorService.TakeMetricsSamplesAsync() path (running on the primary host) reads the set but never populates it. When these run on different workers, the primary host's _targetScalersInError stays empty — it never discovers the failure, excludes the incremental monitor, collects no metrics, and produces no scale-in votes. The app stays stuck at max workers indefinitely.

Root Cause

The fundamental issue is that _targetScalersInError is process-local shared state that needs to be visible across multiple workers — the same class of problem solved by IScaleMetricsRepository (for metrics) and IConcurrencyStatusRepository (for concurrency snapshots), both of which use blob storage.

Fix

Introduce a new internal ITargetScalerErrorRepository abstraction backed by blob storage, following the established BlobStorageConcurrencyStatusRepository pattern.

New Files

File Description
ITargetScalerErrorRepository Internal interface: AddAsync, GetAsync
NullTargetScalerErrorRepository No-op default (used when storage isn't configured)
BlobStorageTargetScalerErrorRepository Blob-backed impl persisting to scale/{hostId}/targetScalersInError.json

Modified Files

File Description
ScaleManager.cs Removed static _targetScalersInError. Injected ITargetScalerErrorRepository. Writes errors via AddAsync on NotSupportedException. GetScalersToSample is now async and accepts the repository directly (reads inside IsTargetScalingEnabled block).
ScaleMonitorService.cs Injected ITargetScalerErrorRepository. Passes repository to GetScalersToSample.
WebJobsServiceCollectionExtensions.cs Registers NullTargetScalerErrorRepository as default
StorageServiceCollectionExtensions.cs Registers BlobStorageTargetScalerErrorRepository when Azure Storage is configured

Key Design Decisions

ETag-based optimistic concurrency: AddAsync uses a read-modify-write loop with ETag conditions to prevent lost updates when multiple instances write concurrently. Uses IfMatch for existing blobs and IfNoneMatch = * for new blobs. Retries up to 3 times on 412/409.

TTL-based expiry (no ClearAsync): Instead of explicitly clearing errors on startup (which races with other instances), the blob stores a lastUpdated timestamp alongside the scaler set. GetAsync returns empty if the data is older than 10 minutes. This makes recovery automatic — once the error stops occurring (e.g., customer grants Manage claim and restarts), no new AddAsync calls refresh the timestamp, and the data ages out.

Internal interface: ITargetScalerErrorRepository is internal — it's an implementation detail, not public API surface.

How It Fixes the Race Condition

  1. Worker A (running Scale Controller) calls GetScaleStatusAsync → catches NotSupportedException → writes scaler ID to blob via AddAsync
  2. Worker B (primary host) calls TakeMetricsSamplesAsyncGetScalersToSample reads blob via repository → sees the error → falls back to incremental monitor → collects metrics → produces scale-in votes

Recovery Flow

  1. TBS call fails → NotSupportedExceptionAddAsync() → blob created/updated with current timestamp
  2. As long as the error persists, each Scale Controller poll refreshes the timestamp
  3. Customer grants Manage claim → restarts app → TBS succeeds → no AddAsync calls → blob ages out after 10 minutes → TBS used again

Testing

  • Updated existing GetScalersToSample_FallsBackToMonitor_OnTargetScalerError test to use repository
  • Updated all ScaleManager constructor calls for new parameter
  • Added GetScalersToSample_WithErrorSet_FiltersCorrectly test
  • Added GetScalersToSample_WithNullErrorSet_NoFiltering test (uses NullTargetScalerErrorRepository)
  • Added GetScaleStatusAsync_SecondCall_ReadsPersistedError test
  • Added OnTimer_CrossWorkerFallback_SharedRepository test
  • Added GetAsync_StaleData_ReturnsEmpty TTL expiry test
  • Added GetAsync_FreshData_ReturnsScalers TTL expiry test
  • Updated E2E blob tests to verify new TargetScalerErrorState format
  • All 162 scale + public surface tests pass (0 errors, 0 warnings)

alrod and others added 3 commits April 8, 2026 15:30
Replace static in-process _targetScalersInError HashSet with a new
ITargetScalerErrorRepository abstraction backed by blob storage.

When a target scaler throws NotSupportedException (e.g. ServiceBus
without Manage claim), the scaler ID is persisted to blob storage
so all host instances — including the primary host on a different
worker — can see the failure and fall back to incremental monitoring.

New files:
- ITargetScalerErrorRepository: public interface (AddAsync, GetAsync, ClearAsync)
- InMemoryTargetScalerErrorRepository: default in-process fallback
- BlobStorageTargetScalerErrorRepository: blob-backed impl using
  the same IAzureBlobStorageProvider pattern as
  BlobStorageConcurrencyStatusRepository

Key changes:
- ScaleManager: inject ITargetScalerErrorRepository, write errors via
  AddAsync, read via GetAsync before GetScalersToSample
- ScaleMonitorService: inject ITargetScalerErrorRepository, read errors
  before GetScalersToSample, call ClearAsync on startup so customers
  can recover by restarting after granting Manage claim
- DI: register InMemory as default, BlobStorage when Azure Storage
  is configured (AddAzureStorageCoreServices/AddAzureStorageScaleServices)

Co-authored-by: Dobby <dobby@microsoft.com>
Unit tests (P0/P1):
- StartAsync_ClearsTargetScalerErrors: verifies ClearAsync called on startup
- OnTimer_CrossWorkerFallback_SharedRepository: simulates multi-worker
  scenario where worker A writes error, worker B reads it
- InMemoryTargetScalerErrorRepository_Lifecycle: add/get/clear cycle
- InMemoryTargetScalerErrorRepository_AddAsync_Idempotent: double add
- GetScalersToSample_WithErrorSet_FiltersCorrectly: selective exclusion
- GetScalersToSample_WithNullErrorSet_NoFiltering: null safety
- GetScaleStatusAsync_SecondCall_ReadsPersistedError: error persists

E2E tests (P2 - requires Azurite/storage):
- GetBlobPathAsync_ReturnsExpectedPath
- AddAsync_WritesExpectedBlob (round-trip serialization)
- GetAsync_ReadsExpectedBlob
- GetAsync_NoBlob_ReturnsEmpty (404 handling)
- ClearAsync_DeletesBlob
- NoStorageConnection_HandledGracefully (mocked)

Also adds ITargetScalerErrorRepository to PublicSurfaceTests.

Co-authored-by: Dobby <dobby@microsoft.com>
…attern

Align with NullConcurrencyStatusRepository convention:
- Production default is now NullTargetScalerErrorRepository (discards
  writes, returns empty) — same pattern as NullConcurrencyStatusRepository
- InMemoryTargetScalerErrorRepository moved to test project as a test
  helper for simulating shared state in cross-worker tests

Co-authored-by: Dobby <dobby@microsoft.com>
Comment thread src/Microsoft.Azure.WebJobs.Host/Scale/ScaleManager.cs Outdated
Comment thread src/Microsoft.Azure.WebJobs.Host/Scale/ScaleManager.cs Outdated
Comment thread src/Microsoft.Azure.WebJobs.Host/Scale/ScaleMonitorService.cs Outdated
Comment thread src/Microsoft.Azure.WebJobs.Host/Scale/ITargetScalerErrorRepository.cs Outdated
Comment thread test/Microsoft.Azure.WebJobs.Host.UnitTests/PublicSurfaceTests.cs Outdated
…face

- Move GetAsync into GetScalersToSample (inside IsTargetScalingEnabled block)
- Make GetScalersToSample async, accept ITargetScalerErrorRepository directly
- Remove null parameter support; require non-null repository
- Add ETag-based optimistic concurrency with retry loop in AddAsync
- Replace ClearAsync with TTL-based expiry (10-min default)
- Change blob format to include lastUpdated timestamp
- Make ITargetScalerErrorRepository internal
- Remove from public surface area test
- Add TTL expiry tests (stale data returns empty, fresh data returns scalers)
- Fix existing E2E tests to use new TargetScalerErrorState blob format

Co-authored-by: Dobby <dobby@microsoft.com>
Comment thread test/Microsoft.Azure.WebJobs.Host.UnitTests/Scale/ScaleMonitorServiceTests.cs Outdated
Comment thread test/Microsoft.Azure.WebJobs.Host.UnitTests/Scale/ScaleManagerTests.cs Outdated
- Delete OnTimer_CrossWorkerFallback_SharedRepository (redundant with existing ScaleManagerTests)
- Delete InMemoryTargetScalerErrorRepository_Lifecycle and _AddAsync_Idempotent (test helper tests)
- Simplify AddAsync: drop if-guard, always call set.Add()

Co-authored-by: Dobby <dobby@microsoft.com>
@alrod alrod requested a review from mathewc April 29, 2026 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants