Add ScaleController V3 integration#2462
Conversation
| // the durability provider does not support target-based scaling. | ||
| // Create an empty target scaler to avoid exceptions (unless target-based scaling is actually turned on). | ||
| return new NoOpTargetScaler(functionId); | ||
| } |
There was a problem hiding this comment.
unknown: should we "fail fast" on the scalehost if TBS is not supported?
There was a problem hiding this comment.
At the very least, I think we should have a log that says that the durability provider doesn't support TBS.
There was a problem hiding this comment.
Yeah I don't disagree, although I don't have a ScaleController-provided ILogger instance handy in this class. Let me try to get some clarity on this edge case from the Sc team to figure out how we want to deal with this.
| this.PlatformInformationService = platformInformationService ?? throw new ArgumentNullException(nameof(platformInformationService)); | ||
| this.ResolveAppSettingOptions(); | ||
| DurableTaskOptions.ResolveAppSettingOptions(this.Options, this.nameResolver); | ||
|
|
There was a problem hiding this comment.
made this method static so it may be re-used in DurableTaskTriggersScaleProvider
| this.TraceHelper = new EndToEndTraceHelper(logger, this.Options.Tracing.TraceReplayEvents); | ||
| this.LifeCycleNotificationHelper = lifeCycleNotificationHelper ?? this.CreateLifeCycleNotificationHelper(); | ||
| this.durabilityProviderFactory = this.GetDurabilityProviderFactory(this.Options, logger, orchestrationServiceFactories); | ||
| this.durabilityProviderFactory = GetDurabilityProviderFactory(this.Options, logger, orchestrationServiceFactories); |
There was a problem hiding this comment.
Also made this static so it may be re-used in GetDurabilityProviderFactory.
src/WebJobs.Extensions.DurableTask/DurableTaskJobHostConfigurationExtensions.cs
Outdated
Show resolved
Hide resolved
…what gets recognized in ScV3's logging pipeline
| // the durability provider does not support target-based scaling. | ||
| // Create an empty target scaler to avoid exceptions (unless target-based scaling is actually turned on). | ||
| return new NoOpTargetScaler(functionId); | ||
| } |
There was a problem hiding this comment.
At the very least, I think we should have a log that says that the durability provider doesn't support TBS.
src/WebJobs.Extensions.DurableTask/Scale/DurableTaskTriggersScaleProvider.cs
Show resolved
Hide resolved
| { | ||
| if (options == null) | ||
| { | ||
| throw new InvalidOperationException($"{nameof(options)} must be set before resolving app settings."); |
There was a problem hiding this comment.
Souldn't these be simple ArgumentNullException?
There was a problem hiding this comment.
I now see that you copied this from DurableTaskExtension.cs. It made sense for it to be InvalidOperationException in that case since these were not parameters. Now that you've made them into parameters, the correct thing to do is make this ArgumentNullException.
There was a problem hiding this comment.
I don't mind making this change but I want to clarify that this method isn't just copied from DurableTaskExtension.cs, it's been refactored out of DurableTaskExtension.cs so that it can be used here (for TBS purposes) and in the DurableTaskExtension.cs. With that in mind, do you still want me to change InvalidOperationException for ArgumentNullException, or keep it as is?
I could also just duplicate the code in case we want different exception types in different locations.
| /// </summary> | ||
| /// <param name="builder">The <see cref="IWebJobsBuilder"/> to configure.</param> | ||
| /// <returns>Returns the provided <see cref="IWebJobsBuilder"/>.</returns> | ||
| public static IWebJobsBuilder AddDurableScaleForTrigger(this IWebJobsBuilder builder, TriggerMetadata triggerMetadata) |
There was a problem hiding this comment.
naming nit: I wonder if a method name like AddScalerForDurableTriggers() would be a bit clearer. We're not really adding "Durable scale" to anything.
There was a problem hiding this comment.
I agree the name is weird (and there's no a "durable scale" operation, whatever that means :-) ), I'm just adhering to the naming conventions in the ScV3 repo.
Other builder methods include:
- AddTimerScaleForTrigger
- AddServiceBusScaleForTrigger
- AddCosmosDbScaleForTrigger
- AddAzureStorageQueuesScaleForTrigger
etc.
I could definitely rename this to AddDurableFunctionsScaleForTrigger if that helps. However, I think the convention of Add<Extension>ScaleForTrigger means that the naming will be strange no matter what.
src/WebJobs.Extensions.DurableTask/DurableTaskJobHostConfigurationExtensions.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Scale/DurableTaskTriggersScaleProvider.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Scale/DurableTaskTriggersScaleProvider.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Scale/DurableTaskTriggersScaleProvider.cs
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskTargetScaler.cs
Outdated
Show resolved
Hide resolved
| scaleControllerLog = $"Error: target worker count for '{this.functionId}' was negative: '{this.functionId}'." + | ||
| "An exception was thrown." + metricsLog; | ||
| this.logger.LogError(scaleControllerLog); | ||
| throw new Exception(scaleControllerLog); |
There was a problem hiding this comment.
I'm just now noticing this, but it's bad practice to throw Exception. Consider using InvalidOperationException instead.
There was a problem hiding this comment.
Actually, doesn't this result in double-exception and double logging? If I'm reading the code correctly, we catch this exception a few lines down and make another call to this.logger.LogError. It seems like we need to reconsider the design here, because using exceptions as control flow like this is also bad practice.
There was a problem hiding this comment.
Yeah you're right, it's getting double logged; good call out.
A part of me wants a try-catch surrounding this whole operation so we can log any exceptions that arise, and to include in those logs the contextual information needed to debug this. That's what I do in the outer catch block.
Do you think it would suffice to remove the LogError when the target worker count is negative, and simply throw an exception? Alternatively, I could refactor the outer try-catch so that it ends right when the target worker count is calculated (i.e line 71)
There was a problem hiding this comment.
At a minimum we should remove the first LogError and rely on the one inside the catch-block. But I wonder if we even need that one since we're throwing. Wouldn't it be expected that some other code catches these exceptions and logs them as errors?
There was a problem hiding this comment.
@alrod, do you know what the expected behavior is for logging and throwing exceptions? Do we log the error or will it be handled in the scale controller?
There was a problem hiding this comment.
Still looking for an update on this.
There was a problem hiding this comment.
numWorkersToRequest is calcuated as:
int numWorkersToRequest = (int)Math.Max(activityWorkers, orchestratorWorkers);
How numWorkersToRequest can be negative?
If you want to throw an exception from GetScaleResultAsync I would just thorwing InvalidOperationException/ArgumentOutOfRangeException without writing the error log - Scale Controller will take care about the error logging.
There was a problem hiding this comment.
Thanks @alrod. I agree that the current code cannot provide negative numbers, my check is just ensuring that remains true if the code is ever refactored. It may a bit too defensive, so I don't mind removing it if others think it's overkill :).
@cgillum: I've removed the double logging on this commit: 0533264
There was a problem hiding this comment.
yeah, I think it's an overkill. I vote to remove the code. Or at least change the exception to InvalidOperationException.
There was a problem hiding this comment.
Changed the exception type to InvalidOperationException: 7f47657
…tionExtensions.cs Co-authored-by: Chris Gillum <cgillum@microsoft.com>
src/WebJobs.Extensions.DurableTask/DurableTaskJobHostConfigurationExtensions.cs
Outdated
Show resolved
Hide resolved
cgillum
left a comment
There was a problem hiding this comment.
Added a few more comments. I think the only one I consider blocking is the one about double (or maybe even triple) logging errors.
src/WebJobs.Extensions.DurableTask/DurableTaskJobHostConfigurationExtensions.cs
Outdated
Show resolved
Hide resolved
| scaleControllerLog = $"Error: target worker count for '{this.functionId}' was negative: '{this.functionId}'." + | ||
| "An exception was thrown." + metricsLog; | ||
| this.logger.LogError(scaleControllerLog); | ||
| throw new Exception(scaleControllerLog); |
There was a problem hiding this comment.
At a minimum we should remove the first LogError and rely on the one inside the catch-block. But I wonder if we even need that one since we're throwing. Wouldn't it be expected that some other code catches these exceptions and logs them as errors?
src/WebJobs.Extensions.DurableTask/Scale/DurableTaskTriggersScaleProvider.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Scale/DurableTaskTriggersScaleProvider.cs
Outdated
Show resolved
Hide resolved
|
|
||
| // target worker count should never be negative | ||
| if (numWorkersToRequest < 0) | ||
| DurableTaskTriggerMetrics? metrics = null; |
There was a problem hiding this comment.
To fall back to regular scale for Netherite and MSSQL you need to throw NonSupportedException:
Do you have this implemented for here?
Can we add a test for this?
There was a problem hiding this comment.
I updated the exception here.
Yes, we can add a test.
There was a problem hiding this comment.
this was incorporated in my latest commit
|
@bachuv I'm looking for resolution on two open comment threads in DurableTaskTargetScaler.cs. I'll sign off when those are resolved. |
| this.config.Options.HubName)); | ||
|
|
||
| #endif | ||
| #if FUNCTIONS_V3_OR_GREATER |
There was a problem hiding this comment.
In SC we have "FUNCTIONS_V3_OR_GREATER", right?
There was a problem hiding this comment.
I believe so. SC has TFM .net6.0 and FUNCTIONS_V3_OR_GREATER targets netcoreapp3.1, which is "older" than .net6.0. So yes, I believe this code triggers in SC.
| internal static IWebJobsBuilder AddDurableScaleForTrigger(this IWebJobsBuilder builder, TriggerMetadata triggerMetadata) | ||
| { | ||
| // this segment adheres to the followings pattern: https://github.com/Azure/azure-sdk-for-net/pull/38673/files | ||
| builder.Services.AddSingleton(serviceProvider => new DurableTaskTriggersScaleProvider(serviceProvider.GetService<IOptions<DurableTaskOptions>>(), serviceProvider.GetService<INameResolver>(), serviceProvider.GetService<ILoggerFactory>(), serviceProvider.GetService<IEnumerable<IDurabilityProviderFactory>>(), triggerMetadata)); |
There was a problem hiding this comment.
We can pass serviceProvider as parameter to the DurableTaskTriggersScaleProvider ctor and resolve all required services inside the ctor
There was a problem hiding this comment.
Thanks @alrod. I think the code did this originally, but as per this comment we decided not to resolve services in the constructor: #2462 (comment)
| { | ||
| // this segment adheres to the followings pattern: https://github.com/Azure/azure-sdk-for-net/pull/38673/files | ||
| builder.Services.AddSingleton(serviceProvider => new DurableTaskTriggersScaleProvider(serviceProvider.GetService<IOptions<DurableTaskOptions>>(), serviceProvider.GetService<INameResolver>(), serviceProvider.GetService<ILoggerFactory>(), serviceProvider.GetService<IEnumerable<IDurabilityProviderFactory>>(), triggerMetadata)); | ||
| builder.Services.AddSingleton<IScaleMonitorProvider>(serviceProvider => serviceProvider.GetRequiredService<DurableTaskTriggersScaleProvider>()); |
There was a problem hiding this comment.
There was a bug in this pattern, please change according to:
Azure/azure-sdk-for-net#38756
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskTargetScaler.cs
Outdated
Show resolved
Hide resolved
| var metricsLog = $"Metrics: workItemQueueLength={metrics?.WorkItemQueueLength}. controlQueueLengths={metrics?.ControlQueueLengths}. " + | ||
| $"maxConcurrentOrchestrators={this.MaxConcurrentOrchestrators}. maxConcurrentActivities={this.MaxConcurrentActivities}"; | ||
| var errorLog = $"Error: target worker count for '{this.functionId}' resulted in exception." + metricsLog + $"Exception: {ex}"; | ||
| throw new Exception(errorLog, ex); |
There was a problem hiding this comment.
nit: we're effectively double-logging the exception here my including the same exception information in both errorLog and as an inner exception ex. This can be confusing for someone trying to debug scale controller logs.
There was a problem hiding this comment.
You're right, I think I was being overly verbose here. I've updated the code to remove the string representation of the error and instead just rely on the "innerException" parameter: f5c6d3f
…caler.cs Co-authored-by: Chris Gillum <cgillum@microsoft.com>
| internal static class ScaleUtils | ||
| { | ||
| #if !FUNCTIONS_V1 | ||
| internal static IScaleMonitor GetScaleMonitor(DurabilityProvider durabilityProvider, string functionId, FunctionName functionName, string? connectionName, string hubName) |
There was a problem hiding this comment.
Can we also add unit tests for ScaleUtils.GetTargetScaler and ScaleUtils.GetScaleMonitor
| [Theory] | ||
| [Trait("Category", PlatformSpecificHelpers.TestCategory)] | ||
| [InlineData(true)] | ||
| [InlineData(false)] | ||
| public async void ScaleHostE2ETest(bool isTbsEnabled) | ||
| { | ||
| Action<ScaleOptions> configureScaleOptions = (scaleOptions) => | ||
| { | ||
| scaleOptions.IsTargetScalingEnabled = isTbsEnabled; | ||
| scaleOptions.MetricsPurgeEnabled = false; | ||
| scaleOptions.ScaleMetricsMaxAge = TimeSpan.FromMinutes(4); | ||
| scaleOptions.IsRuntimeScalingEnabled = true; | ||
| scaleOptions.ScaleMetricsSampleInterval = TimeSpan.FromSeconds(1); | ||
| }; | ||
| using (FunctionsV2HostWrapper host = (FunctionsV2HostWrapper)TestHelpers.GetJobHost(this.loggerProvider, nameof(this.ScaleHostE2ETest), enableExtendedSessions: false, configureScaleOptions: configureScaleOptions)) | ||
| { | ||
| await host.StartAsync(); | ||
|
|
||
| IScaleStatusProvider scaleManager = host.InnerHost.Services.GetService<IScaleStatusProvider>(); | ||
| var client = await host.StartOrchestratorAsync(nameof(TestOrchestrations.FanOutFanIn), 50, this.output); | ||
| var status = await client.WaitForCompletionAsync(this.output, timeout: TimeSpan.FromSeconds(400)); | ||
| var scaleStatus = await scaleManager.GetScaleStatusAsync(new ScaleStatusContext()); | ||
| await host.StopAsync(); | ||
| Assert.Equal(OrchestrationRuntimeStatus.Completed, status?.RuntimeStatus); | ||
|
|
||
| // We inspect the Host's logs for evidence that the Host is correctly sampling our scaling requests. | ||
| // the expected logs depend on whether TBS is enabled or not | ||
| var expectedSubString = "scale monitors to sample"; | ||
| if (isTbsEnabled) | ||
| { | ||
| expectedSubString = "target scalers to sample"; | ||
| } | ||
|
|
||
| bool containsExpectedLog = this.loggerProvider.GetAllLogMessages().Select(p => p.FormattedMessage ?? "").Any(p => p.Contains(expectedSubString)); | ||
| Assert.True(containsExpectedLog); | ||
| } | ||
| } |
There was a problem hiding this comment.
@alrod: an E2E test, using the pre-existing DF testing infra, has been added here ^
There was a problem hiding this comment.
I realize this is not exactly like the E2E tests you suggested (doing so would have taken more time due to how this codebase is structured and the DF initialization boilerplate), but I think this, in addition to the unit tests we already have, should have sufficient testing coverage.
This PR
This PR allows Scale Controller V3 to receive TBS worker requests for DF. This PR builds upon the changes in #2452 so please review that PR first.
Since this PR builds upon the changes in #2452, that' is also the branch we're targeting.
Main Changes
The main objective of this PR is the implementation of
DurableTaskTriggersScaleProvider, which the Scale Controller Host will utilize to make scaling decisions.This class implements
IScaleMonitorProviderandITargetScalerProviderby delegating to theGetTargetScalerandGetScaleMonitorimplementations of #2452.When constructing
DurableTaskTriggersScaleProvider, we receive the DI services provider from the Scale Controller Host and the Sync Triggers payload for the function app. We read the user's config (hubName, backend type, etc) from the Sync Triggers payload.Scale Controller V3 will obtain an instance of this class by invoking
private static IWebJobsBuilder AddDurableScaleForTrigger(this IWebJobsBuilder builder, TriggerMetadata triggerMetadata)during initialization.Finally, note that
DurableTaskTriggersScaleProviderwill not be constructed in the Functions Host. It is constructed only by the ScaleController Host through a call toAddDurableScaleForTrigger.Secondary Changes
For code re-use, this PR removes a several scaling utility methods defined in
DurableTaskExtension.csand puts them inside a new file:Scale/ScaleUtils.cs. That way, we're able to re-use these utilites in theDurableTaskTriggersScaleProviderimplementation.We also add
#nullable enablein all new files.References:
Issue describing the changes in this PR
resolves #issue_for_this_pr
Pull request checklist
pending_docs.mdrelease_notes.md/src/Worker.Extensions.DurableTask/AssemblyInfo.cs