Skip to content

Add ScaleController V3 integration#2462

Merged
davidmrdavid merged 35 commits intodajusto/tbsfrom
dajusto/scv3
Oct 6, 2023
Merged

Add ScaleController V3 integration#2462
davidmrdavid merged 35 commits intodajusto/tbsfrom
dajusto/scv3

Conversation

@davidmrdavid
Copy link
Copy Markdown
Contributor

@davidmrdavid davidmrdavid commented May 2, 2023

This PR

This PR allows Scale Controller V3 to receive TBS worker requests for DF. This PR builds upon the changes in #2452 so please review that PR first.

Since this PR builds upon the changes in #2452, that' is also the branch we're targeting.

Main Changes

The main objective of this PR is the implementation of DurableTaskTriggersScaleProvider, which the Scale Controller Host will utilize to make scaling decisions.

This class implements IScaleMonitorProvider and ITargetScalerProvider by delegating to the GetTargetScaler and GetScaleMonitor implementations of #2452.

When constructing DurableTaskTriggersScaleProvider, we receive the DI services provider from the Scale Controller Host and the Sync Triggers payload for the function app. We read the user's config (hubName, backend type, etc) from the Sync Triggers payload.

Scale Controller V3 will obtain an instance of this class by invoking private static IWebJobsBuilder AddDurableScaleForTrigger(this IWebJobsBuilder builder, TriggerMetadata triggerMetadata) during initialization.

Finally, note that DurableTaskTriggersScaleProvider will not be constructed in the Functions Host. It is constructed only by the ScaleController Host through a call to AddDurableScaleForTrigger.

Secondary Changes

For code re-use, this PR removes a several scaling utility methods defined in DurableTaskExtension.cs and puts them inside a new file: Scale/ScaleUtils.cs. That way, we're able to re-use these utilites in the
DurableTaskTriggersScaleProvider implementation.

We also add #nullable enable in all new files.

References:

Issue describing the changes in this PR

resolves #issue_for_this_pr

Pull request checklist

  • My changes do not require documentation changes
    • Otherwise: Documentation PR is ready to merge and referenced in pending_docs.md
  • My changes should not be added to the release notes for the next release
    • Otherwise: I've added my notes to release_notes.md
  • My changes do not need to be backported to a previous version
    • Otherwise: Backport tracked by issue/PR #issue_or_pr
  • I have added all required tests (Unit tests, E2E tests)
  • My changes do not require any extra work to be leveraged by OutOfProc SDKs
    • Otherwise: That work is being tracked here: #issue_or_pr_in_each_sdk
  • My changes do not change the version of the WebJobs.Extensions.DurableTask package
    • Otherwise: major or minor version updates are reflected in /src/Worker.Extensions.DurableTask/AssemblyInfo.cs
  • My changes do not add EventIds to our EventSource logs
    • Otherwise: Ensure the EventIds are within the supported range in our existing Windows infrastructure. You may validate this with a deployed app's telemetry. You may also extend the range by completing a PR such as this one.

Comment on lines +92 to +95
// the durability provider does not support target-based scaling.
// Create an empty target scaler to avoid exceptions (unless target-based scaling is actually turned on).
return new NoOpTargetScaler(functionId);
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unknown: should we "fail fast" on the scalehost if TBS is not supported?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least, I think we should have a log that says that the durability provider doesn't support TBS.

Copy link
Copy Markdown
Contributor Author

@davidmrdavid davidmrdavid Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't disagree, although I don't have a ScaleController-provided ILogger instance handy in this class. Let me try to get some clarity on this edge case from the Sc team to figure out how we want to deal with this.

@davidmrdavid davidmrdavid marked this pull request as ready for review May 11, 2023 23:51
@davidmrdavid davidmrdavid changed the title [WIP] Add ScaleController V3 integration Add ScaleController V3 integration May 11, 2023
Comment on lines 153 to 154
this.PlatformInformationService = platformInformationService ?? throw new ArgumentNullException(nameof(platformInformationService));
this.ResolveAppSettingOptions();
DurableTaskOptions.ResolveAppSettingOptions(this.Options, this.nameResolver);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made this method static so it may be re-used in DurableTaskTriggersScaleProvider

this.TraceHelper = new EndToEndTraceHelper(logger, this.Options.Tracing.TraceReplayEvents);
this.LifeCycleNotificationHelper = lifeCycleNotificationHelper ?? this.CreateLifeCycleNotificationHelper();
this.durabilityProviderFactory = this.GetDurabilityProviderFactory(this.Options, logger, orchestrationServiceFactories);
this.durabilityProviderFactory = GetDurabilityProviderFactory(this.Options, logger, orchestrationServiceFactories);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also made this static so it may be re-used in GetDurabilityProviderFactory.

Comment on lines +92 to +95
// the durability provider does not support target-based scaling.
// Create an empty target scaler to avoid exceptions (unless target-based scaling is actually turned on).
return new NoOpTargetScaler(functionId);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least, I think we should have a log that says that the durability provider doesn't support TBS.

{
if (options == null)
{
throw new InvalidOperationException($"{nameof(options)} must be set before resolving app settings.");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Souldn't these be simple ArgumentNullException?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now see that you copied this from DurableTaskExtension.cs. It made sense for it to be InvalidOperationException in that case since these were not parameters. Now that you've made them into parameters, the correct thing to do is make this ArgumentNullException.

Copy link
Copy Markdown
Contributor Author

@davidmrdavid davidmrdavid Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind making this change but I want to clarify that this method isn't just copied from DurableTaskExtension.cs, it's been refactored out of DurableTaskExtension.cs so that it can be used here (for TBS purposes) and in the DurableTaskExtension.cs. With that in mind, do you still want me to change InvalidOperationException for ArgumentNullException, or keep it as is?

I could also just duplicate the code in case we want different exception types in different locations.

/// </summary>
/// <param name="builder">The <see cref="IWebJobsBuilder"/> to configure.</param>
/// <returns>Returns the provided <see cref="IWebJobsBuilder"/>.</returns>
public static IWebJobsBuilder AddDurableScaleForTrigger(this IWebJobsBuilder builder, TriggerMetadata triggerMetadata)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming nit: I wonder if a method name like AddScalerForDurableTriggers() would be a bit clearer. We're not really adding "Durable scale" to anything.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the name is weird (and there's no a "durable scale" operation, whatever that means :-) ), I'm just adhering to the naming conventions in the ScV3 repo.

Other builder methods include:

  • AddTimerScaleForTrigger
  • AddServiceBusScaleForTrigger
  • AddCosmosDbScaleForTrigger
  • AddAzureStorageQueuesScaleForTrigger
    etc.

I could definitely rename this to AddDurableFunctionsScaleForTrigger if that helps. However, I think the convention of Add<Extension>ScaleForTrigger means that the naming will be strange no matter what.

scaleControllerLog = $"Error: target worker count for '{this.functionId}' was negative: '{this.functionId}'." +
"An exception was thrown." + metricsLog;
this.logger.LogError(scaleControllerLog);
throw new Exception(scaleControllerLog);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just now noticing this, but it's bad practice to throw Exception. Consider using InvalidOperationException instead.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, doesn't this result in double-exception and double logging? If I'm reading the code correctly, we catch this exception a few lines down and make another call to this.logger.LogError. It seems like we need to reconsider the design here, because using exceptions as control flow like this is also bad practice.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right, it's getting double logged; good call out.

A part of me wants a try-catch surrounding this whole operation so we can log any exceptions that arise, and to include in those logs the contextual information needed to debug this. That's what I do in the outer catch block.

Do you think it would suffice to remove the LogError when the target worker count is negative, and simply throw an exception? Alternatively, I could refactor the outer try-catch so that it ends right when the target worker count is calculated (i.e line 71)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a minimum we should remove the first LogError and rely on the one inside the catch-block. But I wonder if we even need that one since we're throwing. Wouldn't it be expected that some other code catches these exceptions and logs them as errors?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alrod, do you know what the expected behavior is for logging and throwing exceptions? Do we log the error or will it be handled in the scale controller?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking for an update on this.

Copy link
Copy Markdown
Member

@alrod alrod Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numWorkersToRequest is calcuated as:
int numWorkersToRequest = (int)Math.Max(activityWorkers, orchestratorWorkers);

How numWorkersToRequest can be negative?

If you want to throw an exception from GetScaleResultAsync I would just thorwing InvalidOperationException/ArgumentOutOfRangeException without writing the error log - Scale Controller will take care about the error logging.

Something like:
https://github.com/Azure/azure-sdk-for-net/blob/main/sdk/servicebus/Microsoft.Azure.WebJobs.Extensions.ServiceBus/src/Listeners/ServiceBusTargetScaler.cs#L94

Copy link
Copy Markdown
Contributor Author

@davidmrdavid davidmrdavid Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alrod. I agree that the current code cannot provide negative numbers, my check is just ensuring that remains true if the code is ever refactored. It may a bit too defensive, so I don't mind removing it if others think it's overkill :).

@cgillum: I've removed the double logging on this commit: 0533264

Copy link
Copy Markdown
Member

@alrod alrod Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think it's an overkill. I vote to remove the code. Or at least change the exception to InvalidOperationException.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the exception type to InvalidOperationException: 7f47657

Copy link
Copy Markdown
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few more comments. I think the only one I consider blocking is the one about double (or maybe even triple) logging errors.

scaleControllerLog = $"Error: target worker count for '{this.functionId}' was negative: '{this.functionId}'." +
"An exception was thrown." + metricsLog;
this.logger.LogError(scaleControllerLog);
throw new Exception(scaleControllerLog);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a minimum we should remove the first LogError and rely on the one inside the catch-block. But I wonder if we even need that one since we're throwing. Wouldn't it be expected that some other code catches these exceptions and logs them as errors?


// target worker count should never be negative
if (numWorkersToRequest < 0)
DurableTaskTriggerMetrics? metrics = null;
Copy link
Copy Markdown
Member

@alrod alrod Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fall back to regular scale for Netherite and MSSQL you need to throw NonSupportedException:

azure-sdk-for-net/sdk/servicebus/Microsoft.Azure.WebJobs.Extensions.ServiceBus/src/Listeners/ServiceBusTargetScaler.cs at main · Azure/azure-sdk-for-net (github.com)

Do you have this implemented for here?
Can we add a test for this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the exception here.

Yes, we can add a test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was incorporated in my latest commit

@bachuv bachuv requested a review from cgillum October 3, 2023 17:15
@cgillum
Copy link
Copy Markdown
Member

cgillum commented Oct 3, 2023

@bachuv I'm looking for resolution on two open comment threads in DurableTaskTargetScaler.cs. I'll sign off when those are resolved.

this.config.Options.HubName));

#endif
#if FUNCTIONS_V3_OR_GREATER
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In SC we have "FUNCTIONS_V3_OR_GREATER", right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so. SC has TFM .net6.0 and FUNCTIONS_V3_OR_GREATER targets netcoreapp3.1, which is "older" than .net6.0. So yes, I believe this code triggers in SC.

internal static IWebJobsBuilder AddDurableScaleForTrigger(this IWebJobsBuilder builder, TriggerMetadata triggerMetadata)
{
// this segment adheres to the followings pattern: https://github.com/Azure/azure-sdk-for-net/pull/38673/files
builder.Services.AddSingleton(serviceProvider => new DurableTaskTriggersScaleProvider(serviceProvider.GetService<IOptions<DurableTaskOptions>>(), serviceProvider.GetService<INameResolver>(), serviceProvider.GetService<ILoggerFactory>(), serviceProvider.GetService<IEnumerable<IDurabilityProviderFactory>>(), triggerMetadata));
Copy link
Copy Markdown
Member

@alrod alrod Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pass serviceProvider as parameter to the DurableTaskTriggersScaleProvider ctor and resolve all required services inside the ctor

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alrod. I think the code did this originally, but as per this comment we decided not to resolve services in the constructor: #2462 (comment)

{
// this segment adheres to the followings pattern: https://github.com/Azure/azure-sdk-for-net/pull/38673/files
builder.Services.AddSingleton(serviceProvider => new DurableTaskTriggersScaleProvider(serviceProvider.GetService<IOptions<DurableTaskOptions>>(), serviceProvider.GetService<INameResolver>(), serviceProvider.GetService<ILoggerFactory>(), serviceProvider.GetService<IEnumerable<IDurabilityProviderFactory>>(), triggerMetadata));
builder.Services.AddSingleton<IScaleMonitorProvider>(serviceProvider => serviceProvider.GetRequiredService<DurableTaskTriggersScaleProvider>());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a bug in this pattern, please change according to:
Azure/azure-sdk-for-net#38756

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, addressed here: 1975b41

var metricsLog = $"Metrics: workItemQueueLength={metrics?.WorkItemQueueLength}. controlQueueLengths={metrics?.ControlQueueLengths}. " +
$"maxConcurrentOrchestrators={this.MaxConcurrentOrchestrators}. maxConcurrentActivities={this.MaxConcurrentActivities}";
var errorLog = $"Error: target worker count for '{this.functionId}' resulted in exception." + metricsLog + $"Exception: {ex}";
throw new Exception(errorLog, ex);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we're effectively double-logging the exception here my including the same exception information in both errorLog and as an inner exception ex. This can be confusing for someone trying to debug scale controller logs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I think I was being overly verbose here. I've updated the code to remove the string representation of the error and instead just rely on the "innerException" parameter: f5c6d3f

internal static class ScaleUtils
{
#if !FUNCTIONS_V1
internal static IScaleMonitor GetScaleMonitor(DurabilityProvider durabilityProvider, string functionId, FunctionName functionName, string? connectionName, string hubName)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add unit tests for ScaleUtils.GetTargetScaler and ScaleUtils.GetScaleMonitor

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added them: 635aa75

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and expanded them here: 3ad0f28

Comment on lines +141 to +177
[Theory]
[Trait("Category", PlatformSpecificHelpers.TestCategory)]
[InlineData(true)]
[InlineData(false)]
public async void ScaleHostE2ETest(bool isTbsEnabled)
{
Action<ScaleOptions> configureScaleOptions = (scaleOptions) =>
{
scaleOptions.IsTargetScalingEnabled = isTbsEnabled;
scaleOptions.MetricsPurgeEnabled = false;
scaleOptions.ScaleMetricsMaxAge = TimeSpan.FromMinutes(4);
scaleOptions.IsRuntimeScalingEnabled = true;
scaleOptions.ScaleMetricsSampleInterval = TimeSpan.FromSeconds(1);
};
using (FunctionsV2HostWrapper host = (FunctionsV2HostWrapper)TestHelpers.GetJobHost(this.loggerProvider, nameof(this.ScaleHostE2ETest), enableExtendedSessions: false, configureScaleOptions: configureScaleOptions))
{
await host.StartAsync();

IScaleStatusProvider scaleManager = host.InnerHost.Services.GetService<IScaleStatusProvider>();
var client = await host.StartOrchestratorAsync(nameof(TestOrchestrations.FanOutFanIn), 50, this.output);
var status = await client.WaitForCompletionAsync(this.output, timeout: TimeSpan.FromSeconds(400));
var scaleStatus = await scaleManager.GetScaleStatusAsync(new ScaleStatusContext());
await host.StopAsync();
Assert.Equal(OrchestrationRuntimeStatus.Completed, status?.RuntimeStatus);

// We inspect the Host's logs for evidence that the Host is correctly sampling our scaling requests.
// the expected logs depend on whether TBS is enabled or not
var expectedSubString = "scale monitors to sample";
if (isTbsEnabled)
{
expectedSubString = "target scalers to sample";
}

bool containsExpectedLog = this.loggerProvider.GetAllLogMessages().Select(p => p.FormattedMessage ?? "").Any(p => p.Contains(expectedSubString));
Assert.True(containsExpectedLog);
}
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alrod: an E2E test, using the pre-existing DF testing infra, has been added here ^

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is not exactly like the E2E tests you suggested (doing so would have taken more time due to how this codebase is structured and the DF initialization boilerplate), but I think this, in addition to the unit tests we already have, should have sufficient testing coverage.

@davidmrdavid davidmrdavid merged commit 731d809 into dajusto/tbs Oct 6, 2023
@davidmrdavid davidmrdavid deleted the dajusto/scv3 branch October 6, 2023 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants