Add target-based scaling support for Azure Storage#2452
Conversation
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskTriggerMetrics.cs
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskTargetScaler.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/DurableTaskJobHostConfigurationExtensions.cs
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProvider.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskTargetScaler.cs
Outdated
Show resolved
Hide resolved
…der.cs Co-authored-by: Varshitha Bachu <vabachu@microsoft.com>
|
Question on this point:
I'm worried that updating our documentation won't be enough. Although Functions V3 has been deprecated, I worry that some users may still try to use it for one reason or another. Can we consider making some kind of proactive announcement that this new version of the DF extension no longer supports Functions V3, and perhaps include the expected error message for improved SEO when customers inevitably run into this? |
|
Also, @davidmrdavid, please fix the merge conflicts so that we can be confident about the changes we're reviewing in WebJobs.Extensions.DurableTask.csproj. |
|
@cgillum: thanks for the heads up on merge conflicts, I missed those. They've been addressed.
Yep, I agree with this. I'll obtain the exact error message tomorrow-ish and we can look to include that in some proactive announcement, as well as in this PR and maybe even our release notes. |
cgillum
left a comment
There was a problem hiding this comment.
Added some questions/comments.
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskMetricsProvider.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskScaleMonitor.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskMetricsProvider.cs
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskTriggerMetrics.cs
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProvider.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProvider.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskTargetScaler.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskTargetScaler.cs
Outdated
Show resolved
Hide resolved
| { | ||
| scaleControllerLog = "Tried to request a negative worker count." + scaleControllerLog; | ||
| this.logger.LogError(scaleControllerLog); | ||
| throw new Exception(scaleControllerLog); |
There was a problem hiding this comment.
Is throwing an exception the behavior we want? Or should we fail more gracefully?
There was a problem hiding this comment.
I can't immediately think of a more graceful solution here. My hunch is that if the scaling logic is requesting a negative number, then something is terribly wrong and we should fail exceptionally so the ScaleController can take over and decide how to deal with it. Do you have any suggestions?
There was a problem hiding this comment.
So you're saying this code is executed by the scale controller and not in the Functions host? Are there some doc comments we can add to make this clear? I think that also helps clarify my question about the direct use of ILogger.
There was a problem hiding this comment.
My understanding is that ScaleController V3 calls this directly, although @bachuv just pointed me to a comment that suggests that this code may also be accessed in the runtime driven scale case as well, which would go through the Host. So perhaps there's two cases: one mediated by the Functions Host, and one where it's accessed directly by the Scale Controller.
I'll look to get a conclusive answer on this. For now, I've pushed a series of comments reflecting my current understanding. I won't merge until we have confirmation from the ScaleController team.
There was a problem hiding this comment.
I spoke with Alexey today to get clarity on this.
The caller of this code depends on whether the user is running with or without a VNET.
If running without a VNET, the ScaleController process calls this directly.
If running with a VNET, then the customer needs to enable runtime-driven scaling and in that case it is the Functions Host which calls this code for us.
I think this may complicate things a little bit, especially around throwing exceptions. We should confirm what's the preferred error handling behavior in the VNET case.
There was a problem hiding this comment.
@alrod: if runtime-driven scaling is enabled, and the DF Extension encounters an error in calculating it's target worker count (for a example, if it wrongly determines that it needs a negative target worker count), is safe to throw an exception for the Host to catch or should we handle the failure more gracefully?
There was a problem hiding this comment.
Is the code even make since to exists?
int numWorkersToRequest = (int)Math.Max(activityWorkers, orchestratorWorkers);
Is possible activityWorkers and orchestratorWorkers are < 0 ?
cgillum
left a comment
There was a problem hiding this comment.
These changes look good to me. My only ask would be to consider adding more comments to make clear the cases where code is run in the Scale Controller host instead of running in the Functions host since that has big implications in terms of what context we're allowed to assume.
| using DurableTask.AzureStorage.Tracking; | ||
| using DurableTask.Core; | ||
| using Microsoft.Extensions.Logging; | ||
| using Microsoft.WindowsAzure.Storage; |
There was a problem hiding this comment.
Is the target-based scaling support still based on the old deprecated AS backend?
This PR
This PR adds target-based scaling support for the AzureStorage backend.
Changes
The requirements were relatively simple: we needed to update our WebJobs SDK dependency and then implement and expose some new interfaces:
ITargetScalerandITargetScalerProvider. This PR provides such an implementation for the Azure Storage backend.Backwards compatibility
These new interfaces are not available in older versions of WebJobs, meaning that we need to use the C# preprocessor to prevent older versions of the Functions runtime from trying to load those types. This has worked for Functions V1 and V2. However, our bundles-mandated TFMs cannot differentiate between Functions V3 and V4.
Once we release this change, the Durable Extension will no longer be compatible with Functions V3. Therefore, prior to merge, we should update our documentation to warn users of this change.
Merge pre-requisites
In order to prevent apps w/ extension bundles from auto-upgrading to an incompatible version of the DF Extension, there will be a patch release to Functions V3 to stop automatic bundle upgrades. We need to wait for that patch release to complete.
Testing
This PR adds unit tests to validate that we request the expected number of workers. We use the same inputs as in our internal design document.
We have also done local manual testing validating that the Host recognizes these changes and considers them when issuing scale votes.
Equivalent PRs for other backends
References
Work done in collaboration with @bachuv.
Pull request checklist
pending_docs.mdrelease_notes.md/src/Worker.Extensions.DurableTask/AssemblyInfo.cs