Skip to content

Change new OOProc orchestration trigger protocol from JSON to protobuf#2125

Merged
cgillum merged 2 commits intodevfrom
cgillum/ooproc-protobuf
Apr 18, 2022
Merged

Change new OOProc orchestration trigger protocol from JSON to protobuf#2125
cgillum merged 2 commits intodevfrom
cgillum/ooproc-protobuf

Conversation

@cgillum
Copy link
Copy Markdown
Member

@cgillum cgillum commented Mar 30, 2022

Overview

This PR changes the orchestration trigger payload protocol from JSON strings to protobuf strings. This includes both the request that gets sent to the language worker and the response that comes back. This only applies to the newer OOProc languages, .NET Isolated and Java.

The other big change in this PR is in the way we interpret errors from the OOProc worker. Previously, we made no attempt to interpret error messages. With this PR, we make a point of parsing out the error message and the stack trace details from the RpcExceptions that we get when an activity function fails.

Why are we making this JSON --> protobuf change?

The old JSON schema for the orchestration trigger payload was problematic from the perspective of SDK code reuse. The newer OOProc SDKs are designed to work with a sidecar over gRPC. Having JSON as the protocol for Functions would then require use to either translate the JSON into protobuf or to translate the JSON into the SDK objects directly. This translation would be a lot of one-off code in each language SDK just to support functions. I deemed it much more economical to instead change the Durable extension to use the protobuf format instead so that language SDKs wouldn't need to do any protocol translation.

This change has the potential of also increasing efficiency since protobuf is much more compact and faster to serialize/deserialize compared to JSON. The downside is that it's much more difficult to debug if unexpected data is received at either end. This tradeoff seemed acceptable, however.

Why are we making this error parsing change?

One of the features I'm working on for .NET Isolated and Java is the ability to create custom retry handlers. In order to create a reasonable experience, the error information from an activity needs to be in a format that a developer can reason about. This was not previously possible since the RpcException given to the extension contained unstructured error information. As part of Azure/durabletask#681, we updated DT.Core to support structured failure details. In this PR, we parse the exception to construct a new TaskFailureDetails payload, which is a structure that defines an error in a language neutral way. Having this payload available to orchestrations enables developers to write much simpler error handling logic.

Pull request checklist

  • My changes do not require documentation changes (this is for unreleased .NET Isolated and Java support)
    • Otherwise: Documentation PR is ready to merge and referenced in pending_docs.md
  • My changes should not be added to the release notes for the next release (no impact for existing workloads)
    • Otherwise: I've added my notes to release_notes.md
  • My changes do not need to be backported to a previous version
    • Otherwise: Backport tracked by issue/PR #issue_or_pr
  • I have added all required tests (Unit tests, E2E tests)
  • My changes do not require any extra work to be leveraged by OutOfProc SDKs
  • My changes do not change the version of the WebJobs.Extensions.DurableTask package
    • Otherwise: major or minor version updates are reflected in /src/Worker.Extensions.DurableTask/AssemblyInfo.cs

@cgillum cgillum added the out-of-proc Impacts non-.NET languages (e.g. JavaScript, Python, or PowerShell) which execute out-of-process label Mar 30, 2022
@cgillum cgillum added this to the v2.7.0 milestone Mar 30, 2022
@cgillum cgillum self-assigned this Mar 30, 2022
@cgillum
Copy link
Copy Markdown
Member Author

cgillum commented Mar 30, 2022

Not sure why the commit history for this PR looks so weird. I may try to recreate my clone of this repo after this PR to see if that resolves it. I don't think it should cause any problems as long as I'm doing a squash-merge.

@cgillum cgillum marked this pull request as ready for review March 31, 2022 00:10
@cgillum cgillum requested a review from davidmrdavid March 31, 2022 00:10
Copy link
Copy Markdown
Member Author

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some context comments.


<ItemGroup>
<PackageReference Include="Microsoft.Azure.DurableTask.AzureStorage" Version="1.10.*" />
<PackageReference Include="Microsoft.Azure.DurableTask.Core" Version="2.8.0" />
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time of writing, this package version is currently only on myget.org. We'll need to publish it to nuget.org prior to releasing this extension version.

Assert.Equal("World", status?.Input);
Assert.Equal("Hello, World!", status?.Output);
Assert.Equal("World", (string)status?.Input);
Assert.Equal("Hello, World!", (string)status?.Output);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into an issue in an earlier iteration that caused this test to fail. Unfortunately, xunit doesn't know how to display JValue objects in the error messages (it always just shows "[]") so the test output was unfortunately useless. Casting the JValue properties to string, however, fixes the display problem.

@davidmrdavid
Copy link
Copy Markdown
Contributor

I noticed some of the tests here are failing. should I start reviewing this or would you prefer I wait until the tests are "green"?

@cgillum
Copy link
Copy Markdown
Member Author

cgillum commented Apr 4, 2022

@davidmrdavid feel free to start reviewing. The previous iteration was green and my latest commit was pretty small. I'll investigate what the failures are in the meantime.

@cgillum
Copy link
Copy Markdown
Member Author

cgillum commented Apr 4, 2022

Looks like the CI failed because of an Azure Storage SDK deadlock timeout expiration 😬. Re-running it got things green.

Copy link
Copy Markdown
Contributor

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a first pass through this PR, left some questions. Thanks!

Comment on lines +70 to +72
// The legacy behavior where the DTFx orchestration context schedules the activity results in an input that's
// wrapped in an array. This is unfortunately quite inefficient because it requires us to do an extra serialization
// round-trip in order to correctly interpret the input data.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do anything to deprecate this legacy inputsAreArrays behavior?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunately embedded deep into DurableTask.Core, which makes changing it more difficult since it impacts all languages. If we remove the DurableTask.Core dependency from the .NET Isolated SDK, then we can remove this unnecessary overhead (similar to how Java can bypass this stuff).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. My 2 cents is that it would be great if we could document this blocker in the code, as I don't think the impact is immediately obvious :)

Copy link
Copy Markdown
Contributor

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the answers so far @cgillum. My only blockers at this point are:
(1) documenting why inputsAsArray is hard to deprecate in code
(2) looking into fixing the \nStack-parsing edge case

Thanks!

@cgillum cgillum force-pushed the cgillum/ooproc-protobuf branch from 3393784 to e3a907c Compare April 18, 2022 20:27
@cgillum
Copy link
Copy Markdown
Member Author

cgillum commented Apr 18, 2022

I somehow messed up my local branch history and had to do a force-push of the latest iteration. The only notable change was related to this comment:

looking into fixing the \nStack-parsing edge case

I changed .IndexOf to .LastIndexOf to handle the case where the exception message has \nStack in it for some reasons, which would otherwise mess up the parsing.

Regarding this:

documenting why inputsAsArray is hard to deprecate in code

I don't recall exactly what the state of the code was when you requested this documentation. I feel the current set of comments are good enough, though. I don't think it's appropriate to go as far as talking about why it is/isn't hard to deprecate something in DTFx, though. That kind of comment belongs in DTFx itself.

I hope you're good with this latest update, assuming all tests pass!

Copy link
Copy Markdown
Contributor

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@cgillum cgillum merged commit aec7a33 into dev Apr 18, 2022
@cgillum cgillum deleted the cgillum/ooproc-protobuf branch April 18, 2022 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

out-of-proc Impacts non-.NET languages (e.g. JavaScript, Python, or PowerShell) which execute out-of-process

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants