-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueTask doesn't inline well #21153
Comments
It seems awaiting |
@benaadams, can you share a small code sample where it's not getting inlined? In a quick one I just threw together, the hot path with ValueTask/ValueTaskAwaiter for a synchronously completed ValueTask is entirely inlined. |
@stephentoub its this async call chain https://github.com/mgravell/protobuf-net/blob/async/src/TheAwaitingGame/Program.cs#L144-L152 Perhaps because its |
Also was looking at netcoreapp1.1 rather than 2.0 if something has been resolved. |
Might be different issue than not inlining; have annotated the methods; testing all completed paths.
|
Why is the expectation that ValueTask will be faster than a completed task, or am I misunderstanding the concern? I fully expect that there are overheads associated with a ValueTask, e.g. it stores multiple fields that are returned, it has an extra branch, etc. This is why when I've been asked whether every method should just return a |
If |
"faster" is very subjective. FromResult() allocates a |
Has there been any benchmarks done where the GC is under pressure from other work while still benchmarking the Task vs ValueTask? I imagine that your GC will have a max throughput on your system, and if possible, it might be worth offloading this to the stack. Would that be a valid assumption to make? |
That's the whole reason |
I suspected that would be the answer, but it still strikes me that most use cases are within this particular area, where It's possible this test is just not very illustrative of when |
GC impact compounds and often doesn't show up in full force in microbenchmarks. The more work you have going on in a real scenario, the more allocation affects things globally in the app.
There are definitely many of them, which is why |
(All that said, if there are ways we can improve the performance of |
Heuristic inlining cost estimates for Heuristics should be more or less the same for 1.1 vs 2.0. Inlining of methods that take or return value classes should provide some extra benefit because quite often caller-side struct copy/init code can be removed. But the jit's optimizer is not always able to clean these things up and so the inliner can't yet anticipate that it will do so. |
There are 4 methods that don't like inlining; am investigating best approach. Fighting with the cli and publishing 2.0 atm 😢 , so I can look at the breakout |
Yep, changing the ValueTask type to a reference type instead brings the inlining back to life: protobuf-net/protobuf-net#235 |
@stephentoub It will great if you write some blogs about ValueTask at https://blogs.msdn.microsoft.com/pfxteam/. Just a humble request 😀 |
Can see why its doing what its doing, some of this stuff is pretty big 😄 |
Will take a while to resolve this as isn't striaghtforward; @karelz want to assign to me? @stephentoub has 'd my ego into it |
Mwahahahaha |
@stephentoub I refined and polished the tests, and I still think there's some big wins yet to be had on
However, the |
What do you get for await'ing an already completed
Great. Let's find them. I never said there weren't places that could be improved. |
@stephentoub that's in the gist linked - I snipped those rows here |
I see
Have you profiled? For example, are you sure the overhead is coming from the await rather than from, say, the overhead of invoking an |
I'll add a version that returns the same cached |
with:
I get (net462):
and obviously, I can't cache all the Edit: I went back to
if I had to speculate - some kind of beforefieldinit cost making it cheaper (more inlineable) to do |
Having said that, I think it can be improved, but it needs some juggling |
@benaadams I guess the point I'm getting at here is: you and me - we can hack around this and make our code work fine as though it wasn't an issue; I'm thinking of "regular .NET Jo" - who is just going to use |
Right. Another part is the new .Net conventional wisdom to "async all the things", which I know I've been guilty of spreading the past couple of years as sort of a simplified version of the what we probably should be telling people. The problem is, AsyncMethodBuilder is an expensive construct. If you have a simple REST API that calls into your data access layer, you might be 5 or 6 async calls deep just to get a value that is almost always cached on the heap someplace, for instance. The decision whether or not to use an async path is not always straightforward and has the same trade-offs as any other performance decision, which you can make if you know how it works. But we've kinda told people "if this call might do I/O even once, make it async". I've had quite a few perf tuning sessions end at the realization that AsyncMethodBuilder is the most expensive thing(s) on the call stack. |
@mattnischan AsyncMethodBuilder is already better in 2.0 dotnet/coreclr#9471 though VaueTask doesn't have these changes; also the code paths are a little more complicated for ValueTask as its a superposition of things. Few things to watch for:
The async cost isn't very high; but if a function isn't doing much then it will dominate. While I do think the await performance can be still improved; I must also marvel at the masterpiece of engineering it is. If I make it better, it must be recognised its only through standing on the shoulders of giants. |
Wondering if a level of exception handling can be removed to allow inlining of AsyncMethodBuilder.Start https://github.com/dotnet/coreclr/issues/11156 |
Have tried lots of variations; some gains though for increased complexity. I don't know if it can be done in coreclr/fx (short of Jit magic); I think the state machine roslyn generates might need tweaking instead 😢 |
Best pattern is local function for async; passing the parameters to it via params rather than locals, as that currently allocates (example pattern and issue pattern: dotnet/roslyn#18946)
However, I think some aggressive inlines need to be added to cope with large structs (which is where this started with |
@mgravell random things make benchmarking comparisons hard 😛 |
Closing this as forcing the inlines doesn't seem to improve anything really 😢 |
Pity :(, thanks for pushing on it though @benaadams! |
Its within noise (+/-) for decimal and large structs and there are much larger factors which dominate; e.g. statemachine (create->start->movenext) but I haven't been able to find a general automatic solution. |
My best guess currently on what causes the drop is the async doing more work (obviously); but then that extra work being on memory; whereas the non-async and manual fast-path are working off registers. An inlining doesn't solve it because the state is a very large struct beyond enregistering size and shape. |
Looks to be fixed in 2.1; ValueTask is no longer an outlier
|
Out of curiosity, someone knows what fixed the issue? |
None of the method builder or methods on ValueTask seem to inline very well
Using a ternary check will give better results than a direct
await
e.g.
(task.IsCompleted) ? task.Result : await task;
Note: should be
IsCompletedSuccessfully
though theTask.Status
path doesn't inline either https://github.com/dotnet/corefx/issues/16745#issuecomment-292728779Should the no-wait sync path be marked with aggressive inlines?
Or is the state machine going to automatically add more weight than the ternary check?
/cc @stephentoub @AndyAyersMS
The text was updated successfully, but these errors were encountered: