ValueTask doesn't inline well #21153

benaadams · 2017-04-17T12:31:04Z

None of the method builder or methods on ValueTask seem to inline very well

Inlinee: System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1[System.Decimal].Create - value class System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1<!0> ()
Fail Reason: Native estimate for function size exceeds threshold.

Inlinee: System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1[System.Decimal].get_Task - instance value class System.Threading.Tasks.ValueTask`1<!0> ()
Fail Reason: Native estimate for function size exceeds threshold.

Inlinee: System.Runtime.CompilerServices.ValueTaskAwaiter`1[System.Decimal].GetResult - instance !0 ()
Fail Reason: Native estimate for function size exceeds threshold.

Inliner: TheAwaitingGame.Order+d__5.MoveNext - instance void ()
Inlinee: instance value class System.Threading.Tasks.ValueTask`1 ()
Fail Reason: Native estimate for function size exceeds threshold.

Using a ternary check will give better results than a direct await

e.g. (task.IsCompleted) ? task.Result : await task;

Note: should be IsCompletedSuccessfully though the Task.Status path doesn't inline either https://github.com/dotnet/corefx/issues/16745#issuecomment-292728779

Should the no-wait sync path be marked with aggressive inlines?

Or is the state machine going to automatically add more weight than the ternary check?

/cc @stephentoub @AndyAyersMS

The text was updated successfully, but these errors were encountered:

benaadams · 2017-04-17T13:13:01Z

It seems awaiting .AsTask() rather than the ValueTask gives similar results to the ternary check; other than allocations shooting up protobuf-net/protobuf-net#234

stephentoub · 2017-04-17T13:35:07Z

@benaadams, can you share a small code sample where it's not getting inlined? In a quick one I just threw together, the hot path with ValueTask/ValueTaskAwaiter for a synchronously completed ValueTask is entirely inlined.

benaadams · 2017-04-17T13:39:38Z

@stephentoub its this async call chain https://github.com/mgravell/protobuf-net/blob/async/src/TheAwaitingGame/Program.cs#L144-L152

Perhaps because its ValueTask<decimal>?

benaadams · 2017-04-17T13:41:15Z

Also was looking at netcoreapp1.1 rather than 2.0 if something has been resolved.

benaadams · 2017-04-17T13:53:55Z

Might be different issue than not inlining; have annotated the methods; testing all completed paths.

Type              |                 Method |       Mean |     Op/s | Allocated |
----------------- |----------------------- |-----------:|---------:|----------:|
Straight Call     |                   Sync | 12.4577 us | 80271.89 |      0 kB |
Completed Task    |              TaskAsync | 19.5186 us | 51233.10 |  24.25 kB |
ValueTask         |         ValueTaskAsync | 24.1389 us | 41426.95 |      0 kB |
Ternary Check     |  ValueTaskCheckedAsync | 20.5660 us | 48623.86 |      0 kB |
Manual Await Skip |       HandCrankedAsync | 16.3431 us | 61187.91 |      0 kB |

stephentoub · 2017-04-17T13:57:57Z

Might be different issue

Why is the expectation that ValueTask will be faster than a completed task, or am I misunderstanding the concern? I fully expect that there are overheads associated with a ValueTask, e.g. it stores multiple fields that are returned, it has an extra branch, etc. This is why when I've been asked whether every method should just return a ValueTask<T> instead of a Task<T>, my answer has never been yes and rather it depends on how often an asynchronous completion is expected, whether all T results could be cached, etc., as well as partially why there's no non-generic ValueTask (i.e. it's no better than just returning a completed task).

mattnischan · 2017-04-17T14:02:05Z

If Task<T>.FromResult() is faster than ValueTask<T>, in what cases would it be useful to use ValueTask<T>?

stephentoub · 2017-04-17T14:03:28Z

If Task.FromResult() is faster than ValueTask, in what cases would it be useful to use ValueTask?

"faster" is very subjective. FromResult() allocates a Task<T>, so while it may be "faster" to get the object then and there, you're building up GC debt that'll be paid in spades later.

rogeralsing · 2017-04-17T14:08:57Z

Has there been any benchmarks done where the GC is under pressure from other work while still benchmarking the Task vs ValueTask?

I imagine that your GC will have a max throughput on your system, and if possible, it might be worth offloading this to the stack.
I imagine that it will also have a positive effect on long tail latency, as the value task perf will be more consistent with less variations over time.
e.g. in Kestrel webserver.

Would that be a valid assumption to make?

stephentoub · 2017-04-17T14:12:45Z

Would that be a valid assumption to make?

That's the whole reason ValueTask<T> exists: to minimize allocation in cases where it's expected that an asynchronous method will frequently complete synchronously with arbitrary result values (e.g. T is an Int32 and possible results span the Int32 range such that it would be infeasible to cache a Task<T> for all of them). Common example would be Stream.ReadAsync: if we had it to do over today, I expect we'd have defined this to return ValueTask<int> rather than Task<int>.

mattnischan · 2017-04-17T14:13:32Z

I suspected that would be the answer, but it still strikes me that most use cases are within this particular area, where ValueTask<T> is a representation of some kind of mostly-always-fast-but-occasionally-not type of value, from, say, a data access layer with caching or something like that.

It's possible this test is just not very illustrative of when ValueTask<T> normally wins, i.e. when you're hitting the GC a lot harder than this benchmark is.

stephentoub · 2017-04-17T14:17:54Z

It's possible this test is just not very illustrative

GC impact compounds and often doesn't show up in full force in microbenchmarks. The more work you have going on in a real scenario, the more allocation affects things globally in the app.

most use cases are within this particular area

There are definitely many of them, which is why ValueTask<T> was introduced. But there are also common cases where it's not valuable, e.g. it's pretty common to have an asynchronous operation that returns a bool (imagine an asynchronous enumerable with a MoveNextAsync), and for that, both values (true and false) can be cached in Task<bool>s such that there's no benefit to using a ValueTask<bool>.

stephentoub · 2017-04-17T14:19:39Z

(All that said, if there are ways we can improve the performance of ValueTask<T>, of course we want to do that. @benaadams, I'm looking at you 😄)

AndyAyersMS · 2017-04-17T16:34:58Z

Heuristic inlining cost estimates for ValueTask<T> where T is a value class will be dependent on T in various complex ways (sizeof(T), whether the jit thinks a T is promotable, etc). So decimal vs some other type may show inlining differences.

Heuristics should be more or less the same for 1.1 vs 2.0.

Inlining of methods that take or return value classes should provide some extra benefit because quite often caller-side struct copy/init code can be removed. But the jit's optimizer is not always able to clean these things up and so the inliner can't yet anticipate that it will do so.

benaadams · 2017-04-17T16:58:33Z

There are 4 methods that don't like inlining; am investigating best approach.

Fighting with the cli and publishing 2.0 atm 😢 , so I can look at the breakout

mattnischan · 2017-04-17T17:46:43Z

Yep, changing the ValueTask type to a reference type instead brings the inlining back to life: protobuf-net/protobuf-net#235

imranbaloch · 2017-04-17T19:01:02Z

@stephentoub It will great if you write some blogs about ValueTask at https://blogs.msdn.microsoft.com/pfxteam/. Just a humble request 😀

benaadams · 2017-04-18T03:15:07Z

Can see why its doing what its doing, some of this stuff is pretty big 😄

benaadams · 2017-04-18T21:57:27Z

Will take a while to resolve this as isn't striaghtforward; @karelz want to assign to me? @stephentoub has 'd my ego into it

stephentoub · 2017-04-18T21:58:54Z

Mwahahahaha

mgravell · 2017-04-19T13:06:10Z

@stephentoub I refined and polished the tests, and I still think there's some big wins yet to be had on await for ValueTask<T>. Here's my gist that describes the scenario, and here's the headline numbers:

                                       Method |          Mean |     StdDev |       Op/s | Scaled | Scaled-StdDev |  Gen 0 | Allocated |
--------------------------------------------- |--------------:|-----------:|-----------:|-------:|--------------:|-------:|----------:|
                          int/await/valuetask | 7,056.9673 ns | 20.9213 ns |  141703.93 |  10.68 |          0.07 |      - |      0 kB |
             int/manual/valuetask/iscompleted |   712.3976 ns |  2.2813 ns | 1403710.43 |   1.08 |          0.01 |      - |      0 kB |
 int/manual/valuetask/iscompletedsuccessfully |   698.7291 ns |  1.7101 ns | 1431169.82 |   1.06 |          0.01 |      - |      0 kB |
                                     int/sync |   660.6417 ns |  4.0141 ns | 1513679.87 |   1.00 |          0.00 |      - |      0 kB |

int/sync is regular classic code without any tasks etc; the two int/manual/valuetask/* versions are sync with a check on IsCompleted / IsCompletedSuccessfully, returning an async Awaited(...) local function if not complete, otherwise using .Result. You can see they work great at 1.06-1.08 scaled cost - that's fine, perfectly acceptable.

However, the await version (int/await/valuetask) has a scaled cost of 10.68. That's... huge. Now, I can totally get around that by writing complex C#, but it really feels like there's something missing here - that cost is too high, IMO, compared to manually hacking ValueTask<T> - meaning, it isn't just ValueTask<T> here, but: the interplay with the await machinery.

stephentoub · 2017-04-19T13:12:01Z

What do you get for await'ing an already completed Task<int>?

I still think there's some big wins yet to be had on await for ValueTask

Great. Let's find them. I never said there weren't places that could be improved.

mgravell · 2017-04-19T13:13:27Z

@stephentoub that's in the gist linked - I snipped those rows here

stephentoub · 2017-04-19T13:15:15Z

that's in the gist linked

I see int/await/task, but it's obviously allocating a new object on each call. Do you have one that's not? That's what I was looking for.

the interplay with the await machinery

Have you profiled? For example, are you sure the overhead is coming from the await rather than from, say, the overhead of invoking an async method?

mgravell · 2017-04-19T13:16:46Z

I'll add a version that returns the same cached static readonly Task.FromResult(42) each time - will update in a few minutes

mgravell · 2017-04-19T13:23:59Z

with:

    const int FixedResultValue = 42;
    static readonly Task<int> FixedResultTask = Task.FromResult(FixedResultValue);
    static readonly ValueTask<int> FixedResultValueTask = new ValueTask<int>(FixedResultValue);
    internal int GetLineWorthInt32Sync() => FixedResultValue;
    internal Task<int> GetLineWorthInt32TaskAsync() => FixedResultTask;
    internal ValueTask<int> GetLineWorthInt32ValueTaskAsync() => FixedResultValueTask;

I get (net462):

                                       Method |           Mean |     StdDev |       Op/s | Scaled | Scaled-StdDev |  Gen 0 | Allocated |
--------------------------------------------- |---------------:|-----------:|-----------:|-------:|--------------:|-------:|----------:|
                               int/await/task |  5,887.0756 ns | 45.8721 ns |  169863.62 |   8.99 |          0.08 | 0.6354 |      4 kB |
                  int/manual/task/iscompleted |  1,332.3687 ns | 15.3116 ns |     750543 |   2.04 |          0.02 | 0.6357 |      4 kB |
      int/manual/task/iscompletedsuccessfully |  2,112.0864 ns |  8.5444 ns |  473465.47 |   3.23 |          0.02 | 0.6354 |      4 kB |
                          int/await/valuetask | 10,150.5720 ns | 34.2319 ns |   98516.62 |  15.50 |          0.08 |      - |      0 kB |
             int/manual/valuetask/iscompleted |  2,194.9392 ns | 10.8205 ns |  455593.48 |   3.35 |          0.02 |      - |      0 kB |
 int/manual/valuetask/iscompletedsuccessfully |  2,273.9742 ns |  8.4562 ns |  439758.72 |   3.47 |          0.02 |      - |      0 kB |
                                     int/sync |    654.6944 ns |  2.8883 ns | 1527430.11 |   1.00 |          0.00 |      - |      0 kB |

and obviously, I can't cache all the Int32 -> Task<int> possibilities :) oddly, it looks to be slower using a static readonly ValueTask<int> - cheaper to init on the stack than to copy from static field? meh, who knows!

Edit: I went back to internal ValueTask<int> GetLineWorthInt32ValueTaskAsync() => new ValueTask<int>(FixedResultValue); and the bottom lines recover:

                          int/await/valuetask | 7,121.0357 ns | 14.1383 ns |  140429.01 |  11.33 |          0.04 |      - |      0 kB |
             int/manual/valuetask/iscompleted |   667.0673 ns |  6.2433 ns |  1499099.1 |   1.06 |          0.01 |      - |      0 kB |
 int/manual/valuetask/iscompletedsuccessfully |   663.6910 ns |  3.9172 ns | 1506725.32 |   1.06 |          0.01 |      - |      0 kB |
                                     int/sync |   628.3083 ns |  1.7743 ns | 1591575.31 |   1.00 |          0.00 |      - |      0 kB |

if I had to speculate - some kind of beforefieldinit cost making it cheaper (more inlineable) to do new ValueTask<int>(const) than to touch the static field, but: that doesn't matter for this!

…es/18481#issuecomment-294483221

benaadams · 2017-04-19T22:19:54Z

await task is quite a complicated beast compared to task.IsCompletedSuccessfully as it involves creating MethodBuilder, copying stack into MethodBuilder creating awaiter, getting result from awaiter and saving and restoring the execution context vs a quick check.

Having said that, I think it can be improved, but it needs some juggling

mgravell · 2017-04-19T23:07:22Z

@benaadams I guess the point I'm getting at here is: you and me - we can hack around this and make our code work fine as though it wasn't an issue; I'm thinking of "regular .NET Jo" - who is just going to use await and wonder why they aren't getting the performance they expected

mattnischan · 2017-04-20T16:57:24Z

Right. Another part is the new .Net conventional wisdom to "async all the things", which I know I've been guilty of spreading the past couple of years as sort of a simplified version of the what we probably should be telling people. The problem is, AsyncMethodBuilder is an expensive construct. If you have a simple REST API that calls into your data access layer, you might be 5 or 6 async calls deep just to get a value that is almost always cached on the heap someplace, for instance.

The decision whether or not to use an async path is not always straightforward and has the same trade-offs as any other performance decision, which you can make if you know how it works. But we've kinda told people "if this call might do I/O even once, make it async". I've had quite a few perf tuning sessions end at the realization that AsyncMethodBuilder is the most expensive thing(s) on the call stack.

benaadams · 2017-04-22T08:44:32Z

@mattnischan AsyncMethodBuilder is already better in 2.0 dotnet/coreclr#9471 though VaueTask doesn't have these changes; also the code paths are a little more complicated for ValueTask as its a superposition of things.

Few things to watch for:

how many locals you are using in the async function scope as that's a direct driver of the size of the async state machine struct.
once you've gone async you've paid the cost so its better push up and collect as many awaits in a function that is already async rather than push them down and be very fine grained; then the cost is only paid once.

The async cost isn't very high; but if a function isn't doing much then it will dominate.

While I do think the await performance can be still improved; I must also marvel at the masterpiece of engineering it is. If I make it better, it must be recognised its only through standing on the shoulders of giants.

benaadams · 2017-04-22T10:06:24Z

Wondering if a level of exception handling can be removed to allow inlining of AsyncMethodBuilder.Start https://github.com/dotnet/coreclr/issues/11156

benaadams · 2017-04-23T19:41:57Z

Have tried lots of variations; some gains though for increased complexity. I don't know if it can be done in coreclr/fx (short of Jit magic); I think the state machine roslyn generates might need tweaking instead 😢

benaadams · 2017-04-25T08:38:44Z

Best pattern is local function for async; passing the parameters to it via params rather than locals, as that currently allocates (example pattern and issue pattern: dotnet/roslyn#18946)

                               Method |          Mean |     StdDev |       Op/s | Scaled | Allocated |
------------------------------------- |--------------:|-----------:|-----------:|-------:|----------:|
                                 Sync |   599.9187 ns |  2.1857 ns | 1666892.40 |   1.00 |      0 kB |
 'ValueTask + Local Async via Params' |   668.2087 ns |  4.8867 ns | 1496538.39 |   1.11 |      0 kB |
 'ValueTask + Local Async via Locals' | 1,525.9064 ns |  9.9561 ns |  655348.19 |   2.54 |   2.45 kB |
               'ValueTask Pure Async' | 5,606.6241 ns | 18.2806 ns |  178360.45 |   9.35 |      0 kB |

However, I think some aggressive inlines need to be added to cope with large structs (which is where this started with decimal)

benaadams · 2017-04-26T12:35:44Z

@mgravell random things make benchmarking comparisons hard 😛

benaadams · 2017-04-26T19:38:05Z

Closing this as forcing the inlines doesn't seem to improve anything really 😢

karelz · 2017-04-26T20:17:40Z

Pity :(, thanks for pushing on it though @benaadams!

benaadams · 2017-04-26T20:32:01Z

Its within noise (+/-) for decimal and large structs and there are much larger factors which dominate; e.g. statemachine (create->start->movenext) but I haven't been able to find a general automatic solution.

benaadams · 2017-04-27T22:57:42Z

My best guess currently on what causes the drop is the async doing more work (obviously); but then that extra work being on memory; whereas the non-async and manual fast-path are working off registers.

An inlining doesn't solve it because the state is a very large struct beyond enregistering size and shape.

benaadams · 2018-02-01T02:49:15Z

Looks to be fixed in 2.1; ValueTask is no longer an outlier

                                 Method |     Mean |     Error |    StdDev |     Op/s |  Gen 0 | Allocated |
--------------------------------------- |---------:|----------:|----------:|---------:|-------:|----------:|
                           Sync_Decimal | 10.99 us | 0.0153 us | 0.0136 us | 90,990.3 |      - |       0 B |
                      TaskAsync_Decimal | 17.08 us | 0.0868 us | 0.0812 us | 58,557.6 | 2.1973 |  121680 B |
               TaskCheckedAsync_Decimal | 17.61 us | 0.0695 us | 0.0616 us | 56,795.1 | 2.1973 |  121680 B |
                 ValueTaskAsync_Decimal | 17.87 us | 0.0259 us | 0.0229 us | 55,972.8 |      - |       0 B |
          ValueTaskCheckedAsync_Decimal | 17.14 us | 0.0323 us | 0.0302 us | 58,327.0 |      - |       0 B |
               HandCrankedAsync_Decimal | 14.57 us | 0.0472 us | 0.0441 us | 68,617.4 |      - |       0 B |
           AssertCompletedAsync_Decimal | 15.43 us | 0.0283 us | 0.0236 us | 64,792.3 |      - |       0 B |
                       TaskAsync_Double | 15.49 us | 0.1381 us | 0.1224 us | 64,569.5 | 1.9531 |  109512 B |
                  ValueTaskAsync_Double | 15.59 us | 0.0250 us | 0.0234 us | 64,133.0 |      - |       0 B |

ghost · 2018-03-04T20:08:03Z

Out of curiosity, someone knows what fixed the issue?

stephentoub assigned benaadams Apr 18, 2017

mgravell referenced this issue in protobuf-net/protobuf-net Apr 19, 2017

Added CONSTANT_RESULTS hack for https://github.com/dotnet/corefx/issu…

939192a

…es/18481#issuecomment-294483221

benaadams closed this as completed Apr 26, 2017

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the 2.0.0 milestone Jan 31, 2020

ghost locked as resolved and limited conversation to collaborators Dec 24, 2020

ValueTask doesn't inline well #21153

ValueTask doesn't inline well #21153

Comments

benaadams commented Apr 17, 2017

benaadams commented Apr 17, 2017

stephentoub commented Apr 17, 2017

benaadams commented Apr 17, 2017

benaadams commented Apr 17, 2017

benaadams commented Apr 17, 2017

stephentoub commented Apr 17, 2017

mattnischan commented Apr 17, 2017

stephentoub commented Apr 17, 2017

rogeralsing commented Apr 17, 2017

stephentoub commented Apr 17, 2017

mattnischan commented Apr 17, 2017

stephentoub commented Apr 17, 2017 • edited Loading

stephentoub commented Apr 17, 2017

AndyAyersMS commented Apr 17, 2017

benaadams commented Apr 17, 2017

mattnischan commented Apr 17, 2017

imranbaloch commented Apr 17, 2017

benaadams commented Apr 18, 2017

benaadams commented Apr 18, 2017 • edited Loading

stephentoub commented Apr 18, 2017

mgravell commented Apr 19, 2017

stephentoub commented Apr 19, 2017

mgravell commented Apr 19, 2017

stephentoub commented Apr 19, 2017

mgravell commented Apr 19, 2017

mgravell commented Apr 19, 2017 • edited Loading

benaadams commented Apr 19, 2017

mgravell commented Apr 19, 2017

mattnischan commented Apr 20, 2017

benaadams commented Apr 22, 2017

benaadams commented Apr 22, 2017

benaadams commented Apr 23, 2017 • edited Loading

benaadams commented Apr 25, 2017

benaadams commented Apr 26, 2017

benaadams commented Apr 26, 2017

karelz commented Apr 26, 2017

benaadams commented Apr 26, 2017

benaadams commented Apr 27, 2017

benaadams commented Feb 1, 2018

ghost commented Mar 4, 2018

stephentoub commented Apr 17, 2017 •

edited

Loading

benaadams commented Apr 18, 2017 •

edited

Loading

mgravell commented Apr 19, 2017 •

edited

Loading

benaadams commented Apr 23, 2017 •

edited

Loading