Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueTask doesn't inline well #21153

Closed
benaadams opened this issue Apr 17, 2017 · 40 comments
Closed

ValueTask doesn't inline well #21153

benaadams opened this issue Apr 17, 2017 · 40 comments
Assignees
Labels
area-System.Threading enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Milestone

Comments

@benaadams
Copy link
Member

None of the method builder or methods on ValueTask seem to inline very well

Inlinee: System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1[System.Decimal].Create - value class System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1<!0> ()
Fail Reason: Native estimate for function size exceeds threshold.

Inlinee: System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1[System.Decimal].get_Task - instance value class System.Threading.Tasks.ValueTask`1<!0> ()
Fail Reason: Native estimate for function size exceeds threshold.

Inlinee: System.Runtime.CompilerServices.ValueTaskAwaiter`1[System.Decimal].GetResult - instance !0 ()
Fail Reason: Native estimate for function size exceeds threshold.

Inliner: TheAwaitingGame.Order+d__5.MoveNext - instance void ()
Inlinee: instance value class System.Threading.Tasks.ValueTask`1 ()
Fail Reason: Native estimate for function size exceeds threshold.

Using a ternary check will give better results than a direct await

e.g. (task.IsCompleted) ? task.Result : await task;

Note: should be IsCompletedSuccessfully though the Task.Status path doesn't inline either https://github.com/dotnet/corefx/issues/16745#issuecomment-292728779

Should the no-wait sync path be marked with aggressive inlines?

Or is the state machine going to automatically add more weight than the ternary check?

/cc @stephentoub @AndyAyersMS

@benaadams
Copy link
Member Author

It seems awaiting .AsTask() rather than the ValueTask gives similar results to the ternary check; other than allocations shooting up protobuf-net/protobuf-net#234

@stephentoub
Copy link
Member

@benaadams, can you share a small code sample where it's not getting inlined? In a quick one I just threw together, the hot path with ValueTask/ValueTaskAwaiter for a synchronously completed ValueTask is entirely inlined.

@benaadams
Copy link
Member Author

@benaadams
Copy link
Member Author

Also was looking at netcoreapp1.1 rather than 2.0 if something has been resolved.

@benaadams
Copy link
Member Author

Might be different issue than not inlining; have annotated the methods; testing all completed paths.

Type              |                 Method |       Mean |     Op/s | Allocated |
----------------- |----------------------- |-----------:|---------:|----------:|
Straight Call     |                   Sync | 12.4577 us | 80271.89 |      0 kB |
Completed Task    |              TaskAsync | 19.5186 us | 51233.10 |  24.25 kB |
ValueTask         |         ValueTaskAsync | 24.1389 us | 41426.95 |      0 kB |
Ternary Check     |  ValueTaskCheckedAsync | 20.5660 us | 48623.86 |      0 kB |
Manual Await Skip |       HandCrankedAsync | 16.3431 us | 61187.91 |      0 kB |

@stephentoub
Copy link
Member

Might be different issue

Why is the expectation that ValueTask will be faster than a completed task, or am I misunderstanding the concern? I fully expect that there are overheads associated with a ValueTask, e.g. it stores multiple fields that are returned, it has an extra branch, etc. This is why when I've been asked whether every method should just return a ValueTask<T> instead of a Task<T>, my answer has never been yes and rather it depends on how often an asynchronous completion is expected, whether all T results could be cached, etc., as well as partially why there's no non-generic ValueTask (i.e. it's no better than just returning a completed task).

@mattnischan
Copy link

If Task<T>.FromResult() is faster than ValueTask<T>, in what cases would it be useful to use ValueTask<T>?

@stephentoub
Copy link
Member

If Task.FromResult() is faster than ValueTask, in what cases would it be useful to use ValueTask?

"faster" is very subjective. FromResult() allocates a Task<T>, so while it may be "faster" to get the object then and there, you're building up GC debt that'll be paid in spades later.

@rogeralsing
Copy link

Has there been any benchmarks done where the GC is under pressure from other work while still benchmarking the Task vs ValueTask?

I imagine that your GC will have a max throughput on your system, and if possible, it might be worth offloading this to the stack.
I imagine that it will also have a positive effect on long tail latency, as the value task perf will be more consistent with less variations over time.
e.g. in Kestrel webserver.

Would that be a valid assumption to make?

@stephentoub
Copy link
Member

Would that be a valid assumption to make?

That's the whole reason ValueTask<T> exists: to minimize allocation in cases where it's expected that an asynchronous method will frequently complete synchronously with arbitrary result values (e.g. T is an Int32 and possible results span the Int32 range such that it would be infeasible to cache a Task<T> for all of them). Common example would be Stream.ReadAsync: if we had it to do over today, I expect we'd have defined this to return ValueTask<int> rather than Task<int>.

@mattnischan
Copy link

I suspected that would be the answer, but it still strikes me that most use cases are within this particular area, where ValueTask<T> is a representation of some kind of mostly-always-fast-but-occasionally-not type of value, from, say, a data access layer with caching or something like that.

It's possible this test is just not very illustrative of when ValueTask<T> normally wins, i.e. when you're hitting the GC a lot harder than this benchmark is.

@stephentoub
Copy link
Member

stephentoub commented Apr 17, 2017

It's possible this test is just not very illustrative

GC impact compounds and often doesn't show up in full force in microbenchmarks. The more work you have going on in a real scenario, the more allocation affects things globally in the app.

most use cases are within this particular area

There are definitely many of them, which is why ValueTask<T> was introduced. But there are also common cases where it's not valuable, e.g. it's pretty common to have an asynchronous operation that returns a bool (imagine an asynchronous enumerable with a MoveNextAsync), and for that, both values (true and false) can be cached in Task<bool>s such that there's no benefit to using a ValueTask<bool>.

@stephentoub
Copy link
Member

(All that said, if there are ways we can improve the performance of ValueTask<T>, of course we want to do that. @benaadams, I'm looking at you 😄)

@AndyAyersMS
Copy link
Member

Heuristic inlining cost estimates for ValueTask<T> where T is a value class will be dependent on T in various complex ways (sizeof(T), whether the jit thinks a T is promotable, etc). So decimal vs some other type may show inlining differences.

Heuristics should be more or less the same for 1.1 vs 2.0.

Inlining of methods that take or return value classes should provide some extra benefit because quite often caller-side struct copy/init code can be removed. But the jit's optimizer is not always able to clean these things up and so the inliner can't yet anticipate that it will do so.

@benaadams
Copy link
Member Author

There are 4 methods that don't like inlining; am investigating best approach.

Fighting with the cli and publishing 2.0 atm 😢 , so I can look at the breakout

@mattnischan
Copy link

Yep, changing the ValueTask type to a reference type instead brings the inlining back to life: protobuf-net/protobuf-net#235

@imranbaloch
Copy link

@stephentoub It will great if you write some blogs about ValueTask at https://blogs.msdn.microsoft.com/pfxteam/. Just a humble request 😀

@benaadams
Copy link
Member Author

Can see why its doing what its doing, some of this stuff is pretty big 😄

@benaadams
Copy link
Member Author

benaadams commented Apr 18, 2017

Will take a while to resolve this as isn't striaghtforward; @karelz want to assign to me? @stephentoub has :trollface:'d my ego into it

@stephentoub
Copy link
Member

Mwahahahaha

@mgravell
Copy link
Member

@stephentoub I refined and polished the tests, and I still think there's some big wins yet to be had on await for ValueTask<T>. Here's my gist that describes the scenario, and here's the headline numbers:

                                       Method |          Mean |     StdDev |       Op/s | Scaled | Scaled-StdDev |  Gen 0 | Allocated |
--------------------------------------------- |--------------:|-----------:|-----------:|-------:|--------------:|-------:|----------:|
                          int/await/valuetask | 7,056.9673 ns | 20.9213 ns |  141703.93 |  10.68 |          0.07 |      - |      0 kB |
             int/manual/valuetask/iscompleted |   712.3976 ns |  2.2813 ns | 1403710.43 |   1.08 |          0.01 |      - |      0 kB |
 int/manual/valuetask/iscompletedsuccessfully |   698.7291 ns |  1.7101 ns | 1431169.82 |   1.06 |          0.01 |      - |      0 kB |
                                     int/sync |   660.6417 ns |  4.0141 ns | 1513679.87 |   1.00 |          0.00 |      - |      0 kB |

int/sync is regular classic code without any tasks etc; the two int/manual/valuetask/* versions are sync with a check on IsCompleted / IsCompletedSuccessfully, returning an async Awaited(...) local function if not complete, otherwise using .Result. You can see they work great at 1.06-1.08 scaled cost - that's fine, perfectly acceptable.

However, the await version (int/await/valuetask) has a scaled cost of 10.68. That's... huge. Now, I can totally get around that by writing complex C#, but it really feels like there's something missing here - that cost is too high, IMO, compared to manually hacking ValueTask<T> - meaning, it isn't just ValueTask<T> here, but: the interplay with the await machinery.

@stephentoub
Copy link
Member

What do you get for await'ing an already completed Task<int>?

I still think there's some big wins yet to be had on await for ValueTask

Great. Let's find them. I never said there weren't places that could be improved.

@mgravell
Copy link
Member

@stephentoub that's in the gist linked - I snipped those rows here

@stephentoub
Copy link
Member

that's in the gist linked

I see int/await/task, but it's obviously allocating a new object on each call. Do you have one that's not? That's what I was looking for.

the interplay with the await machinery

Have you profiled? For example, are you sure the overhead is coming from the await rather than from, say, the overhead of invoking an async method?

@mgravell
Copy link
Member

I'll add a version that returns the same cached static readonly Task.FromResult(42) each time - will update in a few minutes

@mgravell
Copy link
Member

mgravell commented Apr 19, 2017

with:

    const int FixedResultValue = 42;
    static readonly Task<int> FixedResultTask = Task.FromResult(FixedResultValue);
    static readonly ValueTask<int> FixedResultValueTask = new ValueTask<int>(FixedResultValue);
    internal int GetLineWorthInt32Sync() => FixedResultValue;
    internal Task<int> GetLineWorthInt32TaskAsync() => FixedResultTask;
    internal ValueTask<int> GetLineWorthInt32ValueTaskAsync() => FixedResultValueTask;

I get (net462):

                                       Method |           Mean |     StdDev |       Op/s | Scaled | Scaled-StdDev |  Gen 0 | Allocated |
--------------------------------------------- |---------------:|-----------:|-----------:|-------:|--------------:|-------:|----------:|
                               int/await/task |  5,887.0756 ns | 45.8721 ns |  169863.62 |   8.99 |          0.08 | 0.6354 |      4 kB |
                  int/manual/task/iscompleted |  1,332.3687 ns | 15.3116 ns |     750543 |   2.04 |          0.02 | 0.6357 |      4 kB |
      int/manual/task/iscompletedsuccessfully |  2,112.0864 ns |  8.5444 ns |  473465.47 |   3.23 |          0.02 | 0.6354 |      4 kB |
                          int/await/valuetask | 10,150.5720 ns | 34.2319 ns |   98516.62 |  15.50 |          0.08 |      - |      0 kB |
             int/manual/valuetask/iscompleted |  2,194.9392 ns | 10.8205 ns |  455593.48 |   3.35 |          0.02 |      - |      0 kB |
 int/manual/valuetask/iscompletedsuccessfully |  2,273.9742 ns |  8.4562 ns |  439758.72 |   3.47 |          0.02 |      - |      0 kB |
                                     int/sync |    654.6944 ns |  2.8883 ns | 1527430.11 |   1.00 |          0.00 |      - |      0 kB |

and obviously, I can't cache all the Int32 -> Task<int> possibilities :) oddly, it looks to be slower using a static readonly ValueTask<int> - cheaper to init on the stack than to copy from static field? meh, who knows!

Edit: I went back to internal ValueTask<int> GetLineWorthInt32ValueTaskAsync() => new ValueTask<int>(FixedResultValue); and the bottom lines recover:

                          int/await/valuetask | 7,121.0357 ns | 14.1383 ns |  140429.01 |  11.33 |          0.04 |      - |      0 kB |
             int/manual/valuetask/iscompleted |   667.0673 ns |  6.2433 ns |  1499099.1 |   1.06 |          0.01 |      - |      0 kB |
 int/manual/valuetask/iscompletedsuccessfully |   663.6910 ns |  3.9172 ns | 1506725.32 |   1.06 |          0.01 |      - |      0 kB |
                                     int/sync |   628.3083 ns |  1.7743 ns | 1591575.31 |   1.00 |          0.00 |      - |      0 kB |

if I had to speculate - some kind of beforefieldinit cost making it cheaper (more inlineable) to do new ValueTask<int>(const) than to touch the static field, but: that doesn't matter for this!

@benaadams
Copy link
Member Author

await task is quite a complicated beast compared to task.IsCompletedSuccessfully as it involves creating MethodBuilder, copying stack into MethodBuilder creating awaiter, getting result from awaiter and saving and restoring the execution context vs a quick check.

Having said that, I think it can be improved, but it needs some juggling

@mgravell
Copy link
Member

@benaadams I guess the point I'm getting at here is: you and me - we can hack around this and make our code work fine as though it wasn't an issue; I'm thinking of "regular .NET Jo" - who is just going to use await and wonder why they aren't getting the performance they expected

@mattnischan
Copy link

Right. Another part is the new .Net conventional wisdom to "async all the things", which I know I've been guilty of spreading the past couple of years as sort of a simplified version of the what we probably should be telling people. The problem is, AsyncMethodBuilder is an expensive construct. If you have a simple REST API that calls into your data access layer, you might be 5 or 6 async calls deep just to get a value that is almost always cached on the heap someplace, for instance.

The decision whether or not to use an async path is not always straightforward and has the same trade-offs as any other performance decision, which you can make if you know how it works. But we've kinda told people "if this call might do I/O even once, make it async". I've had quite a few perf tuning sessions end at the realization that AsyncMethodBuilder is the most expensive thing(s) on the call stack.

@benaadams
Copy link
Member Author

@mattnischan AsyncMethodBuilder is already better in 2.0 dotnet/coreclr#9471 though VaueTask doesn't have these changes; also the code paths are a little more complicated for ValueTask as its a superposition of things.

Few things to watch for:

  • how many locals you are using in the async function scope as that's a direct driver of the size of the async state machine struct.
  • once you've gone async you've paid the cost so its better push up and collect as many awaits in a function that is already async rather than push them down and be very fine grained; then the cost is only paid once.

The async cost isn't very high; but if a function isn't doing much then it will dominate.

While I do think the await performance can be still improved; I must also marvel at the masterpiece of engineering it is. If I make it better, it must be recognised its only through standing on the shoulders of giants.

@benaadams
Copy link
Member Author

Wondering if a level of exception handling can be removed to allow inlining of AsyncMethodBuilder.Start https://github.com/dotnet/coreclr/issues/11156

@benaadams
Copy link
Member Author

benaadams commented Apr 23, 2017

Have tried lots of variations; some gains though for increased complexity. I don't know if it can be done in coreclr/fx (short of Jit magic); I think the state machine roslyn generates might need tweaking instead 😢

@benaadams
Copy link
Member Author

Best pattern is local function for async; passing the parameters to it via params rather than locals, as that currently allocates (example pattern and issue pattern: dotnet/roslyn#18946)

                               Method |          Mean |     StdDev |       Op/s | Scaled | Allocated |
------------------------------------- |--------------:|-----------:|-----------:|-------:|----------:|
                                 Sync |   599.9187 ns |  2.1857 ns | 1666892.40 |   1.00 |      0 kB |
 'ValueTask + Local Async via Params' |   668.2087 ns |  4.8867 ns | 1496538.39 |   1.11 |      0 kB |
 'ValueTask + Local Async via Locals' | 1,525.9064 ns |  9.9561 ns |  655348.19 |   2.54 |   2.45 kB |
               'ValueTask Pure Async' | 5,606.6241 ns | 18.2806 ns |  178360.45 |   9.35 |      0 kB |

However, I think some aggressive inlines need to be added to cope with large structs (which is where this started with decimal)

@benaadams
Copy link
Member Author

@mgravell random things make benchmarking comparisons hard 😛

@benaadams
Copy link
Member Author

Closing this as forcing the inlines doesn't seem to improve anything really 😢

@karelz
Copy link
Member

karelz commented Apr 26, 2017

Pity :(, thanks for pushing on it though @benaadams!

@benaadams
Copy link
Member Author

Its within noise (+/-) for decimal and large structs and there are much larger factors which dominate; e.g. statemachine (create->start->movenext) but I haven't been able to find a general automatic solution.

@benaadams
Copy link
Member Author

My best guess currently on what causes the drop is the async doing more work (obviously); but then that extra work being on memory; whereas the non-async and manual fast-path are working off registers.

An inlining doesn't solve it because the state is a very large struct beyond enregistering size and shape.

@benaadams
Copy link
Member Author

Looks to be fixed in 2.1; ValueTask is no longer an outlier

                                 Method |     Mean |     Error |    StdDev |     Op/s |  Gen 0 | Allocated |
--------------------------------------- |---------:|----------:|----------:|---------:|-------:|----------:|
                           Sync_Decimal | 10.99 us | 0.0153 us | 0.0136 us | 90,990.3 |      - |       0 B |
                      TaskAsync_Decimal | 17.08 us | 0.0868 us | 0.0812 us | 58,557.6 | 2.1973 |  121680 B |
               TaskCheckedAsync_Decimal | 17.61 us | 0.0695 us | 0.0616 us | 56,795.1 | 2.1973 |  121680 B |
                 ValueTaskAsync_Decimal | 17.87 us | 0.0259 us | 0.0229 us | 55,972.8 |      - |       0 B |
          ValueTaskCheckedAsync_Decimal | 17.14 us | 0.0323 us | 0.0302 us | 58,327.0 |      - |       0 B |
               HandCrankedAsync_Decimal | 14.57 us | 0.0472 us | 0.0441 us | 68,617.4 |      - |       0 B |
           AssertCompletedAsync_Decimal | 15.43 us | 0.0283 us | 0.0236 us | 64,792.3 |      - |       0 B |
                       TaskAsync_Double | 15.49 us | 0.1381 us | 0.1224 us | 64,569.5 | 1.9531 |  109512 B |
                  ValueTaskAsync_Double | 15.59 us | 0.0250 us | 0.0234 us | 64,133.0 |      - |       0 B |

@ghost
Copy link

ghost commented Mar 4, 2018

Out of curiosity, someone knows what fixed the issue?

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 2.0.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Threading enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

9 participants