-
Notifications
You must be signed in to change notification settings - Fork 524
Performance: Refactors query prefetch mechanism #4361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance: Refactors query prefetch mechanism #4361
Conversation
…ons, but we also can't be substantially slower to start all tasks
…d swallow exceptions
…sks and buffers that could occur in that some places
…ot allocating more in the non-test cases, but found a field to reuse; needs benchmarking
|
@microsoft-github-policy-service agree company="Microsoft" |
The benchmarks are off in a gist (also linked in the description), but they take a loooong time to run (I just ran 'em overnight during development) so I don't think it'd make much sense to check them into anything that is regularly run. Since I didn't intend to run them regularly, I assembled the charting by hand - so there's nothing to save there. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
sboshra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
![]()
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Description
Reworks
ParallelPrefetch.PrefetchInParallelAsyncto reduce allocations.This came out of profiling an application, and discovering that this method is allocating approximately as many bytes worth of
Task[]as the whole application is creatingbyte[]for IO. This is becauseTask.WhenAny(...)is a) used in a loop b) makes a defensive copy of the passedTasks.This version is substantially more complicated, and accordingly there are a lot of tests in this PR (code coverage is 100% of lines and blocks). Special attention was paid to exception and cancellation cases.
Improvements
Greatly Reduced Allocations
In my benchmarking, anywhere from 30% to 99% depending on the total number of
IPrefetchers used.More benchmarking discussion is at the bottom of this PR.
Special Casing For
maxConcurrencyWhen
== 0we do no work, more efficiently than current code.When
== 1we devolve to a foreach, which is just about ideal.Special Casing When Only 1
IPrefetcherWe accept an
IEnumerable<IPrefetcher>, but when that is only going to yield oneIPrefetchera lot of work (even with the old code) is pointless. New code detects this case (generically, it doesn't look for specific types) and devolves into a singleawait.Prompter Starting Of Next Task
Old code starts at most one task per pass through the while loop, so if multiple
Tasks are sitting there completed there's a fair amount a work done before they are all replaced with activeTasks.New code has the completed
Taskstart its replacement, which should keep us closer tomaxConcurrencyactiveTasks.IEnumerator<IPrefetcher>DisposedSmall nit, but the old code doesn't dispose the
IEnumerator<IPrefetcher>. While unlikely, this can put more load on the finalizer thread or potentially leak resources.Outline
maxConcurrency == 0just returnsmaxConcurrency == 1is just aforeachmaxConcurrency <= BatchSizeis more complicatedBatchSizeIPrefetchers are loaded into a rented arrayTasks are then started for each of thoseIPrefetchersTasks grab and start the nextIPrefetcherof theIEnumerator<IPrefetcher>when they finish with their last oneTaskis then awaited in ordermaxConcurrency > BatchSizereuses a lot of the above case, but is still more complicatedBatchSizeIPrefetchersare loaded and started as aboveTasks grab and start the nextIPrefetcherwhen they finish with oneIPrefetchers(up tomaxConcurrency) while there are activeTasksobject[], which is awaited in turn oncemaxConcurrencyis reached (or theIEnumerator<T>finishes)We distinguish between the two
maxConcurrency > 1cases to avoid allocating very large arrays, and to make sure we start some prefetches fairly promptly even whenmaxConcurrencyis very large.BatchSizeis, somewhat arbitrarily,512- any value> 1and< 8,192would be valid.Type of change
Sort of a bug I guess? Current code allocates a lot more than you'd expect.
Benchmarking
Standard caveats about micro-benchmarking apply, but I did some benchmarking to see how this stacks up versus the old code.
TL;DR - across the board improvements in allocations, with no wall clock regressions in what I believe is the common case. There are some narrow, less common cases, where small wall clock regressions are observed.
I consider the primary case here when the
IPrefetcheractually goes async, and takes some non-trivial time to do its work. My expectation is that the two versions of the code should have about the same wall-clock time when# tasks > maxConcurrency, with the new code edging out old as# tasksincreases.That said, I did also test the synchronous completion case, and the "goes async, but then completes immediately" cases to make sure performance wasn't terrible.
In all cases I expect the new code to perform fewer allocations than the old.
Summarizing some results (the full data is in the previous link)...
Here's old vs new on .NET 6 where the

IPrefetcheris just anawait Task.Delay(1)(< 1is an improvement):As expected, wall clock time is basically unaffected (the delay dominates) but allocations are improved across the board. The benefits of improved replacement
Taskstarting logic are visible at the very extreme ends of max concurrency and prefetcher counts.Again, but this time

IPrefetcherjustreturn default;s so everything completes synchronously:We see here that between 2 and 8 tasks there are configurations with wall clock regressions. I could try and improve that, but I believe "all synchronous completions" is fantastically rare, so it's not worth the extra code complications.
And finally,

IPrefetcheris justawait Task.Yield();so everything completes almost immediately but forces all the async completion machinery to run:Similarly, between 4 and 8 tasks there are some wall clock regressions. While more realistic than the "all synchronous"-case, I think this would still be pretty rare - most
IPrefetchers should be doing real work after some asynchronous operation.Since we target netstandard, I also benchmarked under a Framework version (4.6.2) and the results are basically the same:
