Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in-process concurrency support to the PS worker #123

Merged
merged 11 commits into from
Jan 7, 2019

Conversation

daxian-dbw
Copy link
Contributor

@daxian-dbw daxian-dbw commented Jan 5, 2019

Add in-process concurrency support to the PS worker:

The major changes are:

  • Use PSThreadOptions.ReuseThread for the InitialSessionState when creating Runspace, so that every PowerShellManager only creates one thread and then reuse it afterwards. The default behavior is to create a new thread every time PowerShell.Invoke is called.
  • Update RequestProcessor to process InvocationRequest in asynchronously via tasks.
  • Implement PowerShellManagerPool using BlockingCollection
    • make upper bound of the pool configurable via an environment variable PSWorkerInProcConcurrencyUpperBound
    • make the pool able to expand in a lazy way
    • checkout PowerShellManager via CheckoutIdleWorker on the main thread. Once getting an idle instance back, the main thread will queue a task to process an invocation request on a thread-pool thread and forget about it -- the main thread then can go ahead to process the next message.
  • Update the RpcLogger and make every PowerShellManager have its own logger instance.
    • also update the way to set the RequestId and InvocationId for logger. The original way to setup the context only works for single-thread design.
  • Update MessagingStream to use a BlockingCollection to hold all messages that are about to be written out, then use a single thread-pool thread to take out items and write them to the gRPC channel.
    • currently, the way we write out response/log messages is completely task-based/async, using a semaphore for synchronization. However, this approach doesn't guarantee the order of the message.
    • this is because there could be multiple tasks blocked on the semaphore, and releasing the semaphore allows a blocked task to enter the semaphore, but there is no guaranteed order, such as first-in-first-out, for blocked threads to enter the semaphore.
    • so, the unblocked task could be a random one, and thus change the arrival order of the message when writing the message to the gRPC channel.
  • Remove the two system logging we have in our worker, because they drastically worsen the processing time per an invocation request when there are a lot in-coming invocation requests.
    • the logging for "TriggerMetadata" parameter is not that useful, and should be removed
    • the execution time logging is good to have, but not necessary, especially when it impact the throughput.
    • below are some data for comparing before/after removing the 2 system logging. The AveInvTime represents the average processing time (ms) for a invocation request, and it's 94 vs. 2.6 ms, huge improvement.

Before Removing

Request,   InvRequest,   InvDone,   AveGetReqTime,     AveInvTime,        AveFetchFromPoolTime,   AveReturnToPoolTime,   AveWriteResponseTime
6008,      6000,         5994,      28.7243675099867,  93.2944611277945,  17.835,                 0.000166833500166834,  0.0191858525191859
6108,      6100,         6095,      28.2540929927963,  94.2059064807219,  18.02868852,            0.000164068908941756,  0.0188679245283019
6208,      6200,         6194,      27.7989690721649,  94.585728123991,   18.11709678,            0.000161446561188247,  0.0185663545366484

After Removing

Request,   InvRequest,   InvDone,   AveGetReqTime,     AveInvTime,        AveFetchFromPoolTime,   AveReturnToPoolTime,   AveWriteResponseTime
7008,      7000,         6999,      35.5720605022831,  2.67166738105444,  0.00557142857142857,    0.000142877553936277,  0.00257179597085298
7108,      7100,         7099,      35.3432751828925,  2.67333427243274,  0.00549295774647887,    0.000140864910550782,  0.00253556838991407
7208,      7200,         7199,      35.1571864594895,  2.67356577302403,  0.00541666666666667,    0.000138908181691902,  0.00250034727045423

Throughput investigation

Design Change

The original design is to not block the main thread -- it reads a request message, start a task for processing and forget about it, then it reads the next message.
With this design, every in-coming request will be handled immediately -- a task will be created for it. So, a large amount of tasks will be created rapidly, and thus a large amount of thread-pool threads will be assigned to run those tasks. All those threads will compete for an idle PowerShellManager from the pool, which results in lock contention in the pool. The average time for fetching an idle PowerShellManager instance gets worse dramatically when velocity of in-coming requests reaches about 70/s -- it reaches 994 ms in my measurements, and becomes a huge bottleneck.

Therefore, I changed the design to make the main thread the only one that checks out idle PowerShellManager instance. When the main thread is able to check out one, it creates a task with the psManager for processing the invocation request in a thread-pool thread. When the main thread is not able to check out one, it's blocked on the pool until an idle instance becomes available in the pool.
With this design, the PS worker creates tasks only if there is a PowerShellManager instance available, so the number of tasks running in parallel is limited to the size of the pool. In this way, lock contention is minimized when checking out instance from the pool, and the pressure on TaskScheduler is also minimized.

Below are some data for comparing before/after the design change. Pay attention to the AveFetchFromPoolTime column. The average time to check out a PowerShellManager instance from the pool for an invocation request was ~900 ms with the original design when there are fast in-coming requests. The average time for the same with the new design drops drastically. Be noted that, the AveFetchFromPoolTime drops to 0.0054 ms after removing the 2 system logging because it takes much less time for processing every invocation request, and thus an PowerShellManager instance can become idle and reclaimed by the pool much faster.

System Logging Mystery

In a AzF worker, a system logging is in fact a log message written out by calling Console.WriteLine. AzF host redirects the worker process's STDOUT, and thus can intercept the message and make it a system log.
I don't know why the system logging causing such a big impact to the invocation processing time when there are fast in-coming requests. Will discuss this with @pragnagopa offline.

The Throughput Test Results

I use LOCUST as the infrastructure for load testing locally, so it's guaranteed that only one PS worker is involved (it will auto-scale if testing against Azure).
Here is the benchmark I got for C# script function, JavaScript function and PowerShell (new design + no system logging) function:

Runtime (simulating 300 users) Requests Per Second
C# (csx) 47.5 RPS
Node.js 44.6 RPS
PowerShell-pool-size-1 45.3 RPS
PowerShell-pool-size-5 46.2 RPS

As you can see, we are at the same level with C# script function and JavaScript function.
However, using a pool of 5 instances doesn't show a better concurrency performance comparing with a pool of 1 instance. Based on the data I collected from the measurements, I believe that's because the AzF Host (Function Runtime) becomes the bottleneck when the in-coming requests become very fast (> 70/s), and thus the PS worker hasn't got to the point where a pool with more instances benefits more from the extra PowerShellManager instances. Take a look at the AveFetchFromPoolTime column of the last two tables below (pool-5 vs. pool-1), the average time for fetching an idle instance from the pool is 100x faster for the pool-5 case. However, even though 100x slower, the pool-1 case can still get an idle instance in about 0.28 ms on average, and that's why we don't see a difference in RPS.

Again, I will have an in-depth discussion with @pragnagopa offline about the bottleneck in the AzF host.

Original Design

InvRequest,   InvDone,   AveInvTime,   AveFetchFromPoolTime
5900,         5788,      94,           884
6000,         5886,      95,           919
6100,         5980,      96,           955
6200,         6080,      97,           994

New Design with 2 System Logging (pool size - 5)

Request,   InvRequest,   InvDone,   AveGetReqTime,     AveInvTime,        AveFetchFromPoolTime,   AveReturnToPoolTime,   AveWriteResponseTime
6008,      6000,         5994,      28.7243675099867,  93.2944611277945,  17.835,                 0.000166833500166834,  0.0191858525191859
6108,      6100,         6095,      28.2540929927963,  94.2059064807219,  18.02868852,            0.000164068908941756,  0.0188679245283019
6208,      6200,         6194,      27.7989690721649,  94.585728123991,   18.11709678,            0.000161446561188247,  0.0185663545366484

New Design without 2 System Logging (pool size - 5)

Request,   InvRequest,   InvDone,   AveGetReqTime,     AveInvTime,        AveFetchFromPoolTime,   AveReturnToPoolTime,   AveWriteResponseTime
7008,      7000,         6999,      35.5720605022831,  2.67166738105444,  0.00557142857142857,    0.000142877553936277,  0.00257179597085298
7108,      7100,         7099,      35.3432751828925,  2.67333427243274,  0.00549295774647887,    0.000140864910550782,  0.00253556838991407
7208,      7200,         7199,      35.1571864594895,  2.67356577302403,  0.00541666666666667,    0.000138908181691902,  0.00250034727045423

New Design without 2 System Logging (pool size - 1)

Request,   InvRequest,   InvDone,   AveGetReqTime,     AveInvTime,        AveFetchFromPoolTime,   AveReturnToPoolTime,   AveWriteResponseTime
7608,      7600,         7599,      35.6027865404837,  2.59323595209896,  0.279473684210526,      0.00013159626266614,    0.00157915515199368
7708,      7700,         7699,      35.4072392319668,  2.59527211326146,  0.280519480519481,      0.000129886998311469,  0.00155864397973763
7808,      7800,         7799,      35.2155481557377,  2.59648672906783,  0.281666666666667,      0.000128221566867547,  0.00153865880241057

src/RequestProcessor.cs Outdated Show resolved Hide resolved
Co-Authored-By: daxian-dbw <[email protected]>
Copy link
Member

@TylerLeonhardt TylerLeonhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - just a couple questions

Co-Authored-By: daxian-dbw <[email protected]>
Copy link
Member

@TylerLeonhardt TylerLeonhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢it

@daxian-dbw daxian-dbw merged commit 5de474c into Azure:dev Jan 7, 2019
@daxian-dbw daxian-dbw deleted the MinimizeNumberOfTasks branch January 7, 2019 19:56
@daxian-dbw
Copy link
Contributor Author

Thanks for the review, the in-proc concurrency support is now merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants