Don't use SerializingExecutor when running with a direct executor. #368

buchgr · 2015-05-01T05:42:24Z

I wanted to know what the impact of the SerializingExecutor is when running with a direct executor.

So I did 3 benchmarks, choosing the best out of 3 runs.

Direct Executor + Serializing (current master)

buchgr@buchgr0:~/Code/grpc-java/benchmarks/build/install/grpc-benchmarks/bin$ ./qps_client --port=33333 --host=localhost --channels=8 --outstanding_rpcs_per_channel=10 --warmup_duration=10s --duration=30s --server_payload=1 --client_payload=1 --directexecutor
Channels:                       8
Outstanding RPCs per Channel:   10
Server Payload Size:            1
Client Payload Size:            1
50%ile Latency (in micros):     669
90%ile Latency (in micros):     1817
95%ile Latency (in micros):     3231
99%ile Latency (in micros):     5727
99.9%ile Latency (in micros):   9255
QPS:                            78642

Direct Executor + Serializing Executor without synchronized blocks.

buchgr@buchgr0:~/Code/grpc-java/benchmarks/build/install/grpc-benchmarks/bin$ ./qps_client --port=33333 --host=localhost --channels=8 --outstanding_rpcs_per_channel=10 --warmup_duration=10s --duration=30s --server_payload=1 --client_payload=1 --directexecutor
Channels:                       8
Outstanding RPCs per Channel:   10
Server Payload Size:            1
Client Payload Size:            1
50%ile Latency (in micros):     655
90%ile Latency (in micros):     1647
95%ile Latency (in micros):     3083
99%ile Latency (in micros):     5679
99.9%ile Latency (in micros):   9607
QPS:                            81500

Direct Executor only, no Serializing Executor

buchgr@buchgr0:~/Code/grpc-java/benchmarks/build/install/grpc-benchmarks/bin$ ./qps_client --port=33333 --host=localhost --channels=8 --outstanding_rpcs_per_channel=10 --warmup_duration=10s --duration=30s --server_payload=1 --client_payload=1 --directexecutor
Channels:                       8
Outstanding RPCs per Channel:   10
Server Payload Size:            1
Client Payload Size:            1
50%ile Latency (in micros):     619
90%ile Latency (in micros):     1096
95%ile Latency (in micros):     1132
99%ile Latency (in micros):     3999
99.9%ile Latency (in micros):   11407
QPS:                            99904

So it seems to me that the potential improvement is significant enough to make some changes and not use a SerializingExecutor when using direct i.e. by adding an option to the Server / Channel Builders.
WDYT @nmittler @louiscryan @ejona86 ?

The text was updated successfully, but these errors were encountered:

nmittler · 2015-05-01T15:36:30Z

Nice! SGTM

nmittler · 2015-05-01T15:38:36Z

Should we just make DirectExecutor the default?

ejona86 · 2015-05-01T16:14:05Z

No, we can't really make DirectExecutor the default given all the trouble that caused with Stubby in the past. We really want people to opt-in to the "don't block, ever" requirement.

louiscryan · 2015-05-01T16:21:01Z

Impressive!

I'm with Eric on this one. The fact that we have an explicit directive in the builder to use direct executor means that we can still plumb this in just fine.

buchgr · 2015-05-01T17:35:56Z

Would you prefer to add a DirectExecutor to grpc and special case this class when set in Builder.executor(...) or have an additional option Builder.directExecutor()? I would prefer the latter.

ejona86 · 2015-05-04T16:36:35Z

I agree with @buchgr's proposal on having a directExecutor() on the Builders.

If we wanted our options to be orthogonal, we could have an option disableSerializedCallbacks() or some such. We could still have directExecutor() which would be the equivalent of calling executor(directExecutor) and disableSerializedCallbacks(). I don't have many other use cases of disableSerializedCallbacks(), though, so I question if it is useful. I've considered fanning out streaming requests to multiple threads. Also an "advanced" direct executor that sets a thread-local or some such.

buchgr · 2015-11-02T22:02:43Z

hmm. is that still of interest or is looking into speeding up SerializingExecutor a better idea (i.e. #1050)? I have a lock free version using CAS and a queue from JCTools pushed to a branch somewhere, but if I recall correctly it had some bugs :).

louiscryan · 2015-11-02T22:10:51Z

This is still of interest as a straightforward fix.

A better form of executor striping + JCTools might be the way to go in the
long run but there is probably more performance mileage to be gotten out of
the GRPC Deframer and I have some stuff queued up there.
I believe JCTools released a new version with a bunch of fixes in the SPSC
growable array queue

On Mon, Nov 2, 2015 at 2:02 PM, Jakob Buchgraber [email protected]
wrote:

hmm. is that still of interest or is looking into speeding up
SerializingExecutor a better idea (i.e. #1050
#1050)? I have a lock free
version using CAS and a queue from JCTools pushed to a branch somewhere,
but if I recall correctly it had some bugs :).

—
Reply to this email directly or view it on GitHub
#368 (comment).

buchgr · 2015-11-02T23:02:32Z

ok I ll open a PR.

When using a direct executor we don't need to wrap calls in a serializing executor and can thus also avoid any overhead that comes with it. Benchmarks show that throughput can be improved substantially. On my MBP I get a 8% - 12% improvement in throughput with also slightly better latency. === BEFORE === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 452 90%ile Latency (in micros): 600 95%ile Latency (in micros): 726 99%ile Latency (in micros): 1314 99.9%ile Latency (in micros): 5663 Maximum Latency (in micros): 136447 QPS: 78498 === AFTER === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 432 90%ile Latency (in micros): 540 95%ile Latency (in micros): 609 99%ile Latency (in micros): 931 99.9%ile Latency (in micros): 3471 Maximum Latency (in micros): 126015 QPS: 85779

When using a direct executor we don't need to wrap calls in a serializing executor and can thus also avoid the overhead that comes with it. Benchmarks show that throughput can be improved substantially. On my MBP I get a 23% improvement in throughput with also significantly better latency throughout all percentiles. === BEFORE === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 452 90%ile Latency (in micros): 600 95%ile Latency (in micros): 726 99%ile Latency (in micros): 1314 99.9%ile Latency (in micros): 5663 Maximum Latency (in micros): 136447 QPS: 78498 === AFTER === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 394 90%ile Latency (in micros): 435 95%ile Latency (in micros): 466 99%ile Latency (in micros): 937 99.9%ile Latency (in micros): 1778 Maximum Latency (in micros): 113535 QPS: 96836

When using a direct executor we don't need to wrap calls in a serializing executor and can thus also avoid the overhead that comes with it. Benchmarks show that throughput can be improved substantially. On my MBP I get a 24% improvement in throughput with also significantly better latency throughout all percentiles. (running qps_client and qps_server with --address=localhost:1234 --directexecutor) === BEFORE === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 452 90%ile Latency (in micros): 600 95%ile Latency (in micros): 726 99%ile Latency (in micros): 1314 99.9%ile Latency (in micros): 5663 Maximum Latency (in micros): 136447 QPS: 78498 === AFTER === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 399 90%ile Latency (in micros): 429 95%ile Latency (in micros): 453 99%ile Latency (in micros): 650 99.9%ile Latency (in micros): 1265 Maximum Latency (in micros): 33855 QPS: 97552

buchgr self-assigned this May 1, 2015

google-admin unassigned buchgr May 3, 2015

louiscryan added the performance label May 19, 2015

buchgr closed this as completed in 602473d Nov 19, 2015

njhill mentioned this issue Jan 4, 2018

core: Don't re-wrap SerializingExecutors in ClientCallImpl and ServerImpl #3914

Closed

lock bot locked as resolved and limited conversation to collaborators Sep 22, 2018

fixmebot bot referenced this issue in aomsw13/develop_test Apr 12, 2021

Update readme.md

70b8890

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't use SerializingExecutor when running with a direct executor. #368

Don't use SerializingExecutor when running with a direct executor. #368

buchgr commented May 1, 2015

nmittler commented May 1, 2015

nmittler commented May 1, 2015

ejona86 commented May 1, 2015

louiscryan commented May 1, 2015

buchgr commented May 1, 2015

ejona86 commented May 4, 2015

buchgr commented Nov 2, 2015

louiscryan commented Nov 2, 2015

buchgr commented Nov 2, 2015

Don't use SerializingExecutor when running with a direct executor. #368

Don't use SerializingExecutor when running with a direct executor. #368

Comments

buchgr commented May 1, 2015

Direct Executor + Serializing (current master)

Direct Executor + Serializing Executor without synchronized blocks.

Direct Executor only, no Serializing Executor

nmittler commented May 1, 2015

nmittler commented May 1, 2015

ejona86 commented May 1, 2015

louiscryan commented May 1, 2015

buchgr commented May 1, 2015

ejona86 commented May 4, 2015

buchgr commented Nov 2, 2015

louiscryan commented Nov 2, 2015

buchgr commented Nov 2, 2015