Skip to content

Don't use SerializingExecutor when running with a direct executor. #368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
buchgr opened this issue May 1, 2015 · 9 comments
Closed

Don't use SerializingExecutor when running with a direct executor. #368

buchgr opened this issue May 1, 2015 · 9 comments

Comments

@buchgr
Copy link
Contributor

buchgr commented May 1, 2015

I wanted to know what the impact of the SerializingExecutor is when running with a direct executor.

So I did 3 benchmarks, choosing the best out of 3 runs.

Direct Executor + Serializing (current master)
buchgr@buchgr0:~/Code/grpc-java/benchmarks/build/install/grpc-benchmarks/bin$ ./qps_client --port=33333 --host=localhost --channels=8 --outstanding_rpcs_per_channel=10 --warmup_duration=10s --duration=30s --server_payload=1 --client_payload=1 --directexecutor
Channels:                       8
Outstanding RPCs per Channel:   10
Server Payload Size:            1
Client Payload Size:            1
50%ile Latency (in micros):     669
90%ile Latency (in micros):     1817
95%ile Latency (in micros):     3231
99%ile Latency (in micros):     5727
99.9%ile Latency (in micros):   9255
QPS:                            78642
Direct Executor + Serializing Executor without synchronized blocks.
buchgr@buchgr0:~/Code/grpc-java/benchmarks/build/install/grpc-benchmarks/bin$ ./qps_client --port=33333 --host=localhost --channels=8 --outstanding_rpcs_per_channel=10 --warmup_duration=10s --duration=30s --server_payload=1 --client_payload=1 --directexecutor
Channels:                       8
Outstanding RPCs per Channel:   10
Server Payload Size:            1
Client Payload Size:            1
50%ile Latency (in micros):     655
90%ile Latency (in micros):     1647
95%ile Latency (in micros):     3083
99%ile Latency (in micros):     5679
99.9%ile Latency (in micros):   9607
QPS:                            81500
Direct Executor only, no Serializing Executor
buchgr@buchgr0:~/Code/grpc-java/benchmarks/build/install/grpc-benchmarks/bin$ ./qps_client --port=33333 --host=localhost --channels=8 --outstanding_rpcs_per_channel=10 --warmup_duration=10s --duration=30s --server_payload=1 --client_payload=1 --directexecutor
Channels:                       8
Outstanding RPCs per Channel:   10
Server Payload Size:            1
Client Payload Size:            1
50%ile Latency (in micros):     619
90%ile Latency (in micros):     1096
95%ile Latency (in micros):     1132
99%ile Latency (in micros):     3999
99.9%ile Latency (in micros):   11407
QPS:                            99904

So it seems to me that the potential improvement is significant enough to make some changes and not use a SerializingExecutor when using direct i.e. by adding an option to the Server / Channel Builders.
WDYT @nmittler @louiscryan @ejona86 ?

@buchgr buchgr self-assigned this May 1, 2015
@nmittler
Copy link
Member

nmittler commented May 1, 2015

Nice! SGTM

@nmittler
Copy link
Member

nmittler commented May 1, 2015

Should we just make DirectExecutor the default?

@ejona86
Copy link
Member

ejona86 commented May 1, 2015

No, we can't really make DirectExecutor the default given all the trouble that caused with Stubby in the past. We really want people to opt-in to the "don't block, ever" requirement.

@louiscryan
Copy link
Contributor

Impressive!

I'm with Eric on this one. The fact that we have an explicit directive in the builder to use direct executor means that we can still plumb this in just fine.

@buchgr
Copy link
Contributor Author

buchgr commented May 1, 2015

Would you prefer to add a DirectExecutor to grpc and special case this class when set in Builder.executor(...) or have an additional option Builder.directExecutor()? I would prefer the latter.

@ejona86
Copy link
Member

ejona86 commented May 4, 2015

I agree with @buchgr's proposal on having a directExecutor() on the Builders.

If we wanted our options to be orthogonal, we could have an option disableSerializedCallbacks() or some such. We could still have directExecutor() which would be the equivalent of calling executor(directExecutor) and disableSerializedCallbacks(). I don't have many other use cases of disableSerializedCallbacks(), though, so I question if it is useful. I've considered fanning out streaming requests to multiple threads. Also an "advanced" direct executor that sets a thread-local or some such.

@buchgr
Copy link
Contributor Author

buchgr commented Nov 2, 2015

hmm. is that still of interest or is looking into speeding up SerializingExecutor a better idea (i.e. #1050)? I have a lock free version using CAS and a queue from JCTools pushed to a branch somewhere, but if I recall correctly it had some bugs :).

@louiscryan
Copy link
Contributor

This is still of interest as a straightforward fix.

A better form of executor striping + JCTools might be the way to go in the
long run but there is probably more performance mileage to be gotten out of
the GRPC Deframer and I have some stuff queued up there.
I believe JCTools released a new version with a bunch of fixes in the SPSC
growable array queue

On Mon, Nov 2, 2015 at 2:02 PM, Jakob Buchgraber [email protected]
wrote:

hmm. is that still of interest or is looking into speeding up
SerializingExecutor a better idea (i.e. #1050
#1050)? I have a lock free
version using CAS and a queue from JCTools pushed to a branch somewhere,
but if I recall correctly it had some bugs :).


Reply to this email directly or view it on GitHub
#368 (comment).

@buchgr
Copy link
Contributor Author

buchgr commented Nov 2, 2015

ok I ll open a PR.

buchgr added a commit to buchgr/grpc-java that referenced this issue Nov 11, 2015
When using a direct executor we don't need to wrap calls in a
serializing executor and can thus also avoid any overhead that
comes with it.

Benchmarks show that throughput can be improved substantially.
On my MBP I get a 8% - 12% improvement in throughput with also
slightly better latency.

=== BEFORE ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     452
90%ile Latency (in micros):     600
95%ile Latency (in micros):     726
99%ile Latency (in micros):     1314
99.9%ile Latency (in micros):   5663
Maximum Latency (in micros):    136447
QPS:                            78498

=== AFTER ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     432
90%ile Latency (in micros):     540
95%ile Latency (in micros):     609
99%ile Latency (in micros):     931
99.9%ile Latency (in micros):   3471
Maximum Latency (in micros):    126015
QPS:                            85779
buchgr added a commit to buchgr/grpc-java that referenced this issue Nov 11, 2015
When using a direct executor we don't need to wrap calls in a
serializing executor and can thus also avoid any overhead that
comes with it.

Benchmarks show that throughput can be improved substantially.
On my MBP I get a 8% - 12% improvement in throughput with also
slightly better latency.

=== BEFORE ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     452
90%ile Latency (in micros):     600
95%ile Latency (in micros):     726
99%ile Latency (in micros):     1314
99.9%ile Latency (in micros):   5663
Maximum Latency (in micros):    136447
QPS:                            78498

=== AFTER ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     432
90%ile Latency (in micros):     540
95%ile Latency (in micros):     609
99%ile Latency (in micros):     931
99.9%ile Latency (in micros):   3471
Maximum Latency (in micros):    126015
QPS:                            85779
buchgr added a commit to buchgr/grpc-java that referenced this issue Nov 15, 2015
When using a direct executor we don't need to wrap calls in a
serializing executor and can thus also avoid the overhead that
comes with it.

Benchmarks show that throughput can be improved substantially.
On my MBP I get a 23% improvement in throughput with also
significantly better latency throughout all percentiles.

=== BEFORE ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     452
90%ile Latency (in micros):     600
95%ile Latency (in micros):     726
99%ile Latency (in micros):     1314
99.9%ile Latency (in micros):   5663
Maximum Latency (in micros):    136447
QPS:                            78498

=== AFTER ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     394
90%ile Latency (in micros):     435
95%ile Latency (in micros):     466
99%ile Latency (in micros):     937
99.9%ile Latency (in micros):   1778
Maximum Latency (in micros):    113535
QPS:                            96836
buchgr added a commit to buchgr/grpc-java that referenced this issue Nov 16, 2015
When using a direct executor we don't need to wrap calls in a
serializing executor and can thus also avoid the overhead that
comes with it.

Benchmarks show that throughput can be improved substantially.
On my MBP I get a 24% improvement in throughput with also
significantly better latency throughout all percentiles.

(running qps_client and qps_server with --address=localhost:1234 --directexecutor)

=== BEFORE ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     452
90%ile Latency (in micros):     600
95%ile Latency (in micros):     726
99%ile Latency (in micros):     1314
99.9%ile Latency (in micros):   5663
Maximum Latency (in micros):    136447
QPS:                            78498

=== AFTER ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     399
90%ile Latency (in micros):     429
95%ile Latency (in micros):     453
99%ile Latency (in micros):     650
99.9%ile Latency (in micros):   1265
Maximum Latency (in micros):    33855
QPS:                            97552
buchgr added a commit to buchgr/grpc-java that referenced this issue Nov 18, 2015
When using a direct executor we don't need to wrap calls in a
serializing executor and can thus also avoid the overhead that
comes with it.

Benchmarks show that throughput can be improved substantially.
On my MBP I get a 24% improvement in throughput with also
significantly better latency throughout all percentiles.

(running qps_client and qps_server with --address=localhost:1234 --directexecutor)

=== BEFORE ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     452
90%ile Latency (in micros):     600
95%ile Latency (in micros):     726
99%ile Latency (in micros):     1314
99.9%ile Latency (in micros):   5663
Maximum Latency (in micros):    136447
QPS:                            78498

=== AFTER ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     399
90%ile Latency (in micros):     429
95%ile Latency (in micros):     453
99%ile Latency (in micros):     650
99.9%ile Latency (in micros):   1265
Maximum Latency (in micros):    33855
QPS:                            97552
buchgr added a commit to buchgr/grpc-java that referenced this issue Nov 18, 2015
When using a direct executor we don't need to wrap calls in a
serializing executor and can thus also avoid the overhead that
comes with it.

Benchmarks show that throughput can be improved substantially.
On my MBP I get a 24% improvement in throughput with also
significantly better latency throughout all percentiles.

(running qps_client and qps_server with --address=localhost:1234 --directexecutor)

=== BEFORE ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     452
90%ile Latency (in micros):     600
95%ile Latency (in micros):     726
99%ile Latency (in micros):     1314
99.9%ile Latency (in micros):   5663
Maximum Latency (in micros):    136447
QPS:                            78498

=== AFTER ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     399
90%ile Latency (in micros):     429
95%ile Latency (in micros):     453
99%ile Latency (in micros):     650
99.9%ile Latency (in micros):   1265
Maximum Latency (in micros):    33855
QPS:                            97552
buchgr added a commit to buchgr/grpc-java that referenced this issue Nov 18, 2015
When using a direct executor we don't need to wrap calls in a
serializing executor and can thus also avoid the overhead that
comes with it.

Benchmarks show that throughput can be improved substantially.
On my MBP I get a 24% improvement in throughput with also
significantly better latency throughout all percentiles.

(running qps_client and qps_server with --address=localhost:1234 --directexecutor)

=== BEFORE ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     452
90%ile Latency (in micros):     600
95%ile Latency (in micros):     726
99%ile Latency (in micros):     1314
99.9%ile Latency (in micros):   5663
Maximum Latency (in micros):    136447
QPS:                            78498

=== AFTER ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     399
90%ile Latency (in micros):     429
95%ile Latency (in micros):     453
99%ile Latency (in micros):     650
99.9%ile Latency (in micros):   1265
Maximum Latency (in micros):    33855
QPS:                            97552
buchgr added a commit to buchgr/grpc-java that referenced this issue Nov 19, 2015
When using a direct executor we don't need to wrap calls in a
serializing executor and can thus also avoid the overhead that
comes with it.

Benchmarks show that throughput can be improved substantially.
On my MBP I get a 24% improvement in throughput with also
significantly better latency throughout all percentiles.

(running qps_client and qps_server with --address=localhost:1234 --directexecutor)

=== BEFORE ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     452
90%ile Latency (in micros):     600
95%ile Latency (in micros):     726
99%ile Latency (in micros):     1314
99.9%ile Latency (in micros):   5663
Maximum Latency (in micros):    136447
QPS:                            78498

=== AFTER ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     399
90%ile Latency (in micros):     429
95%ile Latency (in micros):     453
99%ile Latency (in micros):     650
99.9%ile Latency (in micros):   1265
Maximum Latency (in micros):    33855
QPS:                            97552
buchgr added a commit to buchgr/grpc-java that referenced this issue Nov 19, 2015
When using a direct executor we don't need to wrap calls in a
serializing executor and can thus also avoid the overhead that
comes with it.

Benchmarks show that throughput can be improved substantially.
On my MBP I get a 24% improvement in throughput with also
significantly better latency throughout all percentiles.

(running qps_client and qps_server with --address=localhost:1234 --directexecutor)

=== BEFORE ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     452
90%ile Latency (in micros):     600
95%ile Latency (in micros):     726
99%ile Latency (in micros):     1314
99.9%ile Latency (in micros):   5663
Maximum Latency (in micros):    136447
QPS:                            78498

=== AFTER ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     399
90%ile Latency (in micros):     429
95%ile Latency (in micros):     453
99%ile Latency (in micros):     650
99.9%ile Latency (in micros):   1265
Maximum Latency (in micros):    33855
QPS:                            97552
buchgr added a commit to buchgr/grpc-java that referenced this issue Nov 19, 2015
When using a direct executor we don't need to wrap calls in a
serializing executor and can thus also avoid the overhead that
comes with it.

Benchmarks show that throughput can be improved substantially.
On my MBP I get a 24% improvement in throughput with also
significantly better latency throughout all percentiles.

(running qps_client and qps_server with --address=localhost:1234 --directexecutor)

=== BEFORE ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     452
90%ile Latency (in micros):     600
95%ile Latency (in micros):     726
99%ile Latency (in micros):     1314
99.9%ile Latency (in micros):   5663
Maximum Latency (in micros):    136447
QPS:                            78498

=== AFTER ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     399
90%ile Latency (in micros):     429
95%ile Latency (in micros):     453
99%ile Latency (in micros):     650
99.9%ile Latency (in micros):   1265
Maximum Latency (in micros):    33855
QPS:                            97552
buchgr added a commit to buchgr/grpc-java that referenced this issue Nov 19, 2015
When using a direct executor we don't need to wrap calls in a
serializing executor and can thus also avoid the overhead that
comes with it.

Benchmarks show that throughput can be improved substantially.
On my MBP I get a 24% improvement in throughput with also
significantly better latency throughout all percentiles.

(running qps_client and qps_server with --address=localhost:1234 --directexecutor)

=== BEFORE ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     452
90%ile Latency (in micros):     600
95%ile Latency (in micros):     726
99%ile Latency (in micros):     1314
99.9%ile Latency (in micros):   5663
Maximum Latency (in micros):    136447
QPS:                            78498

=== AFTER ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     399
90%ile Latency (in micros):     429
95%ile Latency (in micros):     453
99%ile Latency (in micros):     650
99.9%ile Latency (in micros):   1265
Maximum Latency (in micros):    33855
QPS:                            97552
buchgr added a commit to buchgr/grpc-java that referenced this issue Nov 19, 2015
When using a direct executor we don't need to wrap calls in a
serializing executor and can thus also avoid the overhead that
comes with it.

Benchmarks show that throughput can be improved substantially.
On my MBP I get a 24% improvement in throughput with also
significantly better latency throughout all percentiles.

(running qps_client and qps_server with --address=localhost:1234 --directexecutor)

=== BEFORE ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     452
90%ile Latency (in micros):     600
95%ile Latency (in micros):     726
99%ile Latency (in micros):     1314
99.9%ile Latency (in micros):   5663
Maximum Latency (in micros):    136447
QPS:                            78498

=== AFTER ===
Channels:                       4
Outstanding RPCs per Channel:   10
Server Payload Size:            0
Client Payload Size:            0
50%ile Latency (in micros):     399
90%ile Latency (in micros):     429
95%ile Latency (in micros):     453
99%ile Latency (in micros):     650
99.9%ile Latency (in micros):   1265
Maximum Latency (in micros):    33855
QPS:                            97552
@buchgr buchgr closed this as completed in 602473d Nov 19, 2015
@lock lock bot locked as resolved and limited conversation to collaborators Sep 22, 2018
fixmebot bot referenced this issue in aomsw13/develop_test Apr 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants