-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Don't use SerializingExecutor when running with a direct executor. #368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Nice! SGTM |
Should we just make DirectExecutor the default? |
No, we can't really make DirectExecutor the default given all the trouble that caused with Stubby in the past. We really want people to opt-in to the "don't block, ever" requirement. |
Impressive! I'm with Eric on this one. The fact that we have an explicit directive in the builder to use direct executor means that we can still plumb this in just fine. |
Would you prefer to add a |
I agree with @buchgr's proposal on having a If we wanted our options to be orthogonal, we could have an option |
hmm. is that still of interest or is looking into speeding up |
This is still of interest as a straightforward fix. A better form of executor striping + JCTools might be the way to go in the On Mon, Nov 2, 2015 at 2:02 PM, Jakob Buchgraber [email protected]
|
ok I ll open a PR. |
When using a direct executor we don't need to wrap calls in a serializing executor and can thus also avoid any overhead that comes with it. Benchmarks show that throughput can be improved substantially. On my MBP I get a 8% - 12% improvement in throughput with also slightly better latency. === BEFORE === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 452 90%ile Latency (in micros): 600 95%ile Latency (in micros): 726 99%ile Latency (in micros): 1314 99.9%ile Latency (in micros): 5663 Maximum Latency (in micros): 136447 QPS: 78498 === AFTER === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 432 90%ile Latency (in micros): 540 95%ile Latency (in micros): 609 99%ile Latency (in micros): 931 99.9%ile Latency (in micros): 3471 Maximum Latency (in micros): 126015 QPS: 85779
When using a direct executor we don't need to wrap calls in a serializing executor and can thus also avoid any overhead that comes with it. Benchmarks show that throughput can be improved substantially. On my MBP I get a 8% - 12% improvement in throughput with also slightly better latency. === BEFORE === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 452 90%ile Latency (in micros): 600 95%ile Latency (in micros): 726 99%ile Latency (in micros): 1314 99.9%ile Latency (in micros): 5663 Maximum Latency (in micros): 136447 QPS: 78498 === AFTER === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 432 90%ile Latency (in micros): 540 95%ile Latency (in micros): 609 99%ile Latency (in micros): 931 99.9%ile Latency (in micros): 3471 Maximum Latency (in micros): 126015 QPS: 85779
When using a direct executor we don't need to wrap calls in a serializing executor and can thus also avoid the overhead that comes with it. Benchmarks show that throughput can be improved substantially. On my MBP I get a 23% improvement in throughput with also significantly better latency throughout all percentiles. === BEFORE === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 452 90%ile Latency (in micros): 600 95%ile Latency (in micros): 726 99%ile Latency (in micros): 1314 99.9%ile Latency (in micros): 5663 Maximum Latency (in micros): 136447 QPS: 78498 === AFTER === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 394 90%ile Latency (in micros): 435 95%ile Latency (in micros): 466 99%ile Latency (in micros): 937 99.9%ile Latency (in micros): 1778 Maximum Latency (in micros): 113535 QPS: 96836
When using a direct executor we don't need to wrap calls in a serializing executor and can thus also avoid the overhead that comes with it. Benchmarks show that throughput can be improved substantially. On my MBP I get a 24% improvement in throughput with also significantly better latency throughout all percentiles. (running qps_client and qps_server with --address=localhost:1234 --directexecutor) === BEFORE === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 452 90%ile Latency (in micros): 600 95%ile Latency (in micros): 726 99%ile Latency (in micros): 1314 99.9%ile Latency (in micros): 5663 Maximum Latency (in micros): 136447 QPS: 78498 === AFTER === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 399 90%ile Latency (in micros): 429 95%ile Latency (in micros): 453 99%ile Latency (in micros): 650 99.9%ile Latency (in micros): 1265 Maximum Latency (in micros): 33855 QPS: 97552
When using a direct executor we don't need to wrap calls in a serializing executor and can thus also avoid the overhead that comes with it. Benchmarks show that throughput can be improved substantially. On my MBP I get a 24% improvement in throughput with also significantly better latency throughout all percentiles. (running qps_client and qps_server with --address=localhost:1234 --directexecutor) === BEFORE === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 452 90%ile Latency (in micros): 600 95%ile Latency (in micros): 726 99%ile Latency (in micros): 1314 99.9%ile Latency (in micros): 5663 Maximum Latency (in micros): 136447 QPS: 78498 === AFTER === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 399 90%ile Latency (in micros): 429 95%ile Latency (in micros): 453 99%ile Latency (in micros): 650 99.9%ile Latency (in micros): 1265 Maximum Latency (in micros): 33855 QPS: 97552
When using a direct executor we don't need to wrap calls in a serializing executor and can thus also avoid the overhead that comes with it. Benchmarks show that throughput can be improved substantially. On my MBP I get a 24% improvement in throughput with also significantly better latency throughout all percentiles. (running qps_client and qps_server with --address=localhost:1234 --directexecutor) === BEFORE === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 452 90%ile Latency (in micros): 600 95%ile Latency (in micros): 726 99%ile Latency (in micros): 1314 99.9%ile Latency (in micros): 5663 Maximum Latency (in micros): 136447 QPS: 78498 === AFTER === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 399 90%ile Latency (in micros): 429 95%ile Latency (in micros): 453 99%ile Latency (in micros): 650 99.9%ile Latency (in micros): 1265 Maximum Latency (in micros): 33855 QPS: 97552
When using a direct executor we don't need to wrap calls in a serializing executor and can thus also avoid the overhead that comes with it. Benchmarks show that throughput can be improved substantially. On my MBP I get a 24% improvement in throughput with also significantly better latency throughout all percentiles. (running qps_client and qps_server with --address=localhost:1234 --directexecutor) === BEFORE === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 452 90%ile Latency (in micros): 600 95%ile Latency (in micros): 726 99%ile Latency (in micros): 1314 99.9%ile Latency (in micros): 5663 Maximum Latency (in micros): 136447 QPS: 78498 === AFTER === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 399 90%ile Latency (in micros): 429 95%ile Latency (in micros): 453 99%ile Latency (in micros): 650 99.9%ile Latency (in micros): 1265 Maximum Latency (in micros): 33855 QPS: 97552
When using a direct executor we don't need to wrap calls in a serializing executor and can thus also avoid the overhead that comes with it. Benchmarks show that throughput can be improved substantially. On my MBP I get a 24% improvement in throughput with also significantly better latency throughout all percentiles. (running qps_client and qps_server with --address=localhost:1234 --directexecutor) === BEFORE === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 452 90%ile Latency (in micros): 600 95%ile Latency (in micros): 726 99%ile Latency (in micros): 1314 99.9%ile Latency (in micros): 5663 Maximum Latency (in micros): 136447 QPS: 78498 === AFTER === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 399 90%ile Latency (in micros): 429 95%ile Latency (in micros): 453 99%ile Latency (in micros): 650 99.9%ile Latency (in micros): 1265 Maximum Latency (in micros): 33855 QPS: 97552
When using a direct executor we don't need to wrap calls in a serializing executor and can thus also avoid the overhead that comes with it. Benchmarks show that throughput can be improved substantially. On my MBP I get a 24% improvement in throughput with also significantly better latency throughout all percentiles. (running qps_client and qps_server with --address=localhost:1234 --directexecutor) === BEFORE === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 452 90%ile Latency (in micros): 600 95%ile Latency (in micros): 726 99%ile Latency (in micros): 1314 99.9%ile Latency (in micros): 5663 Maximum Latency (in micros): 136447 QPS: 78498 === AFTER === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 399 90%ile Latency (in micros): 429 95%ile Latency (in micros): 453 99%ile Latency (in micros): 650 99.9%ile Latency (in micros): 1265 Maximum Latency (in micros): 33855 QPS: 97552
When using a direct executor we don't need to wrap calls in a serializing executor and can thus also avoid the overhead that comes with it. Benchmarks show that throughput can be improved substantially. On my MBP I get a 24% improvement in throughput with also significantly better latency throughout all percentiles. (running qps_client and qps_server with --address=localhost:1234 --directexecutor) === BEFORE === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 452 90%ile Latency (in micros): 600 95%ile Latency (in micros): 726 99%ile Latency (in micros): 1314 99.9%ile Latency (in micros): 5663 Maximum Latency (in micros): 136447 QPS: 78498 === AFTER === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 399 90%ile Latency (in micros): 429 95%ile Latency (in micros): 453 99%ile Latency (in micros): 650 99.9%ile Latency (in micros): 1265 Maximum Latency (in micros): 33855 QPS: 97552
When using a direct executor we don't need to wrap calls in a serializing executor and can thus also avoid the overhead that comes with it. Benchmarks show that throughput can be improved substantially. On my MBP I get a 24% improvement in throughput with also significantly better latency throughout all percentiles. (running qps_client and qps_server with --address=localhost:1234 --directexecutor) === BEFORE === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 452 90%ile Latency (in micros): 600 95%ile Latency (in micros): 726 99%ile Latency (in micros): 1314 99.9%ile Latency (in micros): 5663 Maximum Latency (in micros): 136447 QPS: 78498 === AFTER === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 399 90%ile Latency (in micros): 429 95%ile Latency (in micros): 453 99%ile Latency (in micros): 650 99.9%ile Latency (in micros): 1265 Maximum Latency (in micros): 33855 QPS: 97552
When using a direct executor we don't need to wrap calls in a serializing executor and can thus also avoid the overhead that comes with it. Benchmarks show that throughput can be improved substantially. On my MBP I get a 24% improvement in throughput with also significantly better latency throughout all percentiles. (running qps_client and qps_server with --address=localhost:1234 --directexecutor) === BEFORE === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 452 90%ile Latency (in micros): 600 95%ile Latency (in micros): 726 99%ile Latency (in micros): 1314 99.9%ile Latency (in micros): 5663 Maximum Latency (in micros): 136447 QPS: 78498 === AFTER === Channels: 4 Outstanding RPCs per Channel: 10 Server Payload Size: 0 Client Payload Size: 0 50%ile Latency (in micros): 399 90%ile Latency (in micros): 429 95%ile Latency (in micros): 453 99%ile Latency (in micros): 650 99.9%ile Latency (in micros): 1265 Maximum Latency (in micros): 33855 QPS: 97552
I wanted to know what the impact of the
SerializingExecutor
is when running with a direct executor.So I did 3 benchmarks, choosing the best out of 3 runs.
Direct Executor + Serializing (current master)
Direct Executor + Serializing Executor without synchronized blocks.
Direct Executor only, no Serializing Executor
So it seems to me that the potential improvement is significant enough to make some changes and not use a
SerializingExecutor
when using direct i.e. by adding an option to the Server / Channel Builders.WDYT @nmittler @louiscryan @ejona86 ?
The text was updated successfully, but these errors were encountered: