lazy read disable for the http1 codec by wbpcode · Pull Request #20148 · envoyproxy/envoy

wbpcode · 2022-03-01T06:50:20Z

Signed-off-by: wbpcode wbphub@live.com

Commit Message: lazy read disable for the http1 codec
Additional Description:

Ref #19900 for more info.

Risk Level: Low.
Testing: N/A.
Docs Changes: N/A.
Release Notes: N/A.
Platform Specific Features: N/A.
Optional Runtime guard: envoy.reloadable_features.http1_lazy_read_disable.

Signed-off-by: wbpcode <wbphub@live.com>

wbpcode · 2022-03-01T06:52:19Z

cc @KBaichoo cc @alyssawilk

rojkov · 2022-03-01T11:52:27Z

/assign @KBaichoo
/wait
on CI

Signed-off-by: wbpcode <wbphub@live.com>

wbpcode · 2022-03-02T01:26:04Z

/retest

repokitteh-read-only · 2022-03-02T01:26:07Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #20148 (comment) was created by @wbpcode.

see: more, trace.

KBaichoo

Looks great otherwise, mind posting the performance numbers (compared to the prior implementation) since this is a performance change? Thanks

KBaichoo · 2022-03-02T19:17:57Z

source/common/http/http1/codec_impl.cc

+    // Active downstream request remote complete but there is some remaining data in the read buffer
+    // then try to disable the connection reading. Connection reading will be re-enabled after the
+    // current active downstream request has completed.
+    // This ensures that the remaining data can be consumed after the current active downstream
+    // request has completed.


This is a bit hard to grok, perhaps:
Eagerly read disable the connection if the downstream is sending pipelined requests as we serially process them. Reading from the connection will be re-enabled after the active request is completed.

KBaichoo · 2022-03-02T19:24:32Z

source/common/http/http1/codec_impl.cc

+    // Active downstream request remote complete but there is some new data comming then try to
+    // disable the connection reading. Connection reading will be re-enabled after the current
+    // active downstream request has completed.
+    // This ensures that the new comming data can be consumed after the current active downstream
+    // request has completed.


This is a bit hard to grok, perhaps:
Read disable the connection if the downstream is sending additional data while we are working on an existing request. Reading from the connection will be re-enabled after the active request is completed.

Signed-off-by: wbpcode <wbphub@live.com>

wbpcode · 2022-03-03T09:52:21Z

Benchmark with simplest envoy configuration and concurrency 1. A muti-processes nginx as backend return 1k response body. And a wrk as client.

wrk -c 64 -d 60s -t 2 -H"host:a.test.com" http://localhost:9090/anything

Result before this PR:

Running 1m test @ http://localhost:9090/anything
  2 threads and 64 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.28ms  808.89us  34.30ms   86.29%
    Req/Sec     9.83k   549.35    10.87k    83.50%
  1174226 requests in 1.00m, 1.27GB read
Requests/sec:  19558.90
Transfer/sec:     21.73MB

Result after this PR:

Running 1m test @ http://localhost:9090/anything
  2 threads and 64 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.13ms  844.64us  19.47ms   89.36%
    Req/Sec    10.32k   618.36    11.67k    79.75%
  1232194 requests in 1.00m, 1.34GB read
Requests/sec:  20531.11
Transfer/sec:     22.81MB

In a simple stress test scenario, this PR can bring about a 4~5% throughput improvement.

wbpcode · 2022-03-03T12:22:59Z

/retest

repokitteh-read-only · 2022-03-03T12:23:03Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #20148 (comment) was created by @wbpcode.

see: more, trace.

KBaichoo · 2022-03-03T14:38:50Z

/retest

repokitteh-read-only · 2022-03-03T14:38:54Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #20148 (comment) was created by @KBaichoo.

see: more, trace.

KBaichoo

lgtm

/assign @alyssawilk as senior maintainer for merge

alyssawilk

This looks fantastic! I'm inclined to think it's high enough risk to runtime guard, though you're welcome to get a second opinion from Matt or Snow if you would rather not. Definitely let's add a release note to docs/root/version_history/current.rst nothing this should be nothing but a perf win, but in case there are behavioral changes.
Let's do that and add mock checks in the HTTP/1.1 codec test to regression test, and you'll be good to go!
/wait

wbpcode · 2022-03-04T01:52:04Z

Get it. A runtime guard is reasonable for me.

Signed-off-by: wbpcode <wbphub@live.com>

wbpcode · 2022-03-09T07:18:02Z

/retest

repokitteh-read-only · 2022-03-09T07:18:05Z

Retrying Azure Pipelines:
Check envoy-presubmit didn't fail.

🐱

Caused by: a #20148 (comment) was created by @wbpcode.

see: more, trace.

wbpcode · 2022-03-09T07:19:02Z

cc @alyssawilk Release note added. Tests added. Runtime guard added. 😄

alyssawilk

Awesome!

alyssawilk · 2022-03-09T14:32:23Z

Throwing over to Matt for a last (non-google) look

mattklein123

Nice!

Signed-off-by: wbpcode <wbphub@live.com> Signed-off-by: kuochunghsu <kuochunghsu@pinterest.com>

rojkov · 2022-03-24T15:57:30Z

After I had merged the latest main I noticed performance of the io_uring-backed IoHandle deteriorated significantly. git-bisecting led to this PR. Basically the io_uring-backed IoHandle gives the same performance as the normal one. The difference is that with the former a CPU core is idling more often (CPU load is ~6% less).

With envoy.reloadable_features.http1_lazy_read_disable set to false I get

rojkov@drozhkov:~/work/nighthawk (main)$ ./bazel-bin/nighthawk_client --duration 30 --rps 100000 --open-loop http://127.0.0.1:10000/
[17:29:50.785794][2078262][I] Starting 1 threads / event loops. Time limit: 30 seconds.
[17:29:50.785852][2078262][I] Global targets: 100 connections and 100000 calls per second.
[17:30:21.335928][2078268][I] Stopping after 30000 ms. Initiated: 3000000 / Completed: 2999899. (Completion rate was 99996.56666895555 per second.)
Nighthawk - A layer 7 protocol benchmarking tool.

benchmark_http_client.latency_2xx (695804 samples)
  min: 0s 001ms 665us | mean: 0s 004ms 196us | max: 0s 088ms 604us | pstdev: 0s 000ms 656us

  Percentile  Count       Value
  0.5         347918      0s 004ms 154us
  0.75        521934      0s 004ms 363us
  0.8         556710      0s 004ms 417us
  0.9         626227      0s 004ms 547us
  0.95        661028      0s 004ms 674us
  0.990625    689282      0s 005ms 881us
  0.99902344  695125      0s 007ms 784us

Queueing and connection setup latency (695904 samples)
  min: 0s 000ms 001us | mean: 0s 000ms 043us | max: 0s 040ms 495us | pstdev: 0s 000ms 342us

  Percentile  Count       Value
  0.5         348046      0s 000ms 001us
  0.75        521951      0s 000ms 001us
  0.8         556746      0s 000ms 002us
  0.9         626314      0s 000ms 002us
  0.95        661119      0s 000ms 004us
  0.990625    689380      0s 001ms 118us
  0.99902344  695226      0s 003ms 347us

Request start to response end (695804 samples)
  min: 0s 001ms 665us | mean: 0s 004ms 196us | max: 0s 088ms 604us | pstdev: 0s 000ms 656us

  Percentile  Count       Value
  0.5         347975      0s 004ms 154us
  0.75        521984      0s 004ms 363us
  0.8         556752      0s 004ms 417us
  0.9         626253      0s 004ms 547us
  0.95        661044      0s 004ms 674us
  0.990625    689281      0s 005ms 881us
  0.99902344  695125      0s 007ms 784us

Response body size in bytes (695804 samples)
  min: 10 | mean: 10.0 | max: 10 | pstdev: 0.0

Response header size in bytes (695804 samples)
  min: 141 | mean: 141.0 | max: 141 | pstdev: 0.0

Initiation to completion (2999899 samples)
  min: 0s 000ms 000us | mean: 0s 001ms 011us | max: 0s 088ms 674us | pstdev: 0s 001ms 824us

  Percentile  Count       Value
  0.5         1500028     0s 000ms 002us
  0.75        2249926     0s 000ms 384us
  0.8         2399925     0s 003ms 856us
  0.9         2699965     0s 004ms 230us
  0.95        2850018     0s 004ms 422us
  0.990625    2971776     0s 005ms 304us
  0.99902344  2996972     0s 007ms 442us

Counter                                 Value       Per second
benchmark.http_2xx                      695804      23193.45
benchmark.pool_overflow                 2304095     76803.11
cluster_manager.cluster_added           1           0.03
default.total_match_count               1           0.03
membership_change                       1           0.03
runtime.load_success                    1           0.03
runtime.override_dir_not_exists         1           0.03
upstream_cx_http1_total                 100         3.33
upstream_cx_overflow                    635383      21179.42
upstream_cx_rx_bytes_total              133594368   4453142.52
upstream_cx_total                       100         3.33
upstream_cx_tx_bytes_total              28532064    951068.14
upstream_rq_pending_overflow            2304095     76803.11
upstream_rq_pending_total               18744       624.80
upstream_rq_total                       695904      23196.78

[17:30:26.456727][2078268][I] Wait for the connection pool drain timed out, proceeding to hard shutdown.
[17:30:26.466043][2078262][I] Done.

With envoy.reloadable_features.http1_lazy_read_disable set to true I get

rojkov@drozhkov:~/work/nighthawk (main)$ ./bazel-bin/nighthawk_client --duration 30 --rps 100000 --open-loop http://127.0.0.1:10000/
[17:15:03.708749][2054509][I] Starting 1 threads / event loops. Time limit: 30 seconds.
[17:15:03.708790][2054509][I] Global targets: 100 connections and 100000 calls per second.
[17:15:34.258833][2054515][I] Stopping after 30000 ms. Initiated: 2999998 / Completed: 2999897. (Completion rate was 99996.54666735732 per second.)
Nighthawk - A layer 7 protocol benchmarking tool.

benchmark_http_client.latency_2xx (580308 samples)
  min: 0s 001ms 622us | mean: 0s 004ms 822us | max: 0s 072ms 667us | pstdev: 0s 001ms 166us

  Percentile  Count       Value
  0.5         290210      0s 004ms 641us
  0.75        435272      0s 005ms 279us
  0.8         464271      0s 005ms 463us
  0.9         522281      0s 006ms 143us
  0.95        551296      0s 006ms 771us
  0.990625    574868      0s 008ms 001us
  0.99902344  579743      0s 009ms 154us

Queueing and connection setup latency (580408 samples)
  min: 0s 000ms 001us | mean: 0s 000ms 052us | max: 0s 052ms 011us | pstdev: 0s 000ms 684us

  Percentile  Count       Value
  0.5         290216      0s 000ms 002us
  0.75        435317      0s 000ms 003us
  0.8         464405      0s 000ms 004us
  0.9         522424      0s 000ms 004us
  0.95        551399      0s 000ms 007us
  0.990625    574967      0s 001ms 379us
  0.99902344  579842      0s 007ms 007us

Request start to response end (580308 samples)
  min: 0s 001ms 622us | mean: 0s 004ms 822us | max: 0s 072ms 667us | pstdev: 0s 001ms 166us

  Percentile  Count       Value
  0.5         290183      0s 004ms 641us
  0.75        435249      0s 005ms 279us
  0.8         464260      0s 005ms 463us
  0.9         522281      0s 006ms 143us
  0.95        551296      0s 006ms 771us
  0.990625    574868      0s 008ms 001us
  0.99902344  579742      0s 009ms 153us

Response body size in bytes (580308 samples)
  min: 10 | mean: 10.0 | max: 10 | pstdev: 0.0

Response header size in bytes (580308 samples)
  min: 141 | mean: 141.0 | max: 141 | pstdev: 0.0

Initiation to completion (2999897 samples)
  min: 0s 000ms 000us | mean: 0s 001ms 100us | max: 0s 072ms 699us | pstdev: 0s 001ms 984us

  Percentile  Count       Value
  0.5         1499962     0s 000ms 064us
  0.75        2249934     0s 000ms 808us
  0.8         2399919     0s 001ms 381us
  0.9         2699938     0s 004ms 674us
  0.95        2849935     0s 005ms 344us
  0.990625    2971773     0s 007ms 007us
  0.99902344  2996969     0s 008ms 717us

Counter                                 Value       Per second
benchmark.http_2xx                      580308      19343.60
benchmark.pool_overflow                 2419589     80652.95
cluster_manager.cluster_added           1           0.03
default.total_match_count               1           0.03
membership_change                       1           0.03
runtime.load_success                    1           0.03
runtime.override_dir_not_exists         1           0.03
upstream_cx_http1_total                 100         3.33
upstream_cx_overflow                    541830      18061.00
upstream_cx_rx_bytes_total              111419136   3713970.45
upstream_cx_total                       100         3.33
upstream_cx_tx_bytes_total              23796728    793224.11
upstream_rq_pending_overflow            2419589     80652.95
upstream_rq_pending_total               13318       443.93
upstream_rq_total                       580408      19346.93

[17:15:39.376828][2054515][I] Wait for the connection pool drain timed out, proceeding to hard shutdown.
[17:15:39.386339][2054509][I] Done.

wbpcode · 2022-03-25T01:45:48Z

Hi, @rojkov can you provide some more detailed config and also your code base? I can have a try in my local env. This PR reduces the number of calls to readDisable, although it adds a few extra if-checks, which obviously shouldn't lead to poor performance.

Or can we open a new issue to record and discuss this problem?

rojkov · 2022-03-25T07:52:41Z

Hi @wbpcode. The new IoHandle is WIP still. This PR doesn't break anything in the main branch. I just wanted to give a heads-up. The culprit may well be in the new IoHandle actually.

I'd appreciate if you took a look. The code can be obtained from #19082. My current config is

bootstrap_extensions:
  - name: envoy.extensions.io_socket.io_uring
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.network.socket_interface.v3.IoUringSocketInterface
      use_submission_queue_polling: false
      read_buffer_size: 8192
      io_uring_size: 300
default_socket_interface: "envoy.extensions.network.socket_interface.io_uring"
enable_dispatcher_stats: false
static_resources:
  clusters:
    name: cluster_0
    connect_timeout: 0.25s
    circuit_breakers:
      thresholds:
      - priority: DEFAULT
        max_connections: 1000000000
        max_pending_requests: 1000000000
        max_requests: 1000000000
        max_retries: 1000000000
      - priority: HIGH
        max_connections: 1000000000
        max_pending_requests: 1000000000
        max_requests: 1000000000
        max_retries: 1000000000
    load_assignment:
      cluster_name: cluster_0
      endpoints:
        - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: 127.0.0.1
                    port_value: 4500
  listeners:
    name: listener_0
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 10000
    filter_chains:
      filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          codec_type: auto
          generate_request_id: false
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: backend
              domains:
              - "*"
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: cluster_0
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
              dynamic_stats: false
layered_runtime:
  layers:
  - name: static_layer
    static_layer:
      envoy.reloadable_features.http1_lazy_read_disable: false

For benchmarking I use Nighthawk. The server config:

static_resources:
  listeners:
    # define an origin server on :10000 that always returns "lorem ipsum..."
    - address:
        socket_address:
          address: 0.0.0.0
          port_value: 4500
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                generate_request_id: false
                codec_type: AUTO
                stat_prefix: ingress_http
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: service
                      domains:
                        - "*"
                http_filters:
                  - name: test-server # before envoy.router because order matters!
                    typed_config:
                      "@type": type.googleapis.com/nighthawk.server.ResponseOptions
                      response_body_size: 10
                      v3_response_headers:
                        - { header: { key: "foo", value: "bar3" } }
                        - {
                            header: { key: "foo", value: "bar2" },
                            append: true,
                          }
                        - { header: { key: "x-nh", value: "1" } }
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
                      dynamic_stats: false
admin:
  access_log_path: /tmp/envoy.log
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 8081

I run the client with

./bazel-bin/nighthawk_client --duration 30 --rps 100000 --open-loop http://127.0.0.1:10000/

And also I pin Envoy to a single core with

taskset --cpu-list 0 bazel-bin/source/exe/envoy-static -c envoy-config-perf-measurement-io_uring.yaml --concurrency 1 -l warn

Signed-off-by: wbpcode <wbphub@live.com>

lazy read disable for the http1 codec

2d83197

Signed-off-by: wbpcode <wbphub@live.com>

repokitteh-read-only bot assigned KBaichoo Mar 1, 2022

repokitteh-read-only bot added the waiting label Mar 1, 2022

minor updates

5433588

Signed-off-by: wbpcode <wbphub@live.com>

repokitteh-read-only bot removed the waiting label Mar 1, 2022

KBaichoo reviewed Mar 2, 2022

View reviewed changes

update comments

d05fbd0

Signed-off-by: wbpcode <wbphub@live.com>

KBaichoo previously approved these changes Mar 3, 2022

View reviewed changes

KBaichoo assigned KBaichoo and alyssawilk and unassigned KBaichoo Mar 3, 2022

alyssawilk reviewed Mar 3, 2022

View reviewed changes

repokitteh-read-only bot added the waiting label Mar 3, 2022

wbpcode added 2 commits March 8, 2022 10:02

add runtime guard

c4c0de3

Signed-off-by: wbpcode <wbphub@live.com>

add runtime guard & test & release note

aa5bf43

Signed-off-by: wbpcode <wbphub@live.com>

wbpcode dismissed KBaichoo’s stale review via aa5bf43 March 8, 2022 17:17

repokitteh-read-only bot removed the waiting label Mar 8, 2022

fix test

d523361

Signed-off-by: wbpcode <wbphub@live.com>

alyssawilk approved these changes Mar 9, 2022

View reviewed changes

alyssawilk assigned mattklein123 Mar 9, 2022

mattklein123 approved these changes Mar 10, 2022

View reviewed changes

mattklein123 merged commit b435d3a into envoyproxy:main Mar 10, 2022

mum4k mentioned this pull request Mar 15, 2022

[Salvo] Add documentation 'Measure Envoy's Performance Change with an A/B Testing' envoyproxy/envoy-perf#127

Merged

wbpcode mentioned this pull request Mar 17, 2022

lazy read disable for HTTP1 codec to reduce unnecessary syscalls #19900

Closed

JuniorHsu pushed a commit to JuniorHsu/envoy that referenced this pull request Mar 17, 2022

lazy read disable for the http1 codec (envoyproxy#20148)

6641055

Signed-off-by: wbpcode <wbphub@live.com> Signed-off-by: kuochunghsu <kuochunghsu@pinterest.com>

gyohuangxin mentioned this pull request Apr 8, 2022

[Salvo] Add single test mode in Salvo envoyproxy/envoy-perf#132

Closed

ravenblackx pushed a commit to ravenblackx/envoy that referenced this pull request Jun 8, 2022

lazy read disable for the http1 codec (envoyproxy#20148)

c29f752

Signed-off-by: wbpcode <wbphub@live.com>

alyssawilk mentioned this pull request Oct 20, 2022

envoy_reloadable_features_http1_lazy_read_disable deprecation #23595

Closed

Conversation

wbpcode commented Mar 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wbpcode commented Mar 1, 2022

Uh oh!

rojkov commented Mar 1, 2022

Uh oh!

wbpcode commented Mar 2, 2022

Uh oh!

repokitteh-read-only bot commented Mar 2, 2022

Uh oh!

KBaichoo left a comment

Choose a reason for hiding this comment

Uh oh!

KBaichoo Mar 2, 2022

Choose a reason for hiding this comment

Uh oh!

KBaichoo Mar 2, 2022

Choose a reason for hiding this comment

Uh oh!

wbpcode commented Mar 3, 2022

Uh oh!

wbpcode commented Mar 3, 2022

Uh oh!

repokitteh-read-only bot commented Mar 3, 2022

Uh oh!

KBaichoo commented Mar 3, 2022

Uh oh!

repokitteh-read-only bot commented Mar 3, 2022

Uh oh!

KBaichoo left a comment

Choose a reason for hiding this comment

Uh oh!

alyssawilk left a comment

Choose a reason for hiding this comment

Uh oh!

wbpcode commented Mar 4, 2022

Uh oh!

wbpcode commented Mar 9, 2022

Uh oh!

repokitteh-read-only bot commented Mar 9, 2022

Uh oh!

wbpcode commented Mar 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alyssawilk left a comment

Choose a reason for hiding this comment

Uh oh!

alyssawilk commented Mar 9, 2022

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

rojkov commented Mar 24, 2022

Uh oh!

wbpcode commented Mar 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rojkov commented Mar 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wbpcode commented Mar 1, 2022 •

edited

Loading

wbpcode commented Mar 9, 2022 •

edited

Loading

wbpcode commented Mar 25, 2022 •

edited

Loading