Skip to content
This repository was archived by the owner on Sep 21, 2020. It is now read-only.

GPII-4111: Fix 404 failures#507

Merged
stepanstipl merged 3 commits intogpii-ops:masterfrom
stepanstipl:fix-404s
Sep 25, 2019
Merged

GPII-4111: Fix 404 failures#507
stepanstipl merged 3 commits intogpii-ops:masterfrom
stepanstipl:fix-404s

Conversation

@stepanstipl
Copy link
Copy Markdown
Contributor

@stepanstipl stepanstipl commented Sep 24, 2019

This PR fixes GPII-4111 - low number of 404s we recently observed in our pipeline.

TL;DR

Roughly 0.02% of requests to database would fail in 503 error, which further results in 404 error returned by preferences. Based on my investigation this happens only in case of istio-proxy to istio-proxy traffic, and is related to connection pooling (Istio reuses connections for multiple requests). Istio-proxy on the client side will try to send traffic over a connection that has already been closed, and this results in 503 error upstream connect error or disconnect/reset before headers. reset reason: connection termination.

See https://gist.github.com/stepanstipl/51780eb0755ec9d177b6da9df5baaa61 for more details.

This seems to be caused by known issue in Envoy - envoyproxy/envoy#6578 - and should be fixed in Istio 1.1.8 and higher.

Fix

Until version with the fix is available on GKE, this can be resolved by disabling connection pooling, via DestinationRule with:

trafficPolicy:
  connectionPool:
    http:
      maxRequestsPerConnection: 1

Performance impact

Performance impact of this seems to be on average ~ 3ms higher latency on a -duration=600s -rate=100/s test. Given this is temporary and our current traffic levels & response times, this should not be an issue.

Current:

Latencies     [mean, 50, 95, 99, max]    47.106521ms, 43.92287ms, 87.839265ms, 108.368494ms, 191.956161ms

With connection pooling disabled:

Latencies     [mean, 50, 95, 99, max]    44.300331ms, 41.563617ms, 83.572071ms, 100.518912ms, 1.000213669s

Tests

Without fix applied, duration=600s -rate=100/s test against gpii/_design/views/_view/findPrefsSafeByGpiiKey?key=%22wayne%22&include_docs=true endpoint would result usually in 5-15x 503 errors.

bash-5.0# echo "GET $URLSVC" | ./vegeta attack -duration=600s -rate=100/s  | tee results.bin | ./vegeta report^C
bash-5.0# cat results.bin | ./vegeta report
Requests      [total, rate, throughput]  60000, 100.00, 99.98
Duration      [total, attack, wait]      10m0.016821357s, 9m59.989852387s, 26.96897ms
Latencies     [mean, 50, 95, 99, max]    44.300331ms, 41.563617ms, 83.572071ms, 100.518912ms, 1.000213669s
Bytes In      [total, mean]              49314184, 821.90
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    99.99%
Status Codes  [code:count]               200:59992  503:8
Error Set:
503 Service Unavailable

With connection pooling disabled, 10x runs of the same test would finish with 0 errors:

bash-5.0# echo "GET $URLSVC" | ./vegeta attack -duration=600s -rate=100/s  | tee results.bin | ./vegeta report
Requests      [total, rate, throughput]  60000, 100.00, 100.00
Duration      [total, attack, wait]      10m0.025406738s, 9m59.989864885s, 35.541853ms
Latencies     [mean, 50, 95, 99, max]    47.106521ms, 43.92287ms, 87.839265ms, 108.368494ms, 191.956161ms
Bytes In      [total, mean]              49320000, 822.00
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    100.00%
Status Codes  [code:count]               200:60000
Error Set:

I have also tried to run rake test_preferences 10x and all the tests finished fine, without any 404s or 503s (but based on the investigation this test is not a reliable indicator and fails roughly only 1 in 10 times in dev environment).

@stepanstipl stepanstipl self-assigned this Sep 24, 2019
Copy link
Copy Markdown
Contributor

@amatas amatas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@natarajaya
Copy link
Copy Markdown
Contributor

Good job on connection pooling issues investigation!

Performance impact of this seems to be on average ~ 3ms higher latency on a -duration=600s -rate=100/s test

I don't think this extra connection handling overhead is a problem atm, but let's hope fixed Istio is available on GKE soon!

LGTM

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants