This repository was archived by the owner on Sep 21, 2020. It is now read-only.
GPII-4111: Fix 404 failures#507
Merged
stepanstipl merged 3 commits intogpii-ops:masterfrom Sep 25, 2019
Merged
Conversation
Contributor
|
Good job on connection pooling issues investigation!
I don't think this extra connection handling overhead is a problem atm, but let's hope fixed Istio is available on GKE soon! LGTM |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes GPII-4111 - low number of 404s we recently observed in our pipeline.
TL;DR
Roughly 0.02% of requests to database would fail in 503 error, which further results in 404 error returned by preferences. Based on my investigation this happens only in case of
istio-proxytoistio-proxytraffic, and is related to connection pooling (Istio reuses connections for multiple requests). Istio-proxy on the client side will try to send traffic over a connection that has already been closed, and this results in 503 errorupstream connect error or disconnect/reset before headers. reset reason: connection termination.See https://gist.github.com/stepanstipl/51780eb0755ec9d177b6da9df5baaa61 for more details.
This seems to be caused by known issue in Envoy - envoyproxy/envoy#6578 - and should be fixed in Istio 1.1.8 and higher.
Fix
Until version with the fix is available on GKE, this can be resolved by disabling connection pooling, via
DestinationRulewith:Performance impact
Performance impact of this seems to be on average
~ 3mshigher latency on a-duration=600s -rate=100/stest. Given this is temporary and our current traffic levels & response times, this should not be an issue.Current:
With connection pooling disabled:
Tests
Without fix applied,
duration=600s -rate=100/stest againstgpii/_design/views/_view/findPrefsSafeByGpiiKey?key=%22wayne%22&include_docs=trueendpoint would result usually in 5-15x503errors.With connection pooling disabled, 10x runs of the same test would finish with
0errors:I have also tried to run
rake test_preferences10x and all the tests finished fine, without any404s or503s (but based on the investigation this test is not a reliable indicator and fails roughly only 1 in 10 times in dev environment).