-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build/maintner/godata, x/build/maintner/maintnerd/maintapi: getGoData helper should handle failure without failing test, otherwise that causes occasional false positive test failures due to HTTP/2 internal errors while syncing segments #52048
Comments
This looks related to #51323, and a problem that needs to be addressed in net/http (possibly via a change to golang.org/x/net/http2). It's not viable for maintner to stop depending on net/http and Google front end servers, so I'm not sure what actions are available in this package.
|
#51323 looks related, although the high rate of errors there appears to have started in Oct. 2021, whereas the first of these failures was observed Oct. 2020 (a year earlier). 🤔 Either way, if we suspect an error in an HTTP/2 implementation (either Go's or Google's), would it make sense for (But this probably doesn't need to be a priority either way unless the failure rate increases. Mostly I've filed the issue in case it indicates some deeper systemic problem in conjunction with other |
Maintner already does retries at a higher level. I don't think we should add code to it to detect and handle an internal error coming from the We can possibly add the retry to TestCorpusCheck as a test-only workaround sooner. |
(But still nothing to be done here until we make progress on #51323, I think.) |
Based on progress of investigation in #51323, the conclusion is that other than the error message not making it easy to see that the error is coming from a remote server for a possibly unavoidable reason (such as backend HTTP servers restarting, and graceful HTTP/2 stream shutdown being out of scope by intentional design), the Since maintner already does retries, the problem is only in the test, and adding a retry to the test will be a complete fix for this issue of a flaky test (rather than a workaround as we previously thought). |
Change https://go.dev/cl/414174 mentions this issue: |
I initially wanted to update this issue with the latest status, and then saw that it'd be quick to send a trivial CL to close it. All that was needed was a trivial change to the following flaky-test-generating logic in the two copies of the corpusCache, err = Get(context.Background())
if err != nil {
- tb.Fatalf("getting corpus: %v", err)
+ // Occasionally getting non-zero network errors
+ // while downloading 2+ GB of data from the internet
+ // is NOT unexpected!
+ //
+ // Doing t.Fatalf here means we're producing a non-actionable
+ // test failure, and while we haven't implemented go.dev/issue/19177 yet,
+ // any non-zero frequency false-positive flaky test failures
+ // may create additional manual triage work...
+ //
+ // So just handle the error from godata.Get by
+ // trying again or skipping the test—either way it
+ // would likely be better than t.Fatalf given the constraints above.
} While doing that, I tried to confirm my claim that our maintner-using programs handle retries (which I knew to be true without looking earlier, since they haven't needed manual intervention to keep running in a very long time). It turns out they all retry not just by reloading the corpus, but by fatally exiting the program and having k8s restart the pod. That is functional but feels wasteful (computationally). I didn't want to add retry loops to the many So that's the story of how I ended up with CL 414174. It felt easier to just detect possibly retryable network problems and retry them in |
On the internet, servers generally don't go out of their way to ensure graceful stream shutdowns, preferring to instead rely on the clients to receive a network error and decide to repeat the same request. netMutSource implements the MutationSource interface, whose GetMutations method should return a final MutationStreamEvent event with a non-nil Err only when the mutation source encounters a fatal error. Most callers correctly handle such fatal errors with a log.Fatalln or equivalent. In practice, the netMutSource.getNewSegments method does a lot of work to ensure it doesn't give up unnecessarily soon, so most errors are really fatal. However, it was attempting to rely on HTTP requests to a remote server to always succeed on the first try. That is a mistake. This change augments the getServerSegments and syncSeg methods to annotate errors that they believe might be due to transient network problems, and adds non-zero number of retries to getNewSegments. That is a good layer to retry since it allows throwing away the least amount of successful progress made thus far-much less than completely restarting a maintner-based program and having it reload the corpus. While here, start relying on the maintnerd server having support for long-polling since it was added in 2017 (CL 42871), which simplifies code by allowing the top-level loop and its nesting to be deleted. Updates golang/go#52048. Change-Id: I8148109f97365697beab5575358069adfa095f0e Reviewed-on: https://go-review.googlesource.com/c/build/+/414174 Reviewed-by: Dmitri Shuralyov <[email protected]> Reviewed-by: Alex Rakoczy <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Dmitri Shuralyov <[email protected]> Auto-Submit: Dmitri Shuralyov <[email protected]>
I believe CL 414174 is a complete fix and there's nothing more to do here, so tentatively closing but please reopen if there are further occurrences. |
greplogs --dashboard -md -l -e 'getting corpus: syncing segment \d+: stream error: stream ID \d+; INTERNAL_ERROR'
2022-03-29T15:43:06-e96d8cf-ae9ce82/windows-amd64-longtest
2021-04-08T21:58:35-83a8520-d67e739/windows-amd64-longtest
2021-04-08T19:58:50-83a8520-bb76193/windows-amd64-longtest
2020-10-19T18:36:14-2476803-06839e3/linux-amd64-longtest
(CC @golang/release)
The text was updated successfully, but these errors were encountered: