x/build/maintner/godata, x/build/maintner/maintnerd/maintapi: getGoData helper should handle failure without failing test, otherwise that causes occasional false positive test failures due to HTTP/2 internal errors while syncing segments #52048

bcmills · 2022-03-30T20:52:24Z

--- FAIL: TestCorpusCheck (100.40s)
    godata_test.go:52: getting corpus: syncing segment 130: stream error: stream ID 3; INTERNAL_ERROR; received from peer
2022/03/29 15:50:24 Loading data from log *maintner.netMutSource ...
2022/03/29 15:50:24 Downloading 2701 bytes of https://maintner.golang.org/logs/130 ...
2022/03/29 15:50:24 wrote C:\Users\gopher\.cache\golang-maintner\0130.growing.mutlog
2022/03/29 15:50:59 Reloaded data from log *maintner.netMutSource.
FAIL
FAIL	golang.org/x/build/maintner/godata	136.77

greplogs --dashboard -md -l -e 'getting corpus: syncing segment \d+: stream error: stream ID \d+; INTERNAL_ERROR'

2022-03-29T15:43:06-e96d8cf-ae9ce82/windows-amd64-longtest
2021-04-08T21:58:35-83a8520-d67e739/windows-amd64-longtest
2021-04-08T19:58:50-83a8520-bb76193/windows-amd64-longtest
2020-10-19T18:36:14-2476803-06839e3/linux-amd64-longtest

(CC @golang/release)

The text was updated successfully, but these errors were encountered:

dmitshur · 2022-03-30T21:00:06Z

This looks related to #51323, and a problem that needs to be addressed in net/http (possibly via a change to golang.org/x/net/http2).

It's not viable for maintner to stop depending on net/http and Google front end servers, so I'm not sure what actions are available in this package.

Disable HTTP/2? That's not great given as I understand this problem happens quite rarely and probably not worth losing benefits of HTTP/2 over.
Skipping the test? Similarly we'd lose test coverage that we get each time this error doesn't happen.

bcmills · 2022-03-30T21:44:50Z

#51323 looks related, although the high rate of errors there appears to have started in Oct. 2021, whereas the first of these failures was observed Oct. 2020 (a year earlier). 🤔

Either way, if we suspect an error in an HTTP/2 implementation (either Go's or Google's), would it make sense for maintner to work around it by retrying some bounded number of INTERNAL_ERROR results during syncing?

(But this probably doesn't need to be a priority either way unless the failure rate increases. Mostly I've filed the issue in case it indicates some deeper systemic problem in conjunction with other maintner and/or HTTP/2 issues.)

dmitshur · 2022-04-05T18:59:23Z

Maintner already does retries at a higher level. I don't think we should add code to it to detect and handle an internal error coming from the net/http package, since that is out of scope. The real problem needs to be fixed in the HTTP layer, not in maintner which is one of many programs using it.

We can possibly add the retry to TestCorpusCheck as a test-only workaround sooner.

bcmills · 2022-05-13T17:03:46Z

greplogs -l -e 'getting corpus: syncing segment \d+: stream error: stream ID \d+; INTERNAL_ERROR' --since=2022-03-30
2022-05-12T21:45:28-4aa4d2e-6365efb/linux-amd64-longtest
2022-04-12T22:05:26-2897e13-d85694a/windows-amd64-longtest
2022-03-31T14:57:35-27fe37a-378221b/linux-amd64-longtest
2022-03-31T14:51:17-27fe37a-2ea9376/linux-amd64-longtest

(But still nothing to be done here until we make progress on #51323, I think.)

dmitshur · 2022-06-24T22:05:04Z

Based on progress of investigation in #51323, the conclusion is that other than the error message not making it easy to see that the error is coming from a remote server for a possibly unavoidable reason (such as backend HTTP servers restarting, and graceful HTTP/2 stream shutdown being out of scope by intentional design), the net/http package and the HTTP/2 protocol are working as intended (i.e., there isn't an internal error to be fixed elsewhere).

Since maintner already does retries, the problem is only in the test, and adding a retry to the test will be a complete fix for this issue of a flaky test (rather than a workaround as we previously thought).

gopherbot · 2022-06-25T18:13:53Z

Change https://go.dev/cl/414174 mentions this issue: maintner: retry network operations that may fail in getNewSegments

dmitshur · 2022-06-25T18:18:33Z

I initially wanted to update this issue with the latest status, and then saw that it'd be quick to send a trivial CL to close it. All that was needed was a trivial change to the following flaky-test-generating logic in the two copies of the getGoData helper:

 corpusCache, err = Get(context.Background())
 if err != nil {
-	tb.Fatalf("getting corpus: %v", err)
+	// Occasionally getting non-zero network errors
+	// while downloading 2+ GB of data from the internet
+	// is NOT unexpected!
+	//
+	// Doing t.Fatalf here means we're producing a non-actionable
+	// test failure, and while we haven't implemented go.dev/issue/19177 yet,
+	// any non-zero frequency false-positive flaky test failures
+	// may create additional manual triage work...
+	//
+	// So just handle the error from godata.Get by
+	// trying again or skipping the test—either way it
+	// would likely be better than t.Fatalf given the constraints above.
 }

While doing that, I tried to confirm my claim that our maintner-using programs handle retries (which I knew to be true without looking earlier, since they haven't needed manual intervention to keep running in a very long time).

It turns out they all retry not just by reloading the corpus, but by fatally exiting the program and having k8s restart the pod. That is functional but feels wasteful (computationally). I didn't want to add retry loops to the many godata.Get invocations in multiple x/build programs...

So that's the story of how I ended up with CL 414174. It felt easier to just detect possibly retryable network problems and retry them in netMutSource.getNewSegments, allowing the least amount of partial work to be thrown away unnecessarily due to an occasional network error. (I tried to split it into two smaller changes, but it didn't work out well due to changes in nesting intersecting with both logical changes.)

On the internet, servers generally don't go out of their way to ensure graceful stream shutdowns, preferring to instead rely on the clients to receive a network error and decide to repeat the same request. netMutSource implements the MutationSource interface, whose GetMutations method should return a final MutationStreamEvent event with a non-nil Err only when the mutation source encounters a fatal error. Most callers correctly handle such fatal errors with a log.Fatalln or equivalent. In practice, the netMutSource.getNewSegments method does a lot of work to ensure it doesn't give up unnecessarily soon, so most errors are really fatal. However, it was attempting to rely on HTTP requests to a remote server to always succeed on the first try. That is a mistake. This change augments the getServerSegments and syncSeg methods to annotate errors that they believe might be due to transient network problems, and adds non-zero number of retries to getNewSegments. That is a good layer to retry since it allows throwing away the least amount of successful progress made thus far-much less than completely restarting a maintner-based program and having it reload the corpus. While here, start relying on the maintnerd server having support for long-polling since it was added in 2017 (CL 42871), which simplifies code by allowing the top-level loop and its nesting to be deleted. Updates golang/go#52048. Change-Id: I8148109f97365697beab5575358069adfa095f0e Reviewed-on: https://go-review.googlesource.com/c/build/+/414174 Reviewed-by: Dmitri Shuralyov <[email protected]> Reviewed-by: Alex Rakoczy <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Dmitri Shuralyov <[email protected]> Auto-Submit: Dmitri Shuralyov <[email protected]>

dmitshur · 2022-07-11T21:28:32Z

I believe CL 414174 is a complete fix and there's nothing more to do here, so tentatively closing but please reopen if there are further occurrences.

bcmills added this to the Backlog milestone Mar 30, 2022

bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 30, 2022

gopherbot added the Builders x/build issues (builders, bots, dashboards) label Mar 30, 2022

dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 5, 2022

dmitshur added the Testing An issue that has been verified to require only test changes, not just a test failure. label Apr 5, 2022

dmitshur modified the milestones: Backlog, Unreleased Jun 25, 2022

dmitshur closed this as completed Jul 11, 2022

heschi moved this to Done in Go Release Sep 27, 2022

heschi added this to Go Release Sep 27, 2022

golang locked and limited conversation to collaborators Jul 11, 2023

gopherbot added the FrozenDueToAge label Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/build/maintner/godata, x/build/maintner/maintnerd/maintapi: getGoData helper should handle failure without failing test, otherwise that causes occasional false positive test failures due to HTTP/2 internal errors while syncing segments #52048

x/build/maintner/godata, x/build/maintner/maintnerd/maintapi: getGoData helper should handle failure without failing test, otherwise that causes occasional false positive test failures due to HTTP/2 internal errors while syncing segments #52048

bcmills commented Mar 30, 2022

dmitshur commented Mar 30, 2022 •

edited

Loading

bcmills commented Mar 30, 2022

dmitshur commented Apr 5, 2022 •

edited

Loading

bcmills commented May 13, 2022

dmitshur commented Jun 24, 2022 •

edited

Loading

gopherbot commented Jun 25, 2022

dmitshur commented Jun 25, 2022 •

edited

Loading

dmitshur commented Jul 11, 2022

x/build/maintner/godata, x/build/maintner/maintnerd/maintapi: getGoData helper should handle failure without failing test, otherwise that causes occasional false positive test failures due to HTTP/2 internal errors while syncing segments #52048

x/build/maintner/godata, x/build/maintner/maintnerd/maintapi: getGoData helper should handle failure without failing test, otherwise that causes occasional false positive test failures due to HTTP/2 internal errors while syncing segments #52048

Comments

bcmills commented Mar 30, 2022

dmitshur commented Mar 30, 2022 • edited Loading

bcmills commented Mar 30, 2022

dmitshur commented Apr 5, 2022 • edited Loading

bcmills commented May 13, 2022

dmitshur commented Jun 24, 2022 • edited Loading

gopherbot commented Jun 25, 2022

dmitshur commented Jun 25, 2022 • edited Loading

dmitshur commented Jul 11, 2022

dmitshur commented Mar 30, 2022 •

edited

Loading

dmitshur commented Apr 5, 2022 •

edited

Loading

dmitshur commented Jun 24, 2022 •

edited

Loading

dmitshur commented Jun 25, 2022 •

edited

Loading