-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote cache poisoned? #1174
Comments
@coeuvre Is this something you want to debug? 🤔 @alexeagle The tricky thing here is that this is happening on the RBE platform and our usual approach to deal with cache poisoning ( continuous-integration/buildkite/bazelci.py Line 1538 in fc87a04
|
@alexeagle I checked with the RBE team, this doesn't look like something on the backend side. The blob is apparently there and fine. 👀 We're suspecting a Bazel bug, although not sure yet what's going on. Could you bump your |
Sure, bazel-contrib/rules_nodejs#2761 thanks for investigating!! (maybe that will be enough to change the input key for that action in the cache such that this isn't reproducible, but that's okay with me I'm not trying to discover Bazel bugs today, just keep our CI green) |
Hmm get some new errors on RBE now
maybe related to a warning I see locally
will try pinning like it says |
Sounds like you may need to update the bazel-toolchain repo? FYI |
thanks @rubensf I was never able to figure this out in prior attempts. We've been using the bazel-0.28.0.bazelrc this whole time ever since setting up our RBE test job and managing to get away with that. I tried following the instructions,
maybe there's still some version skew in that config |
I have only used these configs for C++ so far, so I may be missing something 😅 I think you might just want to remove those Also, I'm not sure misleading -- see comment in bazelbuild/bazel-toolchains#926. I'm not sure if |
@alexeagle I think that error referring to More context: bazelbuild/bazel#13502 |
thanks @philwo I had seen that discussion but didn't put it together that I'd need that 4.1.0 of the toolchains repo to fix this. Now I get
obviously darwin is an os, not a cpu, so some wiring is crossed somewhere? The only thing in rules_nodejs that I can find which is related is our platform definitions in toolchain/node/BUILD.bazel such as
sorry to be stuck here :( |
For windows, you'll need to also run the For MacOS, there's no RBE support for MacOS so you'll have to deactive remote support for that too. (basically tweak the .bazelrc configurations to not include --remote on windows/macos builds). Note you do can use remote caching, (basically setting |
@alexeagle Can you send a link to a Buildkite log where that error message is visible? I don't think it should happen when the job runs on the |
After comparing the corrupted blob with the correct one, I think this is caused by a known issue bazelbuild/bazel#12927 (comment). Bazel 4.1.0 and HEAD contain the fix. |
FYI, for the corrupted blob downloads, I expect you can mitigate in any version of bazel by setting --remote_timeout=3600 (or some suitably long period). Chi linked the relevant bug - appears to be a race in gRPC at certain versions, which is triggered by the RPC timeout/retry flow for RPCs that exceed a certain duration. I'm guessing your build is download-bottlenecked at that phase, causing the 85MB file download to take >60s (the default), causing the race to be hit. |
Thank you Eric! I assume the flag is safe to set even with future versions of Bazel? We could add it to our list of remote execution flags here, then it would apply to all jobs running on Bazel CI that use our continuous-integration/buildkite/bazelci.py Lines 1669 to 1675 in 54595a2
|
Should be safe, yep - it's a default we recommend for RBE and most of our users have had set since the beginning. I'll note that it may not be safe with other remote execution services than RBE...some have races that can lead to hung RPCs and thus hung builds that a smaller remote timeout would have papered over. But if you don't use other services, or don't set this flag on them, you should be fine. IIRC some future version of bazel will make this unnecessary by changing the logic for how remote timeouts apply to long-running RPCs like execution and download, but I don't think that has landed yet, so still useful at HEAD. |
The bug is fixed in Bazel >= 4.1.0, but it doesn't hurt. Context: #1174
The bug is fixed in Bazel >= 4.1.0, but it doesn't hurt. Context: #1174
Yes, this work is on my list. |
I think this was resolved, since rules_nodejs was able to upgrade bazel versions with only a small amount of RBE flags in bazel-contrib/rules_nodejs#2792 |
Hey @fweikert
rules_nodejs has had almost all our RBE builds red for the last few days. On PRs I'm pressing the Retry button multiple times to get them green, and our default branch has been red for a day.
https://buildkite.com/bazel/rules-nodejs-nodejs/builds/9591#bdced136-a515-4099-a52b-5984d9575a61 is an example
failure:
It's always this same entry causing the problem. Seems like something has gotten into the remote cache that shouldn't be there. Is it easy to just blow away the storage for that cache instance (no idea if it's shared-tenant with other rulesets or other reasons to be careful there)
Thanks!!
The text was updated successfully, but these errors were encountered: