-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
!! AllowCompaction not found for materialized view with id ['User(11)'] #28046
Comments
Still haven't managed to repro this, but here is one insight: In the services.log of the linked buildkite run, we see that at the end envd doesn't manage to connect to the unmanaged replica anymore. This repeats over and over:
I don't know why that happens, but if the controller can't connect to the replica, it of course also can't send commands to it. Edit: Maybe I'm mistaken. Reading the code of the replica-isolation test, and the run.log, the test runs a bunch of queries against that cluster, and they all succeed. Which means the replica must be successfully connected. Edit 2: Ah, I see. The test creates two replicas, one of them connects successfully and is able to respond to the validation queries, but the other one doesn't. The logs check then checks the logs of both replicas and fails because it doesn't find |
Is this the right command to use though? It seems like |
I managed to reproduce this now. It ended up in a situation that is consistent with what I observed above:
So replica 1 crashed, while replica 2 was still able to respond to queries. This is from the logs of replica 1:
We see that both processes tried to connect to one another, process 0 reports success, and then both panic because the connection was somehow broken. It's not clear why it was broken though. |
It seems like this issue has not reoccurred since we fixed the log collection. Best guess is still that this is caused by some Docker network race condition. In which case we can either figure out and fix the race condition or automatically restart the |
Happened yesterday evening here: https://buildkite.com/materialize/test/builds/85812#0190a8d0-4e87-4715-8438-d8a56696bf67 |
Well, we have all the logs this time:
|
Another occurrence: https://buildkite.com/materialize/test/builds/85872#0190b5ec-cbc0-4a36-8aa5-683a7d5f1bff |
The logs from this last one look different. It's again the first out of the two replicas that fails (has been in all occurrences I've observed so far, which might tell us something), but this time only one of the two processes halted:
|
Theory about what might be happening: #27896 introduced, as a side-effect, the behavior that the storage controller will always connect to new replicas as they are created, and disconnect from the previous ones in the process. So when a cluster is created with two replicas, it will connect to the first one, then immediately disconnect and connect to the second one. On clusterd, when the storage controller disconnects, I believe we drop the storage Timely runtime too. And I also believe that dropping the two halves of the storage runtime in the two processes at different times gives the longer-lived Timely runtime the opportunity to try to read from its intra-process channels, notice the absence of the other half, and panic. One fix is to avoid the reconnection of the storage controller if it is already connected to a different replica in the same cluster. |
What version of Materialize are you using?
main
What is the issue?
Seen in https://buildkite.com/materialize/test/builds/85184#01907b01-0715-49df-bf71-39ec90518163
I believe this is caused by #27922
Reproduces locally in a while with
while true; do bin/mzcompose --find replica-isolation down && bin/mzcompose --find replica-isolation run default restart-environmentd || break; done
.Even after waiting further, the AllowCompaction never appears in logs:
bin/mzcompose --find replica-isolation logs | grep "AllowCompaction { id: User"
ci-regexp: builtins.AssertionError:
ci-apply-to: Replica isolation
The text was updated successfully, but these errors were encountered: