Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

!! AllowCompaction not found for materialized view with id ['User(11)'] #28046

Open
def- opened this issue Jul 4, 2024 · 9 comments
Open

!! AllowCompaction not found for materialized view with id ['User(11)'] #28046

def- opened this issue Jul 4, 2024 · 9 comments
Assignees
Labels
C-bug Category: something is broken ci-flake

Comments

@def-
Copy link
Contributor

def- commented Jul 4, 2024

What version of Materialize are you using?

main

What is the issue?

Seen in https://buildkite.com/materialize/test/builds/85184#01907b01-0715-49df-bf71-39ec90518163

!! AllowCompaction not found for materialized view with id ['User(11)']
!! AllowCompaction not found for introspection with ids ['System(856)', 'System(857)', 'System(858)', 'System(859)', 'System(860)', 'System(861)', 'System(862)', 'System(863)', 'System(864)', 'System(865)', 'System(866)', 'System(867)', 'System(868)', 'System(869)', 'System(870)', 'System(871)', 'System(872)', 'System(873)', 'System(874)', 'System(875)', 'System(876)', 'System(877)', 'System(878)', 'System(879)', 'System(880)', 'System(881)', 'System(882)', 'System(883)']
!! AllowCompaction not found for index arrangement with id ['User(3)']
!! AllowCompaction not found for introspection with ids ['System(856)', 'System(857)', 'System(858)', 'System(859)', 'System(860)', 'System(861)', 'System(862)', 'System(863)', 'System(864)', 'System(865)', 'System(866)', 'System(867)', 'System(868)', 'System(869)', 'System(870)', 'System(871)', 'System(872)', 'System(873)', 'System(874)', 'System(875)', 'System(876)', 'System(877)', 'System(878)', 'System(879)', 'System(880)', 'System(881)', 'System(882)', 'System(883)']

I believe this is caused by #27922

Reproduces locally in a while with while true; do bin/mzcompose --find replica-isolation down && bin/mzcompose --find replica-isolation run default restart-environmentd || break; done.
Even after waiting further, the AllowCompaction never appears in logs: bin/mzcompose --find replica-isolation logs | grep "AllowCompaction { id: User"

ci-regexp: builtins.AssertionError:
ci-apply-to: Replica isolation

@def- def- added C-bug Category: something is broken ci-flake labels Jul 4, 2024
@teskje
Copy link
Contributor

teskje commented Jul 4, 2024

Still haven't managed to repro this, but here is one insight: In the services.log of the linked buildkite run, we see that at the end envd doesn't manage to connect to the unmanaged replica anymore. This repeats over and over:

replica-isolation-materialized-1     | environmentd: 2024-07-04T00:27:56.859309Z  INFO mz_compute_client::controller::replica: error connecting to replica, retrying in 1s: transport error replica=User(2)

I don't know why that happens, but if the controller can't connect to the replica, it of course also can't send commands to it.

Edit: Maybe I'm mistaken. Reading the code of the replica-isolation test, and the run.log, the test runs a bunch of queries against that cluster, and they all succeed. Which means the replica must be successfully connected.

Edit 2: Ah, I see. The test creates two replicas, one of them connects successfully and is able to respond to the validation queries, but the other one doesn't. The logs check then checks the logs of both replicas and fails because it doesn't find AllowCompaction commands in the one that didn't connect successfully.

@teskje
Copy link
Contributor

teskje commented Jul 5, 2024

Even after waiting further, the AllowCompaction never appears in logs: bin/mzcompose --find replica-isolation logs | grep "AllowCompaction { id: User"

Is this the right command to use though? It seems like bin/mzcompose --find replica-isolation logs doesn't include any logs from the unmanaged replicas.

@teskje
Copy link
Contributor

teskje commented Jul 5, 2024

I managed to reproduce this now. It ended up in a situation that is consistent with what I observed above:

  • replica-isolation-clusterd_1_1-1 and replica-isolation-clusterd_1_2-1 exited with status (2)
  • replica-isolation-clusterd_2_1-1 and replica-isolation-clusterd_2_2-1 still running

So replica 1 crashed, while replica 2 was still able to respond to queries.

This is from the logs of replica 1:

  • replica-isolation-clusterd_1_1-1
    2024-07-05T05:11:39.569498Z  INFO mz_cluster::communication: initializing network for timely instance, with 2 processes for epoch number (3, 1) process=0 addresses=["clusterd_1_1:2103", "clusterd_1_2:2103"]
    2024-07-05T05:11:39.569760Z  INFO mz_cluster::communication: awaiting connection from peer process=0
    2024-07-05T05:11:39.572382Z  INFO mz_cluster::communication: accepted connection from peer process=0
    2024-07-05T05:11:39.572661Z  INFO mz_cluster::communication: successfully initialized network process=0
    2024-07-05T05:11:39.575169Z  WARN mz_timely_util::panic: halting process: timely communication error: reading data: Connection reset by peer (os error 104)
    
  • replica-isolation-clusterd_1_2-1
    2024-07-05T05:11:39.570901Z  INFO mz_cluster::communication: initializing network for timely instance, with 2 processes for epoch number (3, 1) process=1 addresses=["clusterd_1_1:2103", "clusterd_1_2:2103"]
    2024-07-05T05:11:39.571228Z  INFO mz_cluster::communication: Attempting to connect to process at clusterd_1_1:2103 process=1
    2024-07-05T05:11:39.571248Z  INFO mz_cluster::communication: awaiting connection from peer process=1
    2024-07-05T05:11:39.595786Z  WARN mz_timely_util::panic: halting process: timely communication error: reading data: socket closed
    

We see that both processes tried to connect to one another, process 0 reports success, and then both panic because the connection was somehow broken. It's not clear why it was broken though.

@teskje
Copy link
Contributor

teskje commented Jul 12, 2024

It seems like this issue has not reoccurred since we fixed the log collection. Best guess is still that this is caused by some Docker network race condition. In which case we can either figure out and fix the race condition or automatically restart the clusterd containers when they halt.

@ggevay
Copy link
Contributor

ggevay commented Jul 13, 2024

@teskje
Copy link
Contributor

teskje commented Jul 13, 2024

Well, we have all the logs this time:

  • clusterd process 0:
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:16.491291Z  INFO clusterd: booting os.os_type=Ubuntu os.version=22.04 os.bitness=64-bit build.version="0.109.0-dev" build.sha="367e8fb2c30524db8b2a3d5b18d9525fb5483cfe" build.time="2024-07-12T19:23:54Z" cpus.logical=4 cpus.physical=4 cpu0.brand="" cpu0.frequency=2100 memory.total=8145408000 memory.used=3007365120 memory.limit=<unknown> swap.total=0 swap.used=0 swap.limit=<unknown> tracing.max_level=info
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:16.491516Z  INFO clusterd: serving internal HTTP server on 0.0.0.0:6878
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:16.494676Z  INFO clusterd: listening for storage controller connections on 0.0.0.0:2100
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:16.494713Z  INFO mz_service::grpc: Starting to listen on 0.0.0.0:2100
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:16.494803Z  INFO clusterd: listening for compute controller connections on 0.0.0.0:2101
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:16.494815Z  INFO mz_service::grpc: Starting to listen on 0.0.0.0:2101
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:25.219760Z  INFO mz_service::grpc: GrpcServer: remote client connected
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:25.220971Z  INFO mz_cluster::server: Building timely container with config TimelyConfig { workers: 1, process: 0, addresses: ["clusterd_1_1:2103", "clusterd_1_2:2103"], arrangement_exert_proportionality: 1337 }
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:25.221034Z  INFO mz_cluster::communication: initializing network for timely instance, with 2 processes for epoch number (2, 1) process=0addresses=["clusterd_1_1:2103", "clusterd_1_2:2103"]
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:25.221437Z  INFO mz_cluster::communication: awaiting connection from peer process=0
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:25.221605Z  INFO mz_service::grpc: GrpcServer: remote client connected
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:25.223009Z  INFO mz_cluster::communication: accepted connection from peer process=0
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:25.223338Z  INFO mz_cluster::communication: successfully initialized network process=0
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:25.223880Z  INFO mz_cluster::server: Building timely container with config TimelyConfig { workers: 1, process: 0, addresses: ["clusterd_1_1:2102", "clusterd_1_2:2102"], arrangement_exert_proportionality: 16 }
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:25.224032Z  INFO mz_cluster::communication: initializing network for timely instance, with 2 processes for epoch number (2, 1) process=0addresses=["clusterd_1_1:2102", "clusterd_1_2:2102"]
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:25.225326Z  INFO mz_cluster::communication: awaiting connection from peer process=0
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:25.225777Z  INFO mz_cluster::communication: accepted connection from peer process=0
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:25.226512Z  INFO mz_cluster::communication: successfully initialized network process=0
replica-isolation-clusterd_1_1-1  | 2024-07-12T21:56:25.226747Z  WARN mz_timely_util::panic: halting process: timely communication error: reading data: Connection reset by peer (os error 104)
  • clusterd process 1
replica-isolation-clusterd_1_2-1  | 2024-07-12T21:56:16.723042Z  INFO clusterd: booting os.os_type=Ubuntu os.version=22.04 os.bitness=64-bit build.version="0.109.0-dev" build.sha="367e8fb2c30524db8b2a3d5b18d9525fb5483cfe" build.time="2024-07-12T19:23:54Z" cpus.logical=4 cpus.physical=4 cpu0.brand="" cpu0.frequency=2100 memory.total=8145408000 memory.used=2977255424 memory.limit=<unknown> swap.total=0 swap.used=0 swap.limit=<unknown> tracing.max_level=info
replica-isolation-clusterd_1_2-1  | 2024-07-12T21:56:16.723239Z  INFO clusterd: serving internal HTTP server on 0.0.0.0:6878
replica-isolation-clusterd_1_2-1  | 2024-07-12T21:56:16.726467Z  INFO clusterd: listening for storage controller connections on 0.0.0.0:2100
replica-isolation-clusterd_1_2-1  | 2024-07-12T21:56:16.726506Z  INFO mz_service::grpc: Starting to listen on 0.0.0.0:2100
replica-isolation-clusterd_1_2-1  | 2024-07-12T21:56:16.726569Z  INFO clusterd: listening for compute controller connections on 0.0.0.0:2101
replica-isolation-clusterd_1_2-1  | 2024-07-12T21:56:16.726611Z  INFO mz_service::grpc: Starting to listen on 0.0.0.0:2101
replica-isolation-clusterd_1_2-1  | 2024-07-12T21:56:25.219760Z  INFO mz_service::grpc: GrpcServer: remote client connected
replica-isolation-clusterd_1_2-1  | 2024-07-12T21:56:25.221015Z  INFO mz_cluster::server: Building timely container with config TimelyConfig { workers: 1, process: 1, addresses: ["clusterd_1_1:2103", "clusterd_1_2:2103"], arrangement_exert_proportionality: 1337 }
replica-isolation-clusterd_1_2-1  | 2024-07-12T21:56:25.221106Z  INFO mz_cluster::communication: initializing network for timely instance, with 2 processes for epoch number (2, 1) process=1addresses=["clusterd_1_1:2103", "clusterd_1_2:2103"]
replica-isolation-clusterd_1_2-1  | 2024-07-12T21:56:25.221438Z  INFO mz_cluster::communication: Attempting to connect to process at clusterd_1_1:2103 process=1
replica-isolation-clusterd_1_2-1  | 2024-07-12T21:56:25.221697Z  INFO mz_cluster::communication: awaiting connection from peer process=1
replica-isolation-clusterd_1_2-1  | 2024-07-12T21:56:25.221984Z  INFO mz_service::grpc: GrpcServer: remote client connected
replica-isolation-clusterd_1_2-1  | 2024-07-12T21:56:25.223206Z  INFO mz_cluster::server: Building timely container with config TimelyConfig { workers: 1, process: 1, addresses: ["clusterd_1_1:2102", "clusterd_1_2:2102"], arrangement_exert_proportionality: 16 }
replica-isolation-clusterd_1_2-1  | 2024-07-12T21:56:25.223480Z  INFO mz_cluster::communication: initializing network for timely instance, with 2 processes for epoch number (2, 1) process=1addresses=["clusterd_1_1:2102", "clusterd_1_2:2102"]
replica-isolation-clusterd_1_2-1  | 2024-07-12T21:56:25.223978Z  INFO mz_cluster::communication: Attempting to connect to process at clusterd_1_1:2102 process=1
replica-isolation-clusterd_1_2-1  | 2024-07-12T21:56:25.224160Z  INFO mz_cluster::communication: awaiting connection from peer process=1
replica-isolation-clusterd_1_2-1  | 2024-07-12T21:56:25.226202Z  INFO mz_cluster::communication: successfully initialized network process=1
replica-isolation-clusterd_1_2-1  | 2024-07-12T21:56:25.230212Z  WARN mz_timely_util::panic: halting process: timely communication error: reading data: socket closed

@def-
Copy link
Contributor Author

def- commented Jul 15, 2024

Another occurrence: https://buildkite.com/materialize/test/builds/85872#0190b5ec-cbc0-4a36-8aa5-683a7d5f1bff
I'll temporarily disable this since it flakes a bunch in CI.

@teskje
Copy link
Contributor

teskje commented Jul 15, 2024

The logs from this last one look different. It's again the first out of the two replicas that fails (has been in all occurrences I've observed so far, which might tell us something), but this time only one of the two processes halted:

replica-isolation-clusterd_1_1-1  | 2024-07-15T10:34:46.351709Z  INFO clusterd: booting os.os_type=Ubuntu os.version=22.04 os.bitness=64-bit build.version="0.109.0-dev" build.sha="e95f71ba69fe18ec93e070ec9ba9cb392b61dfab" build.time="2024-07-15T00:29:43Z" cpus.logical=4 cpus.physical=4 cpu0.brand="" cpu0.frequency=2100 memory.total=8145408000 memory.used=3074244608 memory.limit=<unknown> swap.total=0 swap.used=0 swap.limit=<unknown> tracing.max_level=info
replica-isolation-clusterd_1_1-1  | 2024-07-15T10:34:46.351887Z  INFO clusterd: serving internal HTTP server on 0.0.0.0:6878
replica-isolation-clusterd_1_1-1  | 2024-07-15T10:34:46.354929Z  INFO clusterd: listening for storage controller connections on 0.0.0.0:2100
replica-isolation-clusterd_1_1-1  | 2024-07-15T10:34:46.355077Z  INFO mz_service::grpc: Starting to listen on 0.0.0.0:2100
replica-isolation-clusterd_1_1-1  | 2024-07-15T10:34:46.355210Z  INFO clusterd: listening for compute controller connections on 0.0.0.0:2101
replica-isolation-clusterd_1_1-1  | 2024-07-15T10:34:46.355277Z  INFO mz_service::grpc: Starting to listen on 0.0.0.0:2101
replica-isolation-clusterd_1_1-1  | 2024-07-15T10:34:53.195702Z  INFO mz_service::grpc: GrpcServer: remote client connected
replica-isolation-clusterd_1_1-1  | 2024-07-15T10:34:53.199103Z  INFO mz_cluster::server: Building timely container with config TimelyConfig { workers: 1, process: 0, addresses: ["clusterd_1_1:2103", "clusterd_1_2:2103"], arrangement_exert_proportionality: 1337 }
replica-isolation-clusterd_1_1-1  | 2024-07-15T10:34:53.199146Z  INFO mz_cluster::communication: initializing network for timely instance, with 2 processes for epoch number (2, 1) process=0addresses=["clusterd_1_1:2103", "clusterd_1_2:2103"]
replica-isolation-clusterd_1_1-1  | 2024-07-15T10:34:53.199438Z  INFO mz_cluster::communication: awaiting connection from peer process=0
replica-isolation-clusterd_1_1-1  | 2024-07-15T10:34:53.199615Z  INFO mz_service::grpc: GrpcServer: remote client connected
replica-isolation-clusterd_1_1-1  | 2024-07-15T10:34:53.202704Z  INFO mz_cluster::communication: accepted connection from peer process=0
replica-isolation-clusterd_1_1-1  | 2024-07-15T10:34:53.202912Z  INFO mz_cluster::communication: successfully initialized network process=0
replica-isolation-clusterd_1_1-1  | 2024-07-15T10:34:53.204129Z  WARN mz_timely_util::panic: halting process: timely communication error: reading data: Connection reset by peer (os error 104)
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:34:45.852286Z  INFO clusterd: booting os.os_type=Ubuntu os.version=22.04 os.bitness=64-bit build.version="0.109.0-dev" build.sha="e95f71ba69fe18ec93e070ec9ba9cb392b61dfab" build.time="2024-07-15T00:29:43Z" cpus.logical=4 cpus.physical=4 cpu0.brand="" cpu0.frequency=2100 memory.total=8145408000 memory.used=3111383040 memory.limit=<unknown> swap.total=0 swap.used=0 swap.limit=<unknown> tracing.max_level=info
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:34:45.852483Z  INFO clusterd: serving internal HTTP server on 0.0.0.0:6878
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:34:45.854965Z  INFO clusterd: listening for storage controller connections on 0.0.0.0:2100
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:34:45.855001Z  INFO mz_service::grpc: Starting to listen on 0.0.0.0:2100
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:34:45.855071Z  INFO clusterd: listening for compute controller connections on 0.0.0.0:2101
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:34:45.855078Z  INFO mz_service::grpc: Starting to listen on 0.0.0.0:2101
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:34:53.197206Z  INFO mz_service::grpc: GrpcServer: remote client connected
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:34:53.198906Z  INFO mz_service::grpc: GrpcServer: remote client connected
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:34:53.200928Z  INFO mz_cluster::server: Building timely container with config TimelyConfig { workers: 1, process: 1, addresses: ["clusterd_1_1:2103", "clusterd_1_2:2103"], arrangement_exert_proportionality: 1337 }
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:34:53.201064Z  INFO mz_cluster::communication: initializing network for timely instance, with 2 processes for epoch number (2, 1) process=1addresses=["clusterd_1_1:2103", "clusterd_1_2:2103"]
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:34:53.201479Z  INFO mz_cluster::communication: Attempting to connect to process at clusterd_1_1:2103 process=1
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:34:53.201677Z  INFO mz_cluster::communication: awaiting connection from peer process=1
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:34:53.205217Z  INFO mz_cluster::server: Building timely container with config TimelyConfig { workers: 1, process: 1, addresses: ["clusterd_1_1:2102", "clusterd_1_2:2102"], arrangement_exert_proportionality: 16 }
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:34:53.205382Z  INFO mz_cluster::communication: initializing network for timely instance, with 2 processes for epoch number (2, 1) process=1addresses=["clusterd_1_1:2102", "clusterd_1_2:2102"]
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:34:53.208060Z  INFO mz_cluster::communication: Attempting to connect to process at clusterd_1_1:2102 process=1
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:34:53.209990Z  INFO mz_cluster::communication: awaiting connection from peer process=1
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:34:53.836647Z  INFO mz_service::grpc: GrpcServer: remote client connected
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:35:04.202817Z  INFO mz_service::grpc: GrpcServer: remote client connected
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:35:04.203341Z  INFO mz_service::grpc: GrpcServer: remote client connected
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:35:04.336914Z  INFO mz_service::grpc: GrpcServer: remote client connected
replica-isolation-clusterd_1_2-1  | 2024-07-15T10:35:06.092240Z  INFO mz_service::grpc: GrpcServer: remote client connected
[...]

@teskje
Copy link
Contributor

teskje commented Jul 15, 2024

Theory about what might be happening: #27896 introduced, as a side-effect, the behavior that the storage controller will always connect to new replicas as they are created, and disconnect from the previous ones in the process. So when a cluster is created with two replicas, it will connect to the first one, then immediately disconnect and connect to the second one. On clusterd, when the storage controller disconnects, I believe we drop the storage Timely runtime too. And I also believe that dropping the two halves of the storage runtime in the two processes at different times gives the longer-lived Timely runtime the opportunity to try to read from its intra-process channels, notice the absence of the other half, and panic.

One fix is to avoid the reconnection of the storage controller if it is already connected to a different replica in the same cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: something is broken ci-flake
Projects
None yet
Development

No branches or pull requests

3 participants