Jetstream bus pods can't survive the restart #3458

leonid-podolinskiy-hs · 2025-02-06T09:03:21Z

Describe the bug
We are using jetstream bus in several places. We've noticed that after some time that it working correctly, when one of the pods restarted by the k8s and it is not coming up
To Reproduce
Steps to reproduce the behavior:

Create jetstream bus with 3 pods and persistent storage
Wait for k8s to restart pod, also was able to reproduce it by deleting one of the pods 1 day after they were alive

Expected behavior
Pod coming up successfuly and syncing with the cluster

Environment (please complete the following information):

Kubernetes: 1.28
Argo Events: 1.9.3 (also happens on 1.9.5)
Nats: reproduced on versions 2.10.10 and 2.10.25

Additional context
Logs:


[7] 2025/02/05 23:24:32.627012 [DBG] <ip address>:43598 - rid:21 - Router Ping Timer
[7] 2025/02/05 23:24:33.242146 [DBG] <ip address>:6222 - rid:24 - Router Ping Timer
2025/02/05 23:24:35 Trapped "terminated" signal
2025/02/05 23:24:35 Shutting down...
[7] 2025/02/05 23:24:35.274352 [DBG] Trapped "user defined signal 2" signal
[7] 2025/02/05 23:24:35.274423 [INF] Entering lame duck mode, stop accepting new clients
[7] 2025/02/05 23:24:35.274536 [INF] Initiating JetStream Shutdown...
[7] 2025/02/05 23:24:35.274756 [DBG] Exiting stream monitor for 'js > default' [S-R3F-TbLcjFQM]
[7] 2025/02/05 23:24:35.274780 [DBG] JETSTREAM - JetStream connection closed: Client Closed
[7] 2025/02/05 23:24:35.274814 [DBG] JETSTREAM - JetStream connection closed: Client Closed
[7] 2025/02/05 23:24:35.274838 [DBG] JETSTREAM - JetStream connection closed: Client Closed
[7] 2025/02/05 23:24:35.274854 [DBG] JETSTREAM - JetStream connection closed: Client Closed
[7] 2025/02/05 23:24:35.274903 [DBG] JETSTREAM - JetStream connection closed: Client Closed
[7] 2025/02/05 23:24:35.274906 [DBG] JETSTREAM - JetStream connection closed: Client Closed
[7] 2025/02/05 23:24:35.274964 [DBG] Exiting metadata monitor
[7] 2025/02/05 23:24:35.274982 [DBG] SYSTEM - System connection closed: Internal Client
[7] 2025/02/05 23:24:35.274991 [INF] JetStream Shutdown
[7] 2025/02/05 23:24:35.275005 [DBG] Shutting down all raft nodes
[7] 2025/02/05 23:24:35.275010 [DBG] Stepping down all leader raft nodes
[7] 2025/02/05 23:24:35.275025 [DBG] SYSTEM - System connection closed: Internal Client
[7] 2025/02/05 23:24:35.275028 [DBG] SYSTEM - System connection closed: Internal Client
[7] 2025/02/05 23:24:35.275089 [DBG] RAFT [j8C6znlo - _meta_] Shutdown
[7] 2025/02/05 23:24:35.275151 [DBG] RAFT [j8C6znlo - S-R3F-TbLcjFQM] Shutdown
[7] 2025/02/05 23:24:35.275227 [DBG] RAFT [j8C6znlo - C-R3F-ptSslgK2] Shutdown
[7] 2025/02/05 23:24:35.275263 [INF] Initiating Shutdown...
[7] 2025/02/05 23:24:35.275294 [DBG] Route accept loop exiting..
[7] 2025/02/05 23:24:35.275321 [INF] <ip address>:51378 - rid:20 - Router connection closed: Server Shutdown
[7] 2025/02/05 23:24:35.275330 [INF] <ip address>:43582 - rid:16 - Router connection closed: Server Shutdown
[7] 2025/02/05 23:24:35.275335 [INF] <ip address>:6222 - rid:24 - Router connection closed: Server Shutdown
[7] 2025/02/05 23:24:35.275339 [INF] <ip address>:43572 - rid:15 - Router connection closed: Server Shutdown
[7] 2025/02/05 23:24:35.275339 [DBG] Exiting consumer monitor for 'js > default > group-3223285732' [C-R3F-ptSslgK2]
[7] 2025/02/05 23:24:35.275344 [INF] <ip address>:43588 - rid:17 - Router connection closed: Server Shutdown
[7] 2025/02/05 23:24:35.275357 [INF] <ip address>:43598 - rid:21 - Router connection closed: Server Shutdown
[7] 2025/02/05 23:24:35.275366 [INF] <ip address>:51356 - rid:18 - Router connection closed: Server Shutdown
[7] 2025/02/05 23:24:35.275372 [INF] <ip address>:51368 - rid:19 - Router connection closed: Server Shutdown
[7] 2025/02/05 23:24:35.275373 [DBG] SYSTEM - System connection closed: Client Closed
[7] 2025/02/05 23:24:35.275544 [INF] Server Exiting..
[7] 2025/02/05 23:24:35.275557 [DBG] maxprocs: No GOMAXPROCS change to reset
[7] 2025/02/05 23:24:52.677852 [INF] Starting nats-server
[7] 2025/02/05 23:24:52.677901 [INF]   Version:  2.10.25
[7] 2025/02/05 23:24:52.677905 [INF]   Git:      [006039e]
[7] 2025/02/05 23:24:52.677907 [DBG]   Go build: go1.23.5
[7] 2025/02/05 23:24:52.677912 [INF]   Cluster:  my-bus-name
[7] 2025/02/05 23:24:52.677913 [INF]   Name:     eventbus-my-bus-name-js-2
[7] 2025/02/05 23:24:52.677916 [INF]   Node:     j8C6znlo
[7] 2025/02/05 23:24:52.677918 [INF]   ID:       NCLLBKHJZSFYFLL4AZVMEB56THWQYQ5AVIQKSHVUKHR24QNGKOHCIIPF
[7] 2025/02/05 23:24:52.677921 [WRN] Plaintext passwords detected, use nkeys or bcrypt
[7] 2025/02/05 23:24:52.677926 [INF] Using configuration file: /etc/nats-config/nats-js.conf
[7] 2025/02/05 23:24:52.678750 [INF] Starting http monitor on 0.0.0.0:8222
[7] 2025/02/05 23:24:52.678905 [INF] Starting JetStream
[7] 2025/02/05 23:24:52.678925 [DBG] JetStream creating dynamic configuration - 46.33 GB memory, 10.00 MB disk
[7] 2025/02/05 23:24:52.681774 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[7] 2025/02/05 23:24:52.681800 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[7] 2025/02/05 23:24:52.681803 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[7] 2025/02/05 23:24:52.681805 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[7] 2025/02/05 23:24:52.681807 [INF]
[7] 2025/02/05 23:24:52.681809 [INF]          https://docs.nats.io/jetstream
[7] 2025/02/05 23:24:52.681811 [INF]
[7] 2025/02/05 23:24:52.681813 [INF] ---------------- JETSTREAM ----------------
[7] 2025/02/05 23:24:52.681820 [INF]   Max Memory:      46.33 GB
[7] 2025/02/05 23:24:52.681824 [INF]   Max Storage:     10.00 MB
[7] 2025/02/05 23:24:52.681826 [INF]   Store Directory: "/data/jetstream/store/jetstream"
[7] 2025/02/05 23:24:52.681829 [INF]   Encryption:      ChaCha20-Poly1305
[7] 2025/02/05 23:24:52.681843 [INF] -------------------------------------------
[7] 2025/02/05 23:24:52.682001 [DBG]   Exports:
[7] 2025/02/05 23:24:52.682013 [DBG]      $JS.API.>
[7] 2025/02/05 23:24:52.683512 [DBG] Enabled JetStream for account "js"
[7] 2025/02/05 23:24:52.683562 [DBG]   Max Memory:      -1 B
[7] 2025/02/05 23:24:52.683579 [DBG]   Max Storage:     -1 B
[7] 2025/02/05 23:24:52.683593 [DBG] Recovering JetStream state for account "js"
[7] 2025/02/05 23:24:52.696104 [DBG]   Stream metafile is encrypted, reading encrypted keyfile
[7] 2025/02/05 23:24:52.696166 [WRN]   Error decrypting our stream metafile: unable to recover keys
[7] 2025/02/05 23:24:52.696191 [DBG] JetStream state for account "js" recovered
[7] 2025/02/05 23:24:52.696255 [INF] Starting JetStream cluster
[7] 2025/02/05 23:24:52.696266 [DBG] JetStream cluster checking for stable cluster name and peers
[7] 2025/02/05 23:24:52.696270 [INF] Creating JetStream metadata controller
[7] 2025/02/05 23:24:52.702284 [WRN] Filestore [_meta_] Stream state magic and version mismatch
[7] 2025/02/05 23:24:52.702306 [WRN] Filestore [_meta_] Recovering stream state from index errored: corrupt state file
[7] 2025/02/05 23:24:52.711213 [INF] JetStream cluster recovering state
[7] 2025/02/05 23:24:52.717555 [DBG] RAFT [j8C6znlo - _meta_] Started
[7] 2025/02/05 23:24:52.717650 [INF] Listening for client connections on 0.0.0.0:4222
[7] 2025/02/05 23:24:52.717660 [INF] TLS required for client connections
[7] 2025/02/05 23:24:52.717664 [DBG] Get non local IPs for "0.0.0.0"
[7] 2025/02/05 23:24:52.717784 [DBG]   ip=<ip address>
[7] 2025/02/05 23:24:52.717799 [INF] Server is ready
[7] 2025/02/05 23:24:52.717816 [DBG] Starting metadata monitor
[7] 2025/02/05 23:24:52.717864 [INF] Cluster name is my-bus-name
[7] 2025/02/05 23:24:52.717908 [INF] Listening for route connections on 0.0.0.0:6222
[7] 2025/02/05 23:24:52.718302 [DBG] maxprocs: Leaving GOMAXPROCS=16: CPU quota undefined
[7] 2025/02/05 23:24:52.718299 [DBG] JetStream cluster creating raft group:&{Name:S-R3F-TbLcjFQM Peers:[j8C6znlo pSKec2hq Zncm7LR9] Storage:File Cluster:my-bus-name Preferred: node:<nil>}
[7] 2025/02/05 23:24:52.720351 [WRN] Filestore [S-R3F-TbLcjFQM] Stream state magic and version mismatch
[7] 2025/02/05 23:24:52.720367 [WRN] Filestore [S-R3F-TbLcjFQM] Recovering stream state from index errored: corrupt state file
[7] 2025/02/05 23:24:52.733505 [DBG] RAFT [j8C6znlo - S-R3F-TbLcjFQM] Started
[7] 2025/02/05 23:24:52.734598 [WRN] Filestore [default] Stream state magic and version mismatch
[7] 2025/02/05 23:24:52.734611 [WRN] Filestore [default] Recovering stream state from index errored: corrupt state file
[7] 2025/02/05 23:24:52.741288 [DBG] RAFT [j8C6znlo - S-R3F-TbLcjFQM] Update peers from leader to map[Zncm7LR9:0xc0003c6708 j8C6znlo:0xc0003c66d8 pSKec2hq:0xc0003c66f0]
[7] 2025/02/05 23:24:52.744431 [DBG] JetStream cluster creating raft group:&{Name:C-R3F-ptSslgK2 Peers:[j8C6znlo pSKec2hq Zncm7LR9] Storage:File Cluster:my-bus-name Preferred: node:<nil>}
[7] 2025/02/05 23:24:52.744600 [DBG] Starting stream monitor for 'js > default' [S-R3F-TbLcjFQM]
[7] 2025/02/05 23:24:52.745303 [WRN] Catchup for stream 'js > default' resetting first sequence: 6 on catchup request
[7] 2025/02/05 23:24:52.745378 [DBG] RAFT [j8C6znlo - S-R3F-TbLcjFQM] Pausing our apply channel
[7] 2025/02/05 23:24:52.751038 [WRN] Filestore [C-R3F-ptSslgK2] Stream state magic and version mismatch
[7] 2025/02/05 23:24:52.751063 [WRN] Filestore [C-R3F-ptSslgK2] Recovering stream state from index errored: corrupt state file
[7] 2025/02/05 23:24:52.789930 [DBG] RAFT [j8C6znlo - C-R3F-ptSslgK2] Started
[7] 2025/02/05 23:24:52.793386 [DBG] JETSTREAM - JetStream connection closed: Client Closed
[7] 2025/02/05 23:24:52.793430 [DBG] JETSTREAM - JetStream connection closed: Client Closed
[7] 2025/02/05 23:24:52.793448 [WRN] Consumer create failed for 'js > default > group-3223285732': error creating store for consumer: cipher: message authentication failed (10104)
[7] 2025/02/05 23:24:52.817920 [WRN] Waiting for routing to be established...
2025/02/05 23:24:53 Starting NATS Server Reloader v0.14.0
2025/02/05 23:24:53 Watching file: /etc/nats-config/nats-js.conf
2025/02/05 23:24:53 Live, ready to kick pid 7 on config changes (files=1)
[7] 2025/02/05 23:24:54.702919 [WRN] Waiting for routing to be established...
[7] 2025/02/05 23:24:55.482355 [DBG] <ip address>:55544 - rid:15 - Starting TLS route server handshake
[7] 2025/02/05 23:24:55.487550 [DBG] <ip address>:55544 - rid:15 - TLS handshake complete
[7] 2025/02/05 23:24:55.487569 [DBG] <ip address>:55544 - rid:15 - TLS version 1.3, cipher suite TLS_AES_128_GCM_SHA256
[7] 2025/02/05 23:24:55.487582 [INF] <ip address>:55544 - rid:15 - Route connection created
[7] 2025/02/05 23:24:55.488519 [DBG] <ip address>:55544 - rid:15 - Registering remote route "NCCG6PVTCCWD7E5W77YJSRSEL5XWODY4WYC7QKXJY74NZVLB7PCPPO72"
[7] 2025/02/05 23:24:55.497278 [DBG] <ip address>:54386 - rid:16 - Starting TLS route server handshake
[7] 2025/02/05 23:24:55.497523 [DBG] <ip address>:55552 - rid:17 - Starting TLS route server handshake
[7] 2025/02/05 23:24:55.498915 [DBG] <ip address>:54402 - rid:18 - Starting TLS route server handshake
[7] 2025/02/05 23:24:55.500381 [DBG] <ip address>:54386 - rid:16 - TLS handshake complete
[7] 2025/02/05 23:24:55.500403 [DBG] <ip address>:54386 - rid:16 - TLS version 1.3, cipher suite TLS_AES_128_GCM_SHA256
[7] 2025/02/05 23:24:55.500411 [INF] <ip address>:54386 - rid:16 - Route connection created
[7] 2025/02/05 23:24:55.500787 [DBG] <ip address>:55552 - rid:17 - TLS handshake complete
[7] 2025/02/05 23:24:55.500806 [DBG] <ip address>:55552 - rid:17 - TLS version 1.3, cipher suite TLS_AES_128_GCM_SHA256
[7] 2025/02/05 23:24:55.500810 [INF] <ip address>:55552 - rid:17 - Route connection created
[7] 2025/02/05 23:24:55.501298 [DBG] <ip address>:54386 - rid:16 - Registering remote route "NAIEAAQLI73HIJ5HOOOE7KIFYU6RJK3YIAC4B7TBG2RHKQKVYKE3KGRD"
[7] 2025/02/05 23:24:55.501541 [DBG] <ip address>:55552 - rid:17 - Registering remote route "NCCG6PVTCCWD7E5W77YJSRSEL5XWODY4WYC7QKXJY74NZVLB7PCPPO72"
[7] 2025/02/05 23:24:55.504001 [DBG] <ip address>:54402 - rid:18 - TLS handshake complete
[7] 2025/02/05 23:24:55.504023 [DBG] <ip address>:54402 - rid:18 - TLS version 1.3, cipher suite TLS_AES_128_GCM_SHA256
[7] 2025/02/05 23:24:55.504027 [INF] <ip address>:54402 - rid:18 - Route connection created
[7] 2025/02/05 23:24:55.509258 [DBG] SYSTEM - System connection closed: Internal Client
[7] 2025/02/05 23:24:55.509561 [DBG] RAFT [j8C6znlo - C-R3F-ptSslgK2] Shutdown
[7] 2025/02/05 23:24:55.510776 [DBG] RAFT [j8C6znlo - C-R3F-ptSslgK2] Deleted
[7] 2025/02/05 23:24:55.510812 [DBG] Checking JetStream cluster state
[41] 2025/02/05 23:24:55.514664 [INF] Prometheus exporter listening at http://0.0.0.0:7777/metrics
[7] 2025/02/05 23:24:55.531801 [DBG] <ip address>:54402 - rid:18 - Registering remote route "NAIEAAQLI73HIJ5HOOOE7KIFYU6RJK3YIAC4B7TBG2RHKQKVYKE3KGRD"
[7] 2025/02/05 23:24:55.536958 [DBG] Recovered JetStream cluster metadata
[7] 2025/02/05 23:24:55.537007 [WRN] JetStream has not established contact with a meta leader
[7] 2025/02/05 23:24:55.540088 [DBG] <ip address>:54418 - rid:19 - Starting TLS route server handshake
[7] 2025/02/05 23:24:55.545412 [DBG] <ip address>:54418 - rid:19 - TLS handshake complete
[7] 2025/02/05 23:24:55.545429 [DBG] <ip address>:54418 - rid:19 - TLS version 1.3, cipher suite TLS_AES_128_GCM_SHA256
[7] 2025/02/05 23:24:55.545434 [INF] <ip address>:54418 - rid:19 - Route connection created
[7] 2025/02/05 23:24:55.546272 [DBG] <ip address>:54418 - rid:19 - Registering remote route "NAIEAAQLI73HIJ5HOOOE7KIFYU6RJK3YIAC4B7TBG2RHKQKVYKE3KGRD"
[7] 2025/02/05 23:24:55.552453 [DBG] RAFT [j8C6znlo - S-R3F-TbLcjFQM] Resuming our apply channel
[7] 2025/02/05 23:24:55.576884 [DBG] <ip address>:55560 - rid:20 - Starting TLS route server handshake
[7] 2025/02/05 23:24:55.581179 [DBG] <ip address>:55560 - rid:20 - TLS handshake complete
[7] 2025/02/05 23:24:55.581198 [DBG] <ip address>:55560 - rid:20 - TLS version 1.3, cipher suite TLS_AES_128_GCM_SHA256
[7] 2025/02/05 23:24:55.581210 [INF] <ip address>:55560 - rid:20 - Route connection created
[7] 2025/02/05 23:24:55.582074 [DBG] <ip address>:55560 - rid:20 - Registering remote route "NCCG6PVTCCWD7E5W77YJSRSEL5XWODY4WYC7QKXJY74NZVLB7PCPPO72"
[7] 2025/02/05 23:24:55.586518 [DBG] <ip address>:55574 - rid:21 - Starting TLS route server handshake
[7] 2025/02/05 23:24:55.589785 [DBG] <ip address>:55574 - rid:21 - TLS handshake complete
[7] 2025/02/05 23:24:55.589805 [DBG] <ip address>:55574 - rid:21 - TLS version 1.3, cipher suite TLS_AES_128_GCM_SHA256
[7] 2025/02/05 23:24:55.589812 [INF] <ip address>:55574 - rid:21 - Route connection created
[7] 2025/02/05 23:24:55.590485 [INF] <ip address>:55574 - rid:21 - Router connection closed: Client Closed
[7] 2025/02/05 23:24:55.718375 [DBG] Checking JetStream cluster state
[7] 2025/02/05 23:24:56.437479 [DBG] <ip address>:44534 - rid:22 - Starting TLS route server handshake
[7] 2025/02/05 23:24:56.440599 [DBG] <ip address>:44534 - rid:22 - TLS handshake complete
[7] 2025/02/05 23:24:56.440616 [DBG] <ip address>:44534 - rid:22 - TLS version 1.3, cipher suite TLS_AES_128_GCM_SHA256
[7] 2025/02/05 23:24:56.440642 [INF] <ip address>:44534 - rid:22 - Route connection created
[7] 2025/02/05 23:24:56.441565 [DBG] <ip address>:44534 - rid:22 - Sent local subscriptions to route
[7] 2025/02/05 23:24:56.441581 [DBG] <ip address>:44534 - rid:22 - Registering remote route "NAIEAAQLI73HIJ5HOOOE7KIFYU6RJK3YIAC4B7TBG2RHKQKVYKE3KGRD" for account "sys"
[7] 2025/02/05 23:24:56.520140 [DBG] <ip address>:55580 - rid:23 - Starting TLS route server handshake
[7] 2025/02/05 23:24:56.522976 [DBG] <ip address>:55580 - rid:23 - TLS handshake complete
[7] 2025/02/05 23:24:56.522992 [DBG] <ip address>:55580 - rid:23 - TLS version 1.3, cipher suite TLS_AES_128_GCM_SHA256
[7] 2025/02/05 23:24:56.522997 [INF] <ip address>:55580 - rid:23 - Route connection created
[7] 2025/02/05 23:24:56.524113 [DBG] <ip address>:55580 - rid:23 - Sent local subscriptions to route
[7] 2025/02/05 23:24:56.524132 [DBG] <ip address>:55580 - rid:23 - Registering remote route "NCCG6PVTCCWD7E5W77YJSRSEL5XWODY4WYC7QKXJY74NZVLB7PCPPO72" for account "sys"
[7] 2025/02/05 23:24:56.540807 [DBG] <ip address>:55590 - rid:24 - Starting TLS route server handshake
[7] 2025/02/05 23:24:56.544421 [DBG] <ip address>:55590 - rid:24 - TLS handshake complete
[7] 2025/02/05 23:24:56.544440 [DBG] <ip address>:55590 - rid:24 - TLS version 1.3, cipher suite TLS_AES_128_GCM_SHA256
[7] 2025/02/05 23:24:56.544445 [INF] <ip address>:55590 - rid:24 - Route connection created
[7] 2025/02/05 23:24:56.545236 [INF] <ip address>:55590 - rid:24 - Router connection closed: Client Closed
[7] 2025/02/05 23:24:56.562569 [DBG] <ip address>:55552 - rid:17 - Router Ping Timer
[7] 2025/02/05 23:24:56.593806 [DBG] <ip address>:54402 - rid:18 - Router Ping Timer
[7] 2025/02/05 23:24:56.617707 [DBG] <ip address>:55560 - rid:20 - Router Ping Timer
[7] 2025/02/05 23:24:56.638885 [DBG] <ip address>:55544 - rid:15 - Router Ping Timer
[7] 2025/02/05 23:24:56.698853 [DBG] <ip address>:54386 - rid:16 - Router Ping Timer
[7] 2025/02/05 23:24:56.718137 [DBG] Checking JetStream cluster state
[7] 2025/02/05 23:24:56.719258 [DBG] <ip address>:54418 - rid:19 - Router Ping Timer
[7] 2025/02/05 23:24:57.447278 [DBG] RAFT [j8C6znlo - _meta_] AppendEntry updating leader to "pSKec2hq"
[7] 2025/02/05 23:24:57.448479 [DBG] RAFT [j8C6znlo - S-R3F-TbLcjFQM] AppendEntry updating leader to "pSKec2hq"
[7] 2025/02/05 23:24:57.451173 [INF] JetStream cluster new metadata leader: eventbus-my-bus-name-js-1/my-bus-name
[7] 2025/02/05 23:24:57.529033 [DBG] <ip address>:55580 - rid:23 - Router Ping Timer
[7] 2025/02/05 23:24:57.547491 [DBG] <ip address>:44538 - rid:25 - Starting TLS route server handshake
[7] 2025/02/05 23:24:57.550783 [DBG] <ip address>:44538 - rid:25 - TLS handshake complete
[7] 2025/02/05 23:24:57.550803 [DBG] <ip address>:44538 - rid:25 - TLS version 1.3, cipher suite TLS_AES_128_GCM_SHA256
[7] 2025/02/05 23:24:57.550810 [INF] <ip address>:44538 - rid:25 - Route connection created
[7] 2025/02/05 23:24:57.551907 [INF] <ip address>:44538 - rid:25 - Router connection closed: Client Closed
[7] 2025/02/05 23:24:57.581792 [DBG] <ip address>:44534 - rid:22 - Router Ping Timer
[7] 2025/02/05 23:24:57.585828 [DBG] <ip address>:44546 - rid:26 - Starting TLS route server handshake
[7] 2025/02/05 23:24:57.590761 [DBG] <ip address>:44546 - rid:26 - TLS handshake complete
[7] 2025/02/05 23:24:57.590780 [DBG] <ip address>:44546 - rid:26 - TLS version 1.3, cipher suite TLS_AES_128_GCM_SHA256
[7] 2025/02/05 23:24:57.590787 [INF] <ip address>:44546 - rid:26 - Route connection created
[7] 2025/02/05 23:24:57.591455 [INF] <ip address>:44546 - rid:26 - Router connection closed: Client Closed
[7] 2025/02/05 23:24:57.719211 [DBG] Checking JetStream cluster state
[7] 2025/02/05 23:24:57.721636 [DBG] Trying to connect to route on eventbus-my-bus-name-js-1.eventbus-my-bus-name-js-svc.my-bus-name.svc:6222 (<ip address>:6222)
[7] 2025/02/05 23:24:57.721683 [DBG] Trying to connect to route on eventbus-my-bus-name-js-1.eventbus-my-bus-name-js-svc.my-bus-name.svc:6222 (<ip address>:6222)
[7] 2025/02/05 23:24:57.721871 [DBG] Trying to connect to route on eventbus-my-bus-name-js-0.eventbus-my-bus-name-js-svc.my-bus-name.svc:6222 (<ip address>:6222)
[7] 2025/02/05 23:24:57.722232 [DBG] <ip address>:6222 - rid:27 - Starting TLS route client handshake
[7] 2025/02/05 23:24:57.722401 [DBG] <ip address>:6222 - rid:28 - Starting TLS route client handshake
[7] 2025/02/05 23:24:57.722444 [DBG] <ip address>:6222 - rid:29 - Starting TLS route client handshake
[7] 2025/02/05 23:24:57.723171 [ERR] Error trying to connect to route (attempt 1): lookup for host "eventbus-my-bus-name-js-2.eventbus-my-bus-name-js-svc.my-bus-name.svc": lookup eventbus-my-bus-name-js-2.eventbus-my-bus-name-js-svc.my-bus-name.svc on 172.20.0.10:53: no such host
[7] 2025/02/05 23:24:57.725247 [DBG] <ip address>:6222 - rid:28 - TLS handshake complete
[7] 2025/02/05 23:24:57.725261 [DBG] <ip address>:6222 - rid:28 - TLS version 1.3, cipher suite TLS_AES_128_GCM_SHA256
[7] 2025/02/05 23:24:57.725263 [DBG] <ip address>:6222 - rid:28 - Route connect msg sent
[7] 2025/02/05 23:24:57.725292 [INF] <ip address>:6222 - rid:28 - Route connection created
[7] 2025/02/05 23:24:57.726031 [DBG] <ip address>:6222 - rid:28 - Detected duplicate remote route "NAIEAAQLI73HIJ5HOOOE7KIFYU6RJK3YIAC4B7TBG2RHKQKVYKE3KGRD"
[7] 2025/02/05 23:24:57.726046 [INF] <ip address>:6222 - rid:28 - Router connection closed: Duplicate Route
[7] 2025/02/05 23:24:57.726405 [DBG] <ip address>:6222 - rid:29 - TLS handshake complete
[7] 2025/02/05 23:24:57.726419 [DBG] <ip address>:6222 - rid:29 - TLS version 1.3, cipher suite TLS_AES_128_GCM_SHA256
[7] 2025/02/05 23:24:57.726422 [DBG] <ip address>:6222 - rid:29 - Route connect msg sent
[7] 2025/02/05 23:24:57.726448 [INF] <ip address>:6222 - rid:29 - Route connection created
[7] 2025/02/05 23:24:57.726496 [DBG] <ip address>:6222 - rid:27 - TLS handshake complete
[7] 2025/02/05 23:24:57.726512 [DBG] <ip address>:6222 - rid:27 - TLS version 1.3, cipher suite TLS_AES_128_GCM_SHA256
[7] 2025/02/05 23:24:57.726515 [DBG] <ip address>:6222 - rid:27 - Route connect msg sent
[7] 2025/02/05 23:24:57.726518 [DBG] Trying to connect to route on eventbus-my-bus-name-js-0.eventbus-my-bus-name-js-svc.my-bus-name.svc:6222 (<ip address>:6222)
[7] 2025/02/05 23:24:57.726525 [INF] <ip address>:6222 - rid:27 - Route connection created
[7] 2025/02/05 23:24:57.727058 [DBG] <ip address>:6222 - rid:30 - Starting TLS route client handshake
[7] 2025/02/05 23:24:57.727142 [DBG] <ip address>:6222 - rid:29 - Route pool size reached, closing extra connection to "NCCG6PVTCCWD7E5W77YJSRSEL5XWODY4WYC7QKXJY74NZVLB7PCPPO72"
[7] 2025/02/05 23:24:57.727169 [DBG] <ip address>:6222 - rid:29 - Detected duplicate remote route "NCCG6PVTCCWD7E5W77YJSRSEL5XWODY4WYC7QKXJY74NZVLB7PCPPO72"
[7] 2025/02/05 23:24:57.727174 [INF] <ip address>:6222 - rid:29 - Router connection closed: Duplicate Route
[7] 2025/02/05 23:24:57.727360 [DBG] <ip address>:6222 - rid:27 - Detected duplicate remote route "NCCG6PVTCCWD7E5W77YJSRSEL5XWODY4WYC7QKXJY74NZVLB7PCPPO72"
[7] 2025/02/05 23:24:57.727375 [INF] <ip address>:6222 - rid:27 - Router connection closed: Duplicate Route
[7] 2025/02/05 23:24:57.730340 [DBG] <ip address>:6222 - rid:30 - TLS handshake complete
[7] 2025/02/05 23:24:57.730357 [DBG] <ip address>:6222 - rid:30 - TLS version 1.3, cipher suite TLS_AES_128_GCM_SHA256
[7] 2025/02/05 23:24:57.730373 [DBG] <ip address>:6222 - rid:30 - Route connect msg sent
[7] 2025/02/05 23:24:57.730394 [INF] <ip address>:6222 - rid:30 - Route connection created
[7] 2025/02/05 23:24:57.731368 [DBG] <ip address>:6222 - rid:30 - Route pool size reached, closing extra connection to "NAIEAAQLI73HIJ5HOOOE7KIFYU6RJK3YIAC4B7TBG2RHKQKVYKE3KGRD"
[7] 2025/02/05 23:24:57.731381 [DBG] <ip address>:6222 - rid:30 - Detected duplicate remote route "NAIEAAQLI73HIJ5HOOOE7KIFYU6RJK3YIAC4B7TBG2RHKQKVYKE3KGRD"
[7] 2025/02/05 23:24:57.731384 [INF] <ip address>:6222 - rid:30 - Router connection closed: Duplicate Route
[7] 2025/02/05 23:25:11.297552 [WRN] Healthcheck failed: "JetStream consumer 'js > default > group-3223285732' is not current"
[7] 2025/02/05 23:25:21.297805 [WRN] Healthcheck failed: "JetStream consumer 'js > default > group-3223285732' is not current"
[7] 2025/02/05 23:25:25.537037 [DBG] JetStream cluster checking for orphans
[7] 2025/02/05 23:25:26.563203 [DBG] <ip address>:55552 - rid:17 - Router Ping Timer
[7] 2025/02/05 23:25:26.594754 [DBG] <ip address>:54402 - rid:18 - Router Ping Timer
[7] 2025/02/05 23:25:26.618600 [DBG] <ip address>:55560 - rid:20 - Router Ping Timer
[7] 2025/02/05 23:25:26.639029 [DBG] <ip address>:55544 - rid:15 - Router Ping Timer
[7] 2025/02/05 23:25:26.699495 [DBG] <ip address>:54386 - rid:16 - Router Ping Timer
[7] 2025/02/05 23:25:26.719937 [DBG] <ip address>:54418 - rid:19 - Router Ping Timer
[7] 2025/02/05 23:25:27.529931 [DBG] <ip address>:55580 - rid:23 - Router Ping Timer
[7] 2025/02/05 23:25:27.583573 [DBG] <ip address>:44534 - rid:22 - Router Ping Timer
[7] 2025/02/05 23:25:31.297596 [WRN] Healthcheck failed: "JetStream consumer 'js > default > group-3223285732' is not current"
[7] 2025/02/05 23:25:41.297792 [WRN] Healthcheck failed: "JetStream consumer 'js > default > group-3223285732' is not current"
[7] 2025/02/05 23:25:51.297586 [WRN] Healthcheck failed: "JetStream consumer 'js > default > group-3223285732' is not current"
[7] 2025/02/05 23:25:56.564140 [DBG] <ip address>:55552 - rid:17 - Router Ping Timer

And further logs are combination of pings and this message

[7] 2025/02/05 23:25:11.297552 [WRN] Healthcheck failed: "JetStream consumer 'js > default > group-3223285732' is not current"

till the moment k8s restarting the pod due to failed healthcheck

there are several errors there, probably some are the reasons of the others:

not sure if this is relevant as i saw the same on the healthy pods as well

[7] 2025/02/05 23:24:52.696104 [DBG]   Stream metafile is encrypted, reading encrypted keyfile
[7] 2025/02/05 23:24:52.696166 [WRN]   Error decrypting our stream metafile: unable to recover keys

[7] 2025/02/05 23:24:52.702284 [WRN] Filestore [_meta_] Stream state magic and version mismatch
[7] 2025/02/05 23:24:52.702306 [WRN] Filestore [_meta_] Recovering stream state from index errored: corrupt state file
[7] 2025/02/05 23:24:52.720351 [WRN] Filestore [S-R3F-TbLcjFQM] Stream state magic and version mismatch
[7] 2025/02/05 23:24:52.720367 [WRN] Filestore [S-R3F-TbLcjFQM] Recovering stream state from index errored: corrupt state file
[7] 2025/02/05 23:24:52.734598 [WRN] Filestore [default] Stream state magic and version mismatch
[7] 2025/02/05 23:24:52.734611 [WRN] Filestore [default] Recovering stream state from index errored: corrupt state file
[7] 2025/02/05 23:24:52.751038 [WRN] Filestore [C-R3F-ptSslgK2] Stream state magic and version mismatch
[7] 2025/02/05 23:24:52.751063 [WRN] Filestore [C-R3F-ptSslgK2] Recovering stream state from index errored: corrupt state file

[7] 2025/02/05 23:24:52.793448 [WRN] Consumer create failed for 'js > default > group-3223285732': error creating store for consumer: cipher: message authentication failed (10104)

event bus configuration

apiVersion: argoproj.io/v1alpha1
kind: EventBus
metadata:
  name: my-bus
spec:
  jetstream:
    affinity:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - podAffinityTerm:
            labelSelector:
              matchLabels:
                controller: eventbus-controller
            topologyKey: kubernetes.io/hostname
          weight: 50
        - podAffinityTerm:
            labelSelector:
              matchLabels:
                controller: eventbus-controller
            topologyKey: topology.kubernetes.io/zone
          weight: 50
    containerTemplate:
      resources:
        requests:
          cpu: 10m
          memory: 32Mi
    metadata:
    metricsContainerTemplate:
      resources:
        requests:
          cpu: 10m
          memory: 32Mi
    persistence:
      accessMode: ReadWriteOnce
      storageClassName: gp3
      volumeSize: 1Gi
    reloaderContainerTemplate:
      resources:
        requests:
          cpu: 10m
          memory: 32Mi
    replicas: 3
    settings: |
      max_file_store: 10MB
    startArgs:
    - -D
    streamConfig: |
      maxAge: 24h

Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

The text was updated successfully, but these errors were encountered:

leonid-podolinskiy-hs · 2025-02-06T09:23:10Z

Few additions.

Once it happens to one pod, it will 100% happen to 2 other pods when they will be restarted
To fix the issue we have to delete the PVC of the pod, after it pod is coming up correctly. So to recover the cluster we need to do it for all 3 pods (can be done one by one or all together)

leonid-podolinskiy-hs added the bug Something isn't working label Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jetstream bus pods can't survive the restart #3458

Jetstream bus pods can't survive the restart #3458

leonid-podolinskiy-hs commented Feb 6, 2025 •

edited

Loading

leonid-podolinskiy-hs commented Feb 6, 2025

Jetstream bus pods can't survive the restart #3458

Jetstream bus pods can't survive the restart #3458

Comments

leonid-podolinskiy-hs commented Feb 6, 2025 • edited Loading

leonid-podolinskiy-hs commented Feb 6, 2025

leonid-podolinskiy-hs commented Feb 6, 2025 •

edited

Loading