Skip to content

[CI] Rolling upgrade tests failing to start after upgrading node #53042

@mark-vieira

Description

@mark-vieira

We have a bunch of BWC tests failing in master:

Execution failed for task ':x-pack:qa:rolling-upgrade:v7.7.0#oneThirdUpgradedTest'.
> `cluster{:x-pack:qa:rolling-upgrade:v7.7.0}` failed to wait for cluster health yellow after 40 SECONDS
  IO error while waiting cluster
    503 Service Unavailable
  > IO error while waiting cluster
    > 503 Service Unavailable

The problem here is the cluster failing to come up after upgrading one of the cluster nodes from 7.7.0 (i.e. latest from 7.x branch) to 8.0.0 (i.e. master).

The logs are littered with logs of SSL/crypto type errors, as well as this one:

»  Caused by: java.lang.IllegalArgumentException: Unknown NamedWriteable [org.elasticsearch.cluster.ClusterState$Custom][]
»  	at org.elasticsearch.common.io.stream.NamedWriteableRegistry.getReader(NamedWriteableRegistry.java:113) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
»  	at org.elasticsearch.common.io.stream.NamedWriteableAwareStreamInput.readNamedWriteable(NamedWriteableAwareStreamInput.java:45) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
»  	at org.elasticsearch.common.io.stream.NamedWriteableAwareStreamInput.readNamedWriteable(NamedWriteableAwareStreamInput.java:39) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
»  	at org.elasticsearch.cluster.ClusterState.readFrom(ClusterState.java:728) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
»  	at org.elasticsearch.cluster.coordination.ValidateJoinRequest.<init>(ValidateJoinRequest.java:33) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
»  	at org.elasticsearch.transport.RequestHandlerRegistry.newRequest(RequestHandlerRegistry.java:56) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
»  	at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:175) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
»  	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:118) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
»  	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:102) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
»  	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:667) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
»  	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) [transport-netty4-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]

It's not clear to me (since these are all info and warn level logs) which is stopping the cluster from actually being formed. My guess is the "failed to join" errors are the problem, given the whole point of these tests is to ensure that an 8.0 node can talk to a 7.7 cluster.

https://gradle-enterprise.elastic.co/s/6asy246orjjj6/console-log?task=:x-pack:qa:rolling-upgrade:v7.7.0%23oneThirdUpgradedTest

There have been over 20 of these failures today across all CI builds (pull requests, feature branches, etc). It didn't reproduce locally more me however, and I'm quite surprised we haven't seen an intake build fail with this yet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions