Skip to content

Serialization bugs are painful to track down #38939

@DaveCTurner

Description

@DaveCTurner

Occasionally we come across a serialization bug, particularly when nodes of multiple versions are involved. Here is a report of an issue in a cross-cluster search scenario involving 6.5.1 nodes, 5.6.2 nodes, and indices dating all the way back to 2.x. The exception we get is not very helpful:

[2019-02-14T23:53:52,630][WARN ][o.e.t.n.Netty4Transport  ] [IK-PRD-M3] exception caught on transport layer [[id: 0xd97a9d8c, L:/10.10.1.184:51594 - R:10.10.1.166/10.10.1.166:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [7719647], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.transport.TransportActionProxy$ProxyResponseHandler@7f2fcd88], error [false]; resetting

Today our two best options for diagnosing this are to reproduce it (often tricky without the user's exact setup) or to grab a packet capture and find a problematic message (which only works if they are not using TLS). It'd be awesome if we could capture and log the whole content of the problematic message so as to avoid messing around with tcpdump and so we can deal with this even if TLS is enabled.

This kind of issue tends to be easy for the user to reproduce, so this capture-and-log thing would not need to happen all the time: we could instead consider something that can be enabled dynamically.

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Distributed Coordination/NetworkHttp and internode communication implementationsTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.help wantedadoptme

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions