-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Description
Occasionally we come across a serialization bug, particularly when nodes of multiple versions are involved. Here is a report of an issue in a cross-cluster search scenario involving 6.5.1 nodes, 5.6.2 nodes, and indices dating all the way back to 2.x. The exception we get is not very helpful:
[2019-02-14T23:53:52,630][WARN ][o.e.t.n.Netty4Transport ] [IK-PRD-M3] exception caught on transport layer [[id: 0xd97a9d8c, L:/10.10.1.184:51594 - R:10.10.1.166/10.10.1.166:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [7719647], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.transport.TransportActionProxy$ProxyResponseHandler@7f2fcd88], error [false]; resetting
Today our two best options for diagnosing this are to reproduce it (often tricky without the user's exact setup) or to grab a packet capture and find a problematic message (which only works if they are not using TLS). It'd be awesome if we could capture and log the whole content of the problematic message so as to avoid messing around with tcpdump and so we can deal with this even if TLS is enabled.
This kind of issue tends to be easy for the user to reproduce, so this capture-and-log thing would not need to happen all the time: we could instead consider something that can be enabled dynamically.