[FIXED] Gateway message tracing hop header duplication#7442
[FIXED] Gateway message tracing hop header duplication#7442johnweldon wants to merge 2 commits intonats-io:mainfrom
Conversation
c5d96d3 to
1f70d0d
Compare
When a message with tracing enabled is routed to multiple gateways simultaneously, all gateways should receive the same hop count value. This test demonstrates a bug where setHopHeader() was being called once per gateway (inside the routing loop), causing each gateway to receive a different hop count and corrupting message headers. The tests verify: - All gateways receive messages with identical hop counts - Both at the client API level (TestGatewayMsgTraceDuplicateHopHeader) - And at the raw HMSG protocol level (TestGatewayMsgTraceDuplicateHopHeaderRawProtocol) Without the corresponding fix, these tests fail because the first gateway receives hop "1", the second receives hop "2", etc., instead of all receiving the same value. This reproduces the production issue where standalone "1" characters appeared in message headers, caused by multiple setHopHeader() calls corrupting header boundaries and creating duplicate trace headers. Signed-off-by: John Weldon <jweldon@nats.io>
Fixes a bug where setHopHeader() was called inside the gateway routing loop, causing it to be invoked once per gateway instead of once per message. This resulted in different gateways receiving different hop counts and corrupted message headers. The bug manifested in production as intermittent malformed headers with standalone "1" characters appearing between headers: type: ResourceStatus\r\n1\r\ntype: ResourceStatus Root cause: Multiple setHopHeader() calls on the same message buffer: - First call: Adds "Nats-Trace-Hop: 1" but corrupts previous header's \r\n - Second call: Fails to remove old hop header (requires \n before header) - Result: Duplicate hop headers, missing newlines, parser confusion The fix moves setHopHeader() before the gateway loop so it's called exactly once. All gateways now receive identical message content with the same hop count. The existing defer statement properly restores c.pa state after the loop completes. This matches the pattern used in route handling (client.go:5342-5345) and aligns with the comment at client.go:4441: "If others will use this later we need to save and restore original." Resolves issue reported in production (NATS Server 2.11.8) where clients could not parse response bodies due to header corruption in multi-gateway deployments with message tracing enabled. Signed-off-by: John Weldon <jweldon@nats.io>
1f70d0d to
671d3a6
Compare
|
I'm thinking the reason for calling |
|
I need to think if this needs to be inside the loop or not. If you look at the route/LN case, it is per route/LN and I do restore the pa at each iteration. It may be that this is what was missing in the gateway case. |
Setting the "hop" header for each gateway could cause header corruption. This is now fixed. A test dealing with gateway has been improved to include more than one gateway, which would have demonstrated the issue. The test now passes and ensures that the hop is different per gateway. Related to #7442 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
|
I do believe that the hop must be set per gateway. This is the only way for a tool to reconstruct the proper routing. I have submitted a PR with the fix (similar to what we do for routes/leaf) and extended the existing trace with gateways to have at least 2 gateways (instead of a single one) that demonstrated the issue and now passes. The test does verify that the hop is different for each gateway. |
|
Thank you all |
Setting the "hop" header for each gateway could cause header corruption. This is now fixed. A test dealing with gateway has been improved to include more than one gateway, which would have demonstrated the issue. The test now passes and ensures that the hop is different per gateway. Related to #7442 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Setting the "hop" header for each gateway could cause header corruption. This is now fixed. A test dealing with gateway has been improved to include more than one gateway, which would have demonstrated the issue. The test now passes and ensures that the hop is different per gateway. Related to #7442 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Setting the "hop" header for each gateway could cause header corruption. This is now fixed. A test dealing with gateway has been improved to include more than one gateway, which would have demonstrated the issue. The test now passes and ensures that the hop is different per gateway. Related to #7442 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Summary
Fixes a critical bug in gateway message routing where the hop header was added multiple times (once per gateway) instead of once per message, causing header corruption and message parsing failures.
Problem
In production environments (NATS Server 2.11.8) with message tracing enabled and multi-gateway deployments, intermittent malformed headers appeared:
type: ResourceStatus\r\n1\r\ntype: ResourceStatus\r\nNats-Trace-Hop: 2.1.1.1
This caused clients to fail parsing response bodies, breaking message delivery.
Root Cause
The
setHopHeader()call was inside the gateway routing loop insendMsgToGateways(), executing once per outbound gateway instead of once per message. Multiple calls on the same message buffer caused:\r\n\nbefore the header name, but it was missing due to corruption)Solution
Moved
setHopHeader()before the gateway loop (line 2568) so it's called exactly once per message. All gateways now receive identical message content with the same hop count. The existingdeferstatement properly restoresc.pastate.This matches the pattern used in route handling and aligns with the comment at
client.go:4441:Testing
Added two comprehensive tests:
TestGatewayMsgTraceDuplicateHopHeader- Validates hop count consistency at the client API levelTestGatewayMsgTraceDuplicateHopHeaderRawProtocol- Validates hop count consistency at the raw HMSG protocol levelBoth tests fail without the fix (showing different hop counts per gateway) and pass with it.
Impact
Signed-off-by: John Weldon jweldon@nats.io