Skip to content

[FIXED] Gateway message tracing hop header duplication#7442

Closed
johnweldon wants to merge 2 commits intonats-io:mainfrom
johnweldon:jw4/gateway-trace-header-bug
Closed

[FIXED] Gateway message tracing hop header duplication#7442
johnweldon wants to merge 2 commits intonats-io:mainfrom
johnweldon:jw4/gateway-trace-header-bug

Conversation

@johnweldon
Copy link
Copy Markdown
Member

Summary

Fixes a critical bug in gateway message routing where the hop header was added multiple times (once per gateway) instead of once per message, causing header corruption and message parsing failures.

Problem

In production environments (NATS Server 2.11.8) with message tracing enabled and multi-gateway deployments, intermittent malformed headers appeared:

type: ResourceStatus\r\n1\r\ntype: ResourceStatus\r\nNats-Trace-Hop: 2.1.1.1

This caused clients to fail parsing response bodies, breaking message delivery.

Root Cause

The setHopHeader() call was inside the gateway routing loop in sendMsgToGateways(), executing once per outbound gateway instead of once per message. Multiple calls on the same message buffer caused:

  1. First call: Added "Nats-Trace-Hop: 1" but corrupted the previous header's trailing \r\n
  2. Second call: Failed to remove the old hop header (header removal requires \n before the header name, but it was missing due to corruption)
  3. Result: Duplicate hop headers, missing newlines, and parser confusion where the "1" from the hop count bled into the wrong position

Solution

Moved setHopHeader() before the gateway loop (line 2568) so it's called exactly once per message. All gateways now receive identical message content with the same hop count. The existing defer statement properly restores c.pa state.

This matches the pattern used in route handling and aligns with the comment at client.go:4441:

"If others will use this later we need to save and restore original."

Testing

Added two comprehensive tests:

  • TestGatewayMsgTraceDuplicateHopHeader - Validates hop count consistency at the client API level
  • TestGatewayMsgTraceDuplicateHopHeaderRawProtocol - Validates hop count consistency at the raw HMSG protocol level

Both tests fail without the fix (showing different hop counts per gateway) and pass with it.

Impact

  • Severity: High - causes message parsing failures in production
  • Scope: All multi-gateway deployments with message tracing enabled
  • Fix: Single line moved, minimal change with maximum safety

Signed-off-by: John Weldon jweldon@nats.io

@johnweldon johnweldon requested a review from a team as a code owner October 18, 2025 03:15
@johnweldon
Copy link
Copy Markdown
Member Author

@kozlovic @wallyqs

@johnweldon johnweldon force-pushed the jw4/gateway-trace-header-bug branch from c5d96d3 to 1f70d0d Compare October 18, 2025 03:26
When a message with tracing enabled is routed to multiple gateways
simultaneously, all gateways should receive the same hop count value.
This test demonstrates a bug where setHopHeader() was being called
once per gateway (inside the routing loop), causing each gateway to
receive a different hop count and corrupting message headers.

The tests verify:
- All gateways receive messages with identical hop counts
- Both at the client API level (TestGatewayMsgTraceDuplicateHopHeader)
- And at the raw HMSG protocol level (TestGatewayMsgTraceDuplicateHopHeaderRawProtocol)

Without the corresponding fix, these tests fail because the first gateway
receives hop "1", the second receives hop "2", etc., instead of all
receiving the same value.

This reproduces the production issue where standalone "1" characters
appeared in message headers, caused by multiple setHopHeader() calls
corrupting header boundaries and creating duplicate trace headers.

Signed-off-by: John Weldon <jweldon@nats.io>
Fixes a bug where setHopHeader() was called inside the gateway routing
loop, causing it to be invoked once per gateway instead of once per
message. This resulted in different gateways receiving different hop
counts and corrupted message headers.

The bug manifested in production as intermittent malformed headers with
standalone "1" characters appearing between headers:
  type: ResourceStatus\r\n1\r\ntype: ResourceStatus

Root cause: Multiple setHopHeader() calls on the same message buffer:
- First call: Adds "Nats-Trace-Hop: 1" but corrupts previous header's \r\n
- Second call: Fails to remove old hop header (requires \n before header)
- Result: Duplicate hop headers, missing newlines, parser confusion

The fix moves setHopHeader() before the gateway loop so it's called
exactly once. All gateways now receive identical message content with
the same hop count. The existing defer statement properly restores
c.pa state after the loop completes.

This matches the pattern used in route handling (client.go:5342-5345)
and aligns with the comment at client.go:4441:
  "If others will use this later we need to save and restore original."

Resolves issue reported in production (NATS Server 2.11.8) where
clients could not parse response bodies due to header corruption in
multi-gateway deployments with message tracing enabled.

Signed-off-by: John Weldon <jweldon@nats.io>
@johnweldon johnweldon force-pushed the jw4/gateway-trace-header-bug branch from 1f70d0d to 671d3a6 Compare October 18, 2025 03:28
@MauriceVanVeen
Copy link
Copy Markdown
Member

I'm thinking the reason for calling setHopHeader per gateway is such that each will count as separate hops, not just as one? In that case probably c.setHeader would need to be adjusted instead of moving the setHopHeader call.

@kozlovic
Copy link
Copy Markdown
Member

I need to think if this needs to be inside the loop or not. If you look at the route/LN case, it is per route/LN and I do restore the pa at each iteration. It may be that this is what was missing in the gateway case.

kozlovic added a commit that referenced this pull request Oct 18, 2025
Setting the "hop" header for each gateway could cause header
corruption. This is now fixed.

A test dealing with gateway has been improved to include more than
one gateway, which would have demonstrated the issue. The test now
passes and ensures that the hop is different per gateway.

Related to #7442

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@kozlovic
Copy link
Copy Markdown
Member

I do believe that the hop must be set per gateway. This is the only way for a tool to reconstruct the proper routing. I have submitted a PR with the fix (similar to what we do for routes/leaf) and extended the existing trace with gateways to have at least 2 gateways (instead of a single one) that demonstrated the issue and now passes. The test does verify that the hop is different for each gateway.

@johnweldon
Copy link
Copy Markdown
Member Author

Thank you all

@johnweldon johnweldon closed this Oct 20, 2025
neilalexander pushed a commit that referenced this pull request Oct 28, 2025
Setting the "hop" header for each gateway could cause header
corruption. This is now fixed.

A test dealing with gateway has been improved to include more than
one gateway, which would have demonstrated the issue. The test now
passes and ensures that the hop is different per gateway.

Related to #7442

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
neilalexander pushed a commit that referenced this pull request Oct 30, 2025
Setting the "hop" header for each gateway could cause header
corruption. This is now fixed.

A test dealing with gateway has been improved to include more than
one gateway, which would have demonstrated the issue. The test now
passes and ensures that the hop is different per gateway.

Related to #7442

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
neilalexander pushed a commit that referenced this pull request Oct 30, 2025
Setting the "hop" header for each gateway could cause header
corruption. This is now fixed.

A test dealing with gateway has been improved to include more than
one gateway, which would have demonstrated the issue. The test now
passes and ensures that the hop is different per gateway.

Related to #7442

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants