Skip to content

[v17] fix(app): prevent premature removal of app servers from session cache#59955

Merged
tigrato merged 2 commits intobranch/v17from
bot/backport-59900-branch/v17
Oct 6, 2025
Merged

[v17] fix(app): prevent premature removal of app servers from session cache#59955
tigrato merged 2 commits intobranch/v17from
bot/backport-59900-branch/v17

Conversation

@tigrato
Copy link
Copy Markdown
Contributor

@tigrato tigrato commented Oct 6, 2025

Backport #59900 to branch/v17

changelog: Fixed issue where temporarily unreachable app servers were permanently removed from session cache, causing persistent connection failures: no application servers remaining to connect.

When a user accesses an app and a new session (i.e. app server
certificate)  is created, the teleport proxy caches a `session` object
for the entire duration of that certificate.

The session context includes the list of app servers available at the
time the session is created. This list is static and is used to dial the
target app service via reverse tunnel. This is already suboptimal, as
new app servers (app services with different `host_id`) that can join
later will be ingored. This static list exists to create some affinity
and the same proxy forwards the user requests to the same app service.

The core issue liest in a piece of code that this PR removes. That code
prunes app servers from the session list if the proxy fails to connect
to them.
This is problematic beause an app server might have temporarily gone
offline (e.g. restarted or dropped the connection), and it gets removed
permanently. Even if it later reconnects to the cluster, the session
will no longer include it, causing all future requests to fail because
the list of servers is now empty.

This issue is especially harmful for machine clients that automatically
retry on disconnection. As retries continue, the session removes all
servers, leading to permanent failure state. Since the session is cached
for the full certificate duration, the only workaround is to manually
restart the teleport proxy to refresh the session state.

This PR removes the session pruning logic as a temporary fix, allowing
clients to retry dialing to previously failed app servers if they
reconnect during the session.

Signed-off-by: Tiago Silva <tiago.silva@goteleport.com>
@tigrato tigrato enabled auto-merge October 6, 2025 12:38
@tigrato tigrato added this pull request to the merge queue Oct 6, 2025
Merged via the queue into branch/v17 with commit b503ef6 Oct 6, 2025
39 checks passed
@tigrato tigrato deleted the bot/backport-59900-branch/v17 branch October 6, 2025 13:30
@fheinecke fheinecke mentioned this pull request Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants