Skip to content

upstream: Null-deref if conn closed and TCP health timeout pending#6422

Closed
andrewjjenkins wants to merge 1 commit intoenvoyproxy:masterfrom
andrewjjenkins:fix-ossfuzz-11100
Closed

upstream: Null-deref if conn closed and TCP health timeout pending#6422
andrewjjenkins wants to merge 1 commit intoenvoyproxy:masterfrom
andrewjjenkins:fix-ossfuzz-11100

Conversation

@andrewjjenkins
Copy link
Contributor

Description: Fixes Null-deref in TCP Health Monitor found by ossfuzz
Risk Level: Low, only implementation change is a nullptr check
Testing: Added unit test that reproduces the backtrace without the fix
Docs Changes: None
Release Notes: N/A (I'm new here - do we typically relnote fuzz bugs?)

Fixes no-longer-embargoed ossfuzz 11100

If a TCP health checker connection is closed but the timeout_timer fires anyway (for instance, the timeout timer is pending), then the TcpHealthCheckerImpl::TcpActiveHealthCheckSession::onTimeout() method attempts to dereference client_, which is nullptr:

We need a guard in onTimeout to check client_, similar to the guard in HttpHealthCheckerImpl::HttpActiveHealthCheckSession::onTimeout(). I added a unit test following the pattern of HttpHealthCheckerImplTest.TimeoutAfterDisconnect that reproduces the backtrace on ossfuzz without my proposed fix, and does not backtrace with my fix.

Signed-off-by: Andrew Jenkins andrew.jenkins@volunteers.acasi.info

Fixes no-longer-embargoed ossfuzz 11100:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=11100&q=envoy&colspec=ID%20Type%20Component%20Status%20Proj%20Reported%20Owner%20Summary

If a TCP health checker connection is closed but the timeout_timer fires
anyway (for instance, the timeout timer is pending), then the
TcpHealthCheckerImpl::TcpActiveHealthCheckSession::onTimeout() method
attempts to dereference client_, which is nullptr.

We need guards in onTimeout, similar to the guards in
HttpHealthCheckerImpl::HttpActiveHealthCheckSession::onTimeout().  I
added a unit test following the pattern of
HttpHealthCheckerImplTest.TimeoutAfterDisconnect that reproduces the
backtrace on ossfuzz without my proposed fix, and does not backtrace
with my fix.

Signed-off-by: Andrew Jenkins <andrew.jenkins@volunteers.acasi.info>
@andrewjjenkins
Copy link
Contributor Author

This is regarding issue #4709 (noting so there's a backref)

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this and fixing. I would suggest a slightly different fix, which is to disable the timer when disconnection happens vs checking for nullptr here. This is more consistent with what we do elsewhere. Thank you!

/wait

@andrewjjenkins
Copy link
Contributor Author

Thanks for taking a look. I appreciate your suggestion: it'd be better to disable the dangling timer in the first place instead of waiting it to fire and ignoring it. That's going to be a bit more work especially as I'm wrapping my head around the timer/client interactions. Here's what I'm thinking, let me know if I'm headed in the wrong direction.

  1. Find places in the TcpHealthChecker where we are disconnecting / setting client_ to nullptr: looks like std::move() calls in disconnect. Disable timeout_timer_ there.
  2. Update my proposed unit test to check that timeout_timer_ is disabled (not that we can safely call timeout_timer_->callback())
  3. Repeat step 1 for the HttpHealthChecker that already has this if (client_) nullptr guard.
  4. Repeat step 2 for the HttpHealthCheckerImplTest.TimeoutAfterDisconnect

Basically should I make both HTTP and TCP health checkers consistently prefer disabling the timer over nullptr checks?

@mattklein123
Copy link
Member

@andrewjjenkins yup that's right. One trick we do in unit tests is to call timers_->disableTimer() before resetting, which allows for easier mocking.

@andrewjjenkins
Copy link
Contributor Author

andrewjjenkins commented Apr 5, 2019

@mattklein123 Update: I went back and tried to find all paths where the TcpHealthChecker client is closed but timeout_timer_ not disabled. I found one: This particular crash is triggered by a Network::ConnectionEvent::LocalClose event on the TCP socket triggered through a path outside of the HealthCheckerImpls. This specific case:

  1. Health check interval fires and health checker creates a new client_, connection begins. Timeout timer is started and waiting in the background.

  2. Config for health checker causes us to attempt to set an invalid socket option (in this case, TCP_KEEPIDLE to 0, by having a config with upstream_connection_options.tcp_keepalive.keepalive_time: {}):

setsockopt(53, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(53, SOL_TCP, TCP_KEEPIDLE, [0], 4) = -1 EINVAL (Invalid argument)
  1. ClientConnectionImpl schedules an immediate_error_event_ = ConnectionEvent::LocalClose

  2. When the immediate_error_event_ is handled, TcpHealthCheckerImpl::TcpActiveHealthCheckSession::onEvent is invoked. Since this is a LocalClose, the std::move(client_) is invoked, so client_ is now nullptr. But the timeout_timer_ is still waiting.

(some time later, after timeout_timer_)

  1. timeout_timer_ fires, onTimeout runs, and nulptr-derefs on client_

So I'm feeling like the TcpHealthCheckerImpl code is assuming that all LocalCloses are triggered from paths inside itself and therefore, all those paths can be relied on to disable timeout_timer_. But this bug has found one path where LocalClose is triggered from "outside" (setting the invalid sockopt).

I tried other paths, like

  • ulimiting the number of fds so that the health checker couldn't call socket() (RELEASE_ASSERT()),
  • limiting /proc/sys/net/ip_local_port_range so that time-wait sockets exhausted the ephemeral port range and connect() fails, causingRemoteClose.

So I think it's only the immediate_error_event_ = Network::ConnectionEvent::LocalClose path that causes the problem.

So I'm evaluating these options:

  1. Update TcpHealthCheckerImpl::TcpActiveHealthCheckSession::onEvent to disable the timeout_timer_ even for LocalCloses. Try to minimize double-disabling of the timeout_timer_ for performance, by removing the other places where we disable timeout_timer_ immediately before closing.
  2. Update ClientConnection code to not make these sockopt issues a LocalClose. This feels like a much bigger and systematic change; I'm happy to do it but I'm unsure if this is correct.
  3. Find some way to differentiate between our own LocalCloses inside of TcpHealthCheckerImpl and those from the connection layer (that seems to be violating some abstraction by teaching TcpHealthCheckerImpl too much - ideally it shouldn't care where the close came from).

I'll start heading down the path of choice 1 unless you have other thoughts. Thanks for working with me on this!

@mattklein123
Copy link
Member

@andrewjjenkins sorry for the delay and thanks for the super detailed analysis. Yes I would definitely do (1). It will be the cleanest solution. I might consider doing it for the HTTP HC also, but not a big deal since that code has been that way for a while. Thank you!

@stale
Copy link

stale bot commented Apr 15, 2019

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Apr 15, 2019
@stale
Copy link

stale bot commented Apr 22, 2019

This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot closed this Apr 22, 2019
@andrewjjenkins
Copy link
Contributor Author

@mattklein123 sorry for disappearing - I have an update, could you please reopen? (I can also open a new one if you'd prefer)

I explored disabling the timeout timer in onEvent() and there are two downsides:

  1. The onEvent(Connection::LocalClose) path is also invoked via the TcpHealthCheckerImpl::TcpActiveHealthCheckSession::~TcpActiveHealthCheckSession() and also healthy closes (like a health check completes but reuse_connection_ is false) so I end up needing to do a good deal of unit test surgery to EXPECT_CALL() for things only invoked via destructor or happy-path things. (Along with the double-disable on every happy path health check)

  2. This approach treats a LocalClose as an unhandled halt - no change to health state, and the health check is never re-attempted (because we never called handleFailure or handleSuccess and therefore never started interval_timer_). If a system call to setsockopt() fails, I think it might make more sense to treat it via handleFailure, instead of never probing again.

So I've got some code that instead treats LocalCloses that aren't from ourselves as failures, causing us to safely retry after the interval. Of course, no guarantee that anything will change and we'll succeed when we retry.

What do you think? I'm happy to share what I've got and get some feedback and see what to do next.

@mattklein123
Copy link
Member

What do you think? I'm happy to share what I've got and get some feedback and see what to do next.

For some reason I can't open reopen this PR. Can you open a fresh one and we can discuss in code review? I think that will be easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale stalebot believes this issue/PR has not been touched recently waiting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants