upstream: Null-deref if conn closed and TCP health timeout pending by andrewjjenkins · Pull Request #6422 · envoyproxy/envoy

andrewjjenkins · 2019-03-29T00:12:16Z

Description: Fixes Null-deref in TCP Health Monitor found by ossfuzz
Risk Level: Low, only implementation change is a nullptr check
Testing: Added unit test that reproduces the backtrace without the fix
Docs Changes: None
Release Notes: N/A (I'm new here - do we typically relnote fuzz bugs?)

Fixes no-longer-embargoed ossfuzz 11100

If a TCP health checker connection is closed but the timeout_timer fires anyway (for instance, the timeout timer is pending), then the TcpHealthCheckerImpl::TcpActiveHealthCheckSession::onTimeout() method attempts to dereference client_, which is nullptr:

We need a guard in onTimeout to check client_, similar to the guard in HttpHealthCheckerImpl::HttpActiveHealthCheckSession::onTimeout(). I added a unit test following the pattern of HttpHealthCheckerImplTest.TimeoutAfterDisconnect that reproduces the backtrace on ossfuzz without my proposed fix, and does not backtrace with my fix.

Signed-off-by: Andrew Jenkins andrew.jenkins@volunteers.acasi.info

Fixes no-longer-embargoed ossfuzz 11100: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=11100&q=envoy&colspec=ID%20Type%20Component%20Status%20Proj%20Reported%20Owner%20Summary If a TCP health checker connection is closed but the timeout_timer fires anyway (for instance, the timeout timer is pending), then the TcpHealthCheckerImpl::TcpActiveHealthCheckSession::onTimeout() method attempts to dereference client_, which is nullptr. We need guards in onTimeout, similar to the guards in HttpHealthCheckerImpl::HttpActiveHealthCheckSession::onTimeout(). I added a unit test following the pattern of HttpHealthCheckerImplTest.TimeoutAfterDisconnect that reproduces the backtrace on ossfuzz without my proposed fix, and does not backtrace with my fix. Signed-off-by: Andrew Jenkins <andrew.jenkins@volunteers.acasi.info>

andrewjjenkins · 2019-03-29T16:03:51Z

This is regarding issue #4709 (noting so there's a backref)

mattklein123

Thanks for working on this and fixing. I would suggest a slightly different fix, which is to disable the timer when disconnection happens vs checking for nullptr here. This is more consistent with what we do elsewhere. Thank you!

/wait

andrewjjenkins · 2019-04-01T18:25:38Z

Thanks for taking a look. I appreciate your suggestion: it'd be better to disable the dangling timer in the first place instead of waiting it to fire and ignoring it. That's going to be a bit more work especially as I'm wrapping my head around the timer/client interactions. Here's what I'm thinking, let me know if I'm headed in the wrong direction.

Find places in the TcpHealthChecker where we are disconnecting / setting client_ to nullptr: looks like std::move() calls in disconnect. Disable timeout_timer_ there.
Update my proposed unit test to check that timeout_timer_ is disabled (not that we can safely call timeout_timer_->callback())
Repeat step 1 for the HttpHealthChecker that already has this if (client_) nullptr guard.
Repeat step 2 for the HttpHealthCheckerImplTest.TimeoutAfterDisconnect

Basically should I make both HTTP and TCP health checkers consistently prefer disabling the timer over nullptr checks?

mattklein123 · 2019-04-02T17:52:29Z

@andrewjjenkins yup that's right. One trick we do in unit tests is to call timers_->disableTimer() before resetting, which allows for easier mocking.

andrewjjenkins · 2019-04-05T20:54:54Z

@mattklein123 Update: I went back and tried to find all paths where the TcpHealthChecker client is closed but timeout_timer_ not disabled. I found one: This particular crash is triggered by a Network::ConnectionEvent::LocalClose event on the TCP socket triggered through a path outside of the HealthCheckerImpls. This specific case:

Health check interval fires and health checker creates a new client_, connection begins. Timeout timer is started and waiting in the background.
Config for health checker causes us to attempt to set an invalid socket option (in this case, TCP_KEEPIDLE to 0, by having a config with upstream_connection_options.tcp_keepalive.keepalive_time: {}):

setsockopt(53, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(53, SOL_TCP, TCP_KEEPIDLE, [0], 4) = -1 EINVAL (Invalid argument)

ClientConnectionImpl schedules an immediate_error_event_ = ConnectionEvent::LocalClose
When the immediate_error_event_ is handled, TcpHealthCheckerImpl::TcpActiveHealthCheckSession::onEvent is invoked. Since this is a LocalClose, the std::move(client_) is invoked, so client_ is now nullptr. But the timeout_timer_ is still waiting.

(some time later, after timeout_timer_)

timeout_timer_ fires, onTimeout runs, and nulptr-derefs on client_

So I'm feeling like the TcpHealthCheckerImpl code is assuming that all LocalCloses are triggered from paths inside itself and therefore, all those paths can be relied on to disable timeout_timer_. But this bug has found one path where LocalClose is triggered from "outside" (setting the invalid sockopt).

I tried other paths, like

ulimiting the number of fds so that the health checker couldn't call socket() (RELEASE_ASSERT()),
limiting /proc/sys/net/ip_local_port_range so that time-wait sockets exhausted the ephemeral port range and connect() fails, causingRemoteClose.

So I think it's only the immediate_error_event_ = Network::ConnectionEvent::LocalClose path that causes the problem.

So I'm evaluating these options:

Update TcpHealthCheckerImpl::TcpActiveHealthCheckSession::onEvent to disable the timeout_timer_ even for LocalCloses. Try to minimize double-disabling of the timeout_timer_ for performance, by removing the other places where we disable timeout_timer_ immediately before closing.
Update ClientConnection code to not make these sockopt issues a LocalClose. This feels like a much bigger and systematic change; I'm happy to do it but I'm unsure if this is correct.
Find some way to differentiate between our own LocalCloses inside of TcpHealthCheckerImpl and those from the connection layer (that seems to be violating some abstraction by teaching TcpHealthCheckerImpl too much - ideally it shouldn't care where the close came from).

I'll start heading down the path of choice 1 unless you have other thoughts. Thanks for working with me on this!

mattklein123 · 2019-04-08T17:19:00Z

@andrewjjenkins sorry for the delay and thanks for the super detailed analysis. Yes I would definitely do (1). It will be the cleanest solution. I might consider doing it for the HTTP HC also, but not a big deal since that code has been that way for a while. Thank you!

stale · 2019-04-15T17:33:24Z

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

stale · 2019-04-22T19:26:20Z

This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

andrewjjenkins · 2019-05-02T20:52:59Z

@mattklein123 sorry for disappearing - I have an update, could you please reopen? (I can also open a new one if you'd prefer)

I explored disabling the timeout timer in onEvent() and there are two downsides:

The onEvent(Connection::LocalClose) path is also invoked via the TcpHealthCheckerImpl::TcpActiveHealthCheckSession::~TcpActiveHealthCheckSession() and also healthy closes (like a health check completes but reuse_connection_ is false) so I end up needing to do a good deal of unit test surgery to EXPECT_CALL() for things only invoked via destructor or happy-path things. (Along with the double-disable on every happy path health check)
This approach treats a LocalClose as an unhandled halt - no change to health state, and the health check is never re-attempted (because we never called handleFailure or handleSuccess and therefore never started interval_timer_). If a system call to setsockopt() fails, I think it might make more sense to treat it via handleFailure, instead of never probing again.

So I've got some code that instead treats LocalCloses that aren't from ourselves as failures, causing us to safely retry after the interval. Of course, no guarantee that anything will change and we'll succeed when we retry.

What do you think? I'm happy to share what I've got and get some feedback and see what to do next.

mattklein123 · 2019-05-02T21:39:44Z

What do you think? I'm happy to share what I've got and get some feedback and see what to do next.

For some reason I can't open reopen this PR. Can you open a fresh one and we can discuss in code review? I think that will be easier.

lizan requested a review from mattklein123 March 29, 2019 00:44

lizan assigned mattklein123 Mar 29, 2019

mattklein123 requested changes Mar 29, 2019

View reviewed changes

repokitteh-read-only bot added the waiting label Mar 29, 2019

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Apr 15, 2019

stale bot closed this Apr 22, 2019

andrewjjenkins mentioned this pull request May 2, 2019

upstream: Null-deref on TCP health checker if setsockopt fails #6793

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upstream: Null-deref if conn closed and TCP health timeout pending#6422

upstream: Null-deref if conn closed and TCP health timeout pending#6422
andrewjjenkins wants to merge 1 commit intoenvoyproxy:masterfrom
andrewjjenkins:fix-ossfuzz-11100

andrewjjenkins commented Mar 29, 2019

Uh oh!

andrewjjenkins commented Mar 29, 2019

Uh oh!

mattklein123 left a comment

Uh oh!

andrewjjenkins commented Apr 1, 2019

Uh oh!

mattklein123 commented Apr 2, 2019

Uh oh!

andrewjjenkins commented Apr 5, 2019 •

edited

Loading

Uh oh!

mattklein123 commented Apr 8, 2019

Uh oh!

stale bot commented Apr 15, 2019

Uh oh!

stale bot commented Apr 22, 2019

Uh oh!

andrewjjenkins commented May 2, 2019

Uh oh!

mattklein123 commented May 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andrewjjenkins commented Mar 29, 2019

Uh oh!

andrewjjenkins commented Mar 29, 2019

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

andrewjjenkins commented Apr 1, 2019

Uh oh!

mattklein123 commented Apr 2, 2019

Uh oh!

andrewjjenkins commented Apr 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattklein123 commented Apr 8, 2019

Uh oh!

stale bot commented Apr 15, 2019

Uh oh!

stale bot commented Apr 22, 2019

Uh oh!

andrewjjenkins commented May 2, 2019

Uh oh!

mattklein123 commented May 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andrewjjenkins commented Apr 5, 2019 •

edited

Loading