upstream: Null-deref if conn closed and TCP health timeout pending#6422
upstream: Null-deref if conn closed and TCP health timeout pending#6422andrewjjenkins wants to merge 1 commit intoenvoyproxy:masterfrom
Conversation
Fixes no-longer-embargoed ossfuzz 11100: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=11100&q=envoy&colspec=ID%20Type%20Component%20Status%20Proj%20Reported%20Owner%20Summary If a TCP health checker connection is closed but the timeout_timer fires anyway (for instance, the timeout timer is pending), then the TcpHealthCheckerImpl::TcpActiveHealthCheckSession::onTimeout() method attempts to dereference client_, which is nullptr. We need guards in onTimeout, similar to the guards in HttpHealthCheckerImpl::HttpActiveHealthCheckSession::onTimeout(). I added a unit test following the pattern of HttpHealthCheckerImplTest.TimeoutAfterDisconnect that reproduces the backtrace on ossfuzz without my proposed fix, and does not backtrace with my fix. Signed-off-by: Andrew Jenkins <andrew.jenkins@volunteers.acasi.info>
|
This is regarding issue #4709 (noting so there's a backref) |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks for working on this and fixing. I would suggest a slightly different fix, which is to disable the timer when disconnection happens vs checking for nullptr here. This is more consistent with what we do elsewhere. Thank you!
/wait
|
Thanks for taking a look. I appreciate your suggestion: it'd be better to disable the dangling timer in the first place instead of waiting it to fire and ignoring it. That's going to be a bit more work especially as I'm wrapping my head around the timer/client interactions. Here's what I'm thinking, let me know if I'm headed in the wrong direction.
Basically should I make both HTTP and TCP health checkers consistently prefer disabling the timer over nullptr checks? |
|
@andrewjjenkins yup that's right. One trick we do in unit tests is to call |
|
@mattklein123 Update: I went back and tried to find all paths where the TcpHealthChecker client is closed but timeout_timer_ not disabled. I found one: This particular crash is triggered by a
(some time later, after timeout_timer_) So I'm feeling like the TcpHealthCheckerImpl code is assuming that all I tried other paths, like
So I think it's only the So I'm evaluating these options:
I'll start heading down the path of choice 1 unless you have other thoughts. Thanks for working with me on this! |
|
@andrewjjenkins sorry for the delay and thanks for the super detailed analysis. Yes I would definitely do (1). It will be the cleanest solution. I might consider doing it for the HTTP HC also, but not a big deal since that code has been that way for a while. Thank you! |
|
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
|
This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
|
@mattklein123 sorry for disappearing - I have an update, could you please reopen? (I can also open a new one if you'd prefer) I explored disabling the timeout timer in
So I've got some code that instead treats What do you think? I'm happy to share what I've got and get some feedback and see what to do next. |
For some reason I can't open reopen this PR. Can you open a fresh one and we can discuss in code review? I think that will be easier. |
Description: Fixes Null-deref in TCP Health Monitor found by ossfuzz
Risk Level: Low, only implementation change is a nullptr check
Testing: Added unit test that reproduces the backtrace without the fix
Docs Changes: None
Release Notes: N/A (I'm new here - do we typically relnote fuzz bugs?)
Fixes no-longer-embargoed ossfuzz 11100
If a TCP health checker connection is closed but the timeout_timer fires anyway (for instance, the timeout timer is pending), then the TcpHealthCheckerImpl::TcpActiveHealthCheckSession::onTimeout() method attempts to dereference client_, which is nullptr:
We need a guard in onTimeout to check client_, similar to the guard in HttpHealthCheckerImpl::HttpActiveHealthCheckSession::onTimeout(). I added a unit test following the pattern of HttpHealthCheckerImplTest.TimeoutAfterDisconnect that reproduces the backtrace on ossfuzz without my proposed fix, and does not backtrace with my fix.
Signed-off-by: Andrew Jenkins andrew.jenkins@volunteers.acasi.info