Skip to content

upstream: fix subset_lb crash when host configured with no metadata#15171

Merged
snowp merged 5 commits intoenvoyproxy:mainfrom
yanjunxiang-google:yanjun_oss_fuzz_crash
Mar 10, 2021
Merged

upstream: fix subset_lb crash when host configured with no metadata#15171
snowp merged 5 commits intoenvoyproxy:mainfrom
yanjunxiang-google:yanjun_oss_fuzz_crash

Conversation

@yanjunxiang-google
Copy link
Copy Markdown
Contributor

@yanjunxiang-google yanjunxiang-google commented Feb 24, 2021

Problem Description:
The problem is that when executing a fuzz test case with empty hostmetadata, envoy crashed at subset_lb.cc:rebuildSingle(): line 131.

Root cause:
The root cause is that current envoy code is assuming host->medata is valid, and directly using it without NULL check. In the case if metadata is actually NULL, which is a valid configuration, it crashes.

Fix:
1)The fix is to add a if (metadat != nullptr) in subset_lb.cc:rebuildSingle() before accessing metadata.
2)GTEST code is added to reproduce the issue, also verifies the fix.
3)The oss-fuzz test file which exposed the issue is pushed with the change.

Testing:
1)The fix is tested by running the oss-fuzz test with the special test file which has host configured without metadata.
2)The fix is also tested by the newly added GTEST code.

Release Notes:
N/A

Issues:

Fix #30705 : https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=30705

Signed-off-by: Yanjun Xiang yanjunxiang@google.com

For an explanation of how to fill out the fields, please see the relevant section
in PULL_REQUESTS.md

Commit Message:
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]
[Optional API Considerations:]

https://oss-fuzz.com/testcase-detail/5135200453525504

The problem is that when executing a fuzz test case with empty host
metadata, envoy crashed at subset_lb.cc:rebuildSingle(): line 131.

The root cause is that current envoy code is assuming host->medata is
valid, and directly using it without NULL check. In the case if metadata
is actually NULL, which is a valid configuration, it crashes.

The fix is to add a if (metadat != nullptr) before accessing metadata.

GTEST code is added to reproduce the issue, and verified the fix.

The fix is also verified by running the oss-fuzz test with that
special testcase input file.

Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
@yanjunxiang-google
Copy link
Copy Markdown
Contributor Author

/assign @asraa @adisuissa

Copy link
Copy Markdown
Contributor

@asraa asraa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yanjunxiang-google!
Could you please make the issue title something like upstream: fix subset_lb crash when host configured with no metadata?

Also, could you please change the bug link to the monorail one? https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=30705. The detailed report eventually expires after some time. You can also commit the regression corpus entry as an additional regression test.

Copy link
Copy Markdown
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!
I've added a couple of comments.
Please update the PR first comment fields, such as risk level, etc.
@asraa: should the fuzz test-case be pushed as well?

Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
@yanjunxiang-google yanjunxiang-google changed the title This commit is to fix oss-fuzz test issue: upstream: fix subset_lb crash when host configured with no metadata Feb 25, 2021
@yanjunxiang-google
Copy link
Copy Markdown
Contributor Author

Thanks @yanjunxiang-google!
Could you please make the issue title something like upstream: fix subset_lb crash when host configured with no metadata?

Also, could you please change the bug link to the monorail one? https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=30705. The detailed report eventually expires after some time. You can also commit the regression corpus entry as an additional regression test.

Done!

@yanjunxiang-google
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Check envoy-presubmit didn't fail.

🐱

Caused by: a #15171 (comment) was created by @yanjunxiang-google.

see: more, trace.

@yanjunxiang-google
Copy link
Copy Markdown
Contributor Author

Reopen the PR.

@yanjunxiang-google
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Check envoy-presubmit didn't fail.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #15171 (comment) was created by @yanjunxiang-google.

see: more, trace.

adisuissa
adisuissa previously approved these changes Feb 26, 2021
Copy link
Copy Markdown
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this, LGTM!

{":authority", "host"}});
response->waitForEndStream();
EXPECT_TRUE(response->complete());
EXPECT_THAT(response->headers(), Http::HttpStatusIs("404"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why this is giving a 404 instead of 503?
It might be configurable, but the default is 503

enum ClusterNotFoundResponseCode {

Copy link
Copy Markdown
Contributor Author

@yanjunxiang-google yanjunxiang-google Feb 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuration for this test is below: cluster and endpoints are configured. My understand is that with this configuration service is available, but page is not found, hence 404 is returned.

static_resources:
clusters:
- name: cluster_0
connect_timeout: 5s
lb_subset_config:
subset_selectors:
- keys:
- type
single_host_per_subset: true
load_assignment:
cluster_name: cluster_0
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 46423
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 40245
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 39563
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 37041

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually now that I think about it I think we're probably hitting a NR here and not actually hitting the bad code. I think the test file sets up a requirement for the header x-type to be set to either a or b in order to actually route to the upstream. Can you take a look and make sure we're actually attempting to route via the subset lb?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Let me take a look.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The bad code in subset_lb.cc:SubsetLoadBalancer::rebuildSingle() is hit by configuration update, not by traffic. Which means in the newly added test code:
    TEST_P(HttpSubsetLbIntegrationTest, SubsetLoadBalancerSingleHostPerSubsetNoMetadata)

initialize(), which eventually call SubsetLoadBalancer::rebuildSingle() can trigger crash. We reproduced with this, also verified the fix with it.

  1. The original test function: TEST_P(HttpSubsetLbIntegrationTest, SubsetLoadBalancer) setup the config with metadata as
    ...
    - lb_endpoints:
    - endpoint:
    address:
    socket_address:
    address: 127.0.0.1
    port_value: 34033
    metadata:
    filter_metadata:
    envoy.lb:
    type: a
    - endpoint:
    address:
    socket_address:
    address: 127.0.0.1
    port_value: 35437
    metadata:
    filter_metadata:
    envoy.lb:
    type: b

...
also call runTest() with type_a and type_b as in TEST_P(..., SubsetLoadBalancer), in this test, Envoy is expected to route via subset lb with this test.

However, the newly added test function doesn't have type a/b metadata config, neither call runTest with the two different request types, it probably won't route via the subset lb.

  1. In the newly added test function, the code after initialize(), i.e, creating HTTP connection and sending a HTTP request and verify the response is not specially testing the code change in SubsetLoadBalancer::rebuildSingle(), but a general purpose test, i.e, just to send some traffic, and verify response as expected.

Please let me know whether this addressed your question.

Thanks!

Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

// Two hosts with the same metadata value were found. Ignore all but one of them, and
// set a metric for how many times this happened.
collision_count++;
if (metadata != nullptr) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do

if (metadata == nullptr) {
  continue;
}

to reduce the nesting?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will do.

Comment on lines +225 to +227
// Set single_host_per_subset to be true
auto* subset_selector = cluster->mutable_lb_subset_config()->mutable_subset_selectors(0);
subset_selector->set_single_host_per_subset(true);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this crash is limited to when this option is enabled, could we be more explicit about this in the comments or test name? Maybe even add a test case where this is disabled for completeness sake.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this crash only happens when single_host_per_subset is set to be true. Without it subset_lb.cc:rebuildSingle() will bailout early hence no crash. I will add some comments here to cover this.

@yanjunxiang-google
Copy link
Copy Markdown
Contributor Author

yanjunxiang-google commented Feb 26, 2021 via email

Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
@yanjunxiang-google
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #15171 (comment) was created by @yanjunxiang-google.

see: more, trace.

@asraa
Copy link
Copy Markdown
Contributor

asraa commented Mar 1, 2021

I think you will need to merge main when this is merged #15238

@yanjunxiang-google
Copy link
Copy Markdown
Contributor Author

I think you will need to merge main when this is merged #15238
Thanks for the info!

@yanjunxiang-google
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #15171 (comment) was created by @yanjunxiang-google.

see: more, trace.

Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

{":authority", "host"}});
response->waitForEndStream();
EXPECT_TRUE(response->complete());
EXPECT_THAT(response->headers(), Http::HttpStatusIs("404"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually now that I think about it I think we're probably hitting a NR here and not actually hitting the bad code. I think the test file sets up a requirement for the header x-type to be set to either a or b in order to actually route to the upstream. Can you take a look and make sure we're actually attempting to route via the subset lb?

@snowp snowp added the waiting label Mar 3, 2021
@yanjunxiang-google
Copy link
Copy Markdown
Contributor Author

@snowp

I checked the code, and also the test logs. Here is some comments:

The bad code in subset_lb.cc:SubsetLoadBalancer::rebuildSingle() is hit by configuration update, not by traffic. Which means in the newly added test code:
TEST_P(HttpSubsetLbIntegrationTest, SubsetLoadBalancerSingleHostPerSubsetNoMetadata)
initialize(), which eventually call SubsetLoadBalancer::rebuildSingle() can trigger crash. We reproduced with this, also verified the fix with it.

The original test function: TEST_P(HttpSubsetLbIntegrationTest, SubsetLoadBalancer) setup the config with metadata as
...

  • lb_endpoints:
  • endpoint:
    address:
    socket_address:
    address: 127.0.0.1
    port_value: 34033
    metadata:
    filter_metadata:
    envoy.lb:
    type: a
  • endpoint:
    address:
    socket_address:
    address: 127.0.0.1
    port_value: 35437
    metadata:
    filter_metadata:
    envoy.lb:
    type: b
    ...
    also call runTest() with type_a_request and type_b_request as in TEST_P(..., SubsetLoadBalancer), in this test, Envoy is routing via subset lb, and verified.

However, the newly added test function doesn't have type a/b metadata config, neither call runTest with the two different request types, it probably won't route via the subset lb.

In the newly added test function, the code after initialize(), i.e, creating HTTP connection and sending a HTTP request and verify the response is not specially testing the code change in SubsetLoadBalancer::rebuildSingle(), but a general purpose test, i.e, just to send some traffic, and verify response as expected.

Please let me know whether this addressed your question.

Thanks!

@snowp
Copy link
Copy Markdown
Contributor

snowp commented Mar 4, 2021

Even if we're not having to validate the behavior via the traffic it seems odd to have a request that doesn't hit the subset LB at all? Would it be hard to make it route via the subset LB? I think we'd just need to set a single header

@yanjunxiang-google
Copy link
Copy Markdown
Contributor Author

Even if we're not having to validate the behavior via the traffic it seems odd to have a request that doesn't hit the subset LB at all? Would it be hard to make it route via the subset LB? I think we'd just need to set a single header

Sure, let me add this.

@yanjunxiang-google
Copy link
Copy Markdown
Contributor Author

working on this today

Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
@yanjunxiang-google
Copy link
Copy Markdown
Contributor Author

Even if we're not having to validate the behavior via the traffic it seems odd to have a request that doesn't hit the subset LB at all? Would it be hard to make it route via the subset LB? I think we'd just need to set a single header

Sure, let me add this.

With below RequestHeader setup,
Http::TestRequestHeaderMapImpl{{":method", "GET"},
{":path", "/test"},
{":scheme", "http"},
{":authority", "host"},
{"x-type", "a"},
{"x-hash", "hash-a"}});

It's observed Envoy tried to route the request, but failed due to "no healthy host for HTTP connection pool". I think this is expected due to no metadata present in host configuration. Below is the complete error logs. Please check the latest code change.

2021-03-09 16:05:42.229][130][trace][connection] [source/common/network/connection_impl.cc:349] [C47] readDisable: disable=true disable_count=0 state=0 buffer_length=80
[2021-03-09 16:05:42.229][130][debug][http] [source/common/http/conn_manager_impl.cc:883] [C47][S7508058493963056235] request headers complete (end_stream=true):
':authority', 'host'
':path', '/test'
':method', 'GET'
'x-type', 'a'
'x-hash', 'hash-a'
'content-length', '0'

[2021-03-09 16:05:42.229][130][debug][http] [source/common/http/filter_manager.cc:774] [C47][S7508058493963056235] request end stream
[2021-03-09 16:05:42.229][130][debug][router] [source/common/router/router.cc:428] [C47][S7508058493963056235] cluster 'cluster_0' match for URL '/test'
[2021-03-09 16:05:42.229][130][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1367] no healthy host for HTTP connection pool
[2021-03-09 16:05:42.230][130][debug][http] [source/common/http/filter_manager.cc:858] [C47][S7508058493963056235] Sending local reply with details no_healthy_upstream
[2021-03-09 16:05:42.230][130][trace][misc] [source/common/event/scaled_range_timer_manager_impl.cc:60] enableTimer called on 0x6543f616080 for 300000ms, min is 300000ms
[2021-03-09 16:05:42.230][130][debug][http] [source/common/http/conn_manager_impl.cc:1454] [C47][S7508058493963056235] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '19'
'content-type', 'text/plain'
'date', 'Tue, 09 Mar 2021 16:05:42 GMT'
'server', 'envoy'

Thanks!

@yanjunxiang-google
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #15171 (comment) was created by @yanjunxiang-google.

see: more, trace.

Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
@snowp
Copy link
Copy Markdown
Contributor

snowp commented Mar 9, 2021

Yeah that looks right to me, thanks for updating the test!

@snowp snowp merged commit e8cd93d into envoyproxy:main Mar 10, 2021
aunu53 pushed a commit to aunu53/envoy that referenced this pull request Mar 19, 2021
…nvoyproxy#15171)

This commit is to fix oss-fuzz test issue:
https://oss-fuzz.com/testcase-detail/5135200453525504

The problem is that when executing a fuzz test case with empty host
metadata, envoy crashed at subset_lb.cc:rebuildSingle(): line 131.

The root cause is that current envoy code is assuming host->medata is
valid, and directly using it without NULL check. In the case if metadata
is actually NULL, which is a valid configuration, it crashes.

The fix is to add a if (metadat != nullptr) before accessing metadata.

GTEST code is added to reproduce the issue, and verified the fix.

The fix is also verified by running the oss-fuzz test with that
special testcase input file.

Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
Signed-off-by: Auni Ahsan <auni@google.com>
@yanjunxiang-google yanjunxiang-google deleted the yanjun_oss_fuzz_crash branch July 19, 2021 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants