Skip to content

[grpc]: fix ex_authz grpc client race condition#17619

Merged
yanavlasov merged 21 commits intoenvoyproxy:mainfrom
chaoqin-li1123:fix_ex_authz_race
Aug 13, 2021
Merged

[grpc]: fix ex_authz grpc client race condition#17619
yanavlasov merged 21 commits intoenvoyproxy:mainfrom
chaoqin-li1123:fix_ex_authz_race

Conversation

@chaoqin-li1123
Copy link
Contributor

@chaoqin-li1123 chaoqin-li1123 commented Aug 6, 2021

fixes commit #15745

Signed-off-by: chaoqin-li1123 chaoqinli@google.com

Commit Message: Creating the raw grpc client outside the callback cause all thread to use the same grpc client and data race.
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]
[Optional API Considerations:]

Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per chat I think we should have more tests for this.

chaoqin-li1123 added 3 commits August 7, 2021 00:36
Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
jmarantz
jmarantz previously approved these changes Aug 7, 2021
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the mock test is helpful but I would like to see us test against real threads more. The mock test is very tuned to the implementation, but a real threading test would ensure we've fixed the problem we found, and possibly find others we don't know about.

I would be OK with this as a follow-up also, but I'd like a senior maintainer to weigh in.

A pattern for testing with real threads can be found here:

class ThreadLocalRealThreadsTestBase : public ThreadLocalStoreNoMocksTestBase {

/assign-from @envoyproxy/senior-maintainers

@repokitteh-read-only
Copy link

@envoyproxy/senior-maintainers assignee is @zuercher

🐱

Caused by: a #17619 (review) was submitted by @jmarantz.

see: more, trace.

Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I'm with @jmarantz on hoping we can get a regression test that TSAN would have caught, as well as adding some thread self-sameness checks on constructor/destructor/send ops for clients to catch this eagerly during development. Thanks.
/wait

Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
@chaoqin-li1123
Copy link
Contributor Author

I added a thread consistency check. The thread consistency assertion will always be triggered before any race condition. Because thread sameness is a stronger assumption than thread safety. Thread safety only implies no concurrent writes, but the thread consistency check will guarantee that every member function is executed in the same thread.

chaoqin-li1123 added 3 commits August 9, 2021 20:21
Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
@chaoqin-li1123
Copy link
Contributor Author

The death can pass locally, try to figure out why it fails the CI.

chaoqin-li1123 added 7 commits August 10, 2021 04:51
Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
Copy link
Contributor

@yanavlasov yanavlasov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

Http::FilterFactoryCb cb = factory.createFilterFactoryFromProto(*proto_config, "stats", context);
Http::MockFilterChainFactoryCallbacks filter_callback;
EXPECT_CALL(filter_callback, addStreamFilter(_));
cb(filter_callback);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be useful to add test that calls cb on a separate thread and verifies that getOrCreateRawAsyncClient is called on that separate thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, test added.

@yanavlasov
Copy link
Contributor

LGTM, I think adding one extra test to verify that gRPC client in the the ext_authz factory is created on the right thread, would be good.
I think test with multiple worker threads can be done separately.

Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
)EOF";
#ifndef ENVOY_DISABLE_DEPRECATED_FEATURES
expectCorrectProtoGrpc(envoy::config::core::v3::ApiVersion::AUTO, google_grpc_service_yaml);
expectCorrectProtoGrpc(envoy::config::core::v3::ApiVersion::V2, google_grpc_service_yaml);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not worth covering V2, suggest removing that, and we might move AUTO to be v3 by default, so maybe just leave as a TODO.

htuch
htuch previously approved these changes Aug 12, 2021
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
@chaoqin-li1123
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17619 (comment) was created by @chaoqin-li1123.

see: more, trace.

@yanavlasov yanavlasov merged commit d59a43b into envoyproxy:main Aug 13, 2021
oschaaf pushed a commit to envoyproxy/nighthawk that referenced this pull request Aug 19, 2021
- Envoy now demands that `Envoy::Grpc::AsyncClient` is created, used and destroyed on the same thread. (See
  envoyproxy/envoy#17619). Nighthawk creates the `Envoy::Grpc::AsyncClient` on the worker thread, but destroys it on the main which triggers an assertion failure. Fixing this by:
  - Adding a new method `RequestSource::destroyOnThread()` which complements the pre-existing `RequestSource::initOnThread()`. While the latter gets called when a worker thread starts, the former now gets called when a worker thread is shutting down.
  - Implementing `RequestSource::destroyOnThread()` on the only request source implementation that uses `Envoy::Grpc::AsyncClient`, i.e. `Nighthawk::RemoteRequestSourceImpl`, the method destroys the object that
    owns `Envoy::Grpc::AsyncClient` when called.
- no changes to `.bazelrc`, `.bazelversion`, `run_envoy_docker.sh`.
- lowering the coverage threshold to `93.2` to accommodate the fact that this code path is only covered in integration tests. The unit-tests use a mock.

Signed-off-by: Jakub Sobon <mumak@google.com>
jmarantz pushed a commit that referenced this pull request Sep 28, 2021
Commit Message: Add test with real threading as a followup of #17619
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:

Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants