Skip to content

jwt_authn: support fetching jwks in the background#16298

Merged
lizan merged 21 commits intoenvoyproxy:mainfrom
qiwzhang:remote-jwks-async-fetch
May 28, 2021
Merged

jwt_authn: support fetching jwks in the background#16298
lizan merged 21 commits intoenvoyproxy:mainfrom
qiwzhang:remote-jwks-async-fetch

Conversation

@qiwzhang
Copy link
Contributor

@qiwzhang qiwzhang commented May 4, 2021

It is to fix: #14556 (comment)

Currently, remote Jwks is fetched on-demand, in the worker thread after the requests come. The first few requests need to pause to wait for the Jwks.

Add a new feature to fetch remote Jwks in the main thread, before the listener is activated.

Detail changes:

  • Change the filter config to add async_fetch field inside RemoteJwks message
  • Add a new class: JwksAsyncFetcher class to handle this new config.
  • Add two new statistics counters jwks_fetch_success and jwks_fetch_fail.

Risk Level: Low since new feature is guarded by the new config.
Testing: unit-tested and integration tested
Docs Changes: None
Release Notes: Yes

@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
envoyproxy/api-shepherds assignee is @lizan
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #16298 was opened by qiwzhang.

see: more, trace.

Copy link
Member

@rojkov rojkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good! I added a few nitpicking comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need for this else branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: setting debug_name_ could be done in the ctor init list. This way it could be marked as const.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callbacks can be marked as const.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: create_fetcher_fn_ would be more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/handle_fetch_done/handleFetchDone/

Copy link
Member

@lizan lizan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API review

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this flag, it can be determined by whether the field async_fetch is set or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we reuse normal HTTP RetryPolicy here, and make HTTP Async Client aware of that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I will add core.Retry config as part of RemoteJwks, Change common/JwksFetcher to support it.
That can be done as separate pr. The issue is tracked here

In this pr, I will remove retry

@lizan
Copy link
Member

lizan commented May 4, 2021

Can you check CI?

@rojkov let me know once the implementation looks good to you, then I will take a second look.

@qiwzhang
Copy link
Contributor Author

qiwzhang commented May 5, 2021

Tried to build docs locally. But run into this issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Debug log the failure error reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JwksFetcher has debug log. Not need to put here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is this if condition needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is safe to check. this is for the case neither local_jwks nor remote_jwks is set. of local_jwks is set, but somehow it is empty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: better stick to the same style for lambdas. Either like in this line or in the line 31.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@qiwzhang
Copy link
Contributor Author

qiwzhang commented May 6, 2021

Update: working on the jwt_authn:filter_integration_test failure. Added 4 AsyncFetch integration_tests. Two of them crashed with fast_listener as false. They only crashes at bazel.release mode, not at debug mode, suspected of race condition.

For now, it is not easy to debug such crash. My local development environment is not working. I have to use ci/run_envoy_docker. Not sure how to run gdb within docker.

@qiwzhang
Copy link
Contributor Author

qiwzhang commented May 9, 2021

How do I run gdb for a bazel test target? especially inside docker, and inside ci_do.sh. For now, I can only produce this crash in

ci/run_envoy_docker.sh `ci/do_ci.sh bazel.debug //test/extensions/filters/http/jwt_authn:filter_integration_test`

Here is the backtrace:

[2021-05-09 06:01:22.958][7771][debug][main] [source/server/worker_impl.cc:131] worker entering dispatch loop
[2021-05-09 06:01:22.958][7764][trace][main] [source/common/event/dispatcher_impl.cc:252] item added to deferred deletion list (size=2)
[2021-05-09 06:01:22.958][7764][debug][pool] [source/common/http/http1/conn_pool.cc:53] [C0] response complete
[2021-05-09 06:01:22.958][7764][debug][pool] [source/common/conn_pool/conn_pool_base.cc:200] [C0] destroying stream: 0 remaining
[2021-05-09 06:01:22.958][7764][trace][http] [source/common/http/http1/codec_impl.cc:607] [C0] parsed 1093 bytes
[2021-05-09 06:01:22.958][7764][trace][main] [source/common/event/dispatcher_impl.cc:115] clearing deferred deletion list (size=2)
[2021-05-09 06:01:22.959][7772][debug][grpc] [source/common/grpc/google_async_client_impl.cc:50] completionThread running
[2021-05-09 06:01:22.959][7771][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:981] adding TLS cluster cluster_0
[2021-05-09 06:01:22.959][7771][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1235] membership update for TLS cluster cluster_0 added 1 removed 0
[2021-05-09 06:01:22.960][7771][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:981] adding TLS cluster pubkey_cluster
[2021-05-09 06:01:22.960][7771][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1235] membership update for TLS cluster pubkey_cluster added 1 removed 0
[2021-05-09 06:01:22.960][7771][warning][jwt] [source/extensions/filters/http/jwt_authn/jwks_cache.cc:41] ======= Create cache slot
libc++abi: terminating with uncaught exception of type std::__1::bad_function_call: std::exception
[2021-05-09 06:01:22.960][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] Caught Aborted, suspect faulting address 0x2b73300001e4c
[2021-05-09 06:01:22.960][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2021-05-09 06:01:22.960][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 0/1.19.0-dev/test/DEBUG/BoringSSL
[2021-05-09 06:01:23.007][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #0: Envoy::SignalAction::sigHandler() [0x53ac8dc]
[2021-05-09 06:01:23.007][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #1: __restore_rt [0x7f7f5afa1980]
[2021-05-09 06:01:23.047][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #2: std::__1::__function::__value_func<>::operator()() [0x2db9194]
[2021-05-09 06:01:23.086][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #3: std::__1::function<>::operator()() [0x2db8f15]
[2021-05-09 06:01:23.126][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #4: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x5103d19]
[2021-05-09 06:01:23.166][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #5: Envoy::Event::DispatcherImpl::run() [0x5103895]
[2021-05-09 06:01:23.205][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #6: Envoy::Server::WorkerImpl::threadRoutine() [0x3a61314]
[2021-05-09 06:01:23.245][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #7: Envoy::Server::WorkerImpl::start()::$_5::operator()() [0x3a69fec]
[2021-05-09 06:01:23.285][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #8: std::__1::__invoke<>() [0x3a69fad]
[2021-05-09 06:01:23.324][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #9: std::__1::__invoke_void_return_wrapper<>::__call<>() [0x3a69f5d]
[2021-05-09 06:01:23.364][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #10: std::__1::__function::__alloc_func<>::operator()() [0x3a69f2d]
[2021-05-09 06:01:23.403][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #11: std::__1::__function::__func<>::operator()() [0x3a6908e]
[2021-05-09 06:01:23.443][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #12: std::__1::__function::__value_func<>::operator()() [0x2db91a5]
[2021-05-09 06:01:23.443][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #13: std::__1::function<>::operator()() [0x2db8f15]
[2021-05-09 06:01:23.483][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #14: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::operator()() [0x62f87a2]
[2021-05-09 06:01:23.522][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #15: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::__invoke() [0x62f8775]
[2021-05-09 06:01:23.522][7771][critical][backtrace] [bazel-out/k8-dbg/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #16: start_thread [0x7f7f5af966db]
================================================================================

@qiwzhang qiwzhang force-pushed the remote-jwks-async-fetch branch from 95c694d to 965523a Compare May 10, 2021 07:21
@qiwzhang
Copy link
Contributor Author

@lizan @rojkov @htuch This PR is ready for review. PTAL. Thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be redundant: fetcher_ calls its cancel() method in its destructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, I can remove it.

@qiwzhang
Copy link
Contributor Author

/retest envoy-presubmit (check linux_x64 gcc)

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #16298 (comment) was created by @qiwzhang.

see: more, trace.

@qiwzhang
Copy link
Contributor Author

test envoy-presubmit (check linux_x64 gcc) is failed du to following error: Not sure how to fix it.

ERROR: /source/source/exe/BUILD:47:17: C++ compilation of rule '//source/exe:envoy_main_entry_lib' failed (Exit 1): gcc failed: error executing command 
gcc: fatal error: cannot execute ‘/usr/lib/gcc/x86_64-linux-gnu/9/cc1plus’: execv: Argument list too long

Copy link
Member

@rojkov rojkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but could you get rid of the double definition of CreateJwksFetcherCb? Or at least give the types different names if you believe the types are fundamentally different.

@qiwzhang
Copy link
Contributor Author

Done. removed the double definitions. Thanks @rojkov

rojkov
rojkov previously approved these changes May 13, 2021
Copy link
Member

@rojkov rojkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@lizan I think it's good for a second pass.

Copy link
Member

@lizan lizan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit, LGTM otherwise

@qiwzhang
Copy link
Contributor Author

check-format failed in pre-submit tests. The error is release note current.rst is out of order after merged with top of tree.

Should I rebase this change to fix it?

@qiwzhang qiwzhang force-pushed the remote-jwks-async-fetch branch from 54a0b47 to 1925efc Compare May 19, 2021 16:48
Signed-off-by: Wayne Zhang <qiwzhang@google.com>
qiwzhang added 10 commits May 27, 2021 09:14
Signed-off-by: Wayne Zhang <qiwzhang@google.com>
Signed-off-by: Wayne Zhang <qiwzhang@google.com>
Signed-off-by: Wayne Zhang <qiwzhang@google.com>
Signed-off-by: Wayne Zhang <qiwzhang@google.com>
Signed-off-by: Wayne Zhang <qiwzhang@google.com>
Signed-off-by: Wayne Zhang <qiwzhang@google.com>
Signed-off-by: Wayne Zhang <qiwzhang@google.com>
Signed-off-by: Wayne Zhang <qiwzhang@google.com>
Signed-off-by: Wayne Zhang <qiwzhang@google.com>
Signed-off-by: Wayne Zhang <qiwzhang@google.com>
@qiwzhang qiwzhang force-pushed the remote-jwks-async-fetch branch from 1925efc to 453d8a9 Compare May 27, 2021 16:15
qiwzhang added 2 commits May 27, 2021 09:49
Signed-off-by: Wayne Zhang <qiwzhang@google.com>
Signed-off-by: Wayne Zhang <qiwzhang@google.com>
@lizan lizan merged commit db16625 into envoyproxy:main May 28, 2021
elcasteel added a commit to kgateway-dev/kgateway that referenced this pull request Sep 9, 2021
Expose the Envoy configuration created here: envoyproxy/envoy#16298
soloio-bulldozer bot pushed a commit to kgateway-dev/kgateway that referenced this pull request Sep 10, 2021
* Add JwksAsyncFetch configuration

Expose the Envoy configuration created here: envoyproxy/envoy#16298
* variable name
* codegeen
* Update jwt.proto
* import definition of JwksAsyncFetch in config
* don't import definition
* define JwksAsyncFetch in config.proto
ben-taussig-solo pushed a commit to kgateway-dev/kgateway that referenced this pull request Sep 10, 2021
* Add JwksAsyncFetch configuration

Expose the Envoy configuration created here: envoyproxy/envoy#16298
* variable name
* codegeen
* Update jwt.proto
* import definition of JwksAsyncFetch in config
* don't import definition
* define JwksAsyncFetch in config.proto
soloio-bulldozer bot pushed a commit to kgateway-dev/kgateway that referenced this pull request Sep 13, 2021
* Add support for headerBodyRequestTransform property in aws destinationSpec
* Add testing around new property
* Document Local AWS config process
* Add changelog entry
* Merge branch 'master' into aws-lambda-header-body-request-transform
* Add staged transformations
* Merge branch 'aws-lambda-header-body-request-transform' of github.com:solo-io/gloo into aws-lambda-header-body-request-transform
* Only apply transformation filter if transformations have been added
* Add notes on debug workflow
* Minor updates to extracted payload values
* rename method to httpMethod
* Merge refs/heads/master into aws-lambda-header-body-request-transform
* Adding changelog file to new location
* Deleting changelog file from old location
* Respond to comments
* git push
Merge branch 'aws-lambda-header-body-request-transform' of github.com:solo-io/gloo into aws-lambda-header-body-request-transform
* Add JwksAsyncFetch configuration (#5311)

* Add JwksAsyncFetch configuration

Expose the Envoy configuration created here: envoyproxy/envoy#16298
* variable name
* codegeen
* Update jwt.proto
* import definition of JwksAsyncFetch in config
* don't import definition
* define JwksAsyncFetch in config.proto
* Adding changelog file to new location
* Deleting changelog file from old location
* Merge branch 'aws-lambda-header-body-request-transform' of github.com:solo-io/gloo into aws-lambda-header-body-request-transform
* extract early stage filter utility
* Add sam's commentary on port 21001 to e2e README
* Pass pointer to &transformationPlugin.RequireEarlyTransformation
* Revert "extract early stage filter utility"

This reverts commit a5ed578.
* Merge branch 'aws-lambda-header-body-request-transform' of github.com:solo-io/gloo into aws-lambda-header-body-request-transform
* Make earlyPluginStage private again
* Update description + fieldname of request transformation option
* Merge refs/heads/master into aws-lambda-header-body-request-transform
* Update function signature of aws.NewPlugin in aws plugin test
* Merge branch 'aws-lambda-header-body-request-transform' of github.com:solo-io/gloo into aws-lambda-header-body-request-transform
* Remove log statements
* Set regular transforms when request transformations are applied, use camelCase for queryString extractor
* Merge refs/heads/master into aws-lambda-header-body-request-transform
* Testing updates
* Merge branch 'aws-lambda-header-body-request-transform' of github.com:solo-io/gloo into aws-lambda-header-body-request-transform
* Remove transformation plugin RequireTransformationFilter property
* Improve commentary around the use of earlyTransformsAdded property in aws plugin
leyao-daily pushed a commit to leyao-daily/envoy that referenced this pull request Sep 30, 2021
It is to fix: envoyproxy#14556 (comment)

Currently,  remote Jwks is fetched on-demand, in the worker thread after the requests come. The first few requests need to pause to wait for the Jwks.  

Add a new feature to fetch remote Jwks in the main thread, before the listener is activated.

Detail changes:
* Change the filter config to add async_fetch field inside RemoteJwks message
* Add a new class: JwksAsyncFetcher class to handle this new config.
* Add two new statistics counters `jwks_fetch_success` and `jwks_fetch_fail`.

Risk Level: Low since new feature is guarded by the new config.
Testing:  unit-tested and integration tested
Docs Changes:  None
Release Notes:  Yes

Signed-off-by: Wayne Zhang <qiwzhang@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

jwt_authn: JWKs health

5 participants