Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quarkus OIDC makes first request very slow #41452

Closed
wesleysalimansdvb opened this issue Jun 26, 2024 · 26 comments · Fixed by #42765
Closed

Quarkus OIDC makes first request very slow #41452

wesleysalimansdvb opened this issue Jun 26, 2024 · 26 comments · Fixed by #42765
Labels
area/oidc kind/bug Something isn't working
Milestone

Comments

@wesleysalimansdvb
Copy link
Contributor

wesleysalimansdvb commented Jun 26, 2024

Describe the bug

Hi,

We use Quarkus OIDC for multi tenancy. We do not use Keycloak, we use our own authentication system. Therefore we set quarkus.oidc.discovery-enabled to false, which results in our OIDC config not getting loaded on startup, but on the first request. This makes the first request that gets done to our container (512 vCPU, 1024 memory) very slow (around 4 seconds).

Is there a way to make a healthcheck manually load the OIDC providers before the first request gets done to the container? Or could there be added a new configuration property to load the providers on startup (such as proposed in #34890)

quarkus.oidc.enabled=true
quarkus.oidc.discovery-enabled=false
quarkus.oidc.use-blocking-dns-lookup=true
quarkus.oidc.bca.public-key=${OUR_PUBLIC_KEY}
quarkus.oidc.bca.token.audience=bsp
quarkus.oidc.bca.token.issued-at-required=false
quarkus.oidc.bca.application-type=service
quarkus.oidc.bca.token.signature-algorithm=rs256
quarkus.oidc.auth-server-url=${JWKS_LOCATION}
quarkus.oidc.jwks-path=${JWKS_LOCATION}
quarkus.oidc.token.issued-at-required=false
quarkus.oidc.token.signature-algorithm=es256
quarkus.oidc.application-type=service

We also have our own TenantResolverConfig, to get the tenant config based on the tenant:

@ApplicationScoped
public class TokenTenantResolver implements TenantConfigResolver {
  private static final String ISSUER_PATH = "iss";

  @ConfigProperty(name = "quarkus.oidc.bca.token.issuer")
  String bcaIssuer;

  @ConfigProperty(name = "quarkus.oidc.bca.public-key")
  String bcaPublicKey;

  @SneakyThrows
  public Uni<OidcTenantConfig> resolve(
      RoutingContext context, OidcRequestContext<OidcTenantConfig> requestContext) {
    final var jwtString = context.request().headers().get("Authorization");

    if (jwtString == null) {
      return Uni.createFrom().nullItem();
    }

    final var jwtHeaders = io.quarkus.oidc.runtime.OidcUtils.decodeJwtContent(jwtString);

    if (jwtHeaders != null && jwtHeaders.getValue(ISSUER_PATH).equals(bcaIssuer)) {
      OidcTenantConfig config = new OidcTenantConfig();
      config.setTenantId("bca");
      config.setPublicKey(bcaPublicKey);
      config.getToken().setIssuedAtRequired(false);
      config.setApplicationType(OidcTenantConfig.ApplicationType.SERVICE);
      return Uni.createFrom().item(config);
    } else {
      // resolve to default tenant config
      return Uni.createFrom().nullItem();
    }
  }
}

Expected behavior

All the fields in our config for our OIDC tenants are "deterministic" as in, they could easily be loaded on startup. So initializing them on startup would be our expected behavior

Actual behavior

The OIDC config does not get loaded on startup, resulting in intializing all OIDC classes etc on the first request, which makes the first request very slow.

How to Reproduce?

No response

Output of uname -a or ver

AWS ECS Fargate container

Output of java -version

openjdk 21.0.1 2023-10-17 LTS OpenJDK Runtime Environment Corretto-21.0.1.12.1 (build 21.0.1+12-LTS) OpenJDK 64-Bit Server VM Corretto-21.0.1.12.1 (build 21.0.1+12-LTS, mixed mode, sharing)

Quarkus version or git rev

3.11.0

Build tool (ie. output of mvnw --version or gradlew --version)

docker

Additional information

No response

@wesleysalimansdvb wesleysalimansdvb added the kind/bug Something isn't working label Jun 26, 2024
Copy link

quarkus-bot bot commented Jun 26, 2024

/cc @pedroigor (oidc), @sberyozkin (oidc)

@gsmet
Copy link
Member

gsmet commented Jun 26, 2024

Any chance you could get some profiling done?

We have some guidance here: https://github.com/quarkusio/quarkus/blob/main/TROUBLESHOOTING.md .

@wesleysalimansdvb
Copy link
Contributor Author

I do not have the time right now to do any profiling, maybe these CloudWatch logs give you some insights?
oidc-initalizing.xlsx

@wesleysalimansdvb
Copy link
Contributor Author

My "main" question right now is (as a quick hack): How can i manually initialize the OIDC client so it is ready for my first request (via a healthcheck or something).

@gsmet
Copy link
Member

gsmet commented Jun 26, 2024

So https://quarkus.io/guides/lifecycle#startup_annotation is what should be used to execute something at startup but I have no idea what you should initialize.

Question for @michalvavrik or @sberyozkin .

@sberyozkin
Copy link
Member

sberyozkin commented Jun 26, 2024

@wesleysalimansdvb Hi, 4 secs is a very long period of time, to get JWKS fetched at runtime should take less than a second.

Why do you think that disabling the discovery postpones pulling the keys until the 1st request's time ? It only skips a step which involves fetching the discovery metadata doc pointing to the JWKS, etc endpoint, but JWKS, by default, will be fetched immediately: https://github.com/quarkusio/quarkus/blob/main/extensions/oidc/runtime/src/main/java/io/quarkus/oidc/runtime/OidcRecorder.java#L404

I wonder if quarkus.oidc.use-blocking-dns-lookup=true is a culprit here ? It is only required for slow systems to avoid a blocked IO thread problem when a connection to the provider is being established.

As a side note, you may want to experiment with https://quarkus.io/version/main/guides/security-openid-connect-multitenancy#issuer-based-tenant-resolver, to avoid having to create a custom resolver above which parses the token itself...

@wesleysalimansdvb
Copy link
Contributor Author

wesleysalimansdvb commented Jun 26, 2024

quarkus.oidc.use-blocking-dns-lookup=true was added because we kept getting Vert.x blocked thread exceptions.

2024-06-24 07:13:24,827 WARN [io.ver.cor.imp.BlockedThreadChecker] (vertx-blocked-thread-checker) Thread Thread[vert.x-eventloop-thread-1,5,main] has been blocked for 3160 ms, time limit is 2000 ms: io.vertx.core.VertxException: Thread blocked

After enabling this property those exceptions are gone. 4 secs indeed is a long time, but that is what our logging says how long it takes to intialize all classes, algorithms and parse the token.

Thanks for the sidenote, this is something we will definitely try out!

@sberyozkin
Copy link
Member

@wesleysalimansdvb
So that stack trace says that it takes 3.16+ seconds to establish a connection with your provider. Sounds like this is what is happening, you get something close to 4 secs to have it established.
You can try any of the quickstarts with live Keycloak and with the disabled discovery and I can guarantee you it won't take 4 seconds to handle a first request

@sberyozkin
Copy link
Member

IMHO it is about your dev env's setup than anything else, a lot of users have the discovery disabled for endpoints, most of the OAuth2 social providers don't support it. Please prove it otherwise with the profiling if you'd like to have some progress on this issue

@sberyozkin
Copy link
Member

sberyozkin commented Jun 26, 2024

@wesleysalimansdvb

Thanks for the sidenote, this is something we will definitely try out!

Sure, give it a try please and let Michal and myself know how it goes

@sberyozkin sberyozkin added the triage/needs-reproducer We are waiting for a reproducer. label Jun 27, 2024
@sberyozkin
Copy link
Member

sberyozkin commented Jun 28, 2024

@wesleysalimansdvb Can you temporarily register a test client application and share the details with me offline (at Zulip, or post them to sbiarozk at redhat dot com with CC to someone else in your team) ? I can try it from my laptop to check what is going on there. I'd be worried even about 2 secs if I ever saw it happening in such cases, but I did not see it...

@bartm-dvb
Copy link
Contributor

Hi @sberyozkin, I want to give you a headsup that @wesleysalimansdvb is out of office until next week. He will come back to you when he returns.

@sberyozkin
Copy link
Member

Hi @bartm-dvb, thanks for the heads up, sure, lets continue next week or so and see what we can realistically do at the Quakus level

@geoand
Copy link
Contributor

geoand commented Jul 18, 2024

Is there anything new regarding this issue?

@bartm-dvb
Copy link
Contributor

I understood further communication is happening through Zulip for now

@sberyozkin
Copy link
Member

We have touched on it with @wesleysalimansdvb who spent more time on looking at the issue and could confirm about 4 secs are spent. However, what I said to @wesleysalimansdvb, that we can't fix anything with the images like the one attached to this issue.

So, there are two things Quarkus OIDC does when the discovery is disabled but the JWK endpoint is configured directly:

  1. It performs a networking operation to fetch JWK key set
  2. It process this key set

The team assumes Quarkus OIDC spends this time on step 2.

My bet is, given that you had to enable the blocking DNS to overcome a slow DNS resolution, it is spent on step 1, which is I believe is in fact confirmed by the stackstrace @wesleysalimansdvb pasted above, where, without this option, Quarkus times out after 3.16 secs and the actual time required to perform the networking op is higher.

I'd like to encourage you to run quarkus-quickstarts/security-openid-connect-web-authentication with the disabled discovery. I can assure you won't notice a difference, with and without the discovery.

I also suggested @wesleysalimansdvb to have a breakpoint set in OidcRecorder where JWKs are resolved, @wesleysalimansdvb has already done one PR so it would not be a problem to get the source etc.

IMHO this issue is not really a Quarkus issue and I'd not mind closing it. But I know this is an important one for you so let's keep discussing what can be done about it. Thanks

@geoand geoand added triage/needs-feedback We are waiting for feedback. and removed triage/needs-reproducer We are waiting for a reproducer. labels Jul 22, 2024
@bartm-dvb
Copy link
Contributor

FYI, I am trying out the quarkus-quickstarts to reproduce the issue. I'll be back with my findings later

@bartm-dvb
Copy link
Contributor

bartm-dvb commented Aug 1, 2024

@sberyozkin So what seems to most time in step 2 (It process this key set) is initialising jose4j. Both in the profiling output and logs, it seems to take a lot of time. On my local machine it takes 64ms, in the logs @wesleysalimansdvb attached it takes 298ms. Is initialising jose4j something that could be moved to build time?

2024-06-24 09:23:17,817 DEBUG [org.jos.jwa.AlgorithmFactoryFactory] (vert.x-eventloop-thread-0) Initialized jose4j in 298ms

@geoand geoand removed the triage/needs-feedback We are waiting for feedback. label Aug 1, 2024
@wesleysalimansdvb
Copy link
Contributor Author

@sberyozkin could you please look at this thread?

@sberyozkin
Copy link
Member

sberyozkin commented Aug 22, 2024

@wesleysalimansdvb @bartm-dvb Sure, we can try to do that. Even in the worst case though, 0.298ms, is still 0.33 of a sec. In any case though, it is worth optimizing. Would you be interested in a small PR ? Like adding a build step that will call out to OIDC recorder which would make some neural Jose4j call causing its initialization ?

@sberyozkin
Copy link
Member

@wesleysalimansdvb @bartm-dvb, in general the OIDC provider initialization must be runtime as the keys are fetched at runtime. But having some neutral Jose call which can be recorded can certainly be done. For example, AlgorithmFactoryFactory.getInstance() should do.

Let me know if you'll be up to it

@wesleysalimansdvb
Copy link
Contributor Author

Hi @sberyozkin, we will be proposing a PR any of these days :)

@sberyozkin
Copy link
Member

sberyozkin commented Aug 26, 2024

Hi @wesleysalimansdvb, thanks very much, I just thought I'd create a quick one today, #42765, as I got a few minutes, I did not notice your comment today.

I'm sure there will be more opportunities for OIDC PRs :-)

Note I've marked #42765 to close this issue, I know it might not answer all your questions, but I feel we should rather focus on more specific issues related to possible first slow requests, if you notice them. Please create them as necessary going forward

@wesleysalimansdvb
Copy link
Contributor Author

Hahah thanks @sberyozkin for your help & proposing a fix :)
I might open a separate recommendation to make it optional to fetch the JWKS at boot time, so we can also initialize the algorithms that are used before the first request :)

@sberyozkin
Copy link
Member

Hi @wesleysalimansdvb There is a quarkus.oidc.jwks.resolve-early flag which you can set to false, give that a try, but yeah, please keep opening issues/PRs

Thanks to you and @bartm-dvb, cheers

@quarkus-bot quarkus-bot bot added this to the 3.16 - main milestone Aug 27, 2024
@bartm-dvb
Copy link
Contributor

@sberyozkin With the 3.16-RC1 I am seeing a ~30% improvement in first request latency with the oidc quick-start, thanks for your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/oidc kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants