Skip to content

Conversation

joe-elliott
Copy link
Contributor

@joe-elliott joe-elliott commented Jun 16, 2020

What this PR does:

  • Adjusts the behavior of the frontend readiness handler at /ready to return 200 only if the frontend is ready to receive requests
    • Note that this also impacts the single binary
  • Adds appropriate unit/integration tests
  • Adds documentation on scaling the query frontend
  • Minor updates/fixes to scalable frontend proposal

Which issue(s) this PR fixes:
Fixes #None

Addresses this accepted proposal: https://github.com/cortexproject/cortex/blob/master/docs/proposals/scalable-query-frontend.md#querier-discovery-lag

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@joe-elliott joe-elliott marked this pull request as ready for review June 16, 2020 20:21
@pstibrany
Copy link
Contributor

pstibrany commented Jun 16, 2020

All Cortex componets have /ready handler. I don’t quite understand why does frontend need its own. Would it make sense to extend /ready handler with frontend check, similar to ingester?

(I guess I better read the proposal first)

@joe-elliott
Copy link
Contributor Author

The current readiness handler returns 200 if all of the Cortex modules have successfully started and are running. This is normally fine, but in the case of the query frontend it is not actually ready to receive requests until a querier has established a GRPC connection to it. Depending on settings this could be 10s of seconds.

In discussions surrounding the original proposal it was decided an additional handler was more appropriate then modifying the behavior of the original, but we can revisit this conversation.

@pstibrany
Copy link
Contributor

In discussions surrounding the original proposal it was decided an additional handler was more appropriate then modifying the behavior of the original, but we can revisit this conversation.

By using separate handler, we risk sending requests to Cortex instance which already has querier connected to it, but is otherwise still initializing. If initialization of other modules fails and Cortex exits, queued requests will be dropped.

Benefit of using separate handler is that query-frontend may start serving requests faster, while e.g. WAL is being replayed or blocks being downloaded (both of which can take minutes).

I am not sure what is better. I will review previous discussion to educate myself.

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @joe-elliott for working on it! LGTM. I just have few little requests:

  • Could you rebase master, please?
  • Could you add a CHANGELOG entry, please?
  • What do you think to add /ready and /query-frontend/ready to docs/apis.md and explain the use case for /query-frontend/ready?

@gouthamve
Copy link
Contributor

I would be in favour of re-using the existing handler and adding the check to it.

Benefit of using separate handler is that query-frontend may start serving requests faster, while e.g. WAL is being replayed or blocks being downloaded (both of which can take minutes).

If this is the case for single-binary, I am not sure how much benefit we'll have by getting the frontend ready but not the other components.

Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
@joe-elliott joe-elliott force-pushed the query-frontend-readiness branch from 2f3e77b to 1309aba Compare June 17, 2020 18:04
Signed-off-by: Joe Elliott <[email protected]>
@joe-elliott
Copy link
Contributor Author

Thanks @joe-elliott for working on it! LGTM. I just have few little requests:

Done!

If this is the case for single-binary, I am not sure how much benefit we'll have by getting the frontend ready but not the other components.

In the single binary scenario the current readiness endpoint works well for insertion into a gossip ring or readiness for ingestion, but not for reads. The idea behind this endpoint is that a query path load balancer or k8s service could use this endpoint to delay querying the new shard until the query-frontend has attached queriers.

@pstibrany
Copy link
Contributor

In the single binary scenario the current readiness endpoint works well for insertion into a gossip ring or readiness for ingestion, but not for reads. The idea behind this endpoint is that a query path load balancer or k8s service could use this endpoint to delay querying the new shard until the query-frontend has attached queriers.

Discussion is whether to extend existing /ready handler with check for query-frontend, or create new handler. I'm not convinced that new handler is necessary, but don't have strong opinion on it.

@pstibrany
Copy link
Contributor

Also in single-binary mode, I assume that querier running in the same binary "connects" to the query-frontend quickly.

@joe-elliott
Copy link
Contributor Author

Also in single-binary mode, I assume that querier running in the same binary "connects" to the query-frontend quickly.

You could configure the querier to use localhost. In this case each query-frontend would have a single querier attached.

However, I believe that the recommended configuration would be exactly as we have it now: The querier component of each single binary will be configured to use a DNS entry for discovery of the frontend component of all the other single binaries. In this case the connection lag would be no different than microservices mode.

Co-authored-by: Marco Pracucci <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
@joe-elliott joe-elliott force-pushed the query-frontend-readiness branch from 1640287 to 67dfcfe Compare June 19, 2020 19:15
@pracucci pracucci self-requested a review June 22, 2020 07:16
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joe-elliott I think Peter and Goutham are right. We don't need a dedicated readiness endpoint. When we brainstormed the proposal, we weren't aware of publishNotReadyAddresses existence, but thanks to it we can just change the query-frontend readiness probe endpoint to also ensure that least 1 querier is connected (we customise the readiness also for ingesters, see pkg/cortex/cortex.go:391). What's your take?

@joe-elliott joe-elliott marked this pull request as draft June 29, 2020 15:34
@joe-elliott joe-elliott force-pushed the query-frontend-readiness branch from ee14105 to 8fdf0f8 Compare June 30, 2020 14:36
joe-elliott and others added 3 commits June 30, 2020 10:52
Co-authored-by: Marco Pracucci <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
@joe-elliott joe-elliott force-pushed the query-frontend-readiness branch from 8fdf0f8 to e8a3557 Compare June 30, 2020 14:52
@joe-elliott
Copy link
Contributor Author

joe-elliott commented Jun 30, 2020

Ok, I think this is in a good place now. As discussed we settled on modifying the behavior of the /ready endpoint.

I will also be submitting a PR to this repo: https://github.com/grafana/cortex-jsonnet to handle the k8s configuration changes. Deploying these changes will need to be coordinated.

Also: Should we do more to highlight this change then the changelog and additional doc? I'm concerned this can have unexpected impacts when people roll it out.

@joe-elliott joe-elliott marked this pull request as ready for review June 30, 2020 15:42
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @joe-elliott for addressing my feedback. LGTM (modulo a couple of nits)

Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for patiently addressing our feedback!

@pstibrany pstibrany merged commit 76ac8a9 into cortexproject:master Jul 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants