Conversation
| } | ||
|
|
||
| this.server.ext('onRequest', adoptToHapiOnPreAuthFormat(fn, this.log)); | ||
| this.server.ext('onPreAuth', adoptToHapiOnPreAuthFormat(fn, this.log)); |
There was a problem hiding this comment.
The onRequest lifecycle event occurs prior to "route lookup", which prevented the use of request.route.options.tags within the pre-auth handler to determine whether or not the route is a fleet-agent specific. Changing this from onRequest to onPreAuth fixes this specific issue, but potentially introduces others.
@restrry this was originally using onRequest as part of #36690, are you aware of any reason why this can't be changed to onPreAuth instead?
There was a problem hiding this comment.
onRequest was required to support Spaces rewriting url.
There was a problem hiding this comment.
Gotcha, thanks! Could we theoretically change http.registerOnPreAuth to use onPreAuth, and introduce a http.registerOnPreRouting which uses onRequest?
There was a problem hiding this comment.
Gotcha, thanks! Could we theoretically change http.registerOnPreAuth to use onPreAuth, and introduce a http.registerOnPreRouting which uses onRequest?
Yeah, it shouldn't be that hard. I want to make sure it's necessary to extend API for the current implementation. Let me know and the platform team can provide an interceptor implementation
| ); | ||
| core.http.registerOnPreResponse( | ||
| (request: KibanaRequest, preResponse: OnPreResponseInfo, toolkit: OnPreResponseToolkit) => { | ||
| if (isAgentRequest(request) && preResponse.statusCode !== 429) { |
There was a problem hiding this comment.
This assumes that only the pre-auth handler could cause a response with status code 429, which isn't absolutely certain. It's possible that one of the HTTP route handlers is replying with a 429 itself, or propagating a 429 from Elasticsearch, which would mess up the counter.
The request object is different from the pre-auth to the pre-response, so I wasn't able to use a Set or WeakSet to track whether or not this was a 429 that we returned from within the pre-auth handler... Any other ideas?
There was a problem hiding this comment.
There was a problem hiding this comment.
Can we wrap Ingest manager specific route handlers in a function instead of adding global interceptors? https://github.com/elastic/kibana/pull/70495/files#diff-280f58825bd033ffbe7792f8423f6122R88
I'm also not sure that we want to provide access to response data in interceptors.
There was a problem hiding this comment.
something along lines (not tested):
import { RequestHandlerWrapper } from 'src/core/server';
class Lock {
constructor(private readonly maxConnections: number = 1) {}
private counter = 0;
increase() {
this.counter += 1;
}
decrease() {
this.counter += 1;
}
canHandle() {
return this.counter < this.maxConnections;
}
}
const lock = new Lock();
export const concurrencyLimit: RequestHandlerWrapper = (handler) => {
return async (context, request, response) => {
if (!lock.canHandle()) {
return response.customError({ body: 'Too Many Agents', statusCode: 429 });
}
try {
lock.increase();
return handler(context, request, response);
} finally {
lock.decrease();
}
};
};
concurrencyLimit(postAgentEnrollHandler);
concurrencyLimit(postAgentCheckinHandler);There was a problem hiding this comment.
this doesn't let us reject traffic without validating API Keys as described here, right?
There was a problem hiding this comment.
Yeah... I haven't seen the comment. Now I'm surprised that the auth has such a huge penalty.
As @kobelb mentioned, it's critical that we can reject incoming traffic as cheaply as possible. It translates to saved $$ for our customers, making us able to offer a more competitive solution.
@kobelb for what use-cases we should consider this as a critical option? Auth & spaces do a huge number of requests to ES on every page load.
Would server session help to reduce the number of requests? Does this problem exist for API keys only?
There was a problem hiding this comment.
Fleet's usage of Kibana APIs is different than our traditional usage. The Elastic Agent will be using Kibana APIs to enroll themselves and retrieve their configuration. As such, we're potentially dealing with thousands of Elastic Agents hitting these APIs... Whether or not this is "critical" is debatable and largely dependent on what the ingest-management team is seeing during their load-testing, but skipping auth reduces the load on Kibana when this circuit breaker is hit.
@roncohen despite the current implementation being imperfect and potentially misinterpreting the 429, can we perform load-testing with the circuit breaking being done before authentication and after authentication to determine what type of impact this has on Fleet's scalability?
Would server session help to reduce the number of requests?
I don't think it will, Kibana will still have to make a query to Elasticsearch to return the server-side session document to authenticate the user.
Does this problem exist for API keys only?
Any call to an Elasticsearch API will incur some performance penalty. However, there are differences in the caching strategy for API Keys vs username/password. Regardless of the type of credentials, performing any unnecessary work before their circuit breaker has a chance to short-circuit the operation is inefficient.
There was a problem hiding this comment.
@roncohen despite the current implementation being imperfect and potentially misinterpreting the 429, can we perform load-testing with the circuit breaking being done before authentication and after authentication to determine what type of impact this has on Fleet's scalability?
+1 I'd love to adjust the concurrency value (what should the value be) and merge this into master ASAP so we can test in cloud as we did with long polling
| } | ||
|
|
||
| public async setup(core: CoreSetup, deps: IngestManagerSetupDeps) { | ||
| const maxConcurrentRequests = 1; |
There was a problem hiding this comment.
All this logic was put here out of laziness, it should really be placed in another file...
|
@joshdover any input on the general approach and/or the limitations which I've explicitly noted? |
What's the primary reason for executing this logic in the pre auth stage rather than in the regular request handler? Are we trying to avoid executing some expensive logic or are we just trying to register some behavior in single place? If it's the latter, I would recommend implementing this as a wrapper or even a Proxy object around the |
It was intending to address #67987 (comment). Authenticating incurs some overhead, so if we can short-circuit before performing the authentication we minimize the load on Elasticsearch even more. |
|
thanks so much for taking stab at this! I really appreciate the effort here. As @kobelb mentioned, it's critical that we can reject incoming traffic as cheaply as possible. It translates to saved $$ for our customers, making us able to offer a more competitive solution. |
|
Thanks for getting #70775 together so quickly @restrry! @roncohen, is my understanding correct that Fleet's performance tests can't be done against feature-branches? While I'd prefer to hold off on adding disparate |
ccing @jfsiii and @scunningham: what do you think about testing this feature branch locally/self-hosted GCP instance just to compare the difference this makes? e.g. try to enroll 7k-10k agents over a very short period and see how Kibana/Elasticsearch does in the branch vs. master? A simple "autocannon" test would also be sufficient. It's probably worth trying out a few different @kobelb should we do a version/branch that doesn't add the |
As long as it's feasible, that'd be great. |
|
@elasticmachine merge upstream |
💔 Build Failed
Failed CI Steps
Test FailuresChrome X-Pack UI Functional Tests.x-pack/test/functional/apps/discover/feature_controls/discover_spaces·ts.discover feature controls spaces space with no features disabled shows discover navlinkStandard OutStack TraceChrome X-Pack UI Functional Tests.x-pack/test/functional/apps/dev_tools/feature_controls/dev_tools_spaces·ts.console feature controls spaces space with no features disabled shows 'Dev Tools' navlinkStandard OutStack TraceChrome X-Pack UI Functional Tests.x-pack/test/functional/apps/canvas/feature_controls/canvas_spaces·ts.Canvas app spaces feature controls space with no features disabled shows canvas navlinkStandard OutStack Traceand 13 more failures, only showing the first 3. Build metrics
History
To update your PR or re-run it, just comment with: |
|
@kobelb @roncohen just wanted to mention that I am working on this locally #67987 (comment) I'll push any changes/numbers to my fork and we can decide what to do from there |
|
I put a draft PR up at #71221 but it's having some issues. I'd love some additional 👀 |
|
@jfsiii I overlooked the fact that |
I was wrong about this... Whether or not we should be using the completion of the |
So we do have test coverage that explicitly verifies that the |
Works by the design of the Observables. @joshdover yes, you can add kibana/src/core/server/http/router/request.ts Lines 188 to 191 in 159369b |
I should have clarified. The aborted API is working by design, but using the completion as a signal that the request completed would be abusing the API. I agree we should add the completed event if it is necessary. @kobelb LMK if you would like a PR for that |
|
It looks like In my testing, it seem like Could you help me verify if this is true, and whether it has recently changed or was always the case? |
|
i reread the discussion. Looks like it's wrong to use toPromise here. Sorry for the noise |
This is a proof-of-concept for implementing concurrency controls for a maximum number of agents being connected to a single instance of Kibana. There are some deficiencies and open questions inline using PR comments.
Proof-of-concept for #67987