-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Access pattern observation in keyspace ("pagetrace") #10275
Comments
In past systems, we had an API endpoint that would allow us to temporarily enable and output debug/trace logging at runtime for specific source code files with regex-filtering. So we could e.g. enable trace-logging for the getpage handler and regex-filter by tenant/shard to dump keys for 30 seconds. Might be a simple and general solution, if our logging/tracing library supports it. |
Yeah, this should evolve into something with an API for toggling tracing per tenant (we may even have an issue for that somewhere). However, because we use grafana for logs, and that doesn't cope well with passing around big dumps, if we want to get some dump of like 100K keys to then visualize somehow, we'll probably need to output those some other way (or embrace some other system for recording results that works better than Loki) |
Aside: my favorite one of these was EMC isilon, where you could subscribe to performance metrics on a particular directory in a filesystem, good times. |
Yeah, these debug events would be emitted via the API endpoint response as a stream, not via the regular log sink. |
The
Wdyt? |
I'm a little anxious about using trace+regex here, the overhead could be substantial, and we'll probably be using this in situations where we already have a performance problem. I was thinking about maybe something designed for minimum cost, like:
|
Discussed offline. The performance risks of a generalized tracing endpoint appear too big for us to ship something to production for debugging in a matter of days. We'll do the simple, performant thing for now: add an API endpoint that registers a fixed-size channel for a timeline, and emits compact binary data to the client via HTTP. |
In INC-362 we saw strong signals that the client (compute) was getting something wrong with caching: we suspect it is re-requesting the same data repeatedly, but can't prove it.
To diagnose issues like this, we need an ability to get a raw dump of the keys touched by getpage requests.
Candidate impls:
The text was updated successfully, but these errors were encountered: