-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes secret provider: Caching it to avoid performance issues #3594
Comments
Duplicate of elastic/beats#3442 |
First thoughts: Caching the result by default is not as straightforward as a better solution as it seemed. To cache the results we need to find a way to update them. There are several ways to do that, but most ways (TTL, for example) might lead to incorrect results if the cache is not updated in time. One way that seems efficient is to use watchers, but that will lead to an impact in memory. We have seen this in other cases, like So even though there is a way to cache the results, we have two solutions that don't seem ideal:
Now we need to make sure if it is less bad to use option 1 or 2. |
Do we know why the watchers in that case increased memory, and can confirm for sure it will have the same effect here? If we do know why, can we fix it? Watching the secret keys seems like it is the ideal solution here. There are an increasing number of support cases where we are making problematic amounts of calls to the k8s API so trying to reduce them seems better in the long term. |
We some SDHs related to that, and an issue is well. Since we would be taking the same approach for this, it seems sure that memory would be affected. @axw Could you please share your input? Should we try to go for this solution and see the results? If they are badly received (many SDHs about memory), maybe we undo this decision? |
@constanca-m I don't have a lot of insight to provide here, but I tend to agree with @cmacknz that watchers sound like the right approach. I think we should first:
Assuming there's no fundamental reason not to use watchers here, then let's go ahead with that but also perform testing & memory profiling. We should probably perform a test in a cluster with many secrets, to make sure memory usage is only proportional to the secrets that are actually referenced. Perhaps the memory issue is related to caching things that will never be accessed? In which case, perhaps we could combine the current approach with a watcher, and only cache/update secrets that are requested. Something like synchronously fetch a secret on first use (if it's not cached), and then keep it up to date with a watcher. WDYT @gizas? |
This is the issue we found in our previous analysis kubernetes/client-go#871 By using the watchers indeed we will see a memory increase, this is why we use the cache. But still we have not verified if the amount of increase is justified. For this issue I am thinking we can provide an option on/off for users if they want to use it. |
I like this option. I will implement this approach as the default then, and add the new option if the user does not want to use it.
I am unsure how the testing would be done, I need to do more research on that. |
I ran some tests and I realized that we fetch the secret for multiple resources at the same time. However, we don't update the secret value immediately after it has been updated. This means that sometimes we keep using a secret value that is no longer up to date and we get incorrect values. So based on this, what approach is best? Option 1
Option 2
@gizas @axw Which one seems the best? |
@constanca-m could you please elaborate on this? When you say it's not updated immediately, do you mean that it's never updated? Or there's a delay? If there's a delay, how long is it and what causes it? Watchers still feel right to me, but if there's an existing consistency issue then I guess I'd be OK with a quick fix to add an LRU cache with an option to disable. Unrelated to this issue specifically, I went to see what opentelemetry-collector-contrib is doing. I found open-telemetry/opentelemetry-collector-contrib#23067, and open-telemetry/opentelemetry-collector-contrib#23226 which was opened to help reduce memory usage. May be a good source of inspiration if we need to go down the watcher/informer route. |
It is updated. I tried to see the update of a secret by adding a field to a data stream like this: processors:
- add_fields:
target: my-new-fields
fields:
name: my-secret-field
value: ${kubernetes_secrets.default.my-secret-name.file} And what I notice is that we would still receive documents for a few minutes with the old secret value, but after some time it gets updated. I thought it was something periodic, but from my tests the period was not consistent - it could be 3, 5 or 7 minutes sometimes. I tried to follow the code to see where the function to update was being called, but it leads to a dead end (it leads to the functions of this file and then nowhere else). @gizas , as you have more experience, do you know where this update is being triggered? |
The workflow that I have in my mind is:
Probably you have it, this is the inital PR for https://github.com/elastic/beats/pull/24789/files But to our issue now:
So it seems that all this is event based and we need a relevant watch event that will perform a lookup and call for eg. https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/transpiler/ast.go#L438. @constanca-m this is my theory for not having consistent times
@constanca-m what will be the changes for Option 1?
@axw I will try to use this in combination to this issue 3417, which can be part of the wider meta issue.
|
The old workflow was in this README file. This is a very old file that even points to the beat repository. I can find most of the new steps in the elastic agent repository, and I can see the whole workflow until kubernetes secret provider starts running. Like you said:
I am thinking for option 1:
We need to register the time so we can check when to update and fetch the secret again. Let's say we watch to have values that are never older than 5 minutes.
So for this option we would now have a variable such as The problem: where is this fetch function being called? Because like you linked, I cannot find usages for the
I also think the watchers would carry the problem that we would be trying to check more secrets that we actually need. So if we only want to fetch one secret, and we have 10 more in the same namespace and node, we could be watching them as well for no reason. |
+1 for secrtet_time then ! I like it better (can we just call it ttl ? :) ) @axw what do think? |
Of course, I just called it the first time that came to mind |
I opened the PR for this issue: #3822 |
Sounds like a reasonable approach. One other thing to consider: what happens if the secret is not cached locally, and there's also no matching secret in Kubernetes? Should we cache that knowledge? Otherwise every attempt to fetch will trigger a request to the API server. |
Issue
This issue was first mentioned in #3442.
When we use a kubernetes secret provider, we need to get its result every time:
elastic-agent/internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Line 80 in 69cc860
This is not ideal, because we could be just fetching the result one time and using it for the future, instead of repeating the same steps every time.
We need to implement a way to reduce the amount of requests to the API server for this by caching the result.
To do
The text was updated successfully, but these errors were encountered: