-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vault does not scale well with 2000+ nodes #237
Comments
Pinging @thommay as we have already discussed this issue at the Chef Community Summit in London |
cc @stevendanna since he was part of this conversation in London |
This is based on @tfheen's work in #178, and a suggestion from @rveznaver in #237 Signed-off-by: Thom May <[email protected]>
This is based on @tfheen's work in #178, and a suggestion from @rveznaver in #237 Signed-off-by: Thom May <[email protected]>
So, there is another limitation I've encountered at a customer. The default settings for json on the chef server side limits json to around 1.5 megs. So, their vaults were not processing even delete commands. Had to |
I think there's no reason not to do 1(b) right now, even though we might want to contemplate doing something else later. I think my preference would be that for small numbers of clients - say less than 250 - we keep the same design as currently, but for any larger bags than that we go to one item per client - |
If we turned on request pipelining for data bag items in chef it would be very cheap to fall back, too; and it would presumably give a fairly respectable speed increase when doing most |
Test a couple of secrets with a large number of nodes to see if it scales properly. |
Did anyone do that testing? |
I'm closing this as done. If there are remaining edge cases and you end up here, please speak up. |
I believe this was fixed by sparse mode and released in Chef Vault 3.1, which is included in ChefDK 2. |
I have deployed a secret in our environment to validate that refresh works
in practice but no gone further.
Id be glad to support anyone going further.
…On mer. 23 août 2017 à 21:35 Bryan McLellan ***@***.***> wrote:
Did anyone do that testing?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#237 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAeu8Tmg4KwtU9R-Ct5rwxtZWaTMznw_ks5sbH8FgaJpZM4KezUp>
.
|
Any plans to support conversion of vault items on "default" mode into "sparse" mode? |
Hi,
This will be a long post, so brace yourselves :)
We have noticed that several scaling issues arise when we share a secret across more than 2000 nodes.
Firstly, the
knife vault refresh
takes ages and uses a lot of memory (since the searches return whole node objects). In addition, the search queries strain the SOLR engine on the Chef server to the point where we had to tune it to the max because it was running out of memory. This issue has already been somewhat addressed by #177 and #178. The latter may be improved by settingrows: 0
when we are only interested in the number of results (e.g.numresults = query.search(:node, "name:#{nodename}", filter_result: { name: ['name'] }, rows: 0)[2]
).Secondly, because of the way vault uses data bags, once a secret is shared, the
_keys
data bag item will contain an encrypted entry for each node. This results in very large data bag items that are wholly transferred over the network although a node requires only a single entry (the symmetric key encrypted by its own public key). For example, a 4K secret will have a 3.3M_keys
item if shared across ~7500 nodes (sizes approximated usingknife data bag show databag item_keys -Fjson
). As one can imagine, several secrets may saturate the network on the Chef server.These are the solutions I have thought about so far:
We could split the _keys item per node in two ways, either:
This would enable us to get all keys for a given node with a single query to the Chef Server (bare in mind I'm talking about keys, not secrets; those would remain unchanged and would take a couple more queries). However to fully optimise this solution, the Chef client would have to implement a caching mechanism whereas it would get the key item at the beginning of a run and hit the cache unless it cannot decrypt the secret (known race issue when a node wants to decrypt before the keys item is refreshed) or it wants to create/update a secret.
This would be simpler to implement, however it would create a lot of data bag items and would create more queries for the Chef Server. However, given that the items would be requested directly (i.e. not using wildcard searches), I do not think it would create that much of an issue for the SOLR engine. I'm not sure about the increased number of data bag items (one secret would create 7500 + 1 items in our case).
Instead of creating key items, we could encrypt/decrypt the secret with the Chef Server's private/public key pair and rely on ACLs for authorisation and HTTPS for encryption. So a given node would request an encrypted data bag item, the Chef Server would authorise it depending on ACLs, decrypt the secret, and respond with the decrypted secret. It would still be encrypted in transport since the Chef client connects over HTTPS, however it would require the Chef Server to decrypt the secret on each request. This would simplify the vault implementation and completely solve the aforementioned slow refresh issue, however it would require additional work on the Chef Server side and would effectively allow the Chef Server to decrypt all secrets (which is currently not the case). I am not certain if this could be avoided.
Questions, comments, and feedback are more than welcome!
The text was updated successfully, but these errors were encountered: