Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault does not scale well with 2000+ nodes #237

Closed
rveznaver opened this issue Oct 24, 2016 · 12 comments
Closed

Vault does not scale well with 2000+ nodes #237

rveznaver opened this issue Oct 24, 2016 · 12 comments

Comments

@rveznaver
Copy link

Hi,

This will be a long post, so brace yourselves :)
We have noticed that several scaling issues arise when we share a secret across more than 2000 nodes.

Firstly, the knife vault refresh takes ages and uses a lot of memory (since the searches return whole node objects). In addition, the search queries strain the SOLR engine on the Chef server to the point where we had to tune it to the max because it was running out of memory. This issue has already been somewhat addressed by #177 and #178. The latter may be improved by setting rows: 0 when we are only interested in the number of results (e.g. numresults = query.search(:node, "name:#{nodename}", filter_result: { name: ['name'] }, rows: 0)[2]).

Secondly, because of the way vault uses data bags, once a secret is shared, the _keys data bag item will contain an encrypted entry for each node. This results in very large data bag items that are wholly transferred over the network although a node requires only a single entry (the symmetric key encrypted by its own public key). For example, a 4K secret will have a 3.3M _keys item if shared across ~7500 nodes (sizes approximated using knife data bag show databag item_keys -Fjson). As one can imagine, several secrets may saturate the network on the Chef server.

These are the solutions I have thought about so far:

  1. Data bag item per node
    We could split the _keys item per node in two ways, either:
    • one key item per node containing all encrypted symmetric keys
      This would enable us to get all keys for a given node with a single query to the Chef Server (bare in mind I'm talking about keys, not secrets; those would remain unchanged and would take a couple more queries). However to fully optimise this solution, the Chef client would have to implement a caching mechanism whereas it would get the key item at the beginning of a run and hit the cache unless it cannot decrypt the secret (known race issue when a node wants to decrypt before the keys item is refreshed) or it wants to create/update a secret.
    • multiple key items per node containing a single encrypted symmetric key
      This would be simpler to implement, however it would create a lot of data bag items and would create more queries for the Chef Server. However, given that the items would be requested directly (i.e. not using wildcard searches), I do not think it would create that much of an issue for the SOLR engine. I'm not sure about the increased number of data bag items (one secret would create 7500 + 1 items in our case).
  2. Get rid of _keys items and add features to the Chef Server
    Instead of creating key items, we could encrypt/decrypt the secret with the Chef Server's private/public key pair and rely on ACLs for authorisation and HTTPS for encryption. So a given node would request an encrypted data bag item, the Chef Server would authorise it depending on ACLs, decrypt the secret, and respond with the decrypted secret. It would still be encrypted in transport since the Chef client connects over HTTPS, however it would require the Chef Server to decrypt the secret on each request. This would simplify the vault implementation and completely solve the aforementioned slow refresh issue, however it would require additional work on the Chef Server side and would effectively allow the Chef Server to decrypt all secrets (which is currently not the case). I am not certain if this could be avoided.

Questions, comments, and feedback are more than welcome!

@rveznaver
Copy link
Author

rveznaver commented Oct 24, 2016

Pinging @thommay as we have already discussed this issue at the Chef Community Summit in London

@thommay
Copy link
Contributor

thommay commented Oct 27, 2016

cc @stevendanna since he was part of this conversation in London

thommay pushed a commit that referenced this issue Nov 4, 2016
This is based on @tfheen's work in #178, and a suggestion from
@rveznaver in #237

Signed-off-by: Thom May <[email protected]>
thommay pushed a commit that referenced this issue Nov 7, 2016
This is based on @tfheen's work in #178, and a suggestion from
@rveznaver in #237

Signed-off-by: Thom May <[email protected]>
@vinyar
Copy link

vinyar commented Nov 11, 2016

So, there is another limitation I've encountered at a customer.

The default settings for json on the chef server side limits json to around 1.5 megs. So, their vaults were not processing even delete commands. Had to knife data bag delete it to get rid of old clients.

@thommay
Copy link
Contributor

thommay commented Nov 11, 2016

I think there's no reason not to do 1(b) right now, even though we might want to contemplate doing something else later. I think my preference would be that for small numbers of clients - say less than 250 - we keep the same design as currently, but for any larger bags than that we go to one item per client - foo_key_nodename - and foo_keysto store the metadata. On the client, we'd simply try and fetch foo_key_nodename first and fall back to foo_keys if that was not successful.

@thommay
Copy link
Contributor

thommay commented Nov 11, 2016

If we turned on request pipelining for data bag items in chef it would be very cheap to fall back, too; and it would presumably give a fairly respectable speed increase when doing most knife vault operations too.

@thommay
Copy link
Contributor

thommay commented Jan 24, 2017

So with the merge of the above PRs (#246 and #252), is there anything else we need to do here?

@rveznaver
Copy link
Author

Test a couple of secrets with a large number of nodes to see if it scales properly.

@btm
Copy link
Contributor

btm commented Aug 23, 2017

Did anyone do that testing?

@btm
Copy link
Contributor

btm commented Aug 24, 2017

I'm closing this as done. If there are remaining edge cases and you end up here, please speak up.

@btm btm closed this as completed Aug 24, 2017
@btm
Copy link
Contributor

btm commented Aug 24, 2017

I believe this was fixed by sparse mode and released in Chef Vault 3.1, which is included in ChefDK 2.

@kamaradclimber
Copy link
Contributor

kamaradclimber commented Aug 24, 2017 via email

@josephmilla
Copy link

Any plans to support conversion of vault items on "default" mode into "sparse" mode?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants