-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2.0.6 to 2.2 alpha upgrade issue with DynamoDB backend #1050
Comments
Awesome thanks. If there is anything I can do to further help troubleshoot, please let me know. I can downgrade, upgrade, clear out databases as necessary. I've been a little stuck as to the best way to troubleshoot as I haven't had a clear way to reproduce. |
@ekristen thanks for the help! we're about to push 2.2 out the door, so this issue is our highest priority right now. |
I'll try and dupe this tonight then, see if I can't gather any additional info.
…Sent from my iPhone
On Jun 7, 2017, at 15:49, Ev Kontsevoy ***@***.***> wrote:
@ekristen thanks for the help! we're about to push 2.2 out the door, so this issue is our highest priority right now.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@ekristen I tried reproducing this issue but could not. I created a cluster that had two Auth Servers pointing to the same DynamoDB table running Teleport 2.0.6. I then upgraded the first Auth Server to Teleport 2.2.0-beta.1 and added a Node running Teleport 2.2.0-beta.1 successfully. I then upgraded the second Auth Server to Teleport 2.2.0-beta.1 and added a Node running Teleport 2.2.0-beta.1 successfully. Can you provide the following:
If I can't reproduce the issue, would you be willing to provide access to a staging environment I can connect to to debug? |
Ugh, once again I cannot dupe it either, but it definitely is happening, I just can't seem to control the conditions in which it does, but at some point I loose the ability to register a new node with the cluster with the error I've provided. Perhaps some sort of random TTL value is causing keys in dynamodb to expire at some point thus triggering that error. Question: Is there a good way to determine what values in dynamodb would need to be there or missing to trigger the error I've mentioned? FWIW the steps I've taken are as follows ...
After all those steps, everything is still working but that's not abnormal. This seems to only happen after a random set of time, thus my question earlier about expiring keys and what would trigger the error I'm seeing. FWIW: the new feature you added that brings the roles down from the primary cluster to trusted clusters is working very well, I've been testing it out. Well done. Thanks! |
I started digging around the data, I saw this in dynamodb for my primary cluster {"kind":"cert_authority","version":"v2","metadata":{"name":"primary-cluster","namespace":"default","expires":"0001-01-01T00:00:00Z"}, "spec": ["hidden"]} However on the auth nodes in the cache I see at path {"kind":"cert_authority","version":"v2","metadata":{"name":"primary-cluster","namespace":"default","expires":"2017-06-09T09:38:31.088768281Z"},"spec": ["hidden"]} Does the cache get renewed? Maybe this is part of the problem? I'm just trying to understand how that error occurs in the first place, I think that'd be the best way of tracking this down. If you can help me understand that I think I can help narrow this down more. Thank you. |
@ekristen What you are seeing in DynamoDB for the expires value is a zero value, that the key will not expire. What you are seeing in |
@russjones can you determine what has to be missing for the error |
@ekristen It looks like the certificate authority has lost it's private keys. It's strange because Teleport creates it with a TTL of 0 (so it never expires) so that should never happen. However that TTL only applies to Teleport expiring the key. That's why we asked about DynamoDB settings, do you have any custom settings that would expire keys at the DynamoDB level? Because as far as I know, this does occur with any other backend correct? |
Sorry for not including that information the first time, I forgot.
I'm going to keep a close eye on this, and upgrade to the next version once released. I've dumped a working copy of the teleport database, so if this happens again I'll be able to compare differences as well as see if restoration of changed keys allows it to function again. |
@ekristen @russjones I am removing P0 label and removing this from the upcoming 2.2 milestone, but keeping it open in case you find a way to reproduce this. |
@ekristen This issue should be resolved now. If you see it again ping me and I'll investigate further. |
It's back. I haven't added any new nodes in a while and now that I'm trying to add a new one, I'm getting the error again. @russjones @kontsevoy |
So I've been saving backups of the dynamodb, it seems that the entry for Upon further inspection the newer |
I replaced the value Somewhere along the line when the dynamodb entry gets updated it's loosing the signing_keys array. |
@russjones ping. |
@ekristen Two questions:
|
|
@ekristen You just started the cluster with a static token and after that lost the cluster keys? |
The authority key in the dynamodb table 'teleport/authorities/host/my-cluster-name' was/is missing the signing_keys array. I had backed up all the keys after the original setup after the last time this happened, I opened the backup and found the same key and took its value (after confirming it had the signing_keys array) and after confirming the rest of the info matched and replaced the value in the dynamodb with the original value. I then generated a node key and told my new server to confirm itself to the authority server and it worked. |
@russjones any ides? This case is so strange.... there's literally no place in Teleport where private keys are ever deleted. I doubt we even have a code for this, i.e. wouldn't be able to do even if we tried. Can this be a case where there's a rogue teleport daemon (of the old version) somewhere trying to "upgrade" itself? |
The only 3 instances that have access to the dynamodb table are all the same. Since replacing the key's value back last week, it has not "updating" again. I did notice in the code that the structure omits when empty for the signing_keys field when converting from JSON. Is it possible that at some point go things that the signing_keys is empty and omits them when converting it? |
It happened again. Basically the entry lost the signing_keys array in the dynamodb. Version 2.2.0 |
Everything about the JSON data is the same in terms of all the properties and values except that under the spec object, I replaced the value with the backup I have and of course it works again, but there's definitely a bug somewhere. I just do not know where. |
@ekristen @russjones we believe this is a naming conflict. are you using trusted clusters? It's possible that a remote cluster's public key is overwriting the local cluster named similarly. Russell here can assist with upgrading to 2.3 (where this issue has been resolved). |
I am using trusted clusters, but they all have separate names. As long as I don't loose current capabilities, I'll gladly upgrade to try and fix this and/or to help troubleshoot further. |
@ekristen In Teleport 2.3 we did a overhaul of the Trusted Cluster state machine and we uncovered an issue where if you had clusters with the same name you could wipe out the signing keys of another cluster. That's been resolved now. Do you know what triggers this behavior? Is it just time or do you perform some administrative action on Teleport which then triggers the loss of the signing key? |
@russjones I haven't found any specific trigger. It seems to be time as best as I can figure. I can't rule out trusted clusters since I'm using them, but I only have 2 and they are named differently. I'd be happy to upgrade to 2.3. I know you've started moving enterprise based code out of the public repo, am I going to loose any current capabilities by upgrading? |
@ekristen Okay, as long as all cluster (your main cluster and all trusted clusters) all have different names, you probably are not hitting that particular issue. What features of Teleport are you using? The main Enterprise only features are external identity providers (SAML/OIDC) and RBAC. |
What is included in RBAC? I'm using trusted clusters + the role mapping to allow a role on the primary cluster to be mapped down to a role on the trusted clusters. |
@ekristen Starting with Teleport 2.3 all users will have the same role in both the main cluster and the trusted cluster |
So does RBAC control the roles then? |
Yes, Teleport essentially has no RBAC, all users operate as |
Originally reported by @ekristen in #896 (in comments at the bottom):
Well I upgraded to alpha8 and now I cannot add anymore nodes. Getting the cluster has no signing keys. There seems to be something with upgrading to a new version using dynamodb that breaks everything.
Logs:
I cannot since I've already upgraded. However I was on 2.0.6 and now I am on Teleport v2.2.0-alpha.8 git:v2.1.0-alpha.6-43-g14cf169d-dirty
I can tell you that I've now seen this happen multiple times across multiple versions. I am using dynamodb as a backend. I attempted to replicate it using dir mode only and a single auth server and was unable to.
After that I went back to using dynamodb with mulitple auth servers however I only use 1 when registering nodes, so while the other 2 are running the auth service nothing is talking to them.
Everything seemed great for a while, I was able to add nodes and this bug seemed to not be present anymore until I upgraded to alpha8 and I completely lost the ability to register nodes again.
The text was updated successfully, but these errors were encountered: