Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.0.6 to 2.2 alpha upgrade issue with DynamoDB backend #1050

Closed
kontsevoy opened this issue Jun 7, 2017 · 33 comments · Fixed by #1108
Closed

2.0.6 to 2.2 alpha upgrade issue with DynamoDB backend #1050

kontsevoy opened this issue Jun 7, 2017 · 33 comments · Fixed by #1108
Assignees
Labels

Comments

@kontsevoy
Copy link
Contributor

Originally reported by @ekristen in #896 (in comments at the bottom):

Well I upgraded to alpha8 and now I cannot add anymore nodes. Getting the cluster has no signing keys. There seems to be something with upgrading to a new version using dynamodb that breaks everything.

Logs:

level=warning 
msg="[AUTH] Node \"server-001\" [11fbfa42-17b9-4cfe-a863-65e791663838] can not join:
      certificate generation error: my-cluster has no signing keys" 

file="auth/auth.go:464" func="auth.(*AuthServer).GenerateServerKeys"

I cannot since I've already upgraded. However I was on 2.0.6 and now I am on Teleport v2.2.0-alpha.8 git:v2.1.0-alpha.6-43-g14cf169d-dirty

I can tell you that I've now seen this happen multiple times across multiple versions. I am using dynamodb as a backend. I attempted to replicate it using dir mode only and a single auth server and was unable to.

After that I went back to using dynamodb with mulitple auth servers however I only use 1 when registering nodes, so while the other 2 are running the auth service nothing is talking to them.

Everything seemed great for a while, I was able to add nodes and this bug seemed to not be present anymore until I upgraded to alpha8 and I completely lost the ability to register nodes again.

@ekristen
Copy link

ekristen commented Jun 7, 2017

Awesome thanks. If there is anything I can do to further help troubleshoot, please let me know. I can downgrade, upgrade, clear out databases as necessary. I've been a little stuck as to the best way to troubleshoot as I haven't had a clear way to reproduce.

@kontsevoy
Copy link
Contributor Author

@ekristen thanks for the help! we're about to push 2.2 out the door, so this issue is our highest priority right now.

@ekristen
Copy link

ekristen commented Jun 7, 2017 via email

@russjones
Copy link
Contributor

@ekristen I tried reproducing this issue but could not. I created a cluster that had two Auth Servers pointing to the same DynamoDB table running Teleport 2.0.6. I then upgraded the first Auth Server to Teleport 2.2.0-beta.1 and added a Node running Teleport 2.2.0-beta.1 successfully. I then upgraded the second Auth Server to Teleport 2.2.0-beta.1 and added a Node running Teleport 2.2.0-beta.1 successfully.

Can you provide the following:

  1. Starting from scratch, provide the configuration files and steps on how you are constructing your cluster and how to reproduce the issue. Feel free to include as much detail as possible, as it will make reproducing this issue much easier for me.
  2. Can you tell us the region you are running DynamoDB in and any DynamoDB settings.

If I can't reproduce the issue, would you be willing to provide access to a staging environment I can connect to to debug?

@ekristen
Copy link

ekristen commented Jun 8, 2017

Ugh, once again I cannot dupe it either, but it definitely is happening, I just can't seem to control the conditions in which it does, but at some point I loose the ability to register a new node with the cluster with the error I've provided.

Perhaps some sort of random TTL value is causing keys in dynamodb to expire at some point thus triggering that error.

Question: Is there a good way to determine what values in dynamodb would need to be there or missing to trigger the error I've mentioned?

FWIW the steps I've taken are as follows ...

  1. Stopped all teleport services on servers
  2. Cleared all /var/lib/teleport directories out
  3. Installed 2.2.0-alpha8 on 3 auth servers using dynamodb
  4. Then installed the proxy on my bastion server
  5. Added my first user
  6. Added another server
  7. Added 2 more servers
  8. On my second environment, I stopped all services, cleared all /var/lib/teleport directories out
  9. Installed 2.2.0-alpha8 on 1 auth server using dir mode
  10. Run tctl nodes add --ttl=15m --roles=trustedcluster on the main cluster
  11. Add trusted cluster using trusted cluster role
  12. Add 1 server to the trusted cluster
  13. Upgraded all 3 auth servers to 2.2.0-beta.1
  14. Add 1 more new server to main cluster
  15. Add 1 more server to trusted cluster

After all those steps, everything is still working but that's not abnormal. This seems to only happen after a random set of time, thus my question earlier about expiring keys and what would trigger the error I'm seeing.

FWIW: the new feature you added that brings the roles down from the primary cluster to trusted clusters is working very well, I've been testing it out. Well done. Thanks!

@ekristen
Copy link

ekristen commented Jun 8, 2017

I started digging around the data, I saw this in dynamodb for my primary cluster expires 0001-01-01 at path teleport/authorities/host/primary-cluster

{"kind":"cert_authority","version":"v2","metadata":{"name":"primary-cluster","namespace":"default","expires":"0001-01-01T00:00:00Z"}, "spec": ["hidden"]}

However on the auth nodes in the cache I see at path /var/lib/teleport/cache/node/authorities/host/primary-cluster I see that it expires tomorrow.

{"kind":"cert_authority","version":"v2","metadata":{"name":"primary-cluster","namespace":"default","expires":"2017-06-09T09:38:31.088768281Z"},"spec": ["hidden"]}

Does the cache get renewed? Maybe this is part of the problem?

I'm just trying to understand how that error occurs in the first place, I think that'd be the best way of tracking this down. If you can help me understand that I think I can help narrow this down more.

Thank you.

@russjones
Copy link
Contributor

@ekristen What you are seeing in DynamoDB for the expires value is a zero value, that the key will not expire.

What you are seeing in /var/lib/teleport/cache is a cached value that will be used in-case DynamoDB can not be accessed as part of the new high availability work in 2.2. It should be renewed. You can see implementation details for the new high availability work (as well as how to turn it off) in the following PR: #853.

@ekristen
Copy link

ekristen commented Jun 8, 2017

@russjones can you determine what has to be missing for the error has no signing keys to be thrown, I think that is the best way to track this down.

@russjones
Copy link
Contributor

@ekristen It looks like the certificate authority has lost it's private keys. It's strange because Teleport creates it with a TTL of 0 (so it never expires) so that should never happen.

However that TTL only applies to Teleport expiring the key. That's why we asked about DynamoDB settings, do you have any custom settings that would expire keys at the DynamoDB level? Because as far as I know, this does occur with any other backend correct?

@ekristen
Copy link

ekristen commented Jun 8, 2017

Sorry for not including that information the first time, I forgot.

Table name: my-teleport-cluster
Primary partition key: HashKey (String)
Primary sort key: FullPath (String)
Time to live attribute: DISABLEDManage TTL
Table status: Active
Creation date: March 31, 2017 at 9:51:10 AM UTC-6
Provisioned read capacity units: 5
Provisioned write capacity units: 5
Storage size (in bytes): 87.82 KB
Item count: 73
Region: US West (Oregon)
Amazon Resource Name (ARN): arn:aws:dynamodb:us-west-2:xxxx

I'm going to keep a close eye on this, and upgrade to the next version once released. I've dumped a working copy of the teleport database, so if this happens again I'll be able to compare differences as well as see if restoration of changed keys allows it to function again.

@kontsevoy kontsevoy removed the P0 label Jun 10, 2017
@kontsevoy
Copy link
Contributor Author

@ekristen @russjones I am removing P0 label and removing this from the upcoming 2.2 milestone, but keeping it open in case you find a way to reproduce this.

@russjones
Copy link
Contributor

@ekristen This issue should be resolved now. If you see it again ping me and I'll investigate further.

@ekristen
Copy link

ekristen commented Aug 1, 2017

It's back. I haven't added any new nodes in a while and now that I'm trying to add a new one, I'm getting the error again. @russjones @kontsevoy

@ekristen
Copy link

ekristen commented Aug 1, 2017

So I've been saving backups of the dynamodb, it seems that the entry for teleport/authorities/host/my-cluster has changed!

Upon further inspection the newer teleport/authorities/host/my-cluster has the same checking_keys but the signing_keys entry is just gone.

@ekristen
Copy link

ekristen commented Aug 1, 2017

I replaced the value teleport/authorities/host/my-cluster with the original value from my backup that has the signing_keys array and everything works.

Somewhere along the line when the dynamodb entry gets updated it's loosing the signing_keys array.

@kontsevoy
Copy link
Contributor Author

@russjones ping.

@russjones
Copy link
Contributor

@ekristen Two questions:

  1. Did this occur after you tried adding a trusted cluster?
  2. Are you running an old version of Teleport anywhere within the main or trusted cluster? A single old binary can corrupt an entire cluster.

@ekristen
Copy link

ekristen commented Aug 2, 2017

  1. No I hadn't tried adding anything in a while. Everything was working for a while that's why we closed the issue. I went to add a new server to the primary cluster and ran into this.

  2. To my knowledge everything is the same, but I will doubt check.

@russjones
Copy link
Contributor

@ekristen You just started the cluster with a static token and after that lost the cluster keys?

@ekristen
Copy link

ekristen commented Aug 4, 2017

The authority key in the dynamodb table 'teleport/authorities/host/my-cluster-name' was/is missing the signing_keys array. I had backed up all the keys after the original setup after the last time this happened, I opened the backup and found the same key and took its value (after confirming it had the signing_keys array) and after confirming the rest of the info matched and replaced the value in the dynamodb with the original value.

I then generated a node key and told my new server to confirm itself to the authority server and it worked.

@kontsevoy
Copy link
Contributor Author

@russjones any ides? This case is so strange.... there's literally no place in Teleport where private keys are ever deleted. I doubt we even have a code for this, i.e. wouldn't be able to do even if we tried.

Can this be a case where there's a rogue teleport daemon (of the old version) somewhere trying to "upgrade" itself?

@ekristen
Copy link

ekristen commented Aug 7, 2017

The only 3 instances that have access to the dynamodb table are all the same.

Since replacing the key's value back last week, it has not "updating" again.

I did notice in the code that the structure omits when empty for the signing_keys field when converting from JSON. Is it possible that at some point go things that the signing_keys is empty and omits them when converting it?

@ekristen
Copy link

It happened again. Basically the entry lost the signing_keys array in the dynamodb. Version 2.2.0

@ekristen
Copy link

ekristen commented Sep 27, 2017

Everything about the JSON data is the same in terms of all the properties and values except that under the spec object, signing_keys is just gone.

I replaced the value with the backup I have and of course it works again, but there's definitely a bug somewhere. I just do not know where.

@kontsevoy
Copy link
Contributor Author

@ekristen @russjones we believe this is a naming conflict. are you using trusted clusters? It's possible that a remote cluster's public key is overwriting the local cluster named similarly. Russell here can assist with upgrading to 2.3 (where this issue has been resolved).

@ekristen
Copy link

ekristen commented Oct 2, 2017

I am using trusted clusters, but they all have separate names. As long as I don't loose current capabilities, I'll gladly upgrade to try and fix this and/or to help troubleshoot further.

@russjones
Copy link
Contributor

@ekristen In Teleport 2.3 we did a overhaul of the Trusted Cluster state machine and we uncovered an issue where if you had clusters with the same name you could wipe out the signing keys of another cluster. That's been resolved now.

Do you know what triggers this behavior? Is it just time or do you perform some administrative action on Teleport which then triggers the loss of the signing key?

@ekristen
Copy link

ekristen commented Oct 2, 2017

@russjones I haven't found any specific trigger. It seems to be time as best as I can figure. I can't rule out trusted clusters since I'm using them, but I only have 2 and they are named differently.

I'd be happy to upgrade to 2.3. I know you've started moving enterprise based code out of the public repo, am I going to loose any current capabilities by upgrading?

@russjones
Copy link
Contributor

russjones commented Oct 2, 2017

@ekristen Okay, as long as all cluster (your main cluster and all trusted clusters) all have different names, you probably are not hitting that particular issue.

What features of Teleport are you using? The main Enterprise only features are external identity providers (SAML/OIDC) and RBAC.

@ekristen
Copy link

ekristen commented Oct 2, 2017

What is included in RBAC?

I'm using trusted clusters + the role mapping to allow a role on the primary cluster to be mapped down to a role on the trusted clusters.

@russjones
Copy link
Contributor

@ekristen Starting with Teleport 2.3 all users will have the same role in both the main cluster and the trusted cluster admin.

@ekristen
Copy link

ekristen commented Oct 2, 2017

So does RBAC control the roles then?

@russjones
Copy link
Contributor

Yes, Teleport essentially has no RBAC, all users operate as admin.

hatched pushed a commit to hatched/teleport-merge that referenced this issue Nov 30, 2022
hatched pushed a commit that referenced this issue Dec 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants