2.0.6 to 2.2 alpha upgrade issue with DynamoDB backend #1050

kontsevoy · 2017-06-07T21:33:33Z

Originally reported by @ekristen in #896 (in comments at the bottom):

Well I upgraded to alpha8 and now I cannot add anymore nodes. Getting the cluster has no signing keys. There seems to be something with upgrading to a new version using dynamodb that breaks everything.

Logs:

level=warning 
msg="[AUTH] Node \"server-001\" [11fbfa42-17b9-4cfe-a863-65e791663838] can not join:
      certificate generation error: my-cluster has no signing keys" 

file="auth/auth.go:464" func="auth.(*AuthServer).GenerateServerKeys"

I cannot since I've already upgraded. However I was on 2.0.6 and now I am on Teleport v2.2.0-alpha.8 git:v2.1.0-alpha.6-43-g14cf169d-dirty

I can tell you that I've now seen this happen multiple times across multiple versions. I am using dynamodb as a backend. I attempted to replicate it using dir mode only and a single auth server and was unable to.

After that I went back to using dynamodb with mulitple auth servers however I only use 1 when registering nodes, so while the other 2 are running the auth service nothing is talking to them.

Everything seemed great for a while, I was able to add nodes and this bug seemed to not be present anymore until I upgraded to alpha8 and I completely lost the ability to register nodes again.

The text was updated successfully, but these errors were encountered:

ekristen · 2017-06-07T21:35:19Z

Awesome thanks. If there is anything I can do to further help troubleshoot, please let me know. I can downgrade, upgrade, clear out databases as necessary. I've been a little stuck as to the best way to troubleshoot as I haven't had a clear way to reproduce.

kontsevoy · 2017-06-07T21:49:22Z

@ekristen thanks for the help! we're about to push 2.2 out the door, so this issue is our highest priority right now.

ekristen · 2017-06-07T22:07:49Z

I'll try and dupe this tonight then, see if I can't gather any additional info.

…

Sent from my iPhone

On Jun 7, 2017, at 15:49, Ev Kontsevoy ***@***.***> wrote: @ekristen thanks for the help! we're about to push 2.2 out the door, so this issue is our highest priority right now. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

russjones · 2017-06-08T00:20:18Z

@ekristen I tried reproducing this issue but could not. I created a cluster that had two Auth Servers pointing to the same DynamoDB table running Teleport 2.0.6. I then upgraded the first Auth Server to Teleport 2.2.0-beta.1 and added a Node running Teleport 2.2.0-beta.1 successfully. I then upgraded the second Auth Server to Teleport 2.2.0-beta.1 and added a Node running Teleport 2.2.0-beta.1 successfully.

Can you provide the following:

Starting from scratch, provide the configuration files and steps on how you are constructing your cluster and how to reproduce the issue. Feel free to include as much detail as possible, as it will make reproducing this issue much easier for me.
Can you tell us the region you are running DynamoDB in and any DynamoDB settings.

If I can't reproduce the issue, would you be willing to provide access to a staging environment I can connect to to debug?

ekristen · 2017-06-08T13:46:57Z

Ugh, once again I cannot dupe it either, but it definitely is happening, I just can't seem to control the conditions in which it does, but at some point I loose the ability to register a new node with the cluster with the error I've provided.

Perhaps some sort of random TTL value is causing keys in dynamodb to expire at some point thus triggering that error.

Question: Is there a good way to determine what values in dynamodb would need to be there or missing to trigger the error I've mentioned?

FWIW the steps I've taken are as follows ...

Stopped all teleport services on servers
Cleared all /var/lib/teleport directories out
Installed 2.2.0-alpha8 on 3 auth servers using dynamodb
Then installed the proxy on my bastion server
Added my first user
Added another server
Added 2 more servers
On my second environment, I stopped all services, cleared all /var/lib/teleport directories out
Installed 2.2.0-alpha8 on 1 auth server using dir mode
Run tctl nodes add --ttl=15m --roles=trustedcluster on the main cluster
Add trusted cluster using trusted cluster role
Add 1 server to the trusted cluster
Upgraded all 3 auth servers to 2.2.0-beta.1
Add 1 more new server to main cluster
Add 1 more server to trusted cluster

After all those steps, everything is still working but that's not abnormal. This seems to only happen after a random set of time, thus my question earlier about expiring keys and what would trigger the error I'm seeing.

FWIW: the new feature you added that brings the roles down from the primary cluster to trusted clusters is working very well, I've been testing it out. Well done. Thanks!

ekristen · 2017-06-08T14:30:13Z

I started digging around the data, I saw this in dynamodb for my primary cluster expires 0001-01-01 at path teleport/authorities/host/primary-cluster

{"kind":"cert_authority","version":"v2","metadata":{"name":"primary-cluster","namespace":"default","expires":"0001-01-01T00:00:00Z"}, "spec": ["hidden"]}

However on the auth nodes in the cache I see at path /var/lib/teleport/cache/node/authorities/host/primary-cluster I see that it expires tomorrow.

{"kind":"cert_authority","version":"v2","metadata":{"name":"primary-cluster","namespace":"default","expires":"2017-06-09T09:38:31.088768281Z"},"spec": ["hidden"]}

Does the cache get renewed? Maybe this is part of the problem?

I'm just trying to understand how that error occurs in the first place, I think that'd be the best way of tracking this down. If you can help me understand that I think I can help narrow this down more.

Thank you.

russjones · 2017-06-08T19:19:21Z

@ekristen What you are seeing in DynamoDB for the expires value is a zero value, that the key will not expire.

What you are seeing in /var/lib/teleport/cache is a cached value that will be used in-case DynamoDB can not be accessed as part of the new high availability work in 2.2. It should be renewed. You can see implementation details for the new high availability work (as well as how to turn it off) in the following PR: #853.

ekristen · 2017-06-08T19:20:48Z

@russjones can you determine what has to be missing for the error has no signing keys to be thrown, I think that is the best way to track this down.

russjones · 2017-06-08T19:39:18Z

@ekristen It looks like the certificate authority has lost it's private keys. It's strange because Teleport creates it with a TTL of 0 (so it never expires) so that should never happen.

However that TTL only applies to Teleport expiring the key. That's why we asked about DynamoDB settings, do you have any custom settings that would expire keys at the DynamoDB level? Because as far as I know, this does occur with any other backend correct?

ekristen · 2017-06-08T19:45:21Z

Sorry for not including that information the first time, I forgot.

Table name: my-teleport-cluster
Primary partition key: HashKey (String)
Primary sort key: FullPath (String)
Time to live attribute: DISABLEDManage TTL
Table status: Active
Creation date: March 31, 2017 at 9:51:10 AM UTC-6
Provisioned read capacity units: 5
Provisioned write capacity units: 5
Storage size (in bytes): 87.82 KB
Item count: 73
Region: US West (Oregon)
Amazon Resource Name (ARN): arn:aws:dynamodb:us-west-2:xxxx

I'm going to keep a close eye on this, and upgrade to the next version once released. I've dumped a working copy of the teleport database, so if this happens again I'll be able to compare differences as well as see if restoration of changed keys allows it to function again.

kontsevoy · 2017-06-10T04:59:09Z

@ekristen @russjones I am removing P0 label and removing this from the upcoming 2.2 milestone, but keeping it open in case you find a way to reproduce this.

russjones · 2017-06-28T00:04:24Z

@ekristen This issue should be resolved now. If you see it again ping me and I'll investigate further.

ekristen · 2017-08-01T14:02:02Z

It's back. I haven't added any new nodes in a while and now that I'm trying to add a new one, I'm getting the error again. @russjones @kontsevoy

ekristen · 2017-08-01T17:37:56Z

So I've been saving backups of the dynamodb, it seems that the entry for teleport/authorities/host/my-cluster has changed!

Upon further inspection the newer teleport/authorities/host/my-cluster has the same checking_keys but the signing_keys entry is just gone.

ekristen · 2017-08-01T17:53:39Z

I replaced the value teleport/authorities/host/my-cluster with the original value from my backup that has the signing_keys array and everything works.

Somewhere along the line when the dynamodb entry gets updated it's loosing the signing_keys array.

kontsevoy · 2017-08-02T01:34:35Z

@russjones ping.

russjones · 2017-08-02T20:52:11Z

@ekristen Two questions:

Did this occur after you tried adding a trusted cluster?
Are you running an old version of Teleport anywhere within the main or trusted cluster? A single old binary can corrupt an entire cluster.

ekristen · 2017-08-02T20:55:02Z

No I hadn't tried adding anything in a while. Everything was working for a while that's why we closed the issue. I went to add a new server to the primary cluster and ran into this.
To my knowledge everything is the same, but I will doubt check.

russjones · 2017-08-04T01:18:39Z

@ekristen You just started the cluster with a static token and after that lost the cluster keys?

ekristen · 2017-08-04T03:23:11Z

The authority key in the dynamodb table 'teleport/authorities/host/my-cluster-name' was/is missing the signing_keys array. I had backed up all the keys after the original setup after the last time this happened, I opened the backup and found the same key and took its value (after confirming it had the signing_keys array) and after confirming the rest of the info matched and replaced the value in the dynamodb with the original value.

I then generated a node key and told my new server to confirm itself to the authority server and it worked.

kontsevoy · 2017-08-07T04:59:56Z

@russjones any ides? This case is so strange.... there's literally no place in Teleport where private keys are ever deleted. I doubt we even have a code for this, i.e. wouldn't be able to do even if we tried.

Can this be a case where there's a rogue teleport daemon (of the old version) somewhere trying to "upgrade" itself?

ekristen · 2017-08-07T14:31:16Z

The only 3 instances that have access to the dynamodb table are all the same.

Since replacing the key's value back last week, it has not "updating" again.

I did notice in the code that the structure omits when empty for the signing_keys field when converting from JSON. Is it possible that at some point go things that the signing_keys is empty and omits them when converting it?

ekristen · 2017-09-27T17:54:56Z

It happened again. Basically the entry lost the signing_keys array in the dynamodb. Version 2.2.0

ekristen · 2017-09-27T17:56:12Z

Everything about the JSON data is the same in terms of all the properties and values except that under the spec object, signing_keys is just gone.

I replaced the value with the backup I have and of course it works again, but there's definitely a bug somewhere. I just do not know where.

kontsevoy · 2017-10-01T23:49:14Z

@ekristen @russjones we believe this is a naming conflict. are you using trusted clusters? It's possible that a remote cluster's public key is overwriting the local cluster named similarly. Russell here can assist with upgrading to 2.3 (where this issue has been resolved).

ekristen · 2017-10-02T01:20:50Z

I am using trusted clusters, but they all have separate names. As long as I don't loose current capabilities, I'll gladly upgrade to try and fix this and/or to help troubleshoot further.

russjones · 2017-10-02T18:05:34Z

@ekristen In Teleport 2.3 we did a overhaul of the Trusted Cluster state machine and we uncovered an issue where if you had clusters with the same name you could wipe out the signing keys of another cluster. That's been resolved now.

Do you know what triggers this behavior? Is it just time or do you perform some administrative action on Teleport which then triggers the loss of the signing key?

ekristen · 2017-10-02T18:08:42Z

@russjones I haven't found any specific trigger. It seems to be time as best as I can figure. I can't rule out trusted clusters since I'm using them, but I only have 2 and they are named differently.

I'd be happy to upgrade to 2.3. I know you've started moving enterprise based code out of the public repo, am I going to loose any current capabilities by upgrading?

russjones · 2017-10-02T18:12:39Z

@ekristen Okay, as long as all cluster (your main cluster and all trusted clusters) all have different names, you probably are not hitting that particular issue.

What features of Teleport are you using? The main Enterprise only features are external identity providers (SAML/OIDC) and RBAC.

ekristen · 2017-10-02T18:13:42Z

What is included in RBAC?

I'm using trusted clusters + the role mapping to allow a role on the primary cluster to be mapped down to a role on the trusted clusters.

russjones · 2017-10-02T18:19:29Z

@ekristen Starting with Teleport 2.3 all users will have the same role in both the main cluster and the trusted cluster admin.

ekristen · 2017-10-02T18:30:47Z

So does RBAC control the roles then?

russjones · 2017-10-02T18:32:15Z

Yes, Teleport essentially has no RBAC, all users operate as admin.

kontsevoy added bug P0 labels Jun 7, 2017

kontsevoy added this to the 2.2 milestone Jun 7, 2017

kontsevoy assigned russjones Jun 7, 2017

kontsevoy mentioned this issue Jun 7, 2017

Error joining node to cluster when it has joined another trusted cluster #896

Closed

kontsevoy removed the P0 label Jun 10, 2017

kontsevoy removed this from the 2.2 milestone Jun 10, 2017

russjones mentioned this issue Jun 27, 2017

Remote clusters should only send their own CAs. #1108

Merged

russjones closed this as completed in #1108 Jun 27, 2017

hatched pushed a commit to hatched/teleport-merge that referenced this issue Nov 30, 2022

Mark app session with "AWS" (gravitational#1050)

c7a81e6

hatched pushed a commit that referenced this issue Dec 20, 2022

Mark app session with "AWS" (#1050)

7ebccf0

hatched pushed a commit that referenced this issue Feb 1, 2023

Mark app session with "AWS" (#1050) (#1080)

5e95fad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.0.6 to 2.2 alpha upgrade issue with DynamoDB backend #1050

2.0.6 to 2.2 alpha upgrade issue with DynamoDB backend #1050

kontsevoy commented Jun 7, 2017

ekristen commented Jun 7, 2017

kontsevoy commented Jun 7, 2017

ekristen commented Jun 7, 2017 via email

russjones commented Jun 8, 2017

ekristen commented Jun 8, 2017 •

edited

Loading

ekristen commented Jun 8, 2017

russjones commented Jun 8, 2017

ekristen commented Jun 8, 2017

russjones commented Jun 8, 2017

ekristen commented Jun 8, 2017 •

edited

Loading

kontsevoy commented Jun 10, 2017

russjones commented Jun 28, 2017

ekristen commented Aug 1, 2017 •

edited

Loading

ekristen commented Aug 1, 2017

ekristen commented Aug 1, 2017

kontsevoy commented Aug 2, 2017

russjones commented Aug 2, 2017

ekristen commented Aug 2, 2017 •

edited

Loading

russjones commented Aug 4, 2017

ekristen commented Aug 4, 2017

kontsevoy commented Aug 7, 2017

ekristen commented Aug 7, 2017

ekristen commented Sep 27, 2017

ekristen commented Sep 27, 2017 •

edited

Loading

kontsevoy commented Oct 1, 2017

ekristen commented Oct 2, 2017

russjones commented Oct 2, 2017

ekristen commented Oct 2, 2017

russjones commented Oct 2, 2017 •

edited

Loading

ekristen commented Oct 2, 2017

russjones commented Oct 2, 2017

ekristen commented Oct 2, 2017

russjones commented Oct 2, 2017

2.0.6 to 2.2 alpha upgrade issue with DynamoDB backend #1050

2.0.6 to 2.2 alpha upgrade issue with DynamoDB backend #1050

Comments

kontsevoy commented Jun 7, 2017

ekristen commented Jun 7, 2017

kontsevoy commented Jun 7, 2017

ekristen commented Jun 7, 2017 via email

russjones commented Jun 8, 2017

ekristen commented Jun 8, 2017 • edited Loading

ekristen commented Jun 8, 2017

russjones commented Jun 8, 2017

ekristen commented Jun 8, 2017

russjones commented Jun 8, 2017

ekristen commented Jun 8, 2017 • edited Loading

kontsevoy commented Jun 10, 2017

russjones commented Jun 28, 2017

ekristen commented Aug 1, 2017 • edited Loading

ekristen commented Aug 1, 2017

ekristen commented Aug 1, 2017

kontsevoy commented Aug 2, 2017

russjones commented Aug 2, 2017

ekristen commented Aug 2, 2017 • edited Loading

russjones commented Aug 4, 2017

ekristen commented Aug 4, 2017

kontsevoy commented Aug 7, 2017

ekristen commented Aug 7, 2017

ekristen commented Sep 27, 2017

ekristen commented Sep 27, 2017 • edited Loading

kontsevoy commented Oct 1, 2017

ekristen commented Oct 2, 2017

russjones commented Oct 2, 2017

ekristen commented Oct 2, 2017

russjones commented Oct 2, 2017 • edited Loading

ekristen commented Oct 2, 2017

russjones commented Oct 2, 2017

ekristen commented Oct 2, 2017

russjones commented Oct 2, 2017

ekristen commented Jun 8, 2017 •

edited

Loading

ekristen commented Jun 8, 2017 •

edited

Loading

ekristen commented Aug 1, 2017 •

edited

Loading

ekristen commented Aug 2, 2017 •

edited

Loading

ekristen commented Sep 27, 2017 •

edited

Loading

russjones commented Oct 2, 2017 •

edited

Loading