Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post-election leader setup failure #11276

Closed
mgeggie opened this issue Apr 5, 2021 · 5 comments
Closed

Post-election leader setup failure #11276

mgeggie opened this issue Apr 5, 2021 · 5 comments

Comments

@mgeggie
Copy link

mgeggie commented Apr 5, 2021

Describe the bug
During a Vault HA leadership election, the newly-elected leader failed to conduct it's post-unseal tasks correctly, resulting in a cluster-wide failure due to intra-cluster certificates not being configured.

To Reproduce
Steps to reproduce the behavior:

  1. Have not been able to re-produce. Attempt to reproduce using docker-compose can be found in zumper/vault-docker

Expected behavior
Expect that when one vault leader fails and another takes over that it will successfully configure the cluster TLS

Environment:

  • Vault Server Version (retrieve with vault status): 1.6.0
  • Vault CLI Version (retrieve with vault version): Vault v1.6.0 (7ce0bd9)
  • Server Operating System/Architecture: Ubuntu 16.04 LTS (xenial)/x86_64

Vault server configuration file(s):

# /etc/.d/server.hcl
# Managed by Salt

storage "consul" {
	address = "169.254.1.1:8500"
	path    = "vault/"
}

listener "tcp" {
	address = "0.0.0.0:8200"
	tls_key_file = "/etc/vault.d/vault-prod-complete.key.pem"
	tls_cert_file = "/etc/vault.d/vault-prod-20201207-complete.cert.pem"
	tls_min_version = "tls12"
}

seal "awskms" {
	kms_key_id = "<kms_key_id>"
}

ui = "true"

log_level = "trace"

# NOTE: DO NOT change api_addr to use hostnames!
#
# Consul 1.6.1 does not support recursive DNS lookups. If this is set to a non-IP value
# when a lookup for vault.service.consul reaches Consul's DNS server, Consul will will then
# resolve a CNAME to a Vault server's hostname, which it will then not be able to resolve itself
# to an IP address and result in a DNS resolution failure, even though the service will appear
# healthy in Consul!
api_addr = "https://10.0.32.53:8200"

pid_file = "/var/run/vault.pid"

Additional context
Logs from the particular failover:

vault[13307]: 2021-03-15T16:13:32.898Z [INFO]  core: acquired lock, enabling active operation
vault[13307]: 2021-03-15T16:13:32.903Z [DEBUG] core: generating cluster private key
vault[13307]: 2021-03-15T16:13:32.914Z [DEBUG] core: generating local cluster certificate
vault[13307]: 2021-03-15T16:13:32.935Z [INFO]  core: post-unseal setup starting
vault[13307]: 2021-03-15T16:13:37.499Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
vault[13307]: 2021-03-15T16:13:37.499Z [DEBUG] core.cluster-listener: error handshaking cluster connection: error="unsupported protocol"

Prior, successful failover in a similar timeframe:

[prod-main-ops3-use1]
vault[7646]: 2021-03-15T16:02:09.338Z [INFO]  core: acquired lock, enabling active operation
vault[7646]: 2021-03-15T16:02:09.348Z [DEBUG] core: generating cluster private key
vault[7646]: 2021-03-15T16:02:09.360Z [DEBUG] core: generating local cluster certificate
vault[7646]: 2021-03-15T16:02:09.387Z [INFO]  core: post-unseal setup starting
vault[7646]: 2021-03-15T16:02:09.388Z [DEBUG] core: clearing forwarding clients
vault[7646]: 2021-03-15T16:02:09.388Z [DEBUG] core: forwarding: stopping heartbeating
vault[7646]: 2021-03-15T16:02:09.388Z [DEBUG] core: done clearing forwarding clients
vault[7646]: 2021-03-15T16:02:09.388Z [DEBUG] core: persisting feature flags
vault[7646]: 2021-03-15T16:02:09.400Z [INFO]  core: loaded wrapping token key
vault[7646]: 2021-03-15T16:02:09.400Z [INFO]  core: successfully setup plugin catalog: plugin-directory=
vault[7646]: 2021-03-15T16:02:09.404Z [INFO]  core: successfully mounted backend: type=kv path=secret/
vault[7646]: 2021-03-15T16:02:09.404Z [INFO]  core: successfully mounted backend: type=system path=sys/
vault[7646]: 2021-03-15T16:02:09.404Z [INFO]  core: successfully mounted backend: type=identity path=identity/
vault[7646]: 2021-03-15T16:02:09.405Z [INFO]  core: successfully mounted backend: type=pki path=pki/
vault[7646]: 2021-03-15T16:02:09.405Z [INFO]  core: successfully mounted backend: type=aws path=aws/
vault[7646]: 2021-03-15T16:02:09.405Z [INFO]  core: successfully mounted backend: type=database path=database/
vault[7646]: 2021-03-15T16:02:09.405Z [INFO]  core: successfully mounted backend: type=cubbyhole path=cubbyhole/
vault[7646]: 2021-03-15T16:02:09.405Z [INFO]  secrets.database.database_3beb3d7d: initializing database rotation queue
vault[7646]: 2021-03-15T16:02:09.420Z [INFO]  core: successfully enabled credential backend: type=token path=token/
vault[7646]: 2021-03-15T16:02:09.420Z [INFO]  core: successfully enabled credential backend: type=aws path=aws/
vault[7646]: 2021-03-15T16:02:09.420Z [INFO]  core: successfully enabled credential backend: type=github path=github/
vault[7646]: 2021-03-15T16:02:09.422Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
vault[7646]: 2021-03-15T16:02:09.422Z [DEBUG] core.cluster-listener: error handshaking cluster connection: error="unsupported protocol"
vault[7646]: 2021-03-15T16:02:09.425Z [INFO]  core: restoring leases
vault[7646]: 2021-03-15T16:02:09.425Z [INFO]  rollback: starting rollback manager
vault[7646]: 2021-03-15T16:02:09.425Z [DEBUG] expiration: collecting leases
vault[7646]: 2021-03-15T16:02:09.428Z [DEBUG] identity: loading entities
vault[7646]: 2021-03-15T16:02:09.430Z [DEBUG] identity: entities collected: num_existing=13
vault[7646]: 2021-03-15T16:02:09.431Z [DEBUG] identity: entities loading: progress=0
vault[7646]: 2021-03-15T16:02:09.438Z [INFO]  identity: entities restored
vault[7646]: 2021-03-15T16:02:09.438Z [DEBUG] identity: identity loading groups
vault[7646]: 2021-03-15T16:02:09.439Z [DEBUG] identity: groups collected: num_existing=0
vault[7646]: 2021-03-15T16:02:09.439Z [INFO]  identity: groups restored
vault[7646]: 2021-03-15T16:02:09.445Z [TRACE] activity: scanned existing logs: out=[]
vault[7646]: 2021-03-15T16:02:09.445Z [DEBUG] core: request forwarding setup function
vault[7646]: 2021-03-15T16:02:09.445Z [DEBUG] core: clearing forwarding clients
vault[7646]: 2021-03-15T16:02:09.445Z [DEBUG] core: done clearing forwarding clients
vault[7646]: 2021-03-15T16:02:09.445Z [DEBUG] core: leaving request forwarding setup function
vault[7646]: 2021-03-15T16:02:09.447Z [TRACE] activity: no intent log found
vault[7646]: 2021-03-15T16:02:09.448Z [TRACE] activity: scanned existing logs: out=[]
vault[7646]: 2021-03-15T16:02:09.448Z [INFO]  http: TLS handshake error from 10.0.1.31:60788: remote error: tls: bad certificate
vault[7646]: 2021-03-15T16:02:09.487Z [INFO]  core: usage gauge collection is disabled
vault[7646]: 2021-03-15T16:02:09.498Z [INFO]  core: post-unseal setup complete
vault[7646]: 2021-03-15T16:02:09.517Z [INFO]  secrets.database.database_3beb3d7d: populating role rotation queue
vault[7646]: 2021-03-15T16:02:09.531Z [INFO]  secrets.database.database_3beb3d7d: starting periodic ticker
vault[7646]: 2021-03-15T16:02:09.697Z [INFO]  http: TLS handshake error from 10.0.1.21:59524: remote error: tls: bad certificate
vault[7646]: 2021-03-15T16:02:09.850Z [DEBUG] core.cluster-listener: performing server cert lookup
vault[7646]: 2021-03-15T16:02:09.924Z [DEBUG] core.request-forward: got request forwarding connection
vault[7646]: 2021-03-15T16:02:10.073Z [DEBUG] auth.aws.auth_aws_d3262f5e.initialize: starting initialization
vault[7646]: 2021-03-15T16:02:10.424Z [DEBUG] core.cluster-listener: performing server cert lookup
vault[7646]: 2021-03-15T16:02:10.435Z [INFO]  http: TLS handshake error from 10.0.1.163:53386: remote error: tls: bad certificate
vault[7646]: 2021-03-15T16:02:10.497Z [DEBUG] core.request-forward: got request forwarding connection
prod-main-ops3-use1 vault[7646]: 2021-03-15T16:02:10.833Z [INFO]  http: TLS handshake error from 10.0.1.31:60808: remote error: tls: bad certificate
prod-main-ops3-use1 vault[7646]: 2021-03-15T16:02:11.193Z [INFO]  http: TLS handshake error from 10.0.1.21:59570: remote error: tls: bad certificate
vault[7646]: 2021-03-15T16:02:11.545Z [INFO]  http: TLS handshake error from 10.0.1.163:53420: remote error: tls: bad certificate
vault[7646]: 2021-03-15T16:02:12.389Z [TRACE] secrets.database.database_3beb3d7d.postgresql-database-plugin: initialize: transport=builtin status=started
vault[7646]: 2021-03-15T16:02:12.429Z [TRACE] secrets.database.database_3beb3d7d.postgresql-database-plugin: initialize: transport=builtin status=finished verify=true err=<nil> took=39.771291ms
vault[7646]: 2021-03-15T16:02:12.429Z [TRACE] secrets.database.database_3beb3d7d.postgresql-database-plugin: create user: transport=builtin status=started
vault[7646]: 2021-03-15T16:02:12.440Z [TRACE] secrets.database.database_3beb3d7d.postgresql-database-plugin: create user: transport=builtin status=finished err=<nil> took=10.879113ms
vault[7646]: 2021-03-15T16:02:13.458Z [TRACE] secrets.database.database_3beb3d7d.postgresql-database-plugin: update user: transport=builtin status=started
vault[7646]: 2021-03-15T16:02:13.465Z [TRACE] secrets.database.database_3beb3d7d.postgresql-database-plugin: update user: transport=builtin status=finished err=<nil> took=7.185957ms
vault[7646]: 2021-03-15T16:02:13.754Z [INFO]  http: TLS handshake error from 10.0.1.31:60836: remote error: tls: bad certificate
vault[7646]: 2021-03-15T16:02:14.017Z [INFO]  http: TLS handshake error from 10.0.1.21:59636: remote error: tls: bad certificate
vault[7646]: 2021-03-15T16:02:14.224Z [DEBUG] expiration: leases collected: num_existing=470395
vault[7646]: 2021-03-15T16:02:14.416Z [DEBUG] expiration: leases loading: progress=500
vault[7646]: 2021-03-15T16:02:14.461Z [INFO]  http: TLS handshake error from 10.0.1.163:53522: remote error: tls: bad certificate
vault[7646]: 2021-03-15T16:02:14.535Z [DEBUG] expiration: leases loading: progress=1000
vault[7646]: 2021-03-15T16:02:14.682Z [DEBUG] expiration: leases loading: progress=1500
vault[7646]: 2021-03-15T16:02:14.780Z [DEBUG] expiration: leases loading: progress=2000
vault[7646]: 2021-03-15T16:02:14.931Z [DEBUG] expiration: leases loading: progress=2500
vault[7646]: 2021-03-15T16:02:15.059Z [DEBUG] expiration: leases loading: progress=3000
...
vault[7646]: 2021-03-15T16:02:16.082Z [DEBUG] expiration: leases loading: progress=7000
...
vault[7646]: 2021-03-15T16:02:18.638Z [DEBUG] expiration: leases loading: progress=13000
vault[7646]: 2021-03-15T16:02:19.367Z [TRACE] secrets.database.database_3beb3d7d.postgresql-database-plugin: update user: transport=builtin status=started
vault[7646]: 2021-03-15T16:02:19.390Z [DEBUG] expiration: leases loading: progress=13500
vault[7646]: 2021-03-15T16:02:19.462Z [TRACE] secrets.database.database_3beb3d7d.postgresql-database-plugin: update user: transport=builtin status=finished err=<nil> took=95.061359ms
vault[7646]: 2021-03-15T16:02:19.529Z [DEBUG] expiration: leases loading: progress=14000
...
vault[7646]: 2021-03-15T16:02:26.377Z [DEBUG] expiration: leases loading: progress=32000
vault[7646]: 2021-03-15T16:02:26.433Z [INFO]  expiration: revoked lease: lease_id=auth/aws/login/h00609ecd46485d1c33ee7e7cda596957d428fc6273f73129c77b1eccbeb70c18
vault[7646]: 2021-03-15T16:02:26.504Z [DEBUG] expiration: leases loading: progress=32500
...
vault[7646]: 2021-03-15T16:05:05.442Z [DEBUG] expiration: leases loading: progress=407000
vault[7646]: 2021-03-15T16:05:05.488Z [WARN]  expiration: lease count exceeds warning lease threshold
vault[7646]: 2021-03-15T16:05:05.608Z [DEBUG] expiration: leases loading: progress=407500
...
vault[7646]: 2021-03-15T16:05:34.170Z [DEBUG] expiration: leases loading: progress=470000
vault[7646]: 2021-03-15T16:05:34.478Z [INFO]  expiration: lease restore complete
vault[7646]: 2021-03-15T16:06:06.488Z [WARN]  expiration: lease count exceeds warning lease threshold
vault[7646]: 2021-03-15T16:06:56.511Z [INFO]  expiration: revoked lease: lease_id=auth/aws/login/h314c00f982d379ef7a16b302821155bbd0a611c5ba50e90ef00afea6f261fec3
@ncabatoff
Copy link
Collaborator

Hi @mgeggie,

Thanks for reporting this issue. That sounds a lot like #10456, fixed in 1.6.1. Can you try upgrading to a newer version? Also, if it gets stuck again, can you send a SIGQUIT to get a stack trace in the logs?

@mgeggie
Copy link
Author

mgeggie commented Apr 7, 2021

Hi @ncabatoff we just completed our production Vault cluster upgrade to 1.7.0 after having a spate of leader election issues. If we see a recurrence, we'll try to capture stack traces to add here.

@ncabatoff
Copy link
Collaborator

Cool. Correction to my previous comment: use SIGUSR2, not SIGQUIT. Both will yield stack traces, but SIGQUIT will make Vault exit, which isn't usually what's desired.

@dmattia
Copy link

dmattia commented Apr 14, 2021

We also just did an upgrade from v1.5.4 to 1.5.7 (where the issue persisted) and then upgraded to v1.7.0 where we haven't seen a recurrence yet. So far it looks good!

@ncabatoff
Copy link
Collaborator

Ok, thanks for the update @dmattia. I'm going to close this because otherwise I'll likely forget, please re-open if you have a recurrence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants