Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: don't activate alarm w/missing AlarmType #13084

Merged
merged 2 commits into from
Jun 4, 2021

Conversation

dlowe
Copy link
Contributor

@dlowe dlowe commented Jun 4, 2021

This is a crashing bug I found while fuzzing etcd under Mayhem for API. Full disclosure: working on this is my day job!

On a fresh-out-of-the-box bin/etcd:

$ curl -d "{\"action\":\"ACTIVATE\"}" localhost:2379/v3/maintenance/alarm
curl: (52) Empty reply from server

The server has crashed hard, with:

go.etcd.io/etcd/server/v3/mvcc/backend.(*batchTx).unsafePut
	etcd/release/etcd/server/mvcc/backend/batch_tx.go:138
go.etcd.io/etcd/server/v3/mvcc/backend.(*batchTx).UnsafePut
	etcd/release/etcd/server/mvcc/backend/batch_tx.go:115
go.etcd.io/etcd/server/v3/mvcc/backend.(*batchTxBuffered).UnsafePut
	etcd/release/etcd/server/mvcc/backend/batch_tx.go:337
go.etcd.io/etcd/server/v3/etcdserver/api/v3alarm.(*AlarmStore).Activate
	etcd/release/etcd/server/etcdserver/api/v3alarm/alarms.go:69
go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Alarm
	etcd/release/etcd/server/etcdserver/apply.go:724
go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Apply
	etcd/release/etcd/server/etcdserver/apply.go:192
go.etcd.io/etcd/server/v3/etcdserver.(*authApplierV3).Apply
	etcd/release/etcd/server/etcdserver/apply_auth.go:61
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntryNormal
	etcd/release/etcd/server/etcdserver/server.go:2206
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply
	etcd/release/etcd/server/etcdserver/server.go:2116
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries
	etcd/release/etcd/server/etcdserver/server.go:1357
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll
	etcd/release/etcd/server/etcdserver/server.go:1179
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8
	etcd/release/etcd/server/etcdserver/server.go:1111
go.etcd.io/etcd/pkg/v3/schedule.(*fifo).run
	etcd/release/etcd/pkg/schedule/schedule.go:157

(And in fact fails with the same exception on every subsequent restart until the offending alarm is cleaned out of the data directory!)

The test I've added is at the same level (http request from the outside world) where I discovered the bug, because I'm not super familiar with the etcd codebase. Let me know if it'd be more appropriate to add a narrower unit test.

I tried to keep my fix in line with the existing behavior of applierV3backend.Alarm(), which is to say it turns bad requests into silent noops.

Finally: I've found at least one other crashing bug through fuzzing, as well as other, less severe issues. Please let me know if you don't want further PRs based on these findings!

@@ -366,6 +366,21 @@ func testV3CurlResignMissiongLeaderKey(cx ctlCtx) {
}
}

func TestV3CurlMaintenanceAlarmMissiongAlarm(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Missiong" is how it's spelled in all the other tests in this file, so I decided to be consistent rather than correct 🤷

@codecov-commenter
Copy link

codecov-commenter commented Jun 4, 2021

Codecov Report

Merging #13084 (a26fa0c) into main (b00803a) will increase coverage by 8.53%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #13084      +/-   ##
==========================================
+ Coverage   53.33%   61.86%   +8.53%     
==========================================
  Files         420      415       -5     
  Lines       33368    32911     -457     
==========================================
+ Hits        17796    20360    +2564     
+ Misses      13739    10411    -3328     
- Partials     1833     2140     +307     
Flag Coverage Δ
all 61.86% <66.66%> (+8.53%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/etcdserver/apply.go 87.99% <66.66%> (+14.30%) ⬆️
server/mvcc/backend/hooks.go 0.00% <0.00%> (-100.00%) ⬇️
client/pkg/v3/tlsutil/cipher_suites.go 0.00% <0.00%> (-100.00%) ⬇️
client/pkg/v3/transport/sockopt_unix.go 0.00% <0.00%> (-100.00%) ⬇️
client/pkg/v3/transport/timeout_dialer.go 0.00% <0.00%> (-100.00%) ⬇️
client/pkg/v3/transport/timeout_listener.go 0.00% <0.00%> (-100.00%) ⬇️
client/pkg/v3/transport/keepalive_listener.go 0.00% <0.00%> (-92.31%) ⬇️
client/pkg/v3/srv/srv.go 0.00% <0.00%> (-89.42%) ⬇️
client/pkg/v3/transport/listener_opts.go 13.33% <0.00%> (-86.67%) ⬇️
client/pkg/v3/types/set.go 11.00% <0.00%> (-82.00%) ⬇️
... and 263 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b00803a...a26fa0c. Read the comment docs.

@gyuho
Copy link
Contributor

gyuho commented Jun 4, 2021

This is a crashing bug I found while fuzzing etcd under Mayhem for API. Full disclosure: working on this is my day job!

Nice! Was there any tooling involved for fuzzing?

@gyuho
Copy link
Contributor

gyuho commented Jun 4, 2021

@dlowe Thanks for the fix. Two things to address before merging this.

  1. Can we add this change to our 3.5 changelog?

  2. Can we change this warning to panic?

		lg.Warn("alarm raised", zap.String("alarm", m.Alarm.String()), zap.String("from", types.ID(m.MemberID).String()))
		switch m.Alarm {
		case pb.AlarmType_CORRUPT:
			a.s.applyV3 = newApplierV3Corrupt(a)
		case pb.AlarmType_NOSPACE:
			a.s.applyV3 = newApplierV3Capped(a)
		default:
			lg.Warn("unimplemented alarm activation", zap.String("alarm", fmt.Sprintf("%+v", m)))
		}

to

lg.Panic("unimplemented alarm activation", zap.String("alarm", fmt.Sprintf("%+v", m)))

@gyuho gyuho added this to the etcd-v3.5 milestone Jun 4, 2021
dlowe added 2 commits June 4, 2021 12:18
Narrowly prevent etcd from crashing when given a bad ACTIVATE payload, e.g.:

$ curl -d "{\"action\":\"ACTIVATE\"}" ${ETCD}/v3/maintenance/alarm
curl: (52) Empty reply from server
@dlowe dlowe force-pushed the crash-on-missing-alarm-type branch from eeb3402 to a26fa0c Compare June 4, 2021 19:19
@dlowe
Copy link
Contributor Author

dlowe commented Jun 4, 2021

Nice! Was there any tooling involved for fuzzing?

Mayhem for API itself is a command-line tool (mapi).

Other than that, nope! I just ran bin/etcd locally, then ranmapi against the etcd openapi spec and the base URL, and it did the rest. (Admittedly, the fact that I was able to get interesting results without having to deal with authentication was a pleasant surprise. That's often the trickiest part of running against a new API.)

Can we add this change to our 3.5 changelog?
Can we change this warning to panic?

Yup! Both done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants