[3.4] panic: runtime error: invalid memory address or nil pointer dereference #14256

ahrtr · 2022-07-22T01:31:24Z

I saw this error multiple times in 3.4 pipeline,

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xab2922]
goroutine 3500 [running]:
go.etcd.io/bbolt.(*Tx).Bucket(...)
	/home/runner/go/pkg/mod/go.etcd.io/[email protected]/tx.go:101
go.etcd.io/etcd/mvcc/backend.(*concurrentReadTx).UnsafeRange(0xc000598f00, 0x15792dc, 0x3, 0x3, 0xc0000e34e8, 0x11, 0x12, 0x0, 0x0, 0x0, ...)
	/home/runner/work/etcd/etcd/mvcc/backend/read_tx.go:195 +0x642
go.etcd.io/etcd/mvcc.(*storeTxnRead).rangeKeys(0xc000598f30, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x67, 0x0, 0x66, ...)
	/home/runner/work/etcd/etcd/mvcc/kvstore_txn.go:147 +0x293
go.etcd.io/etcd/mvcc.(*storeTxnRead).Range(0xc000598f30, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x66, 0xc00005e800, ...)
	/home/runner/work/etcd/etcd/mvcc/kvstore_txn.go:51 +0xaf
go.etcd.io/etcd/mvcc.(*metricsTxnWrite).Range(0xc000598f60, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x66, 0x40fa00, ...)
	/home/runner/work/etcd/etcd/mvcc/metrics_txn.go:37 +0xaf
go.etcd.io/etcd/mvcc.(*readView).Range(0xc0005d95c0, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x66, 0x0, ...)
	/home/runner/work/etcd/etcd/mvcc/kv_view.go:39 +0x125
go.etcd.io/etcd/etcdserver/api/v3rpc.(*serverWatchStream).sendLoop(0xc001bb20c0)
	/home/runner/work/etcd/etcd/etcdserver/api/v3rpc/watch.go:405 +0x1ef
go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch.func1(0xc001bb20c0)
	/home/runner/work/etcd/etcd/etcdserver/api/v3rpc/watch.go:180 +0x2b
created by go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch
	/home/runner/work/etcd/etcd/etcdserver/api/v3rpc/watch.go:179 +0x285
FAIL	go.etcd.io/etcd/etcdserver/api/v2store	3.484s
FAIL

Refer to https://github.com/etcd-io/etcd/runs/7460637358?check_suite_focus=true

Please see my comment: #14256 (comment)

The text was updated successfully, but these errors were encountered:

ahrtr · 2022-07-26T21:01:17Z

@ramses, any update on this issue?

ahrtr · 2022-07-27T22:04:17Z

I suggest to diff the code between release-3.4 and (releasae-3.5 or main). cc @SimFG to take a look as well.

If there is no any progress until early next week, then I may take a deep dive myself.

ramses · 2022-07-28T00:39:15Z

@ramses, any update on this issue?

@ahrtr, I was not investigating this issue; I was investigating the "split-brain" issue: #14143

Of these two, which one is currently more concerning to you?

SimFG · 2022-07-28T00:49:03Z

@ahrtr Ok, I will look it.

ahrtr · 2022-07-28T00:51:40Z

I suggest to resolve this issue firstly. @SimFG @ramses

Comment on #14143,

I am NOT convinced that it's real issue of etcd. Apart from the lazyfs issue, I even suspect it's issue on the client side, including the test case or jetcd. Of course, it needs more investigation;
We can start with qualify our corruption detection mechanism (such as Fix corruption checks v3.5 #14282 ) and test process (such as Jepsen or our functional test) firstly

SimFG · 2022-07-28T00:59:08Z

@ramses You can go on to see your current issue to avoid us doing repetitive work. cc @ahrtr, if there is any difficulty, I will keep you informed.

ramses · 2022-07-28T01:06:50Z

I am NOT convinced that it's real issue of etcd. [...] Of course, it needs more investigation;

I agree on both counts, @ahrtr. I'll keep digging down to be sure which one is it.

@ramses You can go on to see your current issue to avoid us doing repetitive work. cc @ahrtr, if there is any difficulty, I will keep you informed.

Noted, @SimFG . By the way, nice to meet you (in GitHub-verse) 🙂

SimFG · 2022-08-01T01:15:52Z

@ahrtr I didn't find a solution to this problem. At first I thought it was because tx did not call the begin method, which caused the error that tx was nil, but it was not the case after a try. Then it is guessed that the server failed to start, resulting in the tx initialization not being run, but the verification did not get the result.

ahrtr · 2022-08-01T22:39:24Z

@ahrtr I didn't find a solution to this problem. At first I thought it was because tx did not call the begin method, which caused the error that tx was nil, but it was not the case after a try. Then it is guessed that the server failed to start, resulting in the tx initialization not being run, but the verification did not get the result.

Thanks @SimFG for the feedback. Let me take care of this issue.

ahrtr · 2022-08-02T06:45:38Z

This issue can happen on both main and release-3.4, and I believe release-3.5 has this issue as well.

The root cause is the etcd just stops immediately before (*readView) Range is called, so the tx is nil; accordingly the following range operation panics.

I managed to reproduce this issue intentionally by simulating the situation mentioned above. The code change (on 3.4) is below,

wachao-a01:etcd wachao$ git diff
diff --git a/etcdmain/etcd.go b/etcdmain/etcd.go
index 73328a73d..ca76e3910 100644
--- a/etcdmain/etcd.go
+++ b/etcdmain/etcd.go
@@ -294,6 +294,7 @@ func startEtcdOrProxyV2() {
        case <-stopped:
        }
 
+       time.Sleep(10 * time.Second)
        osutil.Exit(0)
 }
 
diff --git a/etcdserver/api/v3rpc/watch.go b/etcdserver/api/v3rpc/watch.go
index 6a9204024..e7c27713a 100644
--- a/etcdserver/api/v3rpc/watch.go
+++ b/etcdserver/api/v3rpc/watch.go
@@ -402,6 +402,11 @@ func (sws *serverWatchStream) sendLoop() {
                                events[i] = &evs[i]
                                if needPrevKV && !isCreateEvent(evs[i]) {
                                        opt := mvcc.RangeOptions{Rev: evs[i].Kv.ModRevision - 1}
+                                       var s *etcdserver.EtcdServer = sws.sg.(*etcdserver.EtcdServer)
+                                       go func() {
+                                               s.Stop()
+                                       }()
+                                       time.Sleep(1*time.Second)
                                        r, err := sws.watchable.Range(evs[i].Kv.Key, nil, opt)
                                        if err == nil && len(r.KVs) != 0 {
                                                events[i].PrevKv = &(r.KVs[0])
diff --git a/mvcc/backend/read_tx.go b/mvcc/backend/read_tx.go
index 91fe72ec5..f088e5fa9 100644
--- a/mvcc/backend/read_tx.go
+++ b/mvcc/backend/read_tx.go
@@ -16,6 +16,7 @@ package backend
 
 import (
        "bytes"
+       "fmt"
        "math"
        "sync"
 
@@ -192,6 +193,9 @@ func (rt *concurrentReadTx) UnsafeRange(bucketName, key, endKey []byte, limit in
        rt.txMu.RUnlock()
        if !ok {
                rt.txMu.Lock()
+               if rt.tx == nil {
+                       fmt.Println("###### IN (*concurrentReadTx) UnsafeRange, tr.tx is nil!!!")
+               }
                bucket = rt.tx.Bucket(bucketName)
                rt.buckets[bn] = bucket
                rt.txMu.Unlock()

Then the issue can always be reproduced when running test TestV3WatchWithPrevKV,

wachao-a01:integration wachao$ go test -v -cpu 2 -run TestV3WatchWithPrevKV
=== RUN   TestV3WatchWithPrevKV
###### IN (*concurrentReadTx) UnsafeRange, tr.tx is nil!!!
    v3_watch_test.go:1212: timeout waiting for watch response
{"level":"warn","ts":"2022-08-02T10:38:23.889+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-e005a80b-1a7c-4a2c-83ed-a52cd49a61c4/localhost:83867718980294516900","attempt":0,"error":"rpc error: code = Unknown desc = raft: stopped"}
    v3_watch_test.go:1168: rpc error: code = Unknown desc = raft: stopped
    v3_watch_test.go:1198: rpc error: code = Canceled desc = context canceled
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x174694d]

goroutine 216 [running]:
go.etcd.io/bbolt.(*Tx).Bucket(...)
	/Users/wachao/go/gopath/pkg/mod/go.etcd.io/[email protected]/tx.go:101
go.etcd.io/etcd/mvcc/backend.(*concurrentReadTx).UnsafeRange(0xc000620300, 0x2316484, 0x3, 0x3, 0xc00011e5b8, 0x11, 0x12, 0x0, 0x0, 0x0, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/backend/read_tx.go:199 +0x64d
go.etcd.io/etcd/mvcc.(*storeTxnRead).rangeKeys(0xc000620330, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x3, 0x0, 0x2, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/kvstore_txn.go:147 +0x293
go.etcd.io/etcd/mvcc.(*storeTxnRead).Range(0xc000620330, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x0, 0x2, 0xc000100000, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/kvstore_txn.go:51 +0xaf
go.etcd.io/etcd/mvcc.(*metricsTxnWrite).Range(0xc000620360, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/metrics_txn.go:37 +0xaf
go.etcd.io/etcd/mvcc.(*readView).Range(0xc0002101f0, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x0, 0x2, 0xc000351300, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/kv_view.go:39 +0x125
go.etcd.io/etcd/etcdserver/api/v3rpc.(*serverWatchStream).sendLoop(0xc000176b40)
	/Users/wachao/go/src/github.com/ahrtr/etcd/etcdserver/api/v3rpc/watch.go:410 +0x276
go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch.func1(0xc000176b40)
	/Users/wachao/go/src/github.com/ahrtr/etcd/etcdserver/api/v3rpc/watch.go:180 +0x2b
created by go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch
	/Users/wachao/go/src/github.com/ahrtr/etcd/etcdserver/api/v3rpc/watch.go:179 +0x285
exit status 2
FAIL	go.etcd.io/etcd/integration	31.025s

The possibility of running into this issue is really low in production environment. It's even harder to reproduce this issue in the pipeline after merging #14290 . Since the etcdserver should have already been stopped when running into this issue, it means the last transaction should have already been committed, so it should be safe. So I don't think it's a blocker for 3.4.20 any more.

I will think about a formal fix or refactoring in future.

JohnJAS · 2022-08-29T02:07:17Z

i think i've met the same issue, start etcd v3.4.20 via static pod in a k8s cluster

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x75745b]

goroutine 265 [running]:
go.uber.org/zap.(*Logger).check(0x0, 0x1, 0x10a3ea5, 0x36, 0xc0075d9300)
    /root/go/pkg/mod/go.uber.org/[email protected]/logger.go:264 +0x9b
go.uber.org/zap.(*Logger).Warn(0x0, 0x10a3ea5, 0x36, 0xc0075d9300, 0x2, 0x2)
    /root/go/pkg/mod/go.uber.org/[email protected]/logger.go:194 +0x45
go.etcd.io/etcd/etcdserver.(*EtcdServer).requestCurrentIndex(0xc0000b0600, 0xc000dabc20, 0x5f382e3fe91c3e3, 0x0, 0x0, 0x0)
    /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/v3_server.go:805 +0x873
go.etcd.io/etcd/etcdserver.(*EtcdServer).linearizableReadLoop(0xc0000b0600)
    /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/v3_server.go:721 +0x2d6
go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach.func1(0xc0000b0600, 0xc0007f5d60)
    /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/server.go:2698 +0x57
created by go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach
    /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/server.go:2696 +0x1b1

ahrtr · 2022-08-29T02:30:03Z

One quick question, did you just see this issue once or can you easily reproduce this issue? @JohnJAS

JohnJAS · 2022-08-29T02:47:28Z

hi @ahrtr
It is hard for me to answer the question now because I only have checked one env and I found this issue inside the dead etcd container.
But it is true that we kept met with the ETCD issue after we've upgraded etcd to v3.4.20. Our daily pipeline was blocked for several days until we downgraded the etcd version.
I will try to manually upgrade etcd to see if I can reproduce this issue easily. And I will update my comments later.

JohnJAS · 2022-08-29T06:41:37Z

@ahrtr I've reproduced this issue on another cluster and just hit on the same issue.

I manually upgraded the etcd pods on each node one by one from v3.4.16 to v3.4.20. Then etcd pods crashed and exited like the following.

Error logs:

2022-08-29 06:29:02.702707 E | rafthttp: failed to read 88d5d8ebd854df5a on stream MsgApp v2 (unexpected EOF)
2022-08-29 06:29:02.702723 I | rafthttp: peer 88d5d8ebd854df5a became inactive (message send to peer failed)
2022-08-29 06:29:02.703504 W | rafthttp: lost the TCP streaming connection with peer 88d5d8ebd854df5a (stream Message reader)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x75745b]

goroutine 287 [running]:
go.uber.org/zap.(*Logger).check(0x0, 0x1, 0x10a3ea5, 0x36, 0xc003fb7480)
        /root/go/pkg/mod/go.uber.org/[email protected]/logger.go:264 +0x9b
go.uber.org/zap.(*Logger).Warn(0x0, 0x10a3ea5, 0x36, 0xc003fb7480, 0x2, 0x2)
        /root/go/pkg/mod/go.uber.org/[email protected]/logger.go:194 +0x45
go.etcd.io/etcd/etcdserver.(*EtcdServer).requestCurrentIndex(0xc000398600, 0xc005210f00, 0xd15682e845891c12, 0x0, 0x0, 0x0)
        /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/v3_server.go:805 +0x873
go.etcd.io/etcd/etcdserver.(*EtcdServer).linearizableReadLoop(0xc000398600)
        /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/v3_server.go:721 +0x2d6
go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach.func1(0xc000398600, 0xc00011ec20)
        /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/server.go:2698 +0x57
created by go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach
        /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/server.go:2696 +0x1b1

Add something I have observed. When I try to use helm upgrade command which will call kube-apiserver API internally (kube-apiserver depends on etcd in the backend), this will always trigger the etcd crashed.

So I think this issue is not related the env and it is easy to reproduce.

However, ETCD v3.4.16 works well, but we still got some CVEs on this version and that's why we want to upgrade ETCD.

I will test etcd v3.4.19 which including the fix for those CVEs tomorrow.

Update later.

rtheis · 2022-08-29T18:43:25Z

We hit the same problem with etcd version 3.4.20 and it is very easy to reproduce in our environment.

JohnJAS · 2022-08-30T07:03:06Z

I can confirm that etcd v3.4.19 doesn't have this issue and we are happy to see that the two CVEs are fixed in this version. It looks like some fix in v3.4.20 had imported the current issue.

However, the error message in my case is not 100% as yours, though they are all about the nil pointor issue. I won't create another one in case if they have the same root cause.

Hope my research can help to narrow down the range of root cause.

ahrtr · 2022-08-30T07:38:55Z

Thanks @JohnJAS and @rtheis for the feedback. It seems that it's a different issue with this one, I just raised another issue #14402 to track it. Please feel free to deliver a PR for it if you are interested.

AnassZn · 2024-04-29T01:46:43Z

runtime error: index out of range [0] with length 0
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xd4d846]

ahrtr added the help wanted label Jul 22, 2022

ahrtr mentioned this issue Jul 22, 2022

Plans for v3.4.20 release #14232

Closed

25 tasks

ahrtr mentioned this issue Aug 1, 2022

[3.4] Do not get previous K/V for create event #14290

Merged

ahrtr added stage/tracked and removed help wanted labels Aug 2, 2022

ahrtr mentioned this issue Sep 9, 2022

[release-3.4] etcdctl: fix move-leader for multiple endpoints #14441

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[3.4] panic: runtime error: invalid memory address or nil pointer dereference #14256

[3.4] panic: runtime error: invalid memory address or nil pointer dereference #14256

ahrtr commented Jul 22, 2022 •

edited

Loading

ahrtr commented Jul 26, 2022

ahrtr commented Jul 27, 2022

ramses commented Jul 28, 2022

SimFG commented Jul 28, 2022

ahrtr commented Jul 28, 2022 •

edited

Loading

SimFG commented Jul 28, 2022

ramses commented Jul 28, 2022

SimFG commented Aug 1, 2022

ahrtr commented Aug 1, 2022

ahrtr commented Aug 2, 2022 •

edited

Loading

JohnJAS commented Aug 29, 2022 •

edited

Loading

ahrtr commented Aug 29, 2022

JohnJAS commented Aug 29, 2022 •

edited

Loading

JohnJAS commented Aug 29, 2022 •

edited

Loading

rtheis commented Aug 29, 2022

JohnJAS commented Aug 30, 2022

ahrtr commented Aug 30, 2022

AnassZn commented Apr 29, 2024

[3.4] panic: runtime error: invalid memory address or nil pointer dereference #14256

[3.4] panic: runtime error: invalid memory address or nil pointer dereference #14256

Comments

ahrtr commented Jul 22, 2022 • edited Loading

ahrtr commented Jul 26, 2022

ahrtr commented Jul 27, 2022

ramses commented Jul 28, 2022

SimFG commented Jul 28, 2022

ahrtr commented Jul 28, 2022 • edited Loading

SimFG commented Jul 28, 2022

ramses commented Jul 28, 2022

SimFG commented Aug 1, 2022

ahrtr commented Aug 1, 2022

ahrtr commented Aug 2, 2022 • edited Loading

JohnJAS commented Aug 29, 2022 • edited Loading

ahrtr commented Aug 29, 2022

JohnJAS commented Aug 29, 2022 • edited Loading

JohnJAS commented Aug 29, 2022 • edited Loading

rtheis commented Aug 29, 2022

JohnJAS commented Aug 30, 2022

ahrtr commented Aug 30, 2022

AnassZn commented Apr 29, 2024

ahrtr commented Jul 22, 2022 •

edited

Loading

ahrtr commented Jul 28, 2022 •

edited

Loading

ahrtr commented Aug 2, 2022 •

edited

Loading

JohnJAS commented Aug 29, 2022 •

edited

Loading

JohnJAS commented Aug 29, 2022 •

edited

Loading

JohnJAS commented Aug 29, 2022 •

edited

Loading