Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.4.20] Panic probably due to nil log object #14402

Closed
Tracked by #14438
ahrtr opened this issue Aug 30, 2022 · 9 comments
Closed
Tracked by #14438

[3.4.20] Panic probably due to nil log object #14402

ahrtr opened this issue Aug 30, 2022 · 9 comments

Comments

@ahrtr
Copy link
Member

ahrtr commented Aug 30, 2022

Based on the feedback from @JohnJAS, this issue can't be reproduced in 3.4.19, instead it can only be reproduced on 3.4.20.

It looks like the lg is nil.

Thanks both @JohnJAS and @rtheis.

2022-08-29 06:29:02.702707 E | rafthttp: failed to read 88d5d8ebd854df5a on stream MsgApp v2 (unexpected EOF)
2022-08-29 06:29:02.702723 I | rafthttp: peer 88d5d8ebd854df5a became inactive (message send to peer failed)
2022-08-29 06:29:02.703504 W | rafthttp: lost the TCP streaming connection with peer 88d5d8ebd854df5a (stream Message reader)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x75745b]

goroutine 287 [running]:
go.uber.org/zap.(*Logger).check(0x0, 0x1, 0x10a3ea5, 0x36, 0xc003fb7480)
        /root/go/pkg/mod/go.uber.org/[email protected]/logger.go:264 +0x9b
go.uber.org/zap.(*Logger).Warn(0x0, 0x10a3ea5, 0x36, 0xc003fb7480, 0x2, 0x2)
        /root/go/pkg/mod/go.uber.org/[email protected]/logger.go:194 +0x45
go.etcd.io/etcd/etcdserver.(*EtcdServer).requestCurrentIndex(0xc000398600, 0xc005210f00, 0xd15682e845891c12, 0x0, 0x0, 0x0)
        /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/v3_server.go:805 +0x873
go.etcd.io/etcd/etcdserver.(*EtcdServer).linearizableReadLoop(0xc000398600)
        /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/v3_server.go:721 +0x2d6
go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach.func1(0xc000398600, 0xc00011ec20)
        /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/server.go:2698 +0x57
created by go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach
        /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/server.go:2696 +0x1b1
@kkkkun
Copy link
Contributor

kkkkun commented Aug 31, 2022

It may have a relationship with PR https://github.com/etcd-io/etcd/pull/11616/files

@ahrtr
Copy link
Member Author

ahrtr commented Aug 31, 2022

@kkkkun Please feel free to deliver a PR for this. thx.

@vsvastey
Copy link
Contributor

vsvastey commented Sep 1, 2022

@ahrtr I would like to take a look at the issue, but I'm new to the project and haven't managed to reproduce the bug.
Could you (or anyone else) please help me find out how to reproduce the bug? I believe it might be helpful for anyone else who wants to tackle this issue.

@ahrtr
Copy link
Member Author

ahrtr commented Sep 2, 2022

Thanks @vsvastey .

The reason should be that the Logger isn't correctly initialized in some situation.

@JohnJAS and @rtheis, could you please provide the steps to reproduce this issue? thx

vsvastey added a commit to vsvastey/etcd that referenced this issue Sep 3, 2022
In v3.5 it is assumed that the logger should not be nil, however it is
still a case in v3.4. The PR targeted to v3.5 was backported to 3.4 and
that's why it's possible to get panic on nil logger in 3.4. This commit
fixed this issue.

Fixes etcd-io#14402
@vsvastey
Copy link
Contributor

vsvastey commented Sep 3, 2022

I've digged a bit into git history.
So, what I understood is that capnslog is removed for 3.5 and in main. (corresponding issue: #11426) After that, the Logger should not be nil.

However, for 3.4 it is still legal case for Logger to be nil. But the PR (#12795) that was targeted to main was also backported to 3.4 and that is how the code without nil checks get to the branch.

It will be enough just to add a couple of checks whether the Logger is nil. The fix should go for 3.4 only.

The PR is there. However, I haven't had a chance to test it since I don't know how to reproduce the bug.

vsvastey added a commit to vsvastey/etcd that referenced this issue Sep 3, 2022
In v3.5 it is assumed that the logger should not be nil, however it is
still a case in v3.4. The PR targeted to v3.5 was backported to 3.4 and
that's why it's possible to get panic on nil logger in 3.4. This commit
fixed this issue.

Fixes etcd-io#14402

Signed-off-by: Vladimir Sokolov <[email protected]>
@rtheis
Copy link

rtheis commented Sep 3, 2022

We easily reproduce the problem by deploying an IBM Cloud Kubernetes Service or Red Hat OpenShift on IBM Cloud cluster. However, you wouldn't be able to hack in your own etcd version to such clusters. My hope is that you could recreate by deploying etcd with your own Kubernetes cluster.

@JohnJAS
Copy link

JohnJAS commented Sep 5, 2022

We installed an on-premise multiple control-plane nodes k8s cluster. I think it should be easy to reproduce by installing a k8s cluster via kubeadm. In my case, when you are trying to run helm upgrade command, the etcd pods will always be crashed.

@ahrtr
Copy link
Member Author

ahrtr commented Sep 5, 2022

Thanks both @rtheis and @JohnJAS

ahrtr added a commit to ahrtr/etcd that referenced this issue Sep 6, 2022
Issues:
1. etcd-io#14402 fixed in 3.4 only;
2. etcd-io#14382 fixed in both 3.5 and main.

Signed-off-by: Benjamin Wang <[email protected]>
@ahrtr
Copy link
Member Author

ahrtr commented Sep 6, 2022

Resolved by #14420

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants