Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero assigned two Alphas the same Raft index. #5436

Closed
danielmai opened this issue May 13, 2020 · 1 comment · Fixed by #5571
Closed

Zero assigned two Alphas the same Raft index. #5436

danielmai opened this issue May 13, 2020 · 1 comment · Fixed by #5571
Labels
area/operations Related to operational aspects of the DB, including signals, flags, env vars, etc. kind/bug Something is broken. status/accepted We accept to investigate/work on it.

Comments

@danielmai
Copy link
Contributor

danielmai commented May 13, 2020

What version of Dgraph are you using?

master (5890a6c)

Have you tried reproducing the issue with the latest release?

No

What is the hardware spec (RAM, OS)?

Ubuntu Linux

Steps to reproduce the issue (command/config used to run Dgraph).

  1. Run Dgraph Zero
    dgraph zero --port_offset=0 --replicas 5
    
  2. Run Dgraph Alphas
    dgraph alpha -o 100
    dgraph alpha --my=localhost:7182 -o 102
    dgraph alpha --my=localhost:7183 -o 103
    

Expected behaviour and actual result.

Two of the Alphas got assigned the same Raft ID:

Alpha 1 logs
I0512 14:47:35.697407   21051 run.go:531] gRPC server started.  Listening on port 9180
I0512 14:47:35.697438   21051 run.go:532] HTTP server started.  Listening on port 8180
I0512 14:47:35.795500   21051 pool.go:160] CONNECTING to localhost:5080
I0512 14:47:35.804842   21051 groups.go:135] Connected to group zero. Assigned group: 1
I0512 14:47:35.804865   21051 groups.go:137] Raft Id after connection to Zero: 0x1

Alpha 2 logs
I0512 14:47:45.670334   21129 run.go:531] gRPC server started.  Listening on port 9182
I0512 14:47:45.670351   21129 run.go:532] HTTP server started.  Listening on port 8182
I0512 14:47:45.768813   21129 pool.go:160] CONNECTING to localhost:5080
I0512 14:47:45.776634   21129 groups.go:135] Connected to group zero. Assigned group: 1
I0512 14:47:45.776651   21129 groups.go:137] Raft Id after connection to Zero: 0x2

Alpha 3 logs
I0512 14:47:57.999506   21598 run.go:531] gRPC server started.  Listening on port 9183
I0512 14:47:57.999516   21598 run.go:532] HTTP server started.  Listening on port 8183
I0512 14:47:58.098817   21598 pool.go:160] CONNECTING to localhost:5080
I0512 14:47:58.108609   21598 groups.go:135] Connected to group zero. Assigned group: 1
I0512 14:47:58.108632   21598 groups.go:137] Raft Id after connection to Zero: 0x2

Both Alpha 2 and Alpha 3 say "Raft Id after connection to Zero: 0x2".

When checking /state, I only see two Alpha members, where member id 2 is sometimes either Alpha when refreshing the /state page multiple times:

        "1": {
            "checksum": "14719752810080575978",
            "members": {
                "1": {
                    "addr": "localhost:7180",
                    "groupId": 1,
                    "id": "1",
                    "lastUpdate": "1589320055",
                    "leader": true
                },
                "2": {
                    "addr": "localhost:7183",
                    "groupId": 1,
                    "id": "2",
                    "lastUpdate": "1589320630"
                }
            },

or

        "1": {
            "checksum": "14719752810080575978",
            "members": {
                "1": {
                    "addr": "localhost:7180",
                    "groupId": 1,
                    "id": "1",
                    "lastUpdate": "1589320055",
                    "leader": true
                },
                "2": {
                    "addr": "localhost:7182",
                    "groupId": 1,
                    "id": "2",
                    "lastUpdate": "1589320648"
                }
            },

I expected the Alpha 1 to have Raft ID 0x1, Alpha 2 to have Raft ID 0x2, and Alpha 3 to have Raft ID 0x3.

Here are the Zero logs:

$ dgraph zero --port_offset=0 --replicas 5
[Decoder]: Using assembly version of decoder
[Sentry] 2020/05/12 14:47:24 Integration installed: ContextifyFrames
[Sentry] 2020/05/12 14:47:24 Integration installed: Environment
[Sentry] 2020/05/12 14:47:24 Integration installed: Modules
[Sentry] 2020/05/12 14:47:24 Integration installed: IgnoreErrors
[Decoder]: Using assembly version of decoder
[Sentry] 2020/05/12 14:47:24 Integration installed: ContextifyFrames
[Sentry] 2020/05/12 14:47:24 Integration installed: Environment
[Sentry] 2020/05/12 14:47:24 Integration installed: Modules
[Sentry] 2020/05/12 14:47:24 Integration installed: IgnoreErrors
I0512 14:47:25.147987   20950 init.go:99] 

Dgraph version   : v2.0.0-rc1-309-g5890a6c48
Dgraph SHA-256   : e3add49c6395c1fa651eeb9a8c404489b39c59fae285af139603128d471fab9e
Commit SHA-1     : 5890a6c48
Commit timestamp : 2020-05-11 22:53:45 +0530
Branch           : master
Go version       : go1.14.1

For Dgraph official documentation, visit https://docs.dgraph.io.
For discussions about Dgraph     , visit https://discuss.dgraph.io.
To say hi to the community       , visit https://dgraph.slack.com.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2020 Dgraph Labs, Inc.


I0512 14:47:25.148236   20950 run.go:108] Setting up grpc listener at: 0.0.0.0:5080
I0512 14:47:25.148397   20950 run.go:108] Setting up http listener at: 0.0.0.0:6080
badger 2020/05/12 14:47:25 INFO: All 0 tables opened in 0s
I0512 14:47:25.195823   20950 node.go:148] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc000604100 Applied:0 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x2869bb0 DisableProposalForwarding:false}
I0512 14:47:25.196513   20950 node.go:326] Group 0 found 0 entries
I0512 14:47:25.196610   20950 log.go:34] 1 became follower at term 0
I0512 14:47:25.196719   20950 log.go:34] newRaft 1 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
I0512 14:47:25.196753   20950 log.go:34] 1 became follower at term 1
I0512 14:47:25.196971   20950 run.go:304] Running Dgraph Zero...
E0512 14:47:25.197475   20950 raft.go:516] While proposing CID: Not Zero leader. Aborting proposal: cid:"a0066c20-b304-483d-8ece-dcd4246b6d72" . Retrying...
I0512 14:47:25.203852   20950 node.go:185] Setting conf state to nodes:1 
I0512 14:47:25.203990   20950 raft.go:706] Done applying conf change at 0x1
I0512 14:47:26.197203   20950 log.go:34] 1 no leader at term 1; dropping index reading msg
I0512 14:47:27.997070   20950 log.go:34] 1 is starting a new election at term 1
I0512 14:47:27.997096   20950 log.go:34] 1 became pre-candidate at term 1
I0512 14:47:27.997102   20950 log.go:34] 1 received MsgPreVoteResp from 1 at term 1
I0512 14:47:27.997123   20950 log.go:34] 1 became candidate at term 2
I0512 14:47:27.997128   20950 log.go:34] 1 received MsgVoteResp from 1 at term 2
I0512 14:47:27.997145   20950 log.go:34] 1 became leader at term 2
I0512 14:47:27.997159   20950 log.go:34] raft.node: 1 elected leader 1 at term 2
I0512 14:47:27.997199   20950 raft.go:671] I've become the leader, updating leases.
I0512 14:47:27.997216   20950 assign.go:42] Updated Lease id: 1. Txn Ts: 1
W0512 14:47:28.197258   20950 node.go:674] [0x1] Read index context timed out
I0512 14:47:28.209228   20950 raft.go:509] CID set for cluster: 09adb769-d622-4238-b241-27417e03ee9b
I0512 14:47:28.220323   20950 license_ee.go:45] Enterprise trial license proposed to the cluster: key:"z1-325843994757820810" license:<maxNodes:18446744073709551615 expiryTs:1591912048 > 
I0512 14:47:35.797176   20950 zero.go:417] Got connection request: cluster_info_only:true 
I0512 14:47:35.797346   20950 zero.go:435] Connected: cluster_info_only:true 
I0512 14:47:35.798407   20950 zero.go:417] Got connection request: addr:"localhost:7180" 
I0512 14:47:35.798645   20950 pool.go:160] CONNECTING to localhost:7180
W0512 14:47:35.799909   20950 pool.go:254] Connection lost with localhost:7180. Error: rpc error: code = Unknown desc = No node has been set up yet
I0512 14:47:35.804592   20950 zero.go:562] Connected: id:1 group_id:1 addr:"localhost:7180" 
I0512 14:47:45.770118   20950 zero.go:417] Got connection request: cluster_info_only:true 
I0512 14:47:45.770379   20950 zero.go:435] Connected: cluster_info_only:true 
I0512 14:47:45.771460   20950 zero.go:417] Got connection request: id:2 addr:"localhost:7182" 
I0512 14:47:45.771727   20950 pool.go:160] CONNECTING to localhost:7182
W0512 14:47:45.772614   20950 pool.go:254] Connection lost with localhost:7182. Error: rpc error: code = Unknown desc = No node has been set up yet
I0512 14:47:45.776289   20950 zero.go:562] Connected: id:2 group_id:1 addr:"localhost:7182" 
I0512 14:47:58.100099   20950 zero.go:417] Got connection request: cluster_info_only:true 
I0512 14:47:58.100309   20950 zero.go:435] Connected: cluster_info_only:true 
I0512 14:47:58.101187   20950 zero.go:417] Got connection request: addr:"localhost:7183" 
I0512 14:47:58.101456   20950 pool.go:160] CONNECTING to localhost:7183
W0512 14:47:58.102427   20950 pool.go:254] Connection lost with localhost:7183. Error: rpc error: code = Unknown desc = No node has been set up yet
I0512 14:47:58.108148   20950 zero.go:562] Connected: id:2 group_id:1 addr:"localhost:7183" 

Checking the wal entries for Zero shows that both instances got assigned id 2:

$ dgraph debug --wal zw
...
2 . 21 . EntryNormal . 62 B   . Member: id:2 group_id:1 addr:"localhost:7182"  .
2 . 22 . EntryNormal . 64 B   . Member: id:2 group_id:1 addr:"localhost:7183"  .
...

I noticed this because I ran a drop all and then a data load and only two of the three Alphas showed logs when I ran the operation.

@danielmai danielmai added kind/bug Something is broken. status/accepted We accept to investigate/work on it. area/operations Related to operational aspects of the DB, including signals, flags, env vars, etc. labels May 13, 2020
@ashish-goswami
Copy link
Contributor

I tried to reproduce it multiple time with above instructions. Also tried it with 3 node zero cluster, could not reproduce it. Had a chat with @danielmai, even he saw it only once. Looks like a rarely occurring issue.

martinmr added a commit that referenced this issue Jun 8, 2020
Currently, zero uses the MaxRaftId to assign RAFT IDs to new nodes. However, the
MaxRaftId value is not immediately updated and the lock is released in between
creating the proposal and sending the proposal to the RAFT node. This can lead
to situations where more than one new node is assigned the same ID.

To solve this, this change introduces a new field to generate the IDs that is
updated immediately, thus preventing multiple nodes from being assigned
the same ID.

Fixes #5436
martinmr added a commit that referenced this issue Jun 8, 2020
Currently, zero uses the MaxRaftId to assign RAFT IDs to new nodes. However, the
MaxRaftId value is not immediately updated and the lock is released in between
creating the proposal and sending the proposal to the RAFT node. This can lead
to situations where more than one new node is assigned the same ID.

To solve this, this change introduces a new field to generate the IDs that is
updated immediately, thus preventing multiple nodes from being assigned
the same ID.

Fixes #5436
martinmr added a commit that referenced this issue Jun 8, 2020
Currently, zero uses the MaxRaftId to assign RAFT IDs to new nodes. However, the
MaxRaftId value is not immediately updated and the lock is released in between
creating the proposal and sending the proposal to the RAFT node. This can lead
to situations where more than one new node is assigned the same ID.

To solve this, this change introduces a new field to generate the IDs that is
updated immediately, thus preventing multiple nodes from being assigned
the same ID.

Fixes #5436
martinmr added a commit that referenced this issue Jun 9, 2020
Currently, zero uses the MaxRaftId to assign RAFT IDs to new nodes. However, the
MaxRaftId value is not immediately updated and the lock is released in between
creating the proposal and sending the proposal to the RAFT node. This can lead
to situations where more than one new node is assigned the same ID.

To solve this, this change introduces a new field to generate the IDs that is
updated immediately, thus preventing multiple nodes from being assigned
the same ID.

Fixes #5436

cherry-picked from c093805
martinmr added a commit that referenced this issue Jun 9, 2020
Currently, zero uses the MaxRaftId to assign RAFT IDs to new nodes. However, the
MaxRaftId value is not immediately updated and the lock is released in between
creating the proposal and sending the proposal to the RAFT node. This can lead
to situations where more than one new node is assigned the same ID.

To solve this, this change introduces a new field to generate the IDs that is
updated immediately, thus preventing multiple nodes from being assigned
the same ID.

Fixes #5436

cherry-picked from c093805
dna2github pushed a commit to dna2fork/dgraph that referenced this issue Jul 18, 2020
Currently, zero uses the MaxRaftId to assign RAFT IDs to new nodes. However, the
MaxRaftId value is not immediately updated and the lock is released in between
creating the proposal and sending the proposal to the RAFT node. This can lead
to situations where more than one new node is assigned the same ID.

To solve this, this change introduces a new field to generate the IDs that is
updated immediately, thus preventing multiple nodes from being assigned
the same ID.

Fixes hypermodeinc#5436
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/operations Related to operational aspects of the DB, including signals, flags, env vars, etc. kind/bug Something is broken. status/accepted We accept to investigate/work on it.
Development

Successfully merging a pull request may close this issue.

2 participants