Nodes in cluster die with the error : Unable to unmarshal proposal #1169

bhuthesh · 2017-07-10T14:50:47Z

I'm running a cluster using binary built from master (commit 07abc9a).

After starting bulkloading with dgraphloader , I get many retry errors of the form Retrying req: 59. Error: rpc error: code = Unknown desc = failed to apply mutations error: internal error: context deadline exceeded

I think the mutations are timing out due to some nodes dying.

The last log before a node dies is
RECEIVED: MsgApp
SENDING: MsgAppResp
Unable to unmarshal proposal: [8 161 214 170 many many lines (of binary protocol?) 119 51 46 111] %!q(MISSING)
EDIT : Tagging @janardhan1993

The text was updated successfully, but these errors were encountered:

bhuthesh · 2017-07-11T09:42:27Z

Hi @janardhan1993, any timeline on when the v0.8 will be released?

janardhan1993 · 2017-07-12T01:32:15Z

We started using slicepool for proposals in master. We were returning the slice back to the pool after the proposal in applied in memory. But the raft memory storage keeps the entries in memory and these are sent to the replicas. We should be returning the slice only after the corresponding entry is compacted.(after snapshot).

@bhuthesh : Can you please try with bugs/github_issues branch and let us know if it fixes your issue.

srh · 2017-07-12T06:14:03Z

I get a similar thing. It happens if I make a 3-node cluster, shut down one node, load a bunch of data, then reconnect the node. Possibly also requires making an additional mutation query to trigger, I haven't narrowed that down yet.

2017/07/11 23:12:09 draft.go:590: Unable to unmarshal proposal: unexpected EOF "\b\xa9\xf8\xee\x9f\x03\x12\xe2\xc7\x04\b\x01\x122\tw\f\x02\x00\x00\x00\x00\x00\x12\x04name\x1a\x1bHinkle's Employee Singer #2 \t:\x02en\x12\x1c\tw\f\x02\x00\x00\x00\x00\x00\x12\v_predicate_\x1a\x04name\x12\x1c\t\x12Mv\n\xd351h\x12\bstarring)x\f\x02\x00\x00\x00\x00\x00\x12 \t\x12Mv\n\xd351h\x12\v_predicate_\x1a\bstarring\x12$\tx\f\x02\x00\x00\x00\x00\x00\x12\x10performance.film)\x12Mv\n\xd351h\x12(\t"

srh · 2017-07-12T06:27:14Z

Doesn't require an additional mutation query, and I can also reproduce it in bugs/github_issues.

Here's the basic steps I used:

cd ~
mkdir blah
cd blah
echo 'default: fp % 1 + 1' > groups.conf
mkdir svr{1..3}

# Start node 1
~/go/bin/dgraph --group_conf ~/blah/groups.conf --groups "0,1" \
  --idx 1 --port_offset 0 --my "127.0.0.1:12345" --debugmode=true \
  --p ~/blah/svr1/p --w ~/blah/svr1/w > ~/blah/svr1/dgraph.log 2>&1 &

# Start node 2
~/go/bin/dgraph --group_conf ~/blah/groups.conf --groups "0,1" --idx 2 \
  --port_offset 1 --my "127.0.0.1:12346" --peer "127.0.0.1:12345" \
  --debugmode=true --p ~/blah/svr2/p --w ~/blah/svr2/w \
  > ~/blah/svr2/dgraph.log 2>&1 &

# Start node 3
~/go/bin/dgraph --group_conf ~/blah/groups.conf --groups "0,1" --idx 3 \
  --port_offset 2 --my "127.0.0.1:12347" --peer "127.0.0.1:12346" \
  --debugmode=true --p ~/blah/svr3/p --w ~/blah/svr3/w \
  > ~/blah/svr3/dgraph.log 2>&1 &

# Kill node 3
kill %3

# Load data (from benchmarks repo)
cd ~/benchmarks
~/go/bin/dgraphloader -r ~/1million.rdf.gz -s data/21million.schema 

# When that finishes, revive node 3
cd ~/blah
~/go/bin/dgraph --group_conf ~/blah/groups.conf --groups "0,1" --idx 3 \
  --port_offset 2 --my "127.0.0.1:12347" --peer "127.0.0.1:12346" \
  --debugmode=true --p ~/blah/svr3/p --w ~/blah/svr3/w \
  > ~/blah/svr3/dgraph.log 2>&1 &

I'll note that I didn't run this as a script, there was some reasonable human waiting time between commands.

Edit: Since I'm reproducing it, I'll go and try to bisect the error.

srh · 2017-07-12T07:14:05Z

This is going to be fixed with the latest update to #1176.

bhuthesh · 2017-07-12T08:20:13Z

@janardhan1993 Since it is reproducible in bugs/github_issues as mentioned my @srh, shall I wait for more fixes before I test it again?

janardhan1993 · 2017-07-12T08:25:45Z

@bhuthesh you can try now. i forgot to push the changes to the branch. The fix is there now.

bhuthesh · 2017-07-12T11:59:34Z

@janardhan1993 was able to run 22 servers and index some 50M RDFs without losing any nodes. Seems the issue has been fixed in bugs/github_issues

janardhan1993 · 2017-07-13T01:35:04Z

Fix merged to master

Important changes ``` - Changes to overlap check in compaction. - Remove 'this entry should've been caught' log. - Changes to write stalling on levels 0 and 1. - Compression is disabled by default in Badger. - Bloom filter caching in a separate ristretto cache. - Compression/Encryption in background. - Disable cache by default in badger. ``` The following new changes are being added from badger `git log ab4352b00a17...91c31ebe8c22` ``` 91c31eb Disable cache by default (#1257) eaf64c0 Add separate cache for bloom filters (#1260) 1bcbefc Add BypassDirLock option (#1243) c6c1e5e Add support for watching nil prefix in subscribe API (#1246) b13b927 Compress/Encrypt Blocks in the background (#1227) bdb2b13 fix changelog for v2.0.2 (#1244) 8dbc982 Add Dkron to README (#1241) 3d95b94 Remove coveralls from Travis Build(#1219) 5b4c0a6 Fix ValueThreshold for in-memory mode (#1235) 617ed7c Initialize vlog before starting compactions in db.Open (#1226) e908818 Update CHANGELOG for Badger 2.0.2 release. (#1230) bce069c Fix int overflow for 32bit (#1216) e029e93 Remove ExampleDB_Subscribe Test (#1214) 8734e3a Add missing package to README for badger.NewEntry (#1223) 78d405a Replace t.Fatal with require.NoError in tests (#1213) c51748e Fix flaky TestPageBufferReader2 test (#1210) eee1602 Change else-if statements to idiomatic switch statements. (#1207) 3e25d77 Rework concurrency semantics of valueLog.maxFid (#1184) (#1187) 4676ca9 Add support for caching bloomfilters (#1204) c3333a5 Disable compression and set ZSTD Compression Level to 1 (#1191) 0acb3f6 Fix L0/L1 stall test (#1201) 7e5a956 Support disabling the cache completely. (#1183) (#1185) 82381ac Update ristretto to version 8f368f2 (#1195) 3747be5 Improve write stalling on level 0 and 1 5870b7b Run all tests on CI (#1189) 01a00cb Add Jaegar to list of projects (#1192) 9d6512b Use fastRand instead of locked-rand in skiplist (#1173) 2698bfc Avoid sync in inmemory mode (#1190) 2a90c66 Remove the 'this entry should've caught' log from value.go (#1170) 0a06173 Fix checkOverlap in compaction (#1166) 0f2e629 Fix windows build (#1177) 03af216 Fix commit sha for WithInMemory in CHANGELOG. (#1172) 23a73cd Update CHANGELOG for v2.0.1 release. (#1181) 465f28a Cast sz to uint32 to fix compilation on 32 bit (#1175) ea01d38 Rename option builder from WithInmemory to WithInMemory. (#1169) df99253 Remove ErrGCInMemoryMode in CHANGELOG. (#1171) 8dfdd6d Adding changes for 2.0.1 so far (#1168) ```

janardhan1993 self-assigned this Jul 10, 2017

janardhan1993 added the kind/bug Something is broken. label Jul 11, 2017

janardhan1993 added this to the v0.8 milestone Jul 11, 2017

srh mentioned this issue Jul 12, 2017

Sending data to new nodes is very slow #1180

Closed

janardhan1993 closed this as completed Jul 13, 2017

srh mentioned this issue Jul 14, 2017

Server joining Raft cluster might not catch up. #1193

Closed

manishrjain added the kind/bug Something is broken. label Mar 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes in cluster die with the error : Unable to unmarshal proposal #1169

Nodes in cluster die with the error : Unable to unmarshal proposal #1169

bhuthesh commented Jul 10, 2017 •

edited by manishrjain

Loading

bhuthesh commented Jul 11, 2017

janardhan1993 commented Jul 12, 2017 •

edited

Loading

srh commented Jul 12, 2017

srh commented Jul 12, 2017 •

edited

Loading

srh commented Jul 12, 2017

bhuthesh commented Jul 12, 2017

janardhan1993 commented Jul 12, 2017

bhuthesh commented Jul 12, 2017

janardhan1993 commented Jul 13, 2017

Nodes in cluster die with the error : Unable to unmarshal proposal #1169

Nodes in cluster die with the error : Unable to unmarshal proposal #1169

Comments

bhuthesh commented Jul 10, 2017 • edited by manishrjain Loading

bhuthesh commented Jul 11, 2017

janardhan1993 commented Jul 12, 2017 • edited Loading

srh commented Jul 12, 2017

srh commented Jul 12, 2017 • edited Loading

srh commented Jul 12, 2017

bhuthesh commented Jul 12, 2017

janardhan1993 commented Jul 12, 2017

bhuthesh commented Jul 12, 2017

janardhan1993 commented Jul 13, 2017

bhuthesh commented Jul 10, 2017 •

edited by manishrjain

Loading

janardhan1993 commented Jul 12, 2017 •

edited

Loading

srh commented Jul 12, 2017 •

edited

Loading