Takes too long time for a cluster to recover from leader crash #2866

kikimo · 2021-09-15T02:26:13Z

Please check the FAQ documentation before raising an issue

Please check the FAQ documentation and old issues before raising an issue in case someone has asked the same question that you are asking.

Describe the bug (must be provided)

Takes very long time for cluster to recover from leader crash.

Your Environments (must be provided)

OS: uname -a
Compliler: g++ --version or clang++ --version
CPU: lscpu
Commit id (e.g. a3ffc7d8)
https://github.com/liuyu85cn/nebula.git 730a39d

How To Reproduce(must be provided)

Steps to reproduce the behavior:

create a cluster with 3storage + 3graph + 1meta, a space with 3partition + 3replicas
keep inserting edges
send sigstop to one leader(you can check part leader by typiing show hosts in nebula-console)

Despite the fact that nebula storage elect a new leader very quickly, it takes nearly 5min for the cluster to get back to normal.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Provide logs and configs, or any other context to trace the problem.

The text was updated successfully, but these errors were encountered:

liuyu85cn · 2021-09-28T05:41:10Z

In one reproduce, raft take 15min do a rollback, and graph client looks stuck.

Sophie-Xie · 2021-09-28T05:47:45Z

In one reproduce, raft take 15min do a rollback, and graph client looks stuck.

@Aiee Go client, pls check it.

Sophie-Xie · 2021-09-28T05:48:19Z

raft #2903

critical27 · 2021-10-12T03:56:55Z

This could a be quite complicated case, frankly speaking, I have no idea what will happen when sigstop is sent to thrift server, this may involved system futex. (raft take 15 min do a rollback, clearly impossible to blame rollback).

We could discover later.

kikimo · 2021-10-12T12:17:11Z

This could a be quite complicated case, frankly speaking, I have no idea what will happen when sigstop is sent to thrift server, this may involved system futex. (raft take 15 min do a rollback, clearly impossible to blame rollback).

We could discover later.

The effect of SIGSTOP is just stop scheduling this process from running on cpu temporarily, exactly same as when you run gdb and attach to the process, if the process's invoke of futex() cause block, I'm 100% sure that's not the problem of SIGSTOP. I'll add more detail later.

kikimo · 2021-12-23T05:59:13Z

closed by #3435

kikimo added the type/bug Type: something is unexpected label Sep 15, 2021

kikimo added this to the v2.6.0 milestone Sep 15, 2021

kikimo assigned liuyu85cn Sep 15, 2021

kikimo changed the title ~~takes too long time for cluster to recover from leader crash~~ takes too long time for s cluster to recover from leader crash Sep 15, 2021

kikimo changed the title ~~takes too long time for s cluster to recover from leader crash~~ Takes too long time for s cluster to recover from leader crash Sep 15, 2021

kikimo changed the title ~~Takes too long time for s cluster to recover from leader crash~~ Takes too long time for a cluster to recover from leader crash Sep 16, 2021

jamieliu1023 mentioned this issue Sep 18, 2021

Weekly Report 2021-09-18 vesoft-inc/nebula-community#27

Closed

Sophie-Xie assigned critical27 and Aiee Sep 28, 2021

Sophie-Xie unassigned liuyu85cn Sep 28, 2021

Sophie-Xie added the need to discuss Solution: issue or PR without a clear conclusion on whether to handle it label Oct 8, 2021

Sophie-Xie assigned CPWstatic and unassigned critical27 and Aiee Oct 8, 2021

Sophie-Xie removed the need to discuss Solution: issue or PR without a clear conclusion on whether to handle it label Oct 9, 2021

critical27 assigned kikimo, critical27 and liuyu85cn and unassigned CPWstatic Oct 12, 2021

critical27 modified the milestones: v2.6.0, v2.7.0 Oct 12, 2021

Sophie-Xie added this to the v3.0.0 milestone Oct 15, 2021

Sophie-Xie unassigned kikimo Dec 15, 2021

Sophie-Xie unassigned liuyu85cn Dec 15, 2021

kikimo closed this as completed Dec 23, 2021

jamieliu1023 mentioned this issue Dec 25, 2021

Weekly Report 2021-12-24 vesoft-inc/nebula-community#80

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Takes too long time for a cluster to recover from leader crash #2866

Takes too long time for a cluster to recover from leader crash #2866

kikimo commented Sep 15, 2021 •

edited

Loading

liuyu85cn commented Sep 28, 2021

Sophie-Xie commented Sep 28, 2021

Sophie-Xie commented Sep 28, 2021

critical27 commented Oct 12, 2021

kikimo commented Oct 12, 2021 •

edited

Loading

kikimo commented Dec 23, 2021

Takes too long time for a cluster to recover from leader crash #2866

Takes too long time for a cluster to recover from leader crash #2866

Comments

kikimo commented Sep 15, 2021 • edited Loading

liuyu85cn commented Sep 28, 2021

Sophie-Xie commented Sep 28, 2021

Sophie-Xie commented Sep 28, 2021

critical27 commented Oct 12, 2021

kikimo commented Oct 12, 2021 • edited Loading

kikimo commented Dec 23, 2021

kikimo commented Sep 15, 2021 •

edited

Loading

kikimo commented Oct 12, 2021 •

edited

Loading