Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when exiting shell on single CPU systems #882

Closed
Tracked by #1100
raphCode opened this issue Nov 18, 2021 · 3 comments · Fixed by #1051
Closed
Tracked by #1100

Crash when exiting shell on single CPU systems #882

raphCode opened this issue Nov 18, 2021 · 3 comments · Fixed by #1051
Labels
stability Issues in relation to stability suspected bug

Comments

@raphCode
Copy link
Contributor

Basic information

zellij --version: 0.21.0 / 0.20.1 (crash happens with both)
uname -av: Linux skydragon 5.15.2-arch1-1 #1 SMP PREEMPT Fri, 12 Nov 2021 19:22:10 +0000 x86_64 GNU/Linux

Further information
This only happens on my vserver, I cannot reproduce locally. Happens via ssh oder tty accessed via hoster's VNC feature.
This is a fairly slow CPU server, and I had already problems with earlier zellij builds there, but I don't know anymore.
The issue seems this exact machine.

Reproduction steps:

  1. open zellij
  2. (optional) add new pane or tab so that zellij should keep open after closing a pane
  3. Press Ctrl-D or type exit to quit the shell
  4. Crash in 99% of the time

Backtrace for debug build of the tree at the commit 639566d:

Click to expand:
Error occurred in server:
Originating Thread(s):
1. stdin_handler_thread: AcceptInput
2. screen_thread: WriteCharacter

Error: thread 'screen' panicked at 'failed to drain terminal: Sys(EBADF)': zellij-server/src/tab.rs:719
   0: zellij_utils::errors::handle_panic
             at zellij/zellij-utils/src/errors.rs:25:21
   1: zellij_server::start_server::{{closure}}
             at zellij/zellij-server/src/lib.rs:201:13
   2: std::panicking::rust_panic_with_hook
             at /rustc/1.56.1/library/std/src/panicking.rs:628:17
   3: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/1.56.1/library/std/src/panicking.rs:521:13
   4: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/1.56.1/library/std/src/sys_common/backtrace.rs:141:18
   5: rust_begin_unwind
             at /rustc/1.56.1/library/std/src/panicking.rs:517:5
   6: core::panicking::panic_fmt
             at /rustc/1.56.1/library/core/src/panicking.rs:101:14
   7: core::result::unwrap_failed
             at /rustc/1.56.1/library/core/src/result.rs:1617:5
   8: core::result::Result<T,E>::expect
             at /rustc/1.56.1/library/core/src/result.rs:1259:23
   9: zellij_server::tab::Tab::write_to_pane_id
             at zellij/zellij-server/src/tab.rs:717:17
  10: zellij_server::tab::Tab::write_to_active_terminal
             at zellij/zellij-server/src/tab.rs:707:9
  11: zellij_server::screen::screen_thread_main
             at zellij/zellij-server/src/screen.rs:670:30
  12: zellij_server::init_session::{{closure}}
             at zellij/zellij-server/src/lib.rs:569:17
  13: std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/1.56.1/library/std/src/sys_common/backtrace.rs:125:18
  14: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}
             at /rustc/1.56.1/library/std/src/thread/mod.rs:481:17
  15: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/1.56.1/library/core/src/panic/unwind_safe.rs:271:9
  16: std::panicking::try::do_call
             at /rustc/1.56.1/library/std/src/panicking.rs:403:40
  17: __rust_try
  18: std::panicking::try
             at /rustc/1.56.1/library/std/src/panicking.rs:367:19
  19: std::panic::catch_unwind
             at /rustc/1.56.1/library/std/src/panic.rs:129:14
  20: std::thread::Builder::spawn_unchecked::{{closure}}
             at /rustc/1.56.1/library/std/src/thread/mod.rs:480:30
  21: core::ops::function::FnOnce::call_once{{vtable.shim}}
             at /rustc/1.56.1/library/core/src/ops/function.rs:227:5
  22: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/1.56.1/library/alloc/src/boxed.rs:1636:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/1.56.1/library/alloc/src/boxed.rs:1636:9
      std::sys::unix::thread::Thread::new::thread_start
             at /rustc/1.56.1/library/std/src/sys/unix/thread.rs:106:17
  23: start_thread
  24: __GI___clone
@raphCode
Copy link
Contributor Author

Yeee-ha!
Found a reproducer, at least on my second vServer: running with only one CPU triggers this bug almost every time! My first server has only one CPU, this is why I was seeing this almost every time.

Here is how you disable all CPU cores but one in Linux at runtime:
echo 0 | sudo tee /sys/devices/system/cpu/cpu*/online

After that, try the reproducer in my first post.
The issue smells like a race condition, based on which threads are scheduled first.
I look forward to see where the problem comes from, since safe Rust should not exhibit such bugs.

Cores can be enabled again with echo 1 | ...

@raphCode raphCode changed the title Crash when exiting shell Crash when exiting shell on single CPU systems Jan 30, 2022
@raphCode
Copy link
Contributor Author

raphCode commented Feb 7, 2022

Since this bug is depending on a race condition, I managed to reproduce even on multi-CPU systems:

  1. Hold/Press Alt + n a bunch of times to open a lot of panes
  2. Hold or quickly press Ctrl + d multiple times to exit the shells in the panes

Please note that I tried this on a pretty slow Intel Atom CPU, so it might not trigger on faster systems.

I just want to raise awareness that there are currently at least two crashes due to race conditions based around the terminal file descriptor handling. Two because I got two different error messages (the one from first post and the one below).

Error occurred in server:
Originating Thread(s):
1. stdin_handler_thread: AcceptInput
2. screen_thread: WriteCharacter

Error: thread 'screen' panicked at 'failed to write to terminal: Sys(EBADF)': zellij-server/src/tab/mod.rs:750

raphCode added a commit to raphCode/zellij that referenced this issue Feb 10, 2022
Quick and dirty bandaid fix to some server crashes which occur to me lately.
The underlying issue seems to be a race condition somewhere when the shell in the pane
exits and the tty file descriptor becomes invalid, but zellij wants to write/read it?

Bug trigger:
- open some panes
- exit the shells in the panes by spamming Ctrl-D

works best when the system only runs on a single CPU, run the following to disable all
cores but one:
echo 0 | sudo tee /sys/devices/system/cpu/cpu*/online
@a-kenji a-kenji added the stability Issues in relation to stability label Mar 9, 2022
@tlinford
Copy link
Contributor

Yeee-ha! Found a reproducer, at least on my second vServer: running with only one CPU triggers this bug almost every time! My first server has only one CPU, this is why I was seeing this almost every time.

Here is how you disable all CPU cores but one in Linux at runtime: echo 0 | sudo tee /sys/devices/system/cpu/cpu*/online

After that, try the reproducer in my first post. The issue smells like a race condition, based on which threads are scheduled first. I look forward to see where the problem comes from, since safe Rust should not exhibit such bugs.

Cool, can easily reproduce it with these steps. (super weird to go from 16 cores to 1 btw 😄)
Did some digging, and can add one piece of info for now: crash seems to always happen trying to write a 0x4 End of Transmission byte.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stability Issues in relation to stability suspected bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants