Skip to content
Sergey Bronnikov edited this page Jan 19, 2024 · 15 revisions

image

Fault injection is a technique for improving the coverage of a test by introducing faults to test code paths, in particular, error handling code paths. It is widely considered as an important part of developing robust software. There are many ways to do fault injection to assess the system.

Tool Level Target Comment
CharybdeFS Userspace (FUSE) Filesystem Requires Thrift
PetardFS Userspace (FUSE) Filesystem https://github.com/jrandall/petardfs
UnreliableFS Userspace (FUSE) Filesystem https://github.com/ligurio/unreliablefs
libeatmydata Userspace (LD_PRELOAD) Filesystem, fsync() replace fsync() with no-op, https://github.com/stewartsmith/libeatmydata
cleancache Userspace (LD_PRELOAD) Filesystem cache drop files content from page cache after use, https://github.com/kahing/bin/blob/master/cleancache.c
Device Mapper Kernel space Disk I/O Use Device Mapper's error/flakey/delay/dm-dust devices to return errors/corruption from, or delay/split IO to a synthesized block device (kernel, requires kernel to have been built with device mapper support, appropriate additional device mapper modules (dm-dust is only available on kernel >=5.2) and to have device mapper userspace bits).
QEMU Hardware Disk, Memory blkdebug https://github.com/qemu/qemu/blob/master/docs/devel/blkdebug.txt
sysrq Kernel space OS crash echo c > /proc/sysrq-trigger, https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html
BSOD Kernel space OS crash Windows only, https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/forcing-a-system-crash-from-the-keyboard
strace Userspace POSIX API calls https://strace.io/
libfiu Userspace (LD_PRELOAD) POSIX API calls Use libfiu to perform fault injection on POSIX API calls, http://blitiri.com.ar/p/libfiu/
SystemTap Userspace POSIX API calls Using SystemTap to do fault injection (kernel, requires a kernel to have been built with lots of stuff), https://lwn.net/Articles/289932/
strobe time Userspace Time https://github.com/jepsen-io/jepsen/tree/main/jepsen/resources
libfaketime Userspace (LD_PRELOAD) Time https://github.com/wolfcw/libfaketime
timeskew Userspace (LD_PRELOAD) Time https://github.com/vi/timeskew
Linux kernel's fault injector Kernel space - Use the Linux kernel's fault injector to inject an error into the underlying block device (kernel, requires kernel to have been built with FAIL_MAKE_REQUEST=y).
trickle Userspace Network Bandwidth shaper for Unix-like systems, https://github.com/mariusae/trickle
tc (Linux), dummynet (FreeBSD) Kernel space Network tc(8), dummynet(4)
Linux kernel NVMe fault injection Kernel space NVMe https://www.kernel.org/doc/html/latest/fault-injection/nvme-fault-injection.html
Linux kernel notifier fault injection Kernel space Kernel events https://www.kernel.org/doc/html/latest/fault-injection/notifier-error-inject.html
Linux kernel fault injection capabilities infrastructure Kernel space Memory https://www.kernel.org/doc/html/latest/fault-injection/fault-injection.html

Благодаря LD_PRELOAD работает множество бибилиотек для fault injection:

  • https://github.com/yasuoka/mleakdetect
  • libfaketime - изменение скорости течения времени
  • libeatmydata - выключить вызов fsync() для нашей программы
  • fakeroot - запуск программы в Linux с привилегиями суперпользователя для выполнения любых файловых операций
  • libshape и trickle - ограничение пропускной способности при работе с сетью
  • unreliablefs - сбои в работе файловой системы

И что немаловажно - такой подход не требует изменения исходного кода программы, то есть у вас не возникнет ситуации, когда для тестирования собирают одну версию исходного кода, а для релиза другую.

References:

  1. Restricting program memory https://alex.dzyoba.com/blog/restrict-memory/
  2. Chaos Engineering tools https://github.com/dastergon/awesome-chaos-engineering#notable-tools

Network Condition Profiles

Here's a list of network conditions with values that you can plug into Comcast. Please add any more that you may come across.

source

Name Latency Bandwidth Packet-loss
GPRS (good) 500 50 2
EDGE (good) 300 250 1.5
3G/HSDPA (good) 250 750 1.5
DIAL-UP (good) 185 40 2
DSL (poor) 70 2000 2
DSL (good) 40 8000 0.5
WIFI (good) 40 30000 0.2
Satellite 1500 - 0.2

Statistics of bugs in distributed systems

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems, see transcription, Usenix

Crash recovery bugs are caused by five types of bug patterns:

  • incorrect backup (17%)
  • incorrect crash/reboot detection (18%)
  • incorrect state identification (16%)
  • incorrect state recovery (28%)
  • concurrency (21%)

Almost all (97%) of crash recovery bugs involve no more than four nodes. This finding indicates that we can detect crash recovery bugs in a small set of nodes, rather than thousands.

A majority (87%) of crash recovery bugs require a combination of no more than three crashes and no more than one reboot. It suggests that we can systematically test almost all node crash scenarios with very limited crashes and reboots.

Crash recovery bugs are difficult to fix. 12% of the fixes are incomplete, and 6% of the fixes only reduce the possibility of bug occurrence. This indicates that new approaches to validate crash recovery bug fixes are necessary.

Uncovering Bugs in Distributed Storage Systems during Testing (not in Production!) (slides)

Using fault injection for testing Linux kernel components (slides)

Clone this wiki locally