-
Notifications
You must be signed in to change notification settings - Fork 409
fault injection
Fault injection is a technique for improving the coverage of a test by introducing faults to test code paths, in particular, error handling code paths. It is widely considered as an important part of developing robust software. There are many ways to do fault injection to assess the system.
Tool | Level | Target | Comment |
---|---|---|---|
CharybdeFS | Userspace (FUSE) | Filesystem | Requires Thrift |
PetardFS | Userspace (FUSE) | Filesystem | https://github.com/jrandall/petardfs |
UnreliableFS | Userspace (FUSE) | Filesystem | https://github.com/ligurio/unreliablefs |
libeatmydata | Userspace (LD_PRELOAD) | Filesystem, fsync()
|
replace fsync() with no-op, https://github.com/stewartsmith/libeatmydata
|
cleancache |
Userspace (LD_PRELOAD) | Filesystem cache | drop files content from page cache after use, https://github.com/kahing/bin/blob/master/cleancache.c |
Device Mapper | Kernel space | Disk I/O | Use Device Mapper's error/flakey/delay/dm-dust devices to return errors/corruption from, or delay/split IO to a synthesized block device (kernel, requires kernel to have been built with device mapper support, appropriate additional device mapper modules (dm-dust is only available on kernel >=5.2) and to have device mapper userspace bits). |
QEMU | Hardware | Disk, Memory |
blkdebug https://github.com/qemu/qemu/blob/master/docs/devel/blkdebug.txt
|
sysrq | Kernel space | OS crash |
echo c > /proc/sysrq-trigger , https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html
|
BSOD | Kernel space | OS crash | Windows only, https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/forcing-a-system-crash-from-the-keyboard |
strace | Userspace | POSIX API calls | https://strace.io/ |
libfiu | Userspace (LD_PRELOAD) | POSIX API calls | Use libfiu to perform fault injection on POSIX API calls, http://blitiri.com.ar/p/libfiu/ |
SystemTap | Userspace | POSIX API calls | Using SystemTap to do fault injection (kernel, requires a kernel to have been built with lots of stuff), https://lwn.net/Articles/289932/ |
strobe time |
Userspace | Time | https://github.com/jepsen-io/jepsen/tree/main/jepsen/resources |
libfaketime | Userspace (LD_PRELOAD) | Time | https://github.com/wolfcw/libfaketime |
timeskew |
Userspace (LD_PRELOAD) | Time | https://github.com/vi/timeskew |
Linux kernel's fault injector | Kernel space | - | Use the Linux kernel's fault injector to inject an error into the underlying block device (kernel, requires kernel to have been built with FAIL_MAKE_REQUEST =y). |
trickle | Userspace | Network | Bandwidth shaper for Unix-like systems, https://github.com/mariusae/trickle |
tc (Linux), dummynet (FreeBSD) |
Kernel space | Network | tc(8), dummynet(4) |
Linux kernel NVMe fault injection | Kernel space | NVMe | https://www.kernel.org/doc/html/latest/fault-injection/nvme-fault-injection.html |
Linux kernel notifier fault injection | Kernel space | Kernel events | https://www.kernel.org/doc/html/latest/fault-injection/notifier-error-inject.html |
Linux kernel fault injection capabilities infrastructure | Kernel space | Memory | https://www.kernel.org/doc/html/latest/fault-injection/fault-injection.html |
Благодаря LD_PRELOAD
работает множество бибилиотек для fault injection:
- https://github.com/yasuoka/mleakdetect
-
libfaketime
- изменение скорости течения времени -
libeatmydata
- выключить вызовfsync()
для нашей программы -
fakeroot
- запуск программы в Linux с привилегиями суперпользователя для выполнения любых файловых операций -
libshape
иtrickle
- ограничение пропускной способности при работе с сетью -
unreliablefs
- сбои в работе файловой системы
И что немаловажно - такой подход не требует изменения исходного кода программы, то есть у вас не возникнет ситуации, когда для тестирования собирают одну версию исходного кода, а для релиза другую.
- Restricting program memory https://alex.dzyoba.com/blog/restrict-memory/
- Chaos Engineering tools https://github.com/dastergon/awesome-chaos-engineering#notable-tools
Here's a list of network conditions with values that you can plug into Comcast. Please add any more that you may come across.
Name | Latency | Bandwidth | Packet-loss |
---|---|---|---|
GPRS (good) | 500 | 50 | 2 |
EDGE (good) | 300 | 250 | 1.5 |
3G/HSDPA (good) | 250 | 750 | 1.5 |
DIAL-UP (good) | 185 | 40 | 2 |
DSL (poor) | 70 | 2000 | 2 |
DSL (good) | 40 | 8000 | 0.5 |
WIFI (good) | 40 | 30000 | 0.2 |
Satellite | 1500 | - | 0.2 |
What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems, see transcription, Usenix
Crash recovery bugs are caused by five types of bug patterns:
- incorrect backup (17%)
- incorrect crash/reboot detection (18%)
- incorrect state identification (16%)
- incorrect state recovery (28%)
- concurrency (21%)
Almost all (97%) of crash recovery bugs involve no more than four nodes. This finding indicates that we can detect crash recovery bugs in a small set of nodes, rather than thousands.
A majority (87%) of crash recovery bugs require a combination of no more than three crashes and no more than one reboot. It suggests that we can systematically test almost all node crash scenarios with very limited crashes and reboots.
Crash recovery bugs are difficult to fix. 12% of the fixes are incomplete, and 6% of the fixes only reduce the possibility of bug occurrence. This indicates that new approaches to validate crash recovery bug fixes are necessary.
Uncovering Bugs in Distributed Storage Systems during Testing (not in Production!) (slides)
Using fault injection for testing Linux kernel components (slides)
Copyright © 2014-2025 Sergey Bronnikov. Follow me on Mastodon @[email protected] and Telegram.
Learning
- Glossary
- Books:
- Courses
- Learning Tools
- Bugs And Learned Lessons
- Cheatsheets
Tools / Services / Tests
- Code complexity
- Quality Assurance Tools
- Test Runners
- Testing-As-A-Service
- Conformance Test Suites
- Test Infrastructure
- Fault injection
- TTCN-3
- Continuous Integration
- Speedup your CI
- Performance
- Formal Specification
- Toy Projects
- Test Impact Analysis
- Formats
Functional testing
- Automated testing
- By type:
WIP sections
Community
Links