Skip to content

devops-ru/resilience-engineering

 
 

Repository files navigation

Resilience engineering notes

Alias: http://resiliencepapers.club (thanks to John Allspaw).

If you're not sure what to read first, check out Resilience engineering: Where do I start?

This file contains notes about people active in resilience engineering, as well as some influential researchers who are no longer with us, organized alphabetically. It also includes people and papers from related fields, such as cognitive systems engineering and naturalistic decision-making.

You might also be interested in my notes on David Woods's Resilience Engineering short course.

Note: there are now multiple contributors to this repository.

For each person, I list concepts that they reference in their writings, along with some publications. The publications lists aren't comprehensive: they're ones I've read or have added to my to-read list.

Some big ideas:

John Allspaw

Allspaw is the former CTO of Etsy. He applies concepts from resilience engineering to the tech industry. He is one of the founders Adaptive Capacity Labs, a resilience engineering consultancy.

Allspaw tweets as @allspaw.

Selected publications

Selected talks

Lisanne Bainbridge

Bainbridge is (was?) a psychology researcher. (I have not been able to find any recent information about her).

Contributions

Ironies of automation

Bainbridge is famous for her 1983 Ironies of automation paper, which continues to be frequently cited.

Concepts

  • automation
  • design errors
  • human factors/ ergonomics
  • cognitive modelling
  • cognitive architecture
  • mental workload
  • situation awareness
  • cognitive error
  • skill and training
  • interface design

Selected publications

Andrea Baker

Baker is a practitioner who provides training services in human and organizational performance (HOP) and learning teams.

Baker tweets as @thehopmentor.

Concepts

  • Human and organizational performance (HOP)
  • Learning teams
  • Industrial empathy

Selected publications

Johan Bergström

Bergström is a safety research and consultant. He runs the Master Program of Human Factors and Systems Safety at Lund University.

Bergström tweets as @bergstrom_johan.

Concepts

  • Analytical traps in accident investigation
    • Counterfactual reasoning
    • Normative language
    • Mechanistic reasoning
  • Generic competencies

Selected publications

Selected talks

Todd Conklin

Conklin's books are on my reading list, but I haven't read anything by him yet. I have listened to his great Preaccident investigation podcast.

Conklin tweets as @preaccident.

Selected publications

Richard I. Cook

Cook is a medical doctor who studies failures in complex systems. He is one of the founders Adaptive Capacity Labs, a resilience engineering consultancy.

Cook tweets as @ri_cook.

Concepts

  • complex systems
  • degraded mode
  • sharp end (c.f. Reason's blunt end)
  • Going solid
  • Cycle of error
  • "new look"

Selected publications

Selected talks

Sidney Dekker

Dekker is a human factors and safety researcher with a background in aviation. His books aimed at a lay audience (Drift Into Failure, Just Culture, The Field Guide to 'Human Error' investigations) have been enormously influential. He was a founder of the MSc programme in Human Factors & Systems Safety at Lund University. His PhD advisor is David Woods.

Dekker tweets as @sidneydekkercom.

Contributions

Drift into failure

Dekker developed the theory of drift, characterized by five concepts:

  1. Scarcity and competition
  2. Decrementalism, or small steps
  3. Sensitive dependence on initial conditions
  4. Unruly technology
  5. Contribution of the protective structure

Just Culture

Dekker examines how cultural norms defining justice can be re-oriented to minimize the negative impact and maximize learning when things go wrong.

  1. Retributive justice as society's traditional idea of justice: distributing punishment to those responsible based on severity of the violation
  2. Restorative justice as an improvement for both victims and practicioners: distributing obligations of rebuilding trust to those responsible based on who is hurt and what they need
  3. First, second, and third victims: an incident's negative impact is felt by more than just the obvious victims
  4. Learning theory: people break rules when they have learned there are no negative consequences, and there are actually positive consequences - in other words, they break rules to get things done to meet production pressure
  5. Reporting culture: contributing to reports of adverse events is meant to help the organization understand what went wrong and how to prevent recurrence, but accurate reporting requires appropriate and proportionate accountability actions
  6. Complex systems: normal behavior of practicioners and professionals in the context of a complex system can appear abnormal or deviant in hindsight, particularly in the eyes of non-expert juries and reviewers
  7. The nature of practicioners: professionals want to do good work, and therefore want to be held accountable for their mistakes; they generally want to help similarly-situated professionals avoid the same mistake.

Concepts

  • Drift into failure
  • Safety differently
  • New view vs old view of human performance & error
  • Just culture
  • complexity
  • broken part
  • Newton-Descartes
  • diversity
  • systems theory
  • unruly technology
  • decrementalism
  • generic competencies

Selected publications

John C. Doyle

Doyle is a control systems researcher. He is seeking to identify the universal laws that capture the behavior of resilient systems, and is concerned with the architecture of such systems.

Concepts

  • Robust yet fragile
  • layered architectures
  • constraints that deconstrain
  • protocol-based architectures
  • emergent constraints
  • Universal laws and architectures
  • conservation laws
  • universal architectures
  • Highly optimized tolerance

Selected publications

Bob Edwards

Edwards is a practitioner who provides training services in human and organizational performance (HOP).

Edwards tweets as @thehopcoach.

Anders Ericsson

Ericsson introduced the idea of deliberate practice as a mechanism for achieving high level of expertise.

Ericsson isn't directly associated with the field of resilience engineering. However, Gary Klein's work is informed by his, and I have a particular interest in how people improve in expertise, so I'm including him here.

Concepts

  • Expertise
  • Deliberate practice
  • Protocol analysis

Selected publications

Paul Feltovich

Feltovich is a retired Senior Research Scientist at the Florida Institute for Human & Machine Cognition (IHMC), who has done extensive reserach in human expertise.

Selected publications

Meir Finkel

Finkel is a Colonel in the Israeli Defense Force (IDF) and the Director of the IDF's Ground Forces Concept Development and Doctrine Department

Selected publications

Ivonne Andrade Herrera

Herrera is an associate professor in the department of industrial economics and technology management at NTNU and a senior research scientist at SINTEF. Her areas of expertise include safety management and resilience engineering in avionics and air traffic management.

List of publications

Erik Hollnagel

Contributions

ETTO principle

Hollnagel proposed that there is always a fundamental tradeoff between efficiency and thoroughness, which he called the ETTO principle.

Safety-I vs. Safety-II

Safety-I: avoiding things that go wrong

  • looking at what goes wrong
  • bimodal view of work and activities (acceptable vs unacceptable)
  • find-and-fix approach
  • prevent transition from 'normal' to 'abnormal'
  • causality credo: believe that adverse outcomes happen because something goes wrong (they have causes that can be found and treated)
  • it either works or it doesn't
  • systems are decomposable
  • functioning is bimodal

Safety-II: performance variability rather than bimodality

  • the system’s ability to succeed under varying conditions, so that the number of intended and acceptable outcomes (in other words, everyday activities) is as high as possible
  • performance is always variable
  • performance variation is ubiquitous
  • things that go right
  • focus on frequent events
  • remain sensitive to possibility of failure
  • be thorough as well as efficient

FRAM

Hollnagel proposed the Functional Resonance Analysis Method (FRAM) for modeling complex socio-technical systems.

Concepts

  • ETTO (efficiency thoroughness tradeoff) principle
  • FRAM (functional resonance analysis method)
  • Safety-I and Safety-II
  • things that go wrong vs things that go right
  • causality credo
  • performance variability
  • bimodality
  • emergence
  • work-as-imagined vs. work-as-done
  • joint cognitive systems

Selected publications

Leila Johannesen

Johannesen is currently a UX researcher and community advocate at IBM. Her PhD dissertation work examined how humans cooperate, including studies of anesthesiologists.

Concepts

  • common ground

Selected publications

Gary Klein

Klein studies how experts are able to quickly make effective decisions in high-tempo situations.

Klein tweets as @KleInsight.

Concepts

  • naturalistic decision making (NDM)
  • intuitive expertise
  • cognitive task analysis
  • common ground
  • problem detection
  • automation as a "team player"

Selected publications

Nancy Leveson

Nancy Leveson is a computer science researcher with a focus in software safety.

Contributions

STAMP

Leveson developed the accident causality model known as STAMP: the Systems-Theoretic Accident Model and Process.

See STAMP for some more detailed notes of mine.

Concepts

  • Software safety
  • STAMP (systems-theoretic accident model and processes)
  • STPA (system-theoretic process analysis) hazard analysis technique
  • CAST (causal analysis based on STAMP) accident analysis technique
  • Systems thinking
  • hazard
  • interactive complexity
  • system accident
  • dysfunctional interactions
  • safety constraints
  • control structure
  • dead time
  • time constants
  • feedback delays

Selected publications

Carl Macrae

Macrae is a social psychology researcher who has done safety research in multiple domains, including aviation and healthcare. He helped set up the new healthcare investigation agency in England. He is currently a professor of organizational behavior and psychology at the Notthingham University Business School.

Macrae tweets at @CarlMacrae.

Concepts

  • risk resilience

Selected publications

Laura Maguire

Maguire is a cognitive systems engineering researcher who is currently completing a PhD at Ohio State University. Maguire has done safety work in multiple domains, including forestry, avalanches, and software services.

Maguire tweets as @LauraMDMaguire.

Anne-Sophie Nyssen

Nyssen is a psychology professor at the University of Liège, who does research on human error in complex systems, in particular in medicine.

A list of publications can be found on her website linked above.

Elinor Ostrom

Ostrom was a Nobel-prize winning economics and political science researcher.

Selected publications

Concepts

  • tragedy of the commons
  • polycentric governance
  • social-ecological system framework

Jean Pariès

Pariès is the president of Dédale, a safety and human factors consultancy.

Selected publications

Selected talks

Emily Patterson

Patterson is a researcher who applies human factors engineering to improve patient safety in healthcare.

Selected publications

Charles Perrow

Perrow is a sociologist who studied the Three Mile Island disaster. "Normal Accidents" is cited by numerous other influential systems engineering publications such as Vaughan's "The Challenger Launch Decision".

Concepts

  • Complex systems: A system of tightly-coupled components with common mode connections that is prone to unintended feedback loops, complex controls, low observability, and poorly-understood mechanisms. They are not always high-risk, and thus their failure is not always catastrophic.
  • Normal accidents: Complex systems with many components exhibit unexpected interactions in the face of inevitable component failures. When these components are tightly-coupled, failed parts cannot be isolated from other parts, resulting in unpredictable system failures. Crucially, adding more safety devices and automated system controls often makes these coupling problems worse.
  • Common-mode: The failure of one component that serves multiple purposes results in multiple associated failures, often with high interactivity and low linearity - both ingredients for unexpected behavior that is difficult to control.
  • Production pressures and safety: Organizations adopt processes and devices to improve safety and efficiency, but production pressure often defeats any safety gained from the additions: the safety devices allow or encourage more risky behavior. As an unfortunate side-effect, the system is now also more complex.

Selected publications

Shawna J. Perry

Perry is a medical researcher who studies emergency medicine.

Concepts

  • Underground adaptations
  • Articulated functions vs. important functions
  • Unintended effects
  • Apparent success vs real success
  • Exceptions
  • Dynamic environments

Selected publications

Jens Rasmussen

Jens Rasmussen was a very influential researcher in human factors and safety systems.

Contributions

Skill-rule-knowledge (SKR) model

TBD

Dynamic safety model

Rasmussen proposed a state-based model of a socio-technical system as a system that moves within a region of a state space. The region is surrounded by different boundaries:

  • economic failure
  • unacceptable work load
  • functionality acceptable performance

Migration to the boundary

Source: Risk management in a dynamic society: a modelling problem

Incentives push the system towards the boundary of acceptable performance: accidents happen when the boundary is exceeded.

AcciMaps

TBD

Risk management framework

Rasmussen proposed a multi-layer view of socio-technical systems:

Risk management framework

Source: Risk management in a dynamic society: a modelling problem

Concepts

  • Dynamic safety model
  • Migration toward accidents
  • Risk management framework
  • Boundaries:
    • boundary of functionally acceptable performance
    • boundary to economic failure
    • boundary to unacceptable work load
  • Cognitive systems engineering
  • Skill-rule-knowledge (SKR) model
  • AcciMaps
  • Means-ends hierarchy
  • Ecological interface design
  • Systems approach
  • Control-theoretic
  • decisions, acts, and errors
  • hazard source
  • anatomy of accidents
  • energy
  • systems thinking
  • trial and error experiments
  • defence in depth (fallacy)
  • Role of managers
    • Information
    • Competency
    • Awareness
    • Commitment
  • Going solid

Selected publications

James Reason

Reason is a psychology researcher who did work on understanding and categorizing human error.

Contributions

Accident causation model (Swiss cheese model)

Reason developed an accident causation model that is sometimes known as the swiss cheese model of accidents. In this model, Reason introduced the terms "sharp end" and "blunt end".

Human Error model: Slips, lapses and mistakes

Reason developed a model of the types of errors that humans make:

  • slips
  • lapses
  • mistakes

Concepts

  • Blunt end
  • Human error
  • Slips, lapses and mistakes
  • Swiss cheese model

Selected publications

Emilie M. Roth

Roth is a cognitive psychologist who serves as the principal scientist at Roth Cognitive Engineering, a small company that conducts research and application in the areas of human factors and applied cognitive psychology (cognitive engineering)

Selected publications

Nadine Sarter

Sarter is a researcher in industrial and operations engineering. She is the director of the Center for Ergonomics at the University of Michigan.

Concepts

  • cognitive ergonomics
  • organization safety
  • human-automation/robot interaction
  • human error / error management
  • attention / interruption management
  • design of decision support systems

Selected publications

James C. Scott

Scott is an anthropologist who also does research in political science. While Scott is not a member of a resilience engineering community, his book Seeing like a state has long been a staple of the cognitive systems engineering and resilience engineering communities.

Concepts

  • authoritarian high-modernism
  • legibility
  • mētis

Selected publications

Steven Shorrock

Shorrock is a chartered psychologist and a chartered ergonomist and human factors specialist. He is the editor-in-chief of EUROCONTROL HindSight magazine. He runs the excellent Humanistic Systems blog.

Shorrock tweets as @StevenShorrock.

Diane Vaughan

Vaughan is a sociology researcher who did a famous study of the NASA Challenger accident.

Concepts

  • normalization of deviance

Selected publications

Barry Turner

Turner was a sociologist who greatly influenced the field of organization studies.

Selected publications

Robert L. Wears

Wears was a medical researcher who studied emergency medicine.

Concepts

  • Underground adaptations
  • Articulated functions vs. important functions
  • Unintended effects
  • Apparent success vs real success
  • Exceptions
  • Dynamic environments
  • Systems of care are intrinsically hazardous

Selected publications

David Woods

Woods has a research background in cognitive systems engineering and did work researching NASA accidents. He is one of the founders Adaptive Capacity Labs, a resilience engineering consultancy.

Woods tweets as @ddwoods2.

Contributions

Woods has contributed an enormous number of concepts.

The adaptive universe

Woods uses the adaptive universe as a lens for understanding the behavior of all different kinds of systems.

All systems exist in a dynamic environment, and must adapt to change.

A successful system will need to adapt by virtue of its success.

Systems can be viewed as units of adaptive behavior (UAB) that interact. UABs exist at different scales (e.g., cell, organ, individual, group, organization).

All systems have competence envelopes, which are constrained by boundaries.

The resilience of a system is determined by how it behaves when it comes near to a boundary.

See Resilience Engineering Short Course for more details.

Charting adaptive cycles

  • Trigger
  • Units of adaptive behavior
  • Goals and goal conflicts
  • Pressure points
  • Subcycles

Graceful extensibility

From The theory of graceful extensibility: basic rules that govern adaptive systems:

(Longer wording)

  1. Adaptive capacity is finite
  2. Events will produce demands that challenge boundaries on the adaptive capacity of any UAB
  3. Adaptive capacities are regulated to manage the risk of saturating CfM
  4. No UAB can have sufficient ability to regulate CfM to manage the risk of saturation alone
  5. Some UABs monitor and regulate the CfM of other UABs in response to changes in the risk of saturation
  6. Adaptive capacity is the potential for adjusting patterns of action to handle future situations, events, opportunities and disruptions
  7. Performance of a UAB as it approaches saturation is different from the performance of that UAB when it operates far from saturation
  8. All UABs are local
  9. There are bounds on the perspective any UAB, but these limits are overcome by shifts and contrasts over multiple perspectives.
  10. Reflective systems risk mis-calibration

(Shorter wording)

  1. Boundaries are universal
  2. Surprise occurs, continuously
  3. Risk of saturation is monitored and regulated
  4. Synchronization across multiple units of adaptive behavior in a network is necessary
  5. Risk of saturation can be shared
  6. Pressure changes what is sacrificed when
  7. Pressure for optimality undermines graceful extensibility
  8. All adaptive units are local
  9. Perspective contrast overcomes bounds
  10. Mis-calibration is the norm

Concepts

Many of these are mentioned in Woods's short course.

  • the adaptive universe
  • unit of adaptive behavior (UAB), adaptive unit
  • adaptive capacity
  • continuous adaptation
  • graceful extensibility
  • sustained adaptability
  • Tangled, layered networks (TLN)
  • competence envelope
  • adaptive cycles/histories
  • precarious present (unease)
  • resilient future
  • tradeoffs, five fundamental
  • efflorescence: the degree that changes in one area tend to recruit or open up beneficial changes in many other aspects of the network - which opens new opportunities across the network ...
  • reverberation
  • adaptive stalls
  • borderlands
  • anticipate
  • synchronize
  • proactive learning
  • initiative
  • reciprocity
  • SNAFUs
  • robustness
  • surprise
  • dynamic fault management
  • software systems as "team players"
  • multi-scale
  • brittleness
  • decompensation
  • working at cross-purposes
  • proactive learning vs getting stuck
  • oversimplification
  • fixation
  • fluency law, veil of fluency
  • capacity for manoeuvre (CfM)
  • crunches
  • sharp end, blunt end
  • adaptive landscapes
  • law of stretched systems: Every system is continuously stretched to operate at capacity.
  • cascades
  • adapt how to adapt
  • unit working hard to stay in control
  • you can monitor how hard you're working to stay in control (monitor risk of saturation)
  • reality trumps algorithms
  • stand down
  • time matters
  • Properties of resilient organizations
    • Tangible experience with surprise
    • uneasy about the precarious present
    • push initiative down
    • reciprocity
    • align goals across multiple units
  • goal conflicts, goal interactions (follow them!)
  • to understand system, must study it under load
  • adaptive races are unstable
  • adaptive traps
  • roles, nesting of
  • hidden interdependencies
  • net adaptive value
  • matching tempos
  • tilt toward florescence
  • linear simplification
  • common ground
  • problem detection
  • joint cognitive systems
  • automation as a "team player"
  • "new look"
  • sacrifice judgment
  • task tailoring
  • substitution myth
  • directability
  • directed attention
  • inter-predictability
  • error of the third kind: solving the wrong problem
  • buffering capacity
  • context gap
  • Norbert's contrast
  • anomaly response

Selected publications

Selected talks

John Wreathall

Wreathall is an expert in human performance in safety. He works at the WreathWood Group, a risk and safety studies consultancy.

Wreathall tweets as @wreathall.

Selected publications

About

Resilience Engineering Notes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published