Skip to content

thnkslprpt/awesome-spaceflight-software-best-practices

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

awesome-spaceflight-software-best-practices

This is the data used for a research project on the history of catastrophic spacecraft failures caused by software faults and best practices associated with software development for spacecraft flight software.

Contents

List of Spacecraft Failures Caused by Software Faults

Mission Year Description and Proximate Cause
Mariner 1 1962 Two radar systems guided the Atlas-Agena launch system which launched the Mariner 1 spacecraft. One measured the rocket’s velocity (the Rate System) while the other measured the distance and angle to a tracking antenna at the launch site (the Track System). The two radar systems differed in their timing by 43 ms so the guidance computer which used the data received from the systems to relay control signals back to the spacecraft was programmed with a 43 ms offset. The offset equation required smoothed velocity data from the Rate System, which was represented by an overbar. During transcription of the guidance equations, the overbar was missed and not noticed in the review. The Rate System hardware failed during launch, so guidance control switched to using data from the Track System, but the incorrect equation indicated that the spacecraft velocity was erratically accelerating and decelerating, so correction commands were sent back to the rocket. The rocket was operating nominally, and without these erroneous correction commands would probably have launched successfully. The commands caused the rocket to veer dangerously off course and a range-safety officer triggered the self-destruct command.
Gemini 5 1965 Gemini 5 was a generally successful mission, completing many experiments and orbited Earth 120 times - breaking the Soviet record for the duration of a crewed space mission. The landing, however, was 89 nautical miles off course and could have been dangerously so if the navigational error had miscalculated more seriously. The omission of part of a timing calculation (time elapsed from GMT midnight prior to launch) which was used to calculate the position of the spacecraft caused a deviation in the actual longitudinal position from the reported position of 7.89°. The onboard computer then reduced lift to avoid an overshoot, which in actuality resulted in an undershoot of 89 nautical miles.
Proton M71 1971 The Proton rocket failed to ignite for its second insertion burn an hour after launch. It was discovered afterwards that the 8-digit software command for the second ignition was entered backwards and thus never activated. As with most Soviet space program failures very little information is available about this anomaly.
Viking-1 1982 NASA's Viking-1 was the first successful Mars lander in history, and the craft operated successfully for over 6 years before a mis-sized command overwrote code that was used by the system to control the pointing direction of the antenna, which caused a permanent loss of communication and the unnecessarily early end of the mission in March 1983.
Phobos 1 1988 A block of unnecessary test code was left in the system’s firmware prior to launch due to the difficulty in removing it and time constraints. The unneeded section of firmware was partitioned from the rest of the software in order to render it inoperative. However, a couple of months after launch a 20-30 page command sequence sent by ground control was missing a single character at the end of the sequence. The test-bed that was normally used to catch errors of this kind was out of commission and the operator decided due to time constraints not to wait for it to become available and issued the commands without them passing any testing. The missing character happened to be pivotal for the command’s interpretation by the spacecraft software and was in fact precisely the code that would access the partitioned test code and cause it to operate. This code issued incorrect GN&C commands to disable the attitude control thrusters which meant that the solar panels were not locked onto the sun and thus the batteries depleted quickly. The spacecraft entered into a tumble and contact was lost.
Phobos 2 1989 During a routine imaging session, the Phobos spacecraft disabled its transmitter as planned in order to conserve power. After the imaging session, the probe failed to resume transmission as planned. It is reported that the software failed because a component of the system named ‘e minimal’ was not included in the onboard software - the program was a fault containment component which would put the probe into a safe mode if voltage dropped below a certain level. The component was viewed as unnecessary but would likely have prevented the failure in this case. A voltage drop, perhaps caused by temporary misalignment of the solar panels, is the likely cause of the initial loss of contact. Through emergency commands, the ground team were able to re-establish contact and record 17 minutes of telemetry from the small antenna which indicated that the spacecraft was tumbling but was insufficient to take any further corrective actions before contact was lost permanently.
Clementine 1994 A series of software errors escaped detection due to inconsistencies between the operational spacecraft system and the verification tool configurations. The software issued erroneous commands to fire the thrusters which exhausted the fuel supply in 11 minutes and put the spacecraft into an unrecoverable spin.
Ariane 5
Flight 501
1996 The Ariane design has multi-level redundancy at the hardware level, with all parts of the Inertial Reference System and main flight control system computer (both hardware and software) duplicated and running in ‘hot’ stand-by mode - i.e. running the same instructions and ready to take over instantaneously. In this case, the redundancy was of no use as both copies of the system contained the same programming error.
An exception occurred when a 64-bit floating-point value was converted into a 16-bit signed integer, which was unable to hold the value in its maximum range. The calculation was actually in a redundant software module that had been left in from the reused Ariane 4 flight software, which allowed it to recalculate launch parameters if a launch was delayed in the final moments, rather than having to delay the launch by a day or more and rerun the entire launch sequence. The code, which ran satisfactorily on the Ariane 4, produced a much higher result and the subsequent error on the Ariane 5 due to higher horizontal velocities in that vehicle compared to the Ariane 4. An operand error occurred, and error logging data was reported by the inertial reference system which the launcher interpreted as flight data and incorporated into flight control calculations. These completely erroneous flight control instructions resulted in large corrective commands for what the launch computer thought were attitude deviations that in fact were spurious. The massive aerodynamic forces on the vehicle caused by this attitude change caused its disintegration.
The major recommendations from the Ariane 5 failure report are:
- Don’t allow software to run if it is not presently necessary.
- Run realistic tests with full system hardware (or as much as is technically feasible).
- Any sensor that encounters a failure should keep sending best-effort data.
Additionally, it seems obvious that the critical flight software was not properly partitioned and protected from influence from this module of non-critical code.
The software module perhaps shouldn’t have even been included for the Ariane 4 flight software. It did serve a purpose, but that purpose (saving restart time in the case of a launch delay) may not have justified the additional complexity that adding in this feature created. It is an ever-present reality that non-critical features will always seem to be worth adding at the time, but the negative side of the loss of simplicity, maintainability and re-use is often not fully considered.
Mars Pathfinder 1997 Priority inversion resulting in continuous reset of the system by the watchdog timer. The failure resulted in a moderate loss of mission time.
Titan IV/MILSTAR 1999 A MILSTAR (Military Strategic and Tactical Relay) satellite launched atop a Titan IV rocket and a Centaur upper stage in April 1999. The failure was caused by a mistyped constant value - for the roll rate filter in the inertial measurement system file. The value was entered as -0.1992476 instead of -1.992476 (an order of magnitude lower due to the decimal point being misplaced). The value resulted in erroneous calculation of the roll rate data and caused the loss of roll axis control. The reaction control system attempted to correct the attitude but quickly exhausted its fuel supply. The launch failure meant that the satellite was placed in an unstable orbit that was far too low (it was intended for geosynchronous orbit) to provide the services it was designed for and was permanently shut down after 10 days.
The initial transcription mistake, and then a one-time review that failed to spot the error, resulted in this value becoming the truth baseline for the entire remaining design and launch process. This shows the risk of placing too much emphasis on one-time or early checks of critical software, and the need to implement staged reviews during the entire development and integration process which check the entire toolchain, rather than just recent changes.
Mars Climate
Orbiter (MCO)
1999 A ground software file called ‘SM_FORCES’ (Small Forces) which was used in trajectory calculations that were sent to the spacecraft used imperial units rather than the metric units (lb-sec instead of Newton-sec) used in the rest of the software, including the ‘AMD’ (Angular Momentum Desaturation) file that it sent data to. The AMD file was used during the 9-month trip to Mars for propulsion maneuvers to remove angular momentum buildups in the craft’s reaction wheels. The AMD corrections, being calculated using partially incorrect units, resulted in an increasingly inaccurate trajectory model over the course of the journey. The final insertion at Mars was approximately 170 km lower in altitude than expected and was far lower than the survivable altitude for the EDL phase to handle.
In a bit more detail about the specifics of how the units were misinterpreted - the files used in the trajectory calculations for the MCO had the 4.45 pounds-Newtons conversion factor buried in the equation, with no obvious identification or comments. The new software development team missed the fact that the equations had included this conversion factor and replaced it with the new equation for the MCO without adding any accommodation for the conversion.
Mars Polar
Lander (MPL)
1999 Despite extensive recommendations from the MCO crash investigation, designed to avoid a software fault causing the subsequent loss of the already flying MPL mission, a different software error resulted in its destruction during the landing sequence in January 2000, 4 months after the MPL loss. The cause could not be confirmed due to a lack of communication during the landing (and the accompanying telemetry), but the most likely cause was identified as premature shutdown of the descent engines caused by a misinterpreted touchdown indication from vibrations of the landing legs. Engine shutdown would have occurred at 40m altitude in this scenario, at a velocity of 13 m/s, increasing to 22 m/s at the surface (due to gravitational acceleration). Thus the touchdown likely occurred at around 10 times the nominal design speed and was not survivable. The system did have an accommodation for these spurious signals, requiring persistence for two consecutive sensor readings, but later testing (after the crash) showed that most scenarios of the parachute deployment, atmospheric turbulence and landing leg deployment vibrations would fulfil this test and thus result in erroneous landing detection.
The investigation revealed that some parts of the code, including the critical section involved in the fault, was worked on by a single developer. They recommended that all code modules be worked on at least by a team of developers. Another important detail revealed by the investigation was the fact that integration tests combining the flight system, propulsion systems and thermal control were not fully completed, and may have picked up on the error.
Another finding was that the touchdown logic was flawed in the sense that it allowed for engine shutdown already at an altitude of 40 metres when radar sensing was disabled, and probably should have been delayed and enabled based on other input.
Zenit 3SL 2000 On March 12, 2000 a Zenit-3SL was to launch the ICO F-1 communications satellite into orbit for Sea Launch’s ocean-based launch platform. The ground software which was sending launch sequence instructions to the launch vehicle was missing a command for closure of a pneumatic valve on the second stage prior to launch. This unclosed valve culminated in a loss of helium from the pressurisation tank on the second stage and the inability of the vehicle to reach orbital velocity.
Spirit rover 2004 A design error in the file system services module from the COTS (Commercial off-the-shelf) software resulted in a large amount of memory being used to represent the file system structure (containing all files ever created), including deleted files. When the flash memory filled up the rover was unable to operate until careful corrective actions were taken by the ground team over a 2-week period. The failure resulted in a moderate loss of mission time.
CryoSat-1 2005 ESA’s CryoSat-1 launched on a Rokot/Briz-KM rocket combination in October 2005. The flight software was missing a command for the second stage engine cutoff, and it continued burning until depleting all of its fuel and prevented the upper stage from separating. The second and upper stages thus remained connected, along with the CryoSat satellite and descended into the ocean.
Mars Global
Surveyor (MGS)
2006 The Mars Global Surveyor (MGS) operated for almost 10 years, including over 8 additional years after the primary mission was completed. Contact was lost in November 2006 just after the 4th mission extension. In June 2006, five months before the loss of contact, an update to the contingency positioning instructions for the spacecraft’s High Gain Antenna was written to an incorrect memory address. This error eventually caused both the overheating of one of the onboard batteries as it was overexposed to the sun and unfortunately also precluded any corrective actions as it caused the spacecraft to miscalculate its orientation to Earth and was therefore unable to communicate with NASA through the Deep Space Network. This fact - that the fault caused both a catastrophic fault and also made communication (and corrective action) impossible, makes the MGS case an important example of safe coding principles. Another relevant note from this scenario is the fact that the erroneous update itself was part of an effort to correct a previously introduced operator input error in an update that was sent to the spacecraft in September 2005.
Key Lessons from MGS:
- Erroneous updates 14 months before the loss of contact, and then corrective updates 5 months before loss of contact (which were also erroneous) finally manifested themselves in November 2006.
- Erroneous memory write caused both failures in spacecraft hardware, and caused incorrect alignment and therefore stopped any ability to communicate with, and correct, the original hardware fault.
- Efforts to correct a minor error ended up introducing another error which proved to be much more serious. It must be remembered that if not done carefully, corrective actions can cause even worse problems than what they are trying to correct.
Ekspress-AM4 2011 The Ekspress-AM4 satellite launched atop a Proton rocket and Briz-M upper stage on August 18, 2011. An incorrect entry in the flight software which set the time allowed for the delta rotation prior to the third burn of the upper stage resulted in the pitch axis gimbal ring in the gyroscopic system hitting a hard stop at its gimbal limit. The rocket was then unable to correctly maneuver into its target orbit.
Deep Impact Probe 2013 The Deep Impact Probe successfully completed its mission to study the comet Tempel 1 using an impactor to eject debris and study the composition of the comet. During its extended mission, the spacecraft began to continuously reset itself on August 13, 2013. Mission control tried to address the issue but they were unable before the spacecraft lost its orientation - both making communication impossible and depleting the solar-powered batteries. The error was introduced into the code when the clock counter on the spacecraft overflowed the range of the 32-bit integer holding the timing variable. The system used a starting date (epoch) of January 1, 2000 at 00:00:00. The timer counted time using 100 ms increments from the starting epoch, resulting in the clock counter ticking over the limit (232 or 4,294,967,296) of the 32-bit integer holding its value 13.6 years after that starting epoch - i.e. August 13, 2013. The system was trying to enter safe mode by resetting everything, but this fault containment strategy did not involve resetting the system clock, which continues counting even during fault detection and recovery.
Hitomi (ASTRO-H) 2016 The Hitomi gyroscope-based inertial reference system was misreporting a rotation of 21.7° around its Z-axis after a pointing maneuver, and the star tracker would normally be available to confirm this (and would have correctly reported a stable orientation). However, a bias value for the star tracker which set the minimum brightness required for it to lock onto stars to determine location and attitude was set too high and it was not able to be used. This decision was designed to make the attitude calculations quicker and maximise observation time. The situation was reported to ground control and corrective maneuvers were implemented and uploaded to the spacecraft, but an operator error uploaded previously meant that the rotation calculations were not correct. Specifically, negative position values must be converted to positive before input into a calculation tool, but this was missed by the operator in their first time carrying out the procedure. The uploaded instructions then caused the thrusters to further increase in the anomalous rotation speed. The series of errors culminated in the breakup of the spacecraft due to the excessive centrifugal forces.
Exomars Schiaparelli 2016 After the parachutes were deployed, the inertial measurement unit reported a larger than expected Z-axis pitch and a saturation flag was raised. The turbulence caused by the parachute deployment was the cause of the temporarily higher Z-axis pitch. The GN&C software continued to use this saturated threshold limit in its calculations when the craft was actually oscillating, as the flag was not designed to correct itself after saturation during the EDL phase. The continued integration of this erroneous threshold value caused the attitude estimate of the craft to deviate by ~165° on the Z-axis (i.e. almost upside down). This incorrect orientation estimate caused the system to calculate a faulty altitude, negative altitude in fact, which had no plausibility check in the software logic. The completely incorrect parameters further propagated through the landing logic and resulted in the landing thrusters being turned off almost immediately after beginning to fire - running for only 3 seconds instead of 30 seconds. The craft was still at an altitude of 3.7km, and free fell to the surface impacting at around 150 m/s and was destroyed.
Important lessons from the Schiaparelli loss are:
- The persistence flag setting was not properly verified during integration and was believed to be 15ms - the craft would probably have landed successfully if the flag did, in fact, persist for only 15ms before taking new measurements again.
- Flawed logic in the GN&C system which continued to integrate the faulty attitude measurements and accumulated an attitude determination almost upside down, which was not plausible given the radar echoes were still being received from the Martian surface during the entire EDL phase.
- Inputs should always be checked for plausibility. Both the attitude (upside down) and the altitude (negative altitude - i.e. below the surface) were not plausible values yet were incorporated into the EDL sequence. Another implausible yet untested part of the landing sequence was the fact that the altitude changed from 3.7km to a negative value, in under 1 second.
Soyuz-2.1b 2017 A Soyuz-2.1b with a Fregat-M upper stage launched with 19 satellites on 28 November 2017. Roscosmos confirmed after a lengthy investigation that the coordinates of the Baikonur Cosmodrome in Kazakhstan had been hardcoded into the launch software for the Fregat upper stage. The launch algorithm operated successfully until a new launch location was opened in Russia (the Vostochny Cosmodrome). The current programmers were unaware of this unidentified reference to the Baikonur coordinates in the flight software and it was never updated or removed. The mistake led to the rocket still attempting to correct its orientation based on false coordinates while the main engine ignited for a preprogrammed burn. The incorrect orientation led to a trajectory which ended in the Atlantic Ocean.
Beresheet 2019 On 11 April 2019 Israel attempted its first moon landing with the Beresheet lander. One of the inertial measurement units reported an error, and during attempts to restart the unit the entire system reset and the main thruster was shut down early and an altitude of around 150 m. This led to a hard landing and loss of the spacecraft.
Chandrayaan-2 Lander 2019 The Vikram lander started to deviate from its planned trajectory during the descent at about 2km altitude. Its thrust control algorithms were incorrectly configured as was the flight-time calculation algorithm. These errors accumulated to a significant degree, with the lander's speed and trajectory significantly off-nominal and it broke up upon impact (causing loss of the rover payload) due to high landing speeds of ~50m/s.
Boeing Starliner 2020 Boeing’s first orbital test flight of their planned future manned crew capsule was unsuccessful and had to cancel its planned rendezvous and docking with the space station due to a software fault. The spacecraft also suffered issues during the landing phase due to additional software defects. The first coding error concerned the Mission Elapsed Timer (MET) of the starliner, which was incorrectly set to an earlier starting time (17 hours before launch) rather than the terminal count of the Atlas V launch vehicle. Thus the timing of the spacecraft clock was incorrect and flight maneuvers were not conducted at the correct time and too much fuel was exhausted trying to recover to a stable orbit. The rendezvous and docking with the space station had to be cancelled.
After the MET anomaly, Boeing and NASA teams conducted a full review of the flight software and discovered another error which also could have caused a catastrophic loss of the vehicle during the return to Earth. A list of 61 recommendations were presented by NASA for Boeing to implement before a repeat of the unmanned orbital test flight can occur.
Astra Rocket 3.3 2022 Astra Space's Rocket 3.3 started to tumble due to a wiring error. The upper-stage engine (Aether) has tumble-correction software but was unable to correct the tumble due to a number of lost packets of data.
Hakuto-R 2023 The Japanese company ISpace's Mission 1 lander's guidance software was not correctly designed and a higher-than-expected crater rim near its landing site. This caused an incorrect calculation of the lander's altitude because the fault detection software rejected the incoming altitude measurements (which were actually correct). No hardware faults were identified.
Luna-25 2023 Roscosmos's Luna-25 Lunar Lander crashed into the lunar surface after a faulty guidance command in the onboard software caused the engine to burn for 127 seconds instead of 84 seconds during a maneuver to lower its orbit in preparation for landing. The onboard computer failed to turn on an accelerometer in the BIUS-L device, which measures the angular velocity of the spacecraft.

Crash Investigation Reports

An Analysis of Causation in Aerospace Accidents

Anomaly Trends for Robotic Missions to Mars

Ariane Flight 501 Report to the Inquiry Board

ESA EXOMARS 2016 Schiaparelli Anomaly Inquiry

Evaluating Accident Models Using Recent Aerospace Accidents

Mars Climate Orbiter Mishap Investigation Board Phase I Report

Mars Global Surveyor Spacecraft Loss of Contact

Phobos at Mars

Report on the Loss of the Mars Polar Lander and Deep Space 2 Missions

Sea Launch - Summary of Investigation and Return-to-Flight Preparations

The Failures of the Mars Climate Orbiter and Mars Polar Lander - A Perspective from the People Involved

The Role of Software in Spacecraft Accidents

What Really Happened on Mars

Software Safety Standards and Coding Guidelines

Key Coding Standards/Guidelines

Standard/Guideline Year Language(s)
DO-178B 1992 General
NASA C 1994 C
MISRA C:1998 1998 C
ESA C and C++ 2000 C/C++
MISRA C:2004 2004 C
NASA Software Safety Guidebook 2004 All (incl. C/C++)
JSF (Joint Strike Fighter) C++ 2005 C++
NASA C++ 2005 C++
Power of Ten 2006 C
MISRA C++ 2008 C++
JPL C 2009 C
DO-178C 2011 General
MISRA C:2012 2012 C
BARR C 2018 C

Embedded_C_(ISO-IEC_TR_18037)

JSF++_(Joint_Strike_Fighter)_C++_Coding_Standard

NASA_Software_Safety_Guidebook_(GB-8719.13)

NASA_Software_Safety_Standard_(STD-8719.13B)

NASA_C_Coding_Standard_and_Style_Guide_(SEL-94-003)

NASA_C++_Coding_Standard_and_Style_Guide

ESA_Software_Safety_Guidebook_(ECSS-E-HB-40A)

ESA_Software_Engineering_Standards_(PSS-05-01)

ESA_C_and_C++_Coding_Standards

ESA_Software_Dependability_and_Safety_(ECSS-Q-HB-80-03)

ESA_Space_Engineering_Software_(CSS-E-ST-40C)

ESA_Testing_(ECSS-E-ST-10-03C)

ESA_Risk_Management_(ECSS-M-ST-80C)

ESA_Technology_Readiness_Levels

ESA_Electrical_Electronic_and_Electromechanical_Components_(ECSS_Q_ST_60C)

ESA_Calculation_of_Radiation_and_its_Effects_(ECSS-E-HB-10-12A)

ESA_Radiation_Hardness_Assurance_(ECSS Q_ST_60_15C)

Important rules or guidelines that are common to most of the standards include:

  • Document and justify deviations from the standard: this expectation was established early on in DO-178B, MISRA C:1998 and the ESA C and C++ coding standard, and has carried through to be included in most of the standards since.
  • Use modern, safer language features rather than their older, less safe counterparts: C++ casts rather than C-style casts, C++ smart pointers rather than new /delete or C-style pointers
  • Include default/fall-through clauses for all switch, if...else statements: default for all switch statements, a final else for all if...else statements, a final catch for any unhandled exceptions
  • Include default handlers for exceptions: use function set_unexpected() and function set_terminate() for handling errant exceptions
  • Functions should have a single point of exit: as per the overarching industry standard (IEC 61508)
  • Qualify as const any variables that are not modified: i.e. utilise const to increase compile time error-detection
  • Use simple control flow (avoid recursion, avoid use of statements like goto and continue): recursion and jumping logic make program flow harder to reason about, especially for future programmers maintaining the software
  • Avoid dynamic memory allocation/deallocation after initialisation: most memory allocators/deallocators (e.g. malloc, free) have non-deterministic behaviour, and not allocating all memory at initialisation means that a large class of errors become possible such as forgetting to free memory, over-utilisation of memory etc. typedef the basic numerical types: allows the signedness and size of the main types to be immediately clear in the code, rather than relying on implementation-specific assumptions
  • Objects should be declared at the most limited scope possible: the more limited the scope and lifetime of an object, the lower the chances that it can be improperly accessed
  • Be explicit rather than implicit: examples include implicit conversion to bool in an if statement, using brackets to control the order of operations
  • Use unambiguous typography: Identifiers should differ by only a mixture of case, the letter ‘O’ and the number ‘0’, ‘l’ and ‘1’, ‘i’ and ‘l’, ‘S’ and ‘5’, ‘Z’ and ‘2’, ‘n’ and ‘h’, ‘B’ and ‘8’ or other easily misidentified characters.

A selection of noteworthy guideline-specific rules are included below:
ESA C and C++ advised that programmers should optimise for clear, maintainable code as a first preference; only optimising for speed, memory usage or compactness when absolutely necessary (Rule 7). ESA also has a very strict subset for use onboard spacecraft which bans the use of exceptions, templates, namespaces, multiple or virtual inheritance and dynamic memory allocation (Rule 124).
MISRA C++: 2008 allows the use of goto in limited circumstances (only for forward jumps, not back, and only in the same function body).
JPL (Jet Propulsion Laboratory) C states as its goals: Reliability, Portability (i.e. not compiler or linker dependent), Maintainability (code should be consistent, readable, simple in design, and easy to debug), Testability (by minimising the following in each code module: code size, complexity, static path count (number of paths through a piece of code), Reusability, Extensibility, Readability.
BARR-C: 2018 takes a much more serious view of style, considering it a major component of successful error prevention. The standard lists almost 70 style-related guidelines, in sharp contrast to almost all other safety-critical and space-related software standards which leave stylistic concerns squarely in the domain of organization and team preference.

Detailed Recommendations

These recommendations are not focused on minor variations in grammar/formatting - the likes of which inspire voluminous and never-ending debates between programmers. Nor are they focused on generally applicable principles such as maintaining architectural integrity and adopting well-worn design patterns. The focus is instead on issues that are directly related to space-related software safety, and that can clearly be linked with safety implications in operational code.

Techniques to deal with software faults can broadly be categorised into three primary areas: prevention, detection, and containment; as further defined by Dvorak (2009):

“Prevention includes strong requirements capture and analysis, architectural principles, model-based engineering, coding standards, and codestructuring rules. Detection includes not only traditional testing but also compliance checkers, static source code analyzers, logic model checking, and test randomization techniques. Containment includes hierarchical backup methods, memory protection techniques, and broad use of software safety margins.”

Each area is important and needs to receive due consideration during design and construction of the software system. Prevention of faults has obvious benefits, rapid and accurate detection may be the difference between having time to save a mission or not, and effective containment can help to reduce the impact of faults and keep the other subsystems running smoothly.

Some of the below recommendations are time or cost-intensive and are in direct or partial conflict with other considerations. In general, making a system safer (for humans) can reduce its reliability (ability to achieve its tasks) or performance. On the other hand, making a system more complex or encompassing more functionality can make it less safe. Adequate risk management requires weighing these competing interests against cost and schedule considerations and designing a compromise suitable for the scale of the mission and the operational environment (ECSS 2008). If good design has put a system at the pareto-optimal frontier of safety, given the cost and schedule constraints, safety can only be further improved by impairing another parameter such as cost, speed, flexibility etc. (Dvorak 2009).

At the same time, a holistic view of system safety and fault management will improve overall safety. The only way to increase safety in some circumstances may be to situate it in a different area of control (software/hardware/operator/procedural). An example is provided below in figure 3 from NASA’s Software Safety Guidebook.

Cause Control Example
Hardware Hardware Pressure vessel with pressure relief valve.
Hardware Software Fault detection and safing function; or arm/fire checks which activate or prevent hazardous conditions.
Hardware Operator Operator opens switch to remove power from failed unit.
Software Hardware Hardwired timer or discrete hardware logic to screen invalid commands or data. Sensor directly triggering a safety switch to override a software control system. Hard stops for a robotic arm.
Software Software Two independent processors, one checking the other and intervening if a fault is detected. Emulating expected performance and detecting deviations.
Software Operator Operator sees control parameter violation on display and terminates process.
Operator Hardware Three electrical switches in series in a firing circuit to tolerate two operator errors.
Operator Software Software validation of operator-initiated hazardous command. Software prevents operation in unsafe mode.
Operator Operator Two crew members, one commanding and the other monitoring.

Language Choice

In early space missions, computer instructions were coded by hand in assembly. As the industry developed in the 60s, 70s and 80s programming was done in the higher-level languages of Fortran and ADA, which were still considered highly safe. As the other technical industries gained prominence and drew the most talented engineers, the languages used in those industries, including C and C++, became the default for safety-critical implementations, thanks to their increased functionality and large pool of qualified engineers (Leveson 2013). In the present day, finding enough engineers experienced in Fortran or ADA (let alone assembly) would likely be impossible for a new project, and probably not desirable given the level of safety possible with subsets of C and C++ (Rierson 2017).

The other major consideration relevant during language choice is the level of tool support in the specific mission domain, taking into account the relevant hardware.

C has proven the most suitable option in many spacecraft software engineering scenarios, given its ability to closely manipulate the underlying hardware, support for some higher-level concepts, and wide tool support (Bagnara et al. 2018). Nevertheless, these strengths come with several contingent weaknesses:

  • Wide tool support has essentially led to the creation of countless ‘dialects’ of C, where each compiler, each with hundreds of optimisation and other options that can be enabled or disabled, outputs different object code at the end of the build process
  • The ‘trust the programmer’ approach adopted by the C Standard allows almost anything to be done with the language - including highly unsafe actions that most often lead to faults

C++ has seen a faster pace of innovation and introduction of new features with major updates in 1998, 2003 and 2011. The popularity of C++ in many other industries, including gaming and robotics, is mainly thanks to its speed and wide feature-set. Younger aerospace companies like SpaceX or Rocket Lab use C++ for flight software to take advantage of the large number of qualified engineers, and perhaps also due to their higher level of risk tolerance (SpaceX 2020, RocketLab 2020).

Use a Coding Standard

The IEEE Institute of Electrical and Electronics Engineers) and the IEC (International Electrotechnical Commission) are the two main standards bodies responsible for publishing software safety standards. These standards cover the full software lifecycle and focus more on documentation and process requirements (Lawrence 1995), which is not the primary focus of this paper. Coding standards have been covered above, and are an important safety element of any project. While each company or institution will have their own preferences and internal guidelines, it is important that a defined set of standards exists. Project standards should be:

  • Minimal: don’t add unnecessary, pedantic or confusing guidelines that will be hard for programmers to understand and comply with.
  • Established: using a common, established, fixed guideline (e.g. MISRA, Power of Ten etc.) will increase the proportion of people interacting with the code that will be familiar with the rule set. Also, using an unmodified (or as minimally modified as possible) rule set allows easy machine-checkable analysis by the established tools in the market - changing rules to fit a project may seem valuable at the time, but it may then be time consuming, and often impossible, to perfectly configure static analysis tools for automated checking of compliance with the modified standard.

The coding standard chosen will define the subset of the language that is considered safe for the intended use. Each standard defines its language subset to remove the most error-prone and ambiguous or non-deterministic features of a language. Once a language, coding standard and toolchain are chosen for a project, it must remain fixed to allow unambiguous verification.

The list of recommended coding rules, in the previous section, will most likely be present in the coding standard chosen, and in general they provide a solid foundation on which to customise the project-specific coding guidelines.

Avoiding Complexity

The single most influential factor in avoiding downstream software faults is avoiding complexity. Minimising the complexity of a system, wherever possible, is the highest leverage activity that can be applied early in a development process. A simpler design for a software module can reduce the number of logical paths through the code, and the number of potential faults, by an order of magnitude or more.

NASA recognised the worrying trend of increasing size and complexity in flight software, and commissioned a large-scale study of the issue in 2007 which was published in 2009. The definition of complexity used in the study was chosen to be intuitive and practical: “how hard something is to understand or verify” (Dvorak 2009). Another, more specific definition, is referenced as well which is that complexity represents the number of variables and their interdependencies - with more of either increasing the overall complexity (Dorner 1997). Moving to an even more specific definition of complexity as it applies to software, the UK Ministry of Defence classifies a system as complex “if its design is unsuitable for the application of exhaustive simulation and test, and therefore its behavior cannot be verified by exhaustive testing” (UK MOD 1999). Dvorak, author of the NASA study, notes that software has become the ‘complexity sponge’ of modern spacecraft, as a seemingly never-ending stream of new features and capabilities are expected to be implemented by new systems. Other trends that have led to growing complexity include increasing autonomy, longer missions (and multiple pre-planned mission extensions) and higher expectations of fault protection code to maintain system operations rather than simply entering a safe mode (Reinholtz in Dvorak 2009).

The level of complexity of the mission itself, as well as the hardware and software of the spacecraft, plays an outsized role in the eventual success of a mission. To put it quite simply - the more complex the mission, the higher the chances of failure or an impairment. This fact seems like common sense but is apparently regularly forgotten as missions become ever-more complex while development and integration costs are expected to remain ‘lean’. In figure 2 below, the inevitability of increasing development cost is obvious as mission complexity increases. This study (Bearden et al. 2012) assessed factors like number of payloads, design life, data recording/uplink/downlink requirements, redundancy requirements etc. to measure overall complexity of the flight systems in each mission. Another clear message from the study is the fact that trying to cut corners or institute a ‘better, faster, cheaper’ approach will increase the chances of mission failure or impairment - almost all of the failures/impairments in the study were developed significantly below the average cost for missions of similar complexity.

img

Flight System development cost mapped against a complexity index for several dozen NASA and US Government space missions (Bearden et al. 2012)

The predominant theme of the study is that earlier is better. Earlier investment in requirements definition, design and testing will drastically reduce cost, complexity and the error rate of operational software. The cost to fix bugs in the latter stages of development (integration, acceptance and operation) can be 1-3 orders of magnitude higher than the cost to fix bugs found in the earlier requirements definition and design phases (NIST 2002; Dabney 2003). Front-loading investment in this fashion, and not hesitating to plan, review and design as much as necessary before starting to code, was a major realisation during the shuttle program and has remained in the collective memory of the industry ever since (Leveson 2013).

The space shuttle program provided the test-bed for many technological advancements in the space industry, and one of the major improvements was a detailed process for converting system requirements into software specifications. Specialised experts served as ‘Requirements Analysts’ whose role it was to work closely with hardware and systems engineers to understand precisely the intent of the requirements, and provide guidance for the ideal wording and implementation options for the software engineering department (Billings et al. 1994). A Requirements Review Board was also established to review all requirements and ensure their suitability and to prioritise their implementation in terms of resource availability. The importance of requirements has become a standard industry view since then. Good requirements should be unambiguous, concise, necessary, traceable and consistent with each other (Rierson 2017).

The complexity study references another ongoing large-scale project at NASA - the cFS (Core Flight System) reference architecture. While every spacecraft and mission is unique in many ways, there are several onboard services that are universally required including: navigation, attitude control, thermal control, uplink, downlink, commanding, telemetry, data management, fault protection and several other services. Every time a new mission team was put together there was an effort to take advantage of this fact, and re-use software from the most similar previous missions. The problem with this approach is that the previous mission team did not design their software to be re-used, and so the new mission requirements require almost total refactoring of the code. The cFS reference architecture is designed to avoid this issue (Edelberg et al. 2018). The basic onboard services are designed with only the essential and minimal capabilities, so they can be adapted for new spacecraft with minimal changes required to the core code. Any mission-specific requirements can be added on top of the reference architecture. This approach also allows the most experienced NASA software engineers and architects to implement the best practices accumulated over dozens of space missions into the cFS. Developed in-house at first, in a piecemeal fashion, the cFS has matured into a fully-extensible architecture framework for space missions (Sukhwani et al. 2016). When the project was made open-source in 2011 it very quickly picked up contributors and interest from a large community in the space industry, including internationally. Components of the cFS have been used in numerous space missions.

Many of the recommended coding rules mentioned earlier are designed to minimise non-deterministic behaviour, which also happens to be a major contributor to program complexity (Rierson 2017).

Complexity in flight software, and indeed increasing complexity over time, is inevitable. The key is to minimise ‘incidental complexity’, to borrow a term from NASA’s complexity study. Incidental complexity represents any complexity introduced to the system due to design choices, rather than by unavoidable and essential mission constraints. Autonomy and artificial intelligence will also be increasingly implemented onboard spacecraft with ground control often unable to respond due to occultations or round-trip light-time travel delays (Lutz 2011). Still, any attempt at simplifying code where it is safe to do so will improve its readability, maintainability, re-usability and safety.

Aim for High Cohesion and Low Coupling

Cohesion refers to the interrelatedness of the included parts of a particular software component. Coupling refers to how many dependences a component has to other components. Simple, stable interfaces ensure coupling can remain low. A cohesive but lightly coupled design implies well thought out modularity and minimal cyclic dependencies (Dvorak 2009).. The 1986 classic ‘Why Do Computers Stop and What Can Be Done About It?’ advocates for high-cohesion modularity as the key to maintaining high availability - as any failures are restricted to the module, rather than propagating to the entire system which is what would be likely to occur in a highly coupled design (Gray 1986). NASA also recommends loosely coupled designs to improve portability and extensibility (Lockheed 2005).

Fault Management

According to NASA’s Fault Management Handbook (2012), “[Fault Management] encompasses functions that enable an operational system to prevent, detect, isolate, diagnose, and respond to anomalous and failed conditions interfering with intended operations.”

Estimates vary for the number of bugs remaining in software after a rigorous testing process, but all agree some will inevitably remain - from around 1 fault per 100 lines of code to 1 fault per 1,000 lines of code at the best. The nature of complex software is such that extensive reviews can take place, with many different sets of eyes all with their own unique experience and skills and yet some bugs will still make it all the way through unscathed. These errors can occur in any combination as well - so 102 errors can co-occur in ~104 unique combinations (the problem clearly increases by orders of magnitude if more than 2 errors combine) - making protection and testing of all combinations all but impossible (Holzmann in Dvorak 2009). The loss of the MGS is an example of the combination of several otherwise minor and recoverable errors.

Figure 4 below models the software development process. The numbers serve only to illustrate the percentage of faults that are introduced and detected in each phase of the development process, rather than referencing any specific piece of software. Many faults are already present after the requirements definition and design phases, and not all will be caught in those or subsequence phases. It can be incorrectly assumed that all or most software errors are introduced during the coding phase, but this is inaccurate.

img
Illustration of defect insertion and removal rates in each stage of a representative software development project (Dvorak 2009)

The concession that even the most rigorously tested and verified software will still contain residual bugs carries with it a clear message - fault recovery and containment efforts are vital elements in the effort to avoid catastrophic mission failures. In other words - steps must be taken to ensure that a failure of a system component is contained so that it does not cause a failure of the entire system. ESA’s software engineering handbook (ECSS 2013) notes that software failure modes are most prevalent in components enabling boot/initialisation, command, safe mode and remote access - these are therefore considered most critical.

Monitoring for unexpected behaviour, range violations and missed deadlines are the fundamental aspects of a fault detection system (Rasmussen in Dvorak 2009). Once detected, the system must be able to isolate and recover from the fault, either autonomously or by an operator after entering a safe state. In general, and especially in hard real-time systems, the speed of a fault management response (i.e. the fault management latency) is part of its relevance - a response that comes too slowly will be partially or completely ineffective (NASA 2012). A fault must be isolated and contained before its effects can propagate into other subsystems.

Rather ironically, fault protection software is itself a potential source of errors and can cause or contribute to failures (e.g. the Deep Impact Probe). A loss of system integrity like this, as bugs are patched, only to introduce new bugs which need to be patched and so on, is dubbed the ‘death spiral’ at NASA (Rasmussen in Dvorak 2009). Adding fault protection code must, therefore, be balanced with a consideration of the level of complexity and the number of faults doing so will likely add to the system.

Redundancy and Margin for Error

NASA’s Fault Management Handbook (NASA 2012) examines complementary but distinct forms of redundancy, the most important of which are:
Hardware Identical: identical copies of hardware components that are at a higher risk of suffering a failure during the mission lifetime (e.g. spare reaction wheels that can be spun up if any of the three primary wheels fail).
Functional: utilisation of different hardware, software or operational procedures which can perform replacement functions of a failed component (see examples below)

It is advisable to consider both identical and functional redundancy in the form of:
Hardware: where possible engineer in hardware redundancy (e.g. 3 processors running all calculations enabling a 2-out-of-3 voting system) and other hardware subsystems, perhaps of a degraded but operational form (e.g. low gain antenna for emergency uplink/downlink if the primary high-gain antenna fails).
Software: maintain spare capacity in both processor and storage allowances to ensure that there is spare capacity for high-load periods or for updates in the future.
Memory: maintain margins around critical memory locations (Holzmann in Dvorak 2009). Critical areas of memory should be surrounded by a buffer filled with an identifiable but non-executable pattern to facilitate detection in order to allow for unintentional overflows or memory writes.
GN&C: include automatic onboard emergency maneuvers and multiple sources of truth to confirm trajectory, altitude and orientation (e.g. radiometric, gravitational effects, gyros, optical etc.).
Simplified Hierarchical Backup: Holzmann in Dvorak (2009) recommends a novel idea which is a type of expanded safe mode capability. The idea is to implement each critical system component in a much simpler version (order of magnitude less code and perhaps less efficient but simpler algorithms). The simpler version provides only the absolute bare minimum functionality for that component, and due to being drastically smaller and simpler can be tested and verified to a much higher degree, and expected to retain very few residual defects. If a fault occurs the system can switch that component over to its simplified backup software and attempt to retain basic functionality. This technique can be compared to a spare tyre in a car - it provides degraded but functional performance for the vehicle to keep operating until corrective measures can be taken. NASA’s official Software Safety Guidebook also recommends a similar system of redundant architecture for the kernel where a simplified but higher-assurance version can take control if the primary ‘performance’ version is corrupted somehow (O’Connor 2004).

Smart Testing and Verification Regime

Testing of any non-trivial software component or system should be:
Automated: the code, with any changes made that day, should run through automated nightly tests and flag any failures for review the next morning
Comprehensive: testing should include multiple static analysis tools (to counter any potential errors in the tools themselves), all applicable warnings should be enabled (i.e. pedantic settings), any machine-checkable standards rules compliance should be enabled and any project-specific unit and integration tests that have been written.
Creative: include off-nominal stress-tests (also known as ‘fuzz’ testing) such as unexpected inputs, correct but out-of-order commands, fault injection from sensors or other system components, abnormally high workloads etc. Stress testing will help quantify the robustness (ability to withstand stress) and elasticity (ability to ‘degrade gracefully’ and return to normal operation after a stressor is resolved) of the software. While normal testing attempts to verify that the software performs as expected, stress testing specifically attempts to target weak points and ‘break’ the software. Both Lutz (1993) and Leveson (2004) noted a dearth of off-nominal testing as a prevalent factor in the software failures they analysed.
Realistic: regular high-fidelity tests should be run in testbeds simulating the operational flight environment with emulators or preferably the actual boxes in a fully integrated testbed setup. Several of the failure investigations researched as part of this report identified limited or unrealistic integrated tests. Hardware-in-the-loop testing is essential and should include as many actual system components as possible, and realistic emulations/simulations where that is not possible.
Independent: As far as possible, verification testing should be conducted by independent parties - that is to say, not the programmers who coded the module, and not someone reporting to the same manager or department head.

Consistent and Formal Communication

Project participants, including software engineers, hardware engineers, safety engineers and project managers need to establish formal and fluid communication channels from the outset and maintain these channels through all project stages. Figure 5, from NASA’s Software Safety Guidebook (2004), illustrates this point. The most important point here is that software safety is not just the role of the software developers writing the code, or of the system safety engineers, or anyone else. Software and system safety is the responsibility of all project participants, and can only be achieved if all disciplines contribute to the end goal of an operational system that is as safe as possible. Most system failures that were investigated in depth have highlighted insufficient or disorganised communication between relevant project departments as a contributor to the failure.

img
Flowchart of project disciplines involved in ensuring software safety, and the important communication channels between them (NASA 2004)

Prioritise Defects According to Safety or Mission Risk

Standards and processes are not aligned across the flight software industry (Shapiro 2006). Fault and defect analysis and classification systems are also highly divergent. Some are exhaustive and detailed like that of Kaner, Falk and Nguyen (1999) and Beizer (2003) with dozens of categories and hundreds of types, and some are less detailed and more descriptive such as Whittaker (2002). A review of these and other software defect taxonomies shows that different approaches are taken in each in terms of how to categorise and characterise defects.

The IEEE (2010) has attempted to design a classification system general enough to be applicable to any stage of the project lifecycle and to operating/database systems, applications, firmware, and embedded software. The system is highly detailed and requires the identification and delineation of 18 attributes for defects and 20 additional attributes for failures (when a defect results in actual incorrect performance). The system is designed to allow prioritisation according to the potential effects and severity of the defect. An effective system of prioritisation of defects allows manpower and resources to flow to the most safety-critical tasks - indeed, even the space shuttle program had some defects that were left in operational code, having been documented and approved as effectively not worth rectifying (Leveson 2013).

Design Flexibility into the System

Maintaining the flexibility and malleability of the software is important both to be able to take advantage of unexpected science opportunities as well as to aid in fault recovery (Lutz & Mikulski 2004). When an anomaly occurs in either the hardware or software of a spacecraft, changes to the software or flight control procedures are really the only options for corrective action (unless there is redundancy for that particular component). This means that flight software is generally expected to cover for latent or newly developed faults in all hardware subsystems (Lutz 2011).

Employ Formal Methods to Prove Correctness of the Most Critical Software Components

Modern onboard software systems usually run dozens of concurrent threads of execution through millions of logical paths of code. Full test coverage via conventional methods quickly becomes impossible, which necessitates the use of other techniques to validate the most critical parts of the system (Holzmann 2004).

Even with an unlimited budget, however, it would likely be impossible to mathematically prove correctness of a large, complex code base. What is often feasible is to apply formal methods to small sections of the most critical software components. Formal methods pick up where testing ends to some extent - testing cannot guarantee a faultless software design. Testing can only confirm the presence of faults, not their absence. Formal methods are aimed at the other side of the coin - namely proving the absence of faults.

Formal verification of software involved the conversion of specifications/requirements and the code itself into mathematical representations in order to prove its correctness through deductive reasoning. The theorems are run through a solver which can confirm or deny its correctness based on logical inference from previous statements. By mathematically checking the correctness of a representation of a code module, or section of that module, all paths which branch from the proven section are considered correct (LaRC 2016). This is where the power of formal verification emanates, and what separates it from traditional testing. Strange and conceptually perplexing bugs can often be uncovered using formal verification tools (Miller et al. 2005).

The main drawback of implementing formal methods is the time investment required, both to train staff and to actually implement formal verification for anything more than a modest system. Codifying requirements into formal semantics and decomposing them to the required level of detail for each subsystem is a lengthy process - usually taking several months at a minimum, even for domain experts (e.g. see Brat et al. 2015). Smaller-scale projects like university cube-sats or the space programs of developing nations likely will not have the expertise or funding required. Formal methods will cost less, and provide a higher return on investment when applied early in the development process before higher levels of complexity are introduced.

Consider Radiation Effects

As discussed in an earlier section of this report, the effects of radiation damage to electrical components, circuits and processors must be taken into account for all space missions, including for low-earth orbit. Processor choice, level of shielding and spacecraft design should all be employed to minimise the risk of radiation-induced changes to the software logic.

Defensive Design and Programming

The goal of defensive programming is to prevent faults from resulting in a failure (O’Connor 2004). Common approaches to defensive programming include the use of watchdog timers, exception handling and range checks on data (Lutz 1993). Also separating or isolating critical code through space and/or time partitioning or separate processors/microcontrollers (Feller et al. 2013) can avoid failures like that of Phobos 1 and Ariane 501 where noncritical or unnecessary code interfered with critical code operations.

A well-modularized defensive design combined with solid fault management can in practice result in a system where each individual software component is both self-protecting and self-checking. Self-protecting in the sense that it is resistant to incorrect or potentially dangerous input from other calling components, and self-checking in the sense that it is regularly checking its own operations for faults and attempting to correct them or otherwise enter a safe state.

Constant Iteration and Process Improvement

NASA and their main subcontractors in the 60s, 70s and 80s (principally IBM, Draper Laboratory and Rockwell International) pioneered many of the design and process principles now considered indispensable in the safety-critical software engineering sphere. The manned missions before the shuttle program (Mercury, Gemini, Apollo, and Skylab) were increasingly reliant on software and the software engineering department became an increasingly influential and crucial part of the space program (Leveson 2013). The culmination of this process was in the space shuttle program, which was characterised by disciplined adoption of project management techniques and systematic handling of errors (Billing et al. 1994). A formalised system was developed for error analysis and correction which involved four steps:

  1. Correct the error
  2. Locate and correct the cause(s) of the error
  3. Locate and update aspects of the process that will ensure detection of similar errors in the future
  4. Check for similar errors in the code base and correct them

The shuttle program was considered one of the most successful software projects in history and established the state-of-the-art and best-practice software engineering techniques used in most space programs today. The success of the shuttle program was not just due to technical processes, but also due to a strong level of camaraderie and professionalism amongst the team, and a dogged attempt to strive for a zero-error system (Leveson 2013).

Setting up feedback loops from lessons learned in the latter stages of development to tweak the process will result in a progressively lower error rate and a consistently smoother development lifecycle (Keller 1997).

Bi-Directional Mapping between System Requirements to Software Requirements to Code to Tests

It should be possible to directly trace every requirement to the code that implements it, and from all lines of code back to the requirements they are implementing. This rule also implies that the requirements should be precisely and fully implemented in the code and that there is no code or functionality that are not directly prescribed by the requirements. The same applies to mapping between system requirements and the software requirements, and between code and tests. The image below illustrates this principle adroitly:

img
Flowchart illustrating bi-directional and direct mapping between each part of the development chain (Rierson 2017)

.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published