Skip to content

Commit a26abb1

Browse files
committed
finish writing it up
1 parent a476f7e commit a26abb1

File tree

1 file changed

+105
-46
lines changed

1 file changed

+105
-46
lines changed

docs/RFCs/0004-crashtracker-design.md

+105-46
Original file line numberDiff line numberDiff line change
@@ -1,71 +1,102 @@
1-
# RFC 0004: Crashtracker design
1+
# RFC 0004: Crashtracker design
22

33
## Background
44

55
A crashtracker implementation must balance between two conflicting aims.
66
On the one hand, its purpose is to **reliably collect and report the information necessary to diagnose and debug all crashes** in the tracked process.
7-
On the other hand, a crashtracker, like other instrumentation technology, should ideally **cause no observable difference in the execution of the processes where it is installed.** These aims are in tension: design decisions which increase the quality and reliability of data-collection often increase the (potential) for impact on the customer system.
7+
On the other hand, a crashtracker, like other instrumentation technology, should ideally **cause no observable difference in the execution of the processes where it is installed**.
8+
These aims are in tension: design decisions which increase the quality and reliability of data-collection often increase the (potential) for impact on the customer system.
89

910
This document lays out the design priorities of the crashtracker, describes how the current architecture affects/implements those priorities, and then lays out a plan for future improvements to the crashtracking ecosystem.
1011

1112
## Design Priorities
1213

13-
1. When the system is functioning normally, the crashtracker should, to the greatest extent possible, **have no impact on the functioning of a non-crashing system**.
14-
15-
2. When a crash does occur, the crashtracker should **prioritize the reliable collection and reporting of crash information**.
16-
17-
3. When a crash does occur, the crashtracker should **make a best-effort attempt to minimize impact** on the crashing process.
14+
1. When the system is functioning normally, the crashtracker should, to the greatest extent possible, **have no impact on the functioning of a non-crashing system**.
15+
2. When a crash does occur, the crashtracker should **prioritize the reliable collection and reporting of crash information**.
16+
3. When a crash does occur, the crashtracker should **make a best-effort attempt to minimize impact** on the crashing process.
1817

1918
## Architectural model
2019

2120
### Collector
2221

2322
For all languages other than .Net and Java, this is a signal handler.
24-
25-
2623
The collector interacts with the system in the following ways:
2724

2825
1. Registers a signal handler (`sigaction`).
2926
The old handler is saved into a global variable to enable changing.
30-
* Risks to normal operation
31-
* Chaining handlers is not atomic.
27+
- Risks to normal operation
28+
- Chaining handlers is not atomic.
3229
It is possible for concurrent races to occur.
3330
Mitigation: doing this operation once, early during system initialization.
34-
* If the system expects and handles `sigsegv`, `sigbus` and `sigabort` as part of its ordinary functioning, this can violate principle 1\.
31+
- If the system expects and handles `sigsegv`, `sigbus` and `sigabort` as part of its ordinary functioning, this can violate principle 1.
3532
This is not common for most languages, but Java and .Net turn some crashes into recoverable user exceptions.
3633
A crashtracker signal handler risks breaking that mechanism.
3734
Mitigation: not using signal handler based crashtracking for these languages.
38-
* Risks during a crash
39-
* N/A
35+
- Risks during a crash
36+
- If user code installs signal handlers, these can conflict with the crashtracker.
37+
This can occur either intentionally by the user, or unexpectedly as the result of installing a framework which does so.
38+
Mitigation:
39+
If the user installs their handler after the crashtracker is initialized, then the crashtracker will likely not work unless the user was careful to chain signal handlers.
40+
There is no obvious mitigation to take in that case.
41+
There is no known tooling to reliably detect when a signal handler is replaced.
42+
If the user installs their handler before the crashtracker is initialized, then we make a best-effort attempt to chain their handler.
43+
Chaining handlers is not directly supported by POSIX, so a best-effort attempt is the best we can do.
4044
2. When a crash occurs, the signal handler is invoked.
41-
* Risks to normal operation
42-
* N/A
43-
* Risks during a crash
44-
* In some cases (e.g. stack overflow), invoking a signal handler can itself fail.
45-
Mitigation: support the use of `sigaltstack`
45+
- Risks to normal operation
46+
- N/A
47+
- Risks during a crash
48+
- In some cases (e.g. stack overflow), invoking a signal handler can itself fail.
49+
Mitigation: support the use of `sigaltstack`
50+
- The signal handler could crash.
51+
This would prevent the chaining of signal handlers, and could interfere with the users existing debugging workflow.
52+
Mitigation:
53+
Operating inside a crashing process is inherently risky.
54+
We do our best to limit the operations we perform, and to move as much as possible across a process boundary, but there will always be a risk here.
55+
Provide an environment variable to enable customers to disable the crashtracker.
56+
- The signal handler could hang.
57+
This is potentially worse for the customer, since a crashed process will be restarted by the kubernetes controller, while a hanging process will remain alive until reaped.
58+
This consumes resources and reduces the ability of the customer application to handle load.
59+
Mitigation:
60+
Operating inside a crashing process is inherently risky.
61+
We do our best to limit the operations we perform, and to move as much as possible across a process boundary, but there will always be a risk here.
62+
Provide an environment variable to enable customers to disable the crashtracker.
4663
3. Backtrace collected.
47-
* Risks to normal operation
48-
* N/A
49-
* Risks during a crash
50-
* If the stack is corrupted, attempting to collect the backtrace can crash the crashing process.
64+
- Risks to normal operation
65+
- N/A
66+
- Risks during a crash
67+
- If the stack is corrupted, attempting to collect the backtrace can crash the crashing process.
5168
This prevents chaining the old signal handler and can break customer debugging workflows.
52-
Mitigation: provide an environment variable allowing customers to turn off backtrace collection
53-
* If the stack is corrupted, attempting to collect the backtrace can crash the crashing process.
69+
Mitigation: provide an environment variable allowing customers to turn off backtrace collection
70+
- If the stack is corrupted, attempting to collect the backtrace can crash the crashing process.
5471
This will prevent the collection and reporting of any additional crash info.
55-
Mitigation
56-
4. Results written to a pipe (unix socket)
57-
* Risks to normal operation
58-
* Maintaining the open pipe requires a file-descriptor.
59-
* Maintaining a unix socket requires a synchronization point, typically a file in the file system
60-
* Risks during a crash
61-
* If the pipe/socket is not drained, the collector will stall when attempting to write.
72+
Mitigation:
73+
Collect the backtrace last, allowing less risky operations to complete first.
74+
As a stretch mitigation, we could do out-of-process unwinding.
75+
This would increase the complexity of the crashtracker, but would avoid this issue.
76+
4. Results written to a pipe (unix socket)
77+
- Risks to normal operation
78+
- Maintaining the open pipe requires a file-descriptor.
79+
- Maintaining a unix socket requires a synchronization point, typically a file in the file system.
80+
- Risks during a crash
81+
- If the pipe/socket is not drained, the collector will stall when attempting to write.
6282
This could hang the crashing process.
63-
* If the pipe/socket is not available or is closed early, the
64-
5. Files collected from `/proc`
65-
6. **Risks to normal operation**:
66-
* N/A
67-
* Risks during a crash
68-
* \<\>
83+
Mitigation: Minimize the amount of data sent.
84+
- If the pipe/socket is not available or is closed early, data could not be sent, or the crashtracker could hang.
85+
Mitigation:
86+
Data is sent to the receiver in priority order.
87+
If the channel is closed, the receiver will send as much data as possible, as well as a flag indicating that an error occurred.
88+
5. Files collected from `/proc`
89+
- Risks to normal operation
90+
- N/A
91+
- Risks during a crash
92+
- Due to user security controls, these files may not be available to read.
93+
Mitigation: The crashtracker receiver will still transmit partial crash reports.
94+
- The files may be large.
95+
Mitigation:
96+
None currently.
97+
In the future, we could add a file size limit and only send the first `n` bytes.
98+
- These files could contain sensitive data
99+
Mitigation: only send a select set of files which do not contain sensitive data.
69100

70101
### Receiver
71102

@@ -76,12 +107,40 @@ The receiver interacts with the system in the following ways:
76107
1. Forks a new process.
77108
This can either occur eagerly, at process initialization or lazily, as part of crash handling.
78109
Both options have issues.
79-
* Risks to normal operation
80-
* Eager initialization consumes a PID and a slot in the process table.
81-
This was originally believed to be low-risk operation.
82-
It turns out that there are a few subtle parent/child interactions.
83-
* Risks during a crash
84-
* Forking inside a signal handler is unsafe (see Notes on [signal-safety(7) \- Linux manual page](https://man7.org/linux/man-pages/man7/signal-safety.7.html)).
110+
- Risks to normal operation
111+
- Eager initialization consumes a PID and a slot in the process table.
112+
This can cause either initialization to fail, or block the creation of another process on the customer system.
113+
Mitigation: This is unlikely to occur.
114+
- Eager initialization can lead to an additional child process.
115+
This was originally believed to be harmless.
116+
However, it turns out that some frameworks assume that they are the only thing that can spawn workers, and do a `waitpid` on all children.
117+
The existence of an additional child process hangs the framework.
118+
Mitigation: Either do lazy initialization, or use a proper daemonized sidecar.
119+
- Risks during a crash
120+
- Lazy initialization consumes a PID and a slot in the process table.
121+
This can cause either initialization to fail, or block the creation of another process on the customer system.
122+
Mitigation: This is unlikely to occur.
123+
- Forking inside a signal handler is unsafe (see Notes on [signal-safety(7) \- Linux manual page](https://man7.org/linux/man-pages/man7/signal-safety.7.html)).
85124
Mitigation: Do cursed heroics to avoid triggering `at_fork` handlers.
86-
2. Listens for a crashreport a socket/pipe
87-
3. Transmits the message to
125+
2. Listens for a crashreport a socket/pipe.
126+
- Risks to normal operation
127+
- May consume a small amount of resources.
128+
Mitigation: None.
129+
- Risks during a crash
130+
- If the pipe is not drained fast enough, the crashtracker collector will hang.
131+
Mitigation:
132+
Currently, none, other than careful coding to limit work done while draining the socket.
133+
In the future we might wish to play with process niceness to increase the chance of the receiver running and draining the socket.
134+
- The collector might crash or be terminated, truncating the message.
135+
Mitigation: If an unexpected input is received, including EOF, make a best effort attempt to format and send a partial crash report.
136+
3. Transmits the message to the backend.
137+
- Risks to normal operation
138+
- NA
139+
- Risks during a crash
140+
- The endpoint may be inaccessible.
141+
Mitigation: None.
142+
- It may take a signifcant amount of time to transmit the crashreport.
143+
This could cause the user process to hang.
144+
Mitigation: a configurable timeout on transmission (default 3s).
145+
- The message could be modified or corrupted in transit
146+
Mitigation: allow the optional use of `tls`.

0 commit comments

Comments
 (0)