You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A crashtracker implementation must balance between two conflicting aims.
6
6
On the one hand, its purpose is to **reliably collect and report the information necessary to diagnose and debug all crashes** in the tracked process.
7
-
On the other hand, a crashtracker, like other instrumentation technology, should ideally **cause no observable difference in the execution of the processes where it is installed.** These aims are in tension: design decisions which increase the quality and reliability of data-collection often increase the (potential) for impact on the customer system.
7
+
On the other hand, a crashtracker, like other instrumentation technology, should ideally **cause no observable difference in the execution of the processes where it is installed**.
8
+
These aims are in tension: design decisions which increase the quality and reliability of data-collection often increase the (potential) for impact on the customer system.
8
9
9
10
This document lays out the design priorities of the crashtracker, describes how the current architecture affects/implements those priorities, and then lays out a plan for future improvements to the crashtracking ecosystem.
10
11
11
12
## Design Priorities
12
13
13
-
1. When the system is functioning normally, the crashtracker should, to the greatest extent possible, **have no impact on the functioning of a non-crashing system**.
14
-
15
-
2. When a crash does occur, the crashtracker should **prioritize the reliable collection and reporting of crash information**.
16
-
17
-
3. When a crash does occur, the crashtracker should **make a best-effort attempt to minimize impact** on the crashing process.
14
+
1. When the system is functioning normally, the crashtracker should, to the greatest extent possible, **have no impact on the functioning of a non-crashing system**.
15
+
2. When a crash does occur, the crashtracker should **prioritize the reliable collection and reporting of crash information**.
16
+
3. When a crash does occur, the crashtracker should **make a best-effort attempt to minimize impact** on the crashing process.
18
17
19
18
## Architectural model
20
19
21
20
### Collector
22
21
23
22
For all languages other than .Net and Java, this is a signal handler.
24
-
25
-
26
23
The collector interacts with the system in the following ways:
27
24
28
25
1. Registers a signal handler (`sigaction`).
29
26
The old handler is saved into a global variable to enable changing.
30
-
* Risks to normal operation
31
-
* Chaining handlers is not atomic.
27
+
- Risks to normal operation
28
+
- Chaining handlers is not atomic.
32
29
It is possible for concurrent races to occur.
33
30
Mitigation: doing this operation once, early during system initialization.
34
-
* If the system expects and handles `sigsegv`, `sigbus` and `sigabort` as part of its ordinary functioning, this can violate principle 1\.
31
+
- If the system expects and handles `sigsegv`, `sigbus` and `sigabort` as part of its ordinary functioning, this can violate principle 1.
35
32
This is not common for most languages, but Java and .Net turn some crashes into recoverable user exceptions.
36
33
A crashtracker signal handler risks breaking that mechanism.
37
34
Mitigation: not using signal handler based crashtracking for these languages.
38
-
* Risks during a crash
39
-
* N/A
35
+
- Risks during a crash
36
+
- If user code installs signal handlers, these can conflict with the crashtracker.
37
+
This can occur either intentionally by the user, or unexpectedly as the result of installing a framework which does so.
38
+
Mitigation:
39
+
If the user installs their handler after the crashtracker is initialized, then the crashtracker will likely not work unless the user was careful to chain signal handlers.
40
+
There is no obvious mitigation to take in that case.
41
+
There is no known tooling to reliably detect when a signal handler is replaced.
42
+
If the user installs their handler before the crashtracker is initialized, then we make a best-effort attempt to chain their handler.
43
+
Chaining handlers is not directly supported by POSIX, so a best-effort attempt is the best we can do.
40
44
2. When a crash occurs, the signal handler is invoked.
41
-
* Risks to normal operation
42
-
* N/A
43
-
* Risks during a crash
44
-
* In some cases (e.g. stack overflow), invoking a signal handler can itself fail.
45
-
Mitigation: support the use of `sigaltstack`
45
+
- Risks to normal operation
46
+
- N/A
47
+
- Risks during a crash
48
+
- In some cases (e.g. stack overflow), invoking a signal handler can itself fail.
49
+
Mitigation: support the use of `sigaltstack`
50
+
- The signal handler could crash.
51
+
This would prevent the chaining of signal handlers, and could interfere with the users existing debugging workflow.
52
+
Mitigation:
53
+
Operating inside a crashing process is inherently risky.
54
+
We do our best to limit the operations we perform, and to move as much as possible across a process boundary, but there will always be a risk here.
55
+
Provide an environment variable to enable customers to disable the crashtracker.
56
+
- The signal handler could hang.
57
+
This is potentially worse for the customer, since a crashed process will be restarted by the kubernetes controller, while a hanging process will remain alive until reaped.
58
+
This consumes resources and reduces the ability of the customer application to handle load.
59
+
Mitigation:
60
+
Operating inside a crashing process is inherently risky.
61
+
We do our best to limit the operations we perform, and to move as much as possible across a process boundary, but there will always be a risk here.
62
+
Provide an environment variable to enable customers to disable the crashtracker.
46
63
3. Backtrace collected.
47
-
* Risks to normal operation
48
-
* N/A
49
-
* Risks during a crash
50
-
* If the stack is corrupted, attempting to collect the backtrace can crash the crashing process.
64
+
- Risks to normal operation
65
+
- N/A
66
+
- Risks during a crash
67
+
- If the stack is corrupted, attempting to collect the backtrace can crash the crashing process.
51
68
This prevents chaining the old signal handler and can break customer debugging workflows.
52
-
Mitigation: provide an environment variable allowing customers to turn off backtrace collection
53
-
* If the stack is corrupted, attempting to collect the backtrace can crash the crashing process.
69
+
Mitigation: provide an environment variable allowing customers to turn off backtrace collection
70
+
- If the stack is corrupted, attempting to collect the backtrace can crash the crashing process.
54
71
This will prevent the collection and reporting of any additional crash info.
55
-
Mitigation
56
-
4. Results written to a pipe (unix socket)
57
-
* Risks to normal operation
58
-
* Maintaining the open pipe requires a file-descriptor.
59
-
* Maintaining a unix socket requires a synchronization point, typically a file in the file system
60
-
* Risks during a crash
61
-
* If the pipe/socket is not drained, the collector will stall when attempting to write.
72
+
Mitigation:
73
+
Collect the backtrace last, allowing less risky operations to complete first.
74
+
As a stretch mitigation, we could do out-of-process unwinding.
75
+
This would increase the complexity of the crashtracker, but would avoid this issue.
76
+
4. Results written to a pipe (unix socket)
77
+
- Risks to normal operation
78
+
- Maintaining the open pipe requires a file-descriptor.
79
+
- Maintaining a unix socket requires a synchronization point, typically a file in the file system.
80
+
- Risks during a crash
81
+
- If the pipe/socket is not drained, the collector will stall when attempting to write.
62
82
This could hang the crashing process.
63
-
* If the pipe/socket is not available or is closed early, the
64
-
5. Files collected from `/proc`
65
-
6.**Risks to normal operation**:
66
-
* N/A
67
-
* Risks during a crash
68
-
*\<\>
83
+
Mitigation: Minimize the amount of data sent.
84
+
- If the pipe/socket is not available or is closed early, data could not be sent, or the crashtracker could hang.
85
+
Mitigation:
86
+
Data is sent to the receiver in priority order.
87
+
If the channel is closed, the receiver will send as much data as possible, as well as a flag indicating that an error occurred.
88
+
5. Files collected from `/proc`
89
+
- Risks to normal operation
90
+
- N/A
91
+
- Risks during a crash
92
+
- Due to user security controls, these files may not be available to read.
93
+
Mitigation: The crashtracker receiver will still transmit partial crash reports.
94
+
- The files may be large.
95
+
Mitigation:
96
+
None currently.
97
+
In the future, we could add a file size limit and only send the first `n` bytes.
98
+
- These files could contain sensitive data
99
+
Mitigation: only send a select set of files which do not contain sensitive data.
69
100
70
101
### Receiver
71
102
@@ -76,12 +107,40 @@ The receiver interacts with the system in the following ways:
76
107
1. Forks a new process.
77
108
This can either occur eagerly, at process initialization or lazily, as part of crash handling.
78
109
Both options have issues.
79
-
* Risks to normal operation
80
-
* Eager initialization consumes a PID and a slot in the process table.
81
-
This was originally believed to be low-risk operation.
82
-
It turns out that there are a few subtle parent/child interactions.
83
-
* Risks during a crash
84
-
* Forking inside a signal handler is unsafe (see Notes on [signal-safety(7) \- Linux manual page](https://man7.org/linux/man-pages/man7/signal-safety.7.html)).
110
+
- Risks to normal operation
111
+
- Eager initialization consumes a PID and a slot in the process table.
112
+
This can cause either initialization to fail, or block the creation of another process on the customer system.
113
+
Mitigation: This is unlikely to occur.
114
+
- Eager initialization can lead to an additional child process.
115
+
This was originally believed to be harmless.
116
+
However, it turns out that some frameworks assume that they are the only thing that can spawn workers, and do a `waitpid` on all children.
117
+
The existence of an additional child process hangs the framework.
118
+
Mitigation: Either do lazy initialization, or use a proper daemonized sidecar.
119
+
- Risks during a crash
120
+
- Lazy initialization consumes a PID and a slot in the process table.
121
+
This can cause either initialization to fail, or block the creation of another process on the customer system.
122
+
Mitigation: This is unlikely to occur.
123
+
- Forking inside a signal handler is unsafe (see Notes on [signal-safety(7) \- Linux manual page](https://man7.org/linux/man-pages/man7/signal-safety.7.html)).
85
124
Mitigation: Do cursed heroics to avoid triggering `at_fork` handlers.
86
-
2. Listens for a crashreport a socket/pipe
87
-
3. Transmits the message to
125
+
2. Listens for a crashreport a socket/pipe.
126
+
- Risks to normal operation
127
+
- May consume a small amount of resources.
128
+
Mitigation: None.
129
+
- Risks during a crash
130
+
- If the pipe is not drained fast enough, the crashtracker collector will hang.
131
+
Mitigation:
132
+
Currently, none, other than careful coding to limit work done while draining the socket.
133
+
In the future we might wish to play with process niceness to increase the chance of the receiver running and draining the socket.
134
+
- The collector might crash or be terminated, truncating the message.
135
+
Mitigation: If an unexpected input is received, including EOF, make a best effort attempt to format and send a partial crash report.
136
+
3. Transmits the message to the backend.
137
+
- Risks to normal operation
138
+
- NA
139
+
- Risks during a crash
140
+
- The endpoint may be inaccessible.
141
+
Mitigation: None.
142
+
- It may take a signifcant amount of time to transmit the crashreport.
143
+
This could cause the user process to hang.
144
+
Mitigation: a configurable timeout on transmission (default 3s).
145
+
- The message could be modified or corrupted in transit
0 commit comments