|  | 
| 25 | 25 |         delayed ACKs, reordered or lost packets etc. | 
| 26 | 26 | - [x] Keep some per-flow state | 
| 27 | 27 |   - Will likely be needed for the sampling | 
| 28 |  | -  - [ ] Could potentially include keeping track of average RTT, which | 
|  | 28 | +  - [x] Could potentially include keeping track of average RTT, which | 
| 29 | 29 |         may be useful for some decisions (ex. how often to sample, | 
| 30 | 30 |         when entry can be removed etc) | 
| 31 | 31 |   - [x] Could potentially include keeping track of minimum RTT (as | 
|  | 
| 41 | 41 |     unnecessarily large, which slows down the cleaning and may block | 
| 42 | 42 |     new entries | 
| 43 | 43 | - [ ] Use libxdp to load XDP program | 
| 44 |  | -- [ ] Add support for other hooks | 
| 45 |  | -  - Ex TC-BFP on ingress instead of XDP? | 
| 46 | 44 | 
 | 
| 47 | 45 | ## Done | 
| 48 | 46 | - [x] Clean up commits and add signed-off-by tags | 
|  | 
| 63 | 61 |     so that tools such as [ppviz](https://github.com/pollere/ppviz) | 
| 64 | 62 |     works for both pping implementations. | 
| 65 | 63 | - [x] Add timestamps to output (as original pping) | 
|  | 64 | +- [x] Add support for other hooks | 
|  | 65 | +  - TC-BFP on ingress instead of XDP | 
|  | 66 | + | 
|  | 67 | +# Potential issues | 
|  | 68 | +## Limited information in different output formats | 
|  | 69 | +The ppviz format is a bit limited in what information it can | 
|  | 70 | +include. One of these limitations is that it does not include any | 
|  | 71 | +protocol information as it was designed with only TCP in mind. If | 
|  | 72 | +using PPing with other protocols than TCP may therefore not be | 
|  | 73 | +possible to distinguish flows with different protocols. PPing will | 
|  | 74 | +therefore emit a warning if attempting to use the ppviz format with | 
|  | 75 | +protocols other than TCP, but will still allow it. | 
|  | 76 | + | 
|  | 77 | +Another piece of information tracked by PPing which can't be included | 
|  | 78 | +in the ppviz format is if the calculated RTT includes the local | 
|  | 79 | +processing delay or not (that is, it was timestamped on ingress and | 
|  | 80 | +matched on egress instead of being timestamped on egress and matched | 
|  | 81 | +on ingress). Currently this information is only included in the JSON | 
|  | 82 | +format, but could potentially be added to the standard format if | 
|  | 83 | +deemed important. | 
|  | 84 | + | 
|  | 85 | +## Cannot detect end of ICMP "flow" | 
|  | 86 | +ICMP is not a flow-based protocol, and therefore there is no signaling | 
|  | 87 | +that the ICMP "flow" is about to close. Subsequently, there is not way | 
|  | 88 | +for PPing to automatically detect that and ICMP flow has stopped and | 
|  | 89 | +delete its flow-state entry (and send a timely flow closing event). | 
|  | 90 | + | 
|  | 91 | +A consequence of this is that the ICMP flow entries will stick around | 
|  | 92 | +and occupy a space in the flow state map until they are cleaned out by | 
|  | 93 | +the periodic cleanup process. The current timeout for inactive flows | 
|  | 94 | +is a very generous 5 minutes, which means a lot of short ICMP flows | 
|  | 95 | +could quickly fill up the flow map and potentially block other flows | 
|  | 96 | +for a long while. | 
|  | 97 | + | 
|  | 98 | +## RTT-based sampling | 
|  | 99 | +The RTT-based sampling features means that timestamp entries may only | 
|  | 100 | +be created at an interval proportional to the flows RTT. This allows | 
|  | 101 | +flows with shorter RTTs to get more frequent RTT samples than flows | 
|  | 102 | +with long RTTs. However, as the flows RTT can only be updated based on | 
|  | 103 | +the calculated RTT samples, this creates a situation where the RTTs | 
|  | 104 | +update rate is dependent on itself. Flows with short RTTs will update | 
|  | 105 | +the RTT more often, which in turn affects how often they can update | 
|  | 106 | +the RTT. | 
|  | 107 | + | 
|  | 108 | +This mainly becomes problematic if basing the sampling rate on the | 
|  | 109 | +sRTT which may grow. In this case the sRTT will generally be prone to | 
|  | 110 | +growing faster than it shrinks, as if it starts with a low RTT it will | 
|  | 111 | +quickly update it to higher RTTs, but with high RTTs it will take | 
|  | 112 | +longer for it do decrease to a lower RTT again. | 
|  | 113 | + | 
|  | 114 | +## Concurrency issues | 
|  | 115 | + | 
|  | 116 | +The program uses "global" (not `PERCPU`) hash maps to keep state. As | 
|  | 117 | +the BPF programs need to see the global view to function properly, | 
|  | 118 | +using `PERCPU` maps is not an option. The program must be able to | 
|  | 119 | +match against stored packet timestamps regardless of the CPU the | 
|  | 120 | +packets are processed on, and must also have a global view of the flow | 
|  | 121 | +state in order for the sampling to work correctly. | 
|  | 122 | + | 
|  | 123 | +As the BPF programs may run concurrently on different CPU cores | 
|  | 124 | +accessing these global hash maps, this may result in some concurrency | 
|  | 125 | +issues. In practice, I do not believe these will occur particularly | 
|  | 126 | +often as the hash-map entries are per-flow, and I'm under the | 
|  | 127 | +impression that packets from the same flow will typically be processed | 
|  | 128 | +by the same CPU. Furthermore, most of the concurrency issues will not | 
|  | 129 | +be that problematic even if they do occur. For now, I've therefore | 
|  | 130 | +left these concurrency issues unattended, even if some of them could | 
|  | 131 | +be avoided with atomic operations and/or spinlocks, in order to keep | 
|  | 132 | +things simple and not hurt performance. | 
|  | 133 | + | 
|  | 134 | +The (known) potential concurrency issues are: | 
|  | 135 | + | 
|  | 136 | +### Tracking last seen identifier | 
|  | 137 | +The tc/egress program keeps track of the last seen outgoing identifier | 
|  | 138 | +for each flow, by storing it in the `flow_state` map. This is done to | 
|  | 139 | +detect the first packet with a new identifier. If multiple packets are | 
|  | 140 | +processed concurrently, several of them could potentially detect | 
|  | 141 | +themselves as being first with the same identifier (which only matters | 
|  | 142 | +if they also pass rate-limit check as well), alternatively if the | 
|  | 143 | +concurrent packets have different identifiers there may be a lost | 
|  | 144 | +update (but for TCP timestamps, concurrent packets would typically be | 
|  | 145 | +expected to have the same timestamp). | 
|  | 146 | + | 
|  | 147 | +A possibly more severe issue is out-of-order packets. If a packet with | 
|  | 148 | +an old identifier arrives out of order, that identifier could be | 
|  | 149 | +detected as a new identifier. If for example the following flow of | 
|  | 150 | +four packets with just two different identifiers (id1 and id2) were to | 
|  | 151 | +occur: | 
|  | 152 | + | 
|  | 153 | +id1 -> id2 -> id1 -> id2 | 
|  | 154 | + | 
|  | 155 | +Then the tc/egress program would consider each of these packets to | 
|  | 156 | +have new identifiers and try to create a new timestamp for each of | 
|  | 157 | +them if the sampling strategy allows it. However even if the sampling | 
|  | 158 | +strategy allows it, the (incorrect) creation of timestamps for id1 and | 
|  | 159 | +id2 the second time would only be successful in case the first | 
|  | 160 | +timestamps for id1 and id2 have already been matched against (and thus | 
|  | 161 | +deleted). Even if that is the case, they would only result in | 
|  | 162 | +reporting an incorrect RTT in case there are also new matches against | 
|  | 163 | +these identifiers. | 
|  | 164 | + | 
|  | 165 | +This issue could be avoided entirely by requiring that new-id > old-id | 
|  | 166 | +instead of simply checking that new-id != old-id, as TCP timestamps | 
|  | 167 | +should monotonically increase. That may however not be a suitable | 
|  | 168 | +solution for other types of identifiers. | 
|  | 169 | + | 
|  | 170 | +### Rate-limiting new timestamps | 
|  | 171 | +In the tc/egress program packets to timestamp are sampled by using a | 
|  | 172 | +per-flow rate-limit, which is enforced by storing when the last | 
|  | 173 | +timestamp was created in the `flow_state` map. If multiple packets | 
|  | 174 | +perform this check concurrently, it's possible that multiple packets | 
|  | 175 | +think they are allowed to create timestamps before any of them are | 
|  | 176 | +able to update the `last_timestamp`. When they update `last_timestamp` | 
|  | 177 | +it might also be slightly incorrect, however if they are processed | 
|  | 178 | +concurrently then they should also generate very similar timestamps. | 
|  | 179 | + | 
|  | 180 | +If the packets have different identifiers, (which would typically not | 
|  | 181 | +be expected for concurrent TCP timestamps), then this would allow some | 
|  | 182 | +packets to bypass the rate-limit. By bypassing the rate-limit, the | 
|  | 183 | +flow would use up some additional map space and report some additional | 
|  | 184 | +RTT(s) more than expected (however the reported RTTs should still be | 
|  | 185 | +correct). | 
|  | 186 | + | 
|  | 187 | +If the packets have the same identifier, they must first have managed | 
|  | 188 | +to bypass the previous check for unique identifiers (see [previous | 
|  | 189 | +point](#tracking-last-seen-identifier)), and only one of them will be | 
|  | 190 | +able to successfully store a timestamp entry. | 
|  | 191 | + | 
|  | 192 | +### Matching against stored timestamps | 
|  | 193 | +The XDP/ingress program could potentially match multiple concurrent | 
|  | 194 | +packets with the same identifier against a single timestamp entry in | 
|  | 195 | +`packet_ts`, before any of them manage to delete the timestamp | 
|  | 196 | +entry. This would result in multiple RTTs being reported for the same | 
|  | 197 | +identifier, but if they are processed concurrently these RTTs should | 
|  | 198 | +be very similar, so would mainly result in over-reporting rather than | 
|  | 199 | +reporting incorrect RTTs. | 
|  | 200 | + | 
|  | 201 | +### Updating flow statistics | 
|  | 202 | +Both the tc/egress and XDP/ingress programs will try to update some | 
|  | 203 | +flow statistics each time they successfully parse a packet with an | 
|  | 204 | +identifier. Specifically, they'll update the number of packets and | 
|  | 205 | +bytes sent/received. This is not done in an atomic fashion, so there | 
|  | 206 | +could potentially be some lost updates resulting an underestimate. | 
|  | 207 | + | 
|  | 208 | +Furthermore, whenever the XDP/ingress program calculates an RTT, it | 
|  | 209 | +will check if this is the lowest RTT seen so far for the flow. If | 
|  | 210 | +multiple RTTs are calculated concurrently, then several could pass | 
|  | 211 | +this check concurrently and there may be a lost update. It should only | 
|  | 212 | +be possible for multiple RTTs to be calculated concurrently in case | 
|  | 213 | +either the [timestamp rate-limit was | 
|  | 214 | +bypassed](#rate-limiting-new-timestamps) or [multiple packets managed | 
|  | 215 | +to match against the same | 
|  | 216 | +timestamp](#matching-against-stored-timestamps). | 
|  | 217 | + | 
|  | 218 | +It's worth noting that with sampling the reported minimum-RTT is only | 
|  | 219 | +an estimate anyways (may never calculate RTT for packet with the true | 
|  | 220 | +minimum RTT). And even without sampling there is some inherent | 
|  | 221 | +sampling due to TCP timestamps only being updated at a limited rate | 
|  | 222 | +(1000 Hz). | 
|  | 223 | + | 
|  | 224 | +### Outputting flow opening/closing events | 
|  | 225 | +A flow is not considered opened until a reply has been seen for | 
|  | 226 | +it. The `flow_state` map keeps information about if the flow has been | 
|  | 227 | +opened or not, which is checked and updated for each reply. The check | 
|  | 228 | +and update of this information is not performed atomically, which may | 
|  | 229 | +result in multiple replies thinking they are the first, emitting | 
|  | 230 | +multiple flow-opened events, in case they are processed concurrently. | 
|  | 231 | + | 
|  | 232 | +Likewise, when flows are closed it checks if the flow has been opened | 
|  | 233 | +to determine if a flow closing message should be sent. If multiple | 
|  | 234 | +replies are processed concurrently, it's possible one of them will | 
|  | 235 | +update the flow-open information and emit a flow opening message, but | 
|  | 236 | +another reply closing the flow without thinking it's ever been opened, | 
|  | 237 | +thus not sending a flow closing message. | 
0 commit comments