Skip to content

Commit 3d1a250

Browse files
committed
pping: Do both timestamping and matching on ingress and egress
Perform both timestamping and matching on both ingress and egress hooks. This makes it more similar to Kathie's pping, allowing the tool to capture RTTs in both directions when deployed on just a single interface. Like Kathie's pping, by default filter out RTTs for packets going to the local machine (will only include local processing delays). This behavior can be disabled by passing the -l/--localfilt-off option. As packets that are timestamped on ingress and matched on egress will include the local machines processing delay, add the "match_on_egress" member to the JSON output that can be used to differentiate between RTTs that include the local processing delay, and those which don't. Finally, report the source and destination addresses from the perspective of the reply packet, rather than the timestamped packet, to be consistent with Kathie's pping. Overall, refactor large parts of pping_kern to allow both timestamping and matching, as well as updating both the flow and reverse flow and handle flow-events related to them, in one go. Also update README to reflect changes. Concerns with this commit: - Performance may be worse due to the increased complexity of handling both directions of the flow. Additionally, if local-filtering is used (enabled by default), will also have to perform a fib-lookup. - For the standard and ppviz output formats, it's not possible to tell if the reported RTT includes delays from local processing or not. - No longer works kernel 5.4 (verifier seems to object against lookup of config.use_srtt). Other notes: - Verifier seems to have a much easier time verifying the refactored code, "only" processing 156k instructions (down from ~850k) despite the overall increase in complexity. Signed-off-by: Simon Sundberg <[email protected]>
1 parent d4e7d01 commit 3d1a250

File tree

5 files changed

+440
-289
lines changed

5 files changed

+440
-289
lines changed

pping/README.md

Lines changed: 38 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -13,20 +13,21 @@ spinbit and DNS queries. See the [TODO-list](./TODO.md) for more potential
1313
features (which may or may not ever get implemented).
1414

1515
The fundamental logic of pping is to timestamp a pseudo-unique identifier for
16-
outgoing packets, and then look for matches in the incoming packets. If a match
17-
is found, the RTT is simply calculated as the time difference between the
18-
current time and the stored timestamp.
16+
packets, and then look for matches in the reply packets. If a match is found,
17+
the RTT is simply calculated as the time difference between the current time and
18+
the stored timestamp.
1919

2020
This tool, just as Kathie's original pping implementation, uses TCP timestamps
21-
as identifiers for TCP traffic. For outgoing packets, the TSval (which is a
22-
timestamp in and off itself) is timestamped. Incoming packets are then parsed
23-
for the TSecr, which are the echoed TSval values from the receiver. The TCP
24-
timestamps are not necessarily unique for every packet (they have a limited
25-
update frequency, appears to be 1000 Hz for modern Linux systems), so only the
26-
first instance of an identifier is timestamped, and matched against the first
27-
incoming packet with the identifier. The mechanism to ensure only the first
28-
packet is timestamped and matched differs from the one in Kathie's pping, and is
29-
further described in [SAMPLING_DESIGN](./SAMPLING_DESIGN.md).
21+
as identifiers for TCP traffic. The TSval (which is a timestamp in and off
22+
itself) is used as an identifier and timestamped. Reply packets in the reverse
23+
flow are then parsed for the TSecr, which are the echoed TSval values from the
24+
receiver. The TCP timestamps are not necessarily unique for every packet (they
25+
have a limited update frequency, appears to be 1000 Hz for modern Linux
26+
systems), so only the first instance of an identifier is timestamped, and
27+
matched against the first incoming packet with a matching reply identifier. The
28+
mechanism to ensure only the first packet is timestamped and matched differs
29+
from the one in Kathie's pping, and is further described in
30+
[SAMPLING_DESIGN](./SAMPLING_DESIGN.md).
3031

3132
For ICMP echo, it uses the echo identifier as port numbers, and echo sequence
3233
number as identifer to match against. Linux systems will typically use different
@@ -48,7 +49,7 @@ single line per event.
4849

4950
An example of the format is provided below:
5051
```shell
51-
16:00:46.142279766 TCP 10.11.1.1:5201+10.11.1.2:59528 opening due to SYN-ACK from src
52+
16:00:46.142279766 TCP 10.11.1.1:5201+10.11.1.2:59528 opening due to SYN-ACK from dest
5253
16:00:46.147705205 5.425439 ms 5.425439 ms TCP 10.11.1.1:5201+10.11.1.2:59528
5354
16:00:47.148905125 5.261430 ms 5.261430 ms TCP 10.11.1.1:5201+10.11.1.2:59528
5455
16:00:48.151666385 5.972284 ms 5.261430 ms TCP 10.11.1.1:5201+10.11.1.2:59528
@@ -96,7 +97,7 @@ An example of a (pretty-printed) flow-event is provided below:
9697
"protocol": "TCP",
9798
"flow_event": "opening",
9899
"reason": "SYN-ACK",
99-
"triggered_by": "src"
100+
"triggered_by": "dest"
100101
}
101102
```
102103

@@ -114,7 +115,8 @@ An example of a (pretty-printed) RTT-even is provided below:
114115
"sent_packets": 9393,
115116
"sent_bytes": 492457296,
116117
"rec_packets": 5922,
117-
"rec_bytes": 37
118+
"rec_bytes": 37,
119+
"match_on_egress": false
118120
}
119121
```
120122

@@ -123,36 +125,33 @@ An example of a (pretty-printed) RTT-even is provided below:
123125

124126
### Files:
125127
- **pping.c:** Userspace program that loads and attaches the BPF programs, pulls
126-
the perf-buffer `rtt_events` to print out RTT messages and periodically cleans
128+
the perf-buffer `events` to print out RTT messages and periodically cleans
127129
up the hash-maps from old entries. Also passes user options to the BPF
128130
programs by setting a "global variable" (stored in the programs .rodata
129131
section).
130-
- **pping_kern.c:** Contains the BPF programs that are loaded on tc (egress) and
131-
XDP (ingress), as well as several common functions, a global constant `config`
132-
(set from userspace) and map definitions. The tc program `pping_egress()`
133-
parses outgoing packets for identifiers. If an identifier is found and the
134-
sampling strategy allows it, a timestamp for the packet is created in
135-
`packet_ts`. The XDP program `pping_ingress()` parses incomming packets for an
136-
identifier. If found, it looks up the `packet_ts` map for a match on the
137-
reverse flow (to match source/dest on egress). If there is a match, it
138-
calculates the RTT from the stored timestamp and deletes the entry. The
139-
calculated RTT (together with the flow-tuple) is pushed to the perf-buffer
140-
`events`. Both `pping_egress()` and `pping_ingress` can also push flow-events
141-
to the `events` buffer.
132+
- **pping_kern.c:** Contains the BPF programs that are loaded on egress (tc) and
133+
ingress (XDP or tc), as well as several common functions, a global constant
134+
`config` (set from userspace) and map definitions. Essentially the same pping
135+
program is loaded on both ingress and egress. All packets are parsed for both
136+
an identifier that can be used to create a timestamp entry `packet_ts`, and a
137+
reply identifier that can be used to match the packet with a previously
138+
timestamped one in the reverse flow. If a match is found, an RTT is calculated
139+
and an RTT-event is pushed to userspace through the perf-buffer `events`. For
140+
each packet with a valid identifier, the program also keeps track of and
141+
updates the state flow and reverse flow, stored in the `flow_state` map.
142142
- **pping.h:** Common header file included by `pping.c` and
143143
`pping_kern.c`. Contains some common structs used by both (are part of the
144144
maps).
145145

146146
### BPF Maps:
147147
- **flow_state:** A hash-map storing some basic state for each flow, such as the
148148
last seen identifier for the flow and when the last timestamp entry for the
149-
flow was created. Entries are created by `pping_egress()`, and can be updated
150-
or deleted by both `pping_egress()` and `pping_ingress()`. Leftover entries
151-
are eventually removed by `pping.c`.
149+
flow was created. Entries are created, updated and deleted by the BPF pping
150+
programs. Leftover entries are eventually removed by userspace (`pping.c`).
152151
- **packet_ts:** A hash-map storing a timestamp for a specific packet
153-
identifier. Entries are created by `pping_egress()` and removed by
154-
`pping_ingress()` if a match is found. Leftover entries are eventually removed
155-
by `pping.c`.
152+
identifier. Entries are created by the BPF pping program if a valid identifier
153+
is found, and removed if a match is found. Leftover entries are eventually
154+
removed by userspace (`pping.c`).
156155
- **events:** A perf-buffer used by the BPF programs to push flow or RTT events
157156
to `pping.c`, which continuously polls the map the prints them out.
158157

@@ -222,9 +221,9 @@ additional map space and report some additional RTT(s) more than expected
222221
(however the reported RTTs should still be correct).
223222

224223
If the packets have the same identifier, they must first have managed to bypass
225-
the previous check for unique identifiers (see [previous point](#Tracking last
226-
seen identifier)), and only one of them will be able to successfully store a
227-
timestamp entry.
224+
the previous check for unique identifiers (see [previous
225+
point](#tracking-last-seen-identifier)), and only one of them will be able to
226+
successfully store a timestamp entry.
228227

229228
#### Matching against stored timestamps
230229
The XDP/ingress program could potentially match multiple concurrent packets with
@@ -246,8 +245,8 @@ if this is the lowest RTT seen so far for the flow. If multiple RTTs are
246245
calculated concurrently, then several could pass this check concurrently and
247246
there may be a lost update. It should only be possible for multiple RTTs to be
248247
calculated concurrently in case either the [timestamp rate-limit was
249-
bypassed](#Rate-limiting new timestamps) or [multiple packets managed to match
250-
against the same timestamp](#Matching against stored timestamps).
248+
bypassed](#rate-limiting-new-timestamps) or [multiple packets managed to match
249+
against the same timestamp](#matching-against-stored-timestamps).
251250

252251
It's worth noting that with sampling the reported minimum-RTT is only an
253252
estimate anyways (may never calculate RTT for packet with the true minimum

pping/eBPF_pping_design.png

-8.14 KB
Loading

pping/pping.c

Lines changed: 18 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,8 @@ static const char *__doc__ =
1515
#include <unistd.h>
1616
#include <getopt.h>
1717
#include <stdbool.h>
18-
#include <limits.h>
1918
#include <signal.h> // For detecting Ctrl-C
2019
#include <sys/resource.h> // For setting rlmit
21-
#include <sys/wait.h>
22-
#include <sys/stat.h>
2320
#include <time.h>
2421
#include <pthread.h>
2522

@@ -99,6 +96,7 @@ static const struct option long_options[] = {
9996
{ "cleanup-interval", required_argument, NULL, 'c' }, // Map cleaning interval in s
10097
{ "format", required_argument, NULL, 'F' }, // Which format to output in (standard/json/ppviz)
10198
{ "ingress-hook", required_argument, NULL, 'I' }, // Use tc or XDP as ingress hook
99+
{ "localfilt-off", no_argument, NULL, 'l' }, // Disable local filtering (will start to report "internal" RTTs)
102100
{ 0, 0, NULL, 0 }
103101
};
104102

@@ -163,11 +161,12 @@ static int parse_arguments(int argc, char *argv[], struct pping_config *config)
163161
double rate_limit_ms, cleanup_interval_s, rtt_rate;
164162

165163
config->ifindex = 0;
164+
config->bpf_config.localfilt = true;
166165
config->force = false;
167166
config->json_format = false;
168167
config->ppviz_format = false;
169168

170-
while ((opt = getopt_long(argc, argv, "hfi:r:R:T:c:F:I:", long_options,
169+
while ((opt = getopt_long(argc, argv, "hfli:r:R:T:c:F:I:", long_options,
171170
NULL)) != -1) {
172171
switch (opt) {
173172
case 'i':
@@ -245,6 +244,9 @@ static int parse_arguments(int argc, char *argv[], struct pping_config *config)
245244
return -EINVAL;
246245
}
247246
break;
247+
case 'l':
248+
config->bpf_config.localfilt = false;
249+
break;
248250
case 'f':
249251
config->force = true;
250252
config->xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
@@ -469,9 +471,9 @@ static bool flow_timeout(void *key_ptr, void *val_ptr, __u64 now)
469471
if (print_event_func) {
470472
fe.event_type = EVENT_TYPE_FLOW;
471473
fe.timestamp = now;
472-
fe.flow = *(struct network_tuple *)key_ptr;
473-
fe.event_info.event = FLOW_EVENT_CLOSING;
474-
fe.event_info.reason = EVENT_REASON_FLOW_TIMEOUT;
474+
reverse_flow(&fe.flow, key_ptr);
475+
fe.flow_event_type = FLOW_EVENT_CLOSING;
476+
fe.reason = EVENT_REASON_FLOW_TIMEOUT;
475477
fe.source = EVENT_SOURCE_USERSPACE;
476478
print_event_func(NULL, 0, &fe, sizeof(fe));
477479
}
@@ -622,6 +624,7 @@ static const char *flowevent_to_str(enum flow_event_type fe)
622624
case FLOW_EVENT_OPENING:
623625
return "opening";
624626
case FLOW_EVENT_CLOSING:
627+
case FLOW_EVENT_CLOSING_BOTH:
625628
return "closing";
626629
default:
627630
return "unknown";
@@ -639,8 +642,6 @@ static const char *eventreason_to_str(enum flow_event_reason er)
639642
return "first observed packet";
640643
case EVENT_REASON_FIN:
641644
return "FIN";
642-
case EVENT_REASON_FIN_ACK:
643-
return "FIN-ACK";
644645
case EVENT_REASON_RST:
645646
return "RST";
646647
case EVENT_REASON_FLOW_TIMEOUT:
@@ -653,9 +654,9 @@ static const char *eventreason_to_str(enum flow_event_reason er)
653654
static const char *eventsource_to_str(enum flow_event_source es)
654655
{
655656
switch (es) {
656-
case EVENT_SOURCE_EGRESS:
657+
case EVENT_SOURCE_PKT_SRC:
657658
return "src";
658-
case EVENT_SOURCE_INGRESS:
659+
case EVENT_SOURCE_PKT_DEST:
659660
return "dest";
660661
case EVENT_SOURCE_USERSPACE:
661662
return "userspace-cleanup";
@@ -705,8 +706,8 @@ static void print_event_standard(void *ctx, int cpu, void *data,
705706
printf(" %s ", proto_to_str(e->rtt_event.flow.proto));
706707
print_flow_ppvizformat(stdout, &e->flow_event.flow);
707708
printf(" %s due to %s from %s\n",
708-
flowevent_to_str(e->flow_event.event_info.event),
709-
eventreason_to_str(e->flow_event.event_info.reason),
709+
flowevent_to_str(e->flow_event.flow_event_type),
710+
eventreason_to_str(e->flow_event.reason),
710711
eventsource_to_str(e->flow_event.source));
711712
}
712713
}
@@ -755,18 +756,20 @@ static void print_rttevent_fields_json(json_writer_t *ctx,
755756
jsonw_u64_field(ctx, "sent_bytes", re->sent_bytes);
756757
jsonw_u64_field(ctx, "rec_packets", re->rec_pkts);
757758
jsonw_u64_field(ctx, "rec_bytes", re->rec_bytes);
759+
jsonw_bool_field(ctx, "match_on_egress", re->match_on_egress);
758760
}
759761

760762
static void print_flowevent_fields_json(json_writer_t *ctx,
761763
const struct flow_event *fe)
762764
{
763765
jsonw_string_field(ctx, "flow_event",
764-
flowevent_to_str(fe->event_info.event));
766+
flowevent_to_str(fe->flow_event_type));
765767
jsonw_string_field(ctx, "reason",
766-
eventreason_to_str(fe->event_info.reason));
768+
eventreason_to_str(fe->reason));
767769
jsonw_string_field(ctx, "triggered_by", eventsource_to_str(fe->source));
768770
}
769771

772+
// TODO - add field noting if RTT includes "internal" delays or not
770773
static void print_event_json(void *ctx, int cpu, void *data, __u32 data_size)
771774
{
772775
const union pping_event *e = data;

pping/pping.h

Lines changed: 26 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -22,30 +22,31 @@ typedef __u64 fixpoint64;
2222
enum __attribute__((__packed__)) flow_event_type {
2323
FLOW_EVENT_NONE,
2424
FLOW_EVENT_OPENING,
25-
FLOW_EVENT_CLOSING
25+
FLOW_EVENT_CLOSING,
26+
FLOW_EVENT_CLOSING_BOTH
2627
};
2728

2829
enum __attribute__((__packed__)) flow_event_reason {
2930
EVENT_REASON_SYN,
3031
EVENT_REASON_SYN_ACK,
3132
EVENT_REASON_FIRST_OBS_PCKT,
3233
EVENT_REASON_FIN,
33-
EVENT_REASON_FIN_ACK,
3434
EVENT_REASON_RST,
3535
EVENT_REASON_FLOW_TIMEOUT
3636
};
3737

3838
enum __attribute__((__packed__)) flow_event_source {
39-
EVENT_SOURCE_EGRESS,
40-
EVENT_SOURCE_INGRESS,
39+
EVENT_SOURCE_PKT_SRC,
40+
EVENT_SOURCE_PKT_DEST,
4141
EVENT_SOURCE_USERSPACE
4242
};
4343

4444
struct bpf_config {
4545
__u64 rate_limit;
4646
fixpoint64 rtt_rate;
4747
bool use_srtt;
48-
__u8 reserved[7];
48+
bool localfilt;
49+
__u8 reserved[6];
4950
};
5051

5152
/*
@@ -110,13 +111,14 @@ struct rtt_event {
110111
__u64 sent_bytes;
111112
__u64 rec_pkts;
112113
__u64 rec_bytes;
113-
__u32 reserved;
114+
bool match_on_egress;
115+
__u8 reserved[7];
114116
};
115117

116-
struct flow_event_info {
117-
enum flow_event_type event;
118-
enum flow_event_reason reason;
119-
};
118+
/* struct flow_event_info { */
119+
/* enum flow_event_type event; */
120+
/* enum flow_event_reason reason; */
121+
/* }; */
120122

121123
/*
122124
* A flow event message that can be passed from the bpf-programs to user-space.
@@ -128,7 +130,8 @@ struct flow_event {
128130
__u64 event_type;
129131
__u64 timestamp;
130132
struct network_tuple flow;
131-
struct flow_event_info event_info;
133+
enum flow_event_type flow_event_type;
134+
enum flow_event_reason reason;
132135
enum flow_event_source source;
133136
__u8 reserved;
134137
};
@@ -139,4 +142,16 @@ union pping_event {
139142
struct flow_event flow_event;
140143
};
141144

145+
/*
146+
* Copies the src to dest, but swapping place on saddr and daddr
147+
*/
148+
static void reverse_flow(struct network_tuple *dest, struct network_tuple *src)
149+
{
150+
dest->ipv = src->ipv;
151+
dest->proto = src->proto;
152+
dest->saddr = src->daddr;
153+
dest->daddr = src->saddr;
154+
dest->reserved = 0;
155+
}
156+
142157
#endif

0 commit comments

Comments
 (0)