test improvements, and avoiding retries in event tests #103

kkourt · 2022-05-30T13:05:09Z

Please review commit-by-commit

This PR contains a number of improvements for testing. Most of them are small, but there is one which is notable: allowing testing without retries.

The idea is the following:

The typical structure of our tests is:

start observer
do some things on user-space
check that we get the expected events from tetragon

The above approach requires retries for step 3 if the events we are looking
for are not there. These retry counts are large so that we can deal with
worst case scenarios, and, subsequently, they induce a significant time cost
in failing tests. Furthermore, for some tests we also need to check for the
absence of events (e.g., when doing filtering), which also requires waiting
on timeouts.

This patch introduces the first steps to avoid having to wait on
timeouts.

We use pkg/sensors/test, which is a simpler sensor that hooks into the
lseek call, and generates a MSG_TEST event when lseek is called with
bogus arguments (fd=-1, whence=4444).

The idea is simple: after step 2 in our tests, can we trigger events
from test sensor on all CPUs. Once we have seen all the test events (on
all CPUs), then we know that if we expect any events, they are not
there, and, subsequently, there is no need for retries.

This patch:

introduces the user-space component for above functionality:
TestChecker
modifies the BPF hook to also record the current CPU
adds a trigger-test-events program that triggers the test events on
all CPUs

TestChecker works by wrapping another MultiEventChecker and looking for
TEST messages. Once it has seen one such message in all CPUs, it will
terminate the test by calling FinalCheck of the underlying
MultiEventChecker.

willfindlay

This looks awesome. Just left a couple comments/nits.

bpf/test/bpf_lseek.c

pkg/sensors/test/checker.go

olsajiri · 2022-05-30T18:38:32Z

looks good, there are tests like TestKprobeObjectFilterOpen which have empty checker object and IIUC (I might be missing something) we make them timeout as proof we did not receive an event, looks like this could help a lot

willfindlay · 2022-05-30T21:01:53Z

@olsajiri FWIW this PR #104 handles the problem you mentioned, but yes we should definitely combine both approaches.

Signed-off-by: Kornilios Kourtis <[email protected]>

The typical structure of our tests is: 1. start observer 2. do some things on user-space 3. check that we get the expected events from tetragon The above approach requires retries for step 3 if the events we are looking for are not there. These retry counts are large so that we can deal with worst case scenarios, and, subsequently, they induce a significant time cost in failing tests. Furthermore, for some tests we also need to check for the absence of events (e.g., when doing filtering), which also requires waiting on timeouts. This patch introduces the first steps to avoid having to wait on timeouts. We use pkg/sensors/test, which is a simpler sensor that hooks into the lseek call, and generates a MSG_TEST event when lseek is called with bogus arguments (fd=-1, whence=4444). The idea is simple: after step 2 in our tests, can we trigger events from test sensor on all CPUs. Once we have seen all the test events (on all CPUs), then we know that if we expect any events, they are not there, and, subsequently, there is no need for retries. This patch: * introduces the user-space component for above functionality: TestChecker * modifies the BPF hook to also record the current CPU * adds a trigger-test-events program that triggers the test events on all CPUs TestChecker works by wrapping another MultiEventChecker and looking for TEST messages. Once it has seen one such message in all CPUs, it will terminate the test by calling FinalCheck of the underlying MultiEventChecker. Signed-off-by: Kornilios Kourtis <[email protected]>

Change the value from 4444 to 4729 to avoid conflicts with tracepoint tests. Signed-off-by: Kornilios Kourtis <[email protected]>

This patch introduces a simple test to test the TestChecker. The code uses a dummy checker, which wraps to use the TestChecker. It can be used as example for other tests to remove the need for retries. Signed-off-by: Kornilios Kourtis <[email protected]>

When our json test check fails, it marks the export file containing the events so that it will not be deleted. In some cases (e.g., in TestTestChecker) , however, we expect the test to fail. Add a DontKeepExportFile for this case. Also, use DontKeepExportFile in TestTestChecker. Signed-off-by: Kornilios Kourtis <[email protected]>

doTestGenericTracepointPidFilter calls JsonTestCheck which will call KeepExportFile if the test fails. Remove unnecessary code. Signed-off-by: Kornilios Kourtis <[email protected]>

- use t.Logf instead of fmt.Printf - add comment to doTestGenericTracepointPidFilter Signed-off-by: Kornilios Kourtis <[email protected]>

Signed-off-by: Kornilios Kourtis <[email protected]>

We seem to be leaving /sys/fs/bpf/testObserver directories around after we finish the tests. Make sure they are removed. Also, use defaults.DefaultMapRoot to define the above directory. Signed-off-by: Kornilios Kourtis <[email protected]>

Signed-off-by: Kornilios Kourtis <[email protected]>

Add a LoadSensor function in testutils. Currently, this is only used by tests in pkg/sensors/test, but other tests might find it useful as well. Signed-off-by: Kornilios Kourtis <[email protected]>

This commit adds a helper sensor manager than can be used in testing. Signed-off-by: Kornilios Kourtis <[email protected]>

This patch adds the sensor/test TestChecker in the generic tracepoint tests. The immediate result is that failing the test can be failed quicker. For example, if we modify the code so that the ArgFilterLseek test fails: op := func() { t.Logf("Calling lseek...\n") - unix.Seek(fd, 0, whence) + // unix.Seek(fd, 0, whence) unix.Seek(fd, 0, whence+1) } Then, with this patch the test fails in 4.5s [1], while without in 46s [2]. Additionally, the idea here is to try and use this approach in the GH actions environment and see how well it works. If it does, we can use it to make our testing more reliable and predicatable. 1: observer_test_json.go:153: time="2022-05-30T14:28:47+02:00" level=info msg="seen events on all CPUs, finalizing test" ... --- FAIL: TestGenericTracepointArgFilterLseek (4.46s) 2: tracepoint_test.go:199: error: JsonTestCheck failed after 20 retries: JsonEOF: failed to match after 167 events: err:Got 0 events while expecting at least 1 ... --- FAIL: TestGenericTracepointArgFilterLseek (45.84s) Signed-off-by: Kornilios Kourtis <[email protected]>

kkourt requested a review from a team as a code owner May 30, 2022 13:05

kkourt changed the title ~~test improvements, and avoiding retries~~ test improvements, and avoiding retries in event tests May 30, 2022

kkourt force-pushed the pr/kkourt/test-improvements branch from aa70392 to e265817 Compare May 30, 2022 14:10

willfindlay approved these changes May 30, 2022

View reviewed changes

bpf/test/bpf_lseek.c Outdated Show resolved Hide resolved

pkg/sensors/test/checker.go Show resolved Hide resolved

kkourt added 17 commits May 31, 2022 09:21

pkg/bpf: remove unused GetMapPrefix function

db34a49

Signed-off-by: Kornilios Kourtis <[email protected]>

Makefile: make GO variable overridable

4a3e62f

Signed-off-by: Kornilios Kourtis <[email protected]>

contrib/tester-progs: clean up Makefile

72e731c

Signed-off-by: Kornilios Kourtis <[email protected]>

pkg/sensors: remove noisy message

3448de0

Signed-off-by: Kornilios Kourtis <[email protected]>

pkg/sensors/test: improvements

e8a206a

Signed-off-by: Kornilios Kourtis <[email protected]>

pkg/sensor: change whence value to trigger event

b9b9531

Change the value from 4444 to 4729 to avoid conflicts with tracepoint tests. Signed-off-by: Kornilios Kourtis <[email protected]>

tracepoint_test: no need to call KeepExportFile

b8f3414

doTestGenericTracepointPidFilter calls JsonTestCheck which will call KeepExportFile if the test fails. Remove unnecessary code. Signed-off-by: Kornilios Kourtis <[email protected]>

tracepoint_test: improvements

783fb51

- use t.Logf instead of fmt.Printf - add comment to doTestGenericTracepointPidFilter Signed-off-by: Kornilios Kourtis <[email protected]>

tracepoint_test: add var for whence bogus value

f5f6c3c

Signed-off-by: Kornilios Kourtis <[email protected]>

tracepoint_test: remove map dir after running

968e66b

We seem to be leaving /sys/fs/bpf/testObserver directories around after we finish the tests. Make sure they are removed. Also, use defaults.DefaultMapRoot to define the above directory. Signed-off-by: Kornilios Kourtis <[email protected]>

JsonCheck: remove trailing space from message

9d42e72

Signed-off-by: Kornilios Kourtis <[email protected]>

testutils: add LoadSensor testing helper

06d7f0a

Add a LoadSensor function in testutils. Currently, this is only used by tests in pkg/sensors/test, but other tests might find it useful as well. Signed-off-by: Kornilios Kourtis <[email protected]>

testutils: add test sensor manager

639696c

This commit adds a helper sensor manager than can be used in testing. Signed-off-by: Kornilios Kourtis <[email protected]>

kkourt force-pushed the pr/kkourt/test-improvements branch from e265817 to 0fd9845 Compare May 31, 2022 07:24

kkourt merged commit 2ace011 into main Jun 1, 2022

kkourt deleted the pr/kkourt/test-improvements branch June 1, 2022 05:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test improvements, and avoiding retries in event tests #103

test improvements, and avoiding retries in event tests #103

kkourt commented May 30, 2022

willfindlay left a comment

olsajiri commented May 30, 2022

willfindlay commented May 30, 2022

test improvements, and avoiding retries in event tests #103

test improvements, and avoiding retries in event tests #103

Conversation

kkourt commented May 30, 2022

willfindlay left a comment

Choose a reason for hiding this comment

olsajiri commented May 30, 2022

willfindlay commented May 30, 2022