Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test improvements, and avoiding retries in event tests #103

Merged
merged 17 commits into from
Jun 1, 2022

Conversation

kkourt
Copy link
Contributor

@kkourt kkourt commented May 30, 2022

Please review commit-by-commit

This PR contains a number of improvements for testing. Most of them are small, but there is one which is notable: allowing testing without retries.

The idea is the following:

The typical structure of our tests is:

  1. start observer
  2. do some things on user-space
  3. check that we get the expected events from tetragon

The above approach requires retries for step 3 if the events we are looking
for are not there. These retry counts are large so that we can deal with
worst case scenarios, and, subsequently, they induce a significant time cost
in failing tests. Furthermore, for some tests we also need to check for the
absence of events (e.g., when doing filtering), which also requires waiting
on timeouts.

This patch introduces the first steps to avoid having to wait on
timeouts.

We use pkg/sensors/test, which is a simpler sensor that hooks into the
lseek call, and generates a MSG_TEST event when lseek is called with
bogus arguments (fd=-1, whence=4444).

The idea is simple: after step 2 in our tests, can we trigger events
from test sensor on all CPUs. Once we have seen all the test events (on
all CPUs), then we know that if we expect any events, they are not
there, and, subsequently, there is no need for retries.

This patch:

  • introduces the user-space component for above functionality:
    TestChecker
  • modifies the BPF hook to also record the current CPU
  • adds a trigger-test-events program that triggers the test events on
    all CPUs

TestChecker works by wrapping another MultiEventChecker and looking for
TEST messages. Once it has seen one such message in all CPUs, it will
terminate the test by calling FinalCheck of the underlying
MultiEventChecker.

@kkourt kkourt requested a review from a team as a code owner May 30, 2022 13:05
@kkourt kkourt changed the title test improvements, and avoiding retries test improvements, and avoiding retries in event tests May 30, 2022
@kkourt kkourt force-pushed the pr/kkourt/test-improvements branch from aa70392 to e265817 Compare May 30, 2022 14:10
Copy link
Contributor

@willfindlay willfindlay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome. Just left a couple comments/nits.

bpf/test/bpf_lseek.c Outdated Show resolved Hide resolved
pkg/sensors/test/checker.go Show resolved Hide resolved
@olsajiri
Copy link
Contributor

looks good, there are tests like TestKprobeObjectFilterOpen which have empty checker object and IIUC (I might be missing something) we make them timeout as proof we did not receive an event, looks like this could help a lot

@willfindlay
Copy link
Contributor

@olsajiri FWIW this PR #104 handles the problem you mentioned, but yes we should definitely combine both approaches.

kkourt added 17 commits May 31, 2022 09:21
Signed-off-by: Kornilios Kourtis <[email protected]>
Signed-off-by: Kornilios Kourtis <[email protected]>
The typical structure of our tests is:
  1. start observer
  2. do some things on user-space
  3. check that we get the expected events from tetragon

The above approach requires retries for step 3 if the events we are looking
for are not there. These retry counts are large so that we can deal with
worst case scenarios, and, subsequently, they induce a significant time cost
in failing tests. Furthermore, for some tests we also need to check for the
absence of events (e.g., when doing filtering), which also requires waiting
on timeouts.

This patch introduces the first steps to avoid having to wait on
timeouts.

We use pkg/sensors/test, which is a simpler sensor that hooks into the
lseek call, and generates a MSG_TEST event when lseek is called with
bogus arguments (fd=-1, whence=4444).

The idea is simple: after step 2 in our tests, can we trigger events
from test sensor on all CPUs. Once we have seen all the test events (on
all CPUs), then we know that if we expect any events, they are not
there, and, subsequently, there is no need for retries.

This patch:
 * introduces the user-space component for above functionality:
 TestChecker
 * modifies the BPF hook to also record the current CPU
 * adds a trigger-test-events program that triggers the test events on
   all CPUs

TestChecker works by wrapping another MultiEventChecker and looking for
TEST messages. Once it has seen one such message in all CPUs, it will
terminate the test by calling FinalCheck of the underlying
MultiEventChecker.

Signed-off-by: Kornilios Kourtis <[email protected]>
Change the value from 4444 to 4729 to avoid conflicts with
tracepoint tests.

Signed-off-by: Kornilios Kourtis <[email protected]>
This patch introduces a simple test to test the TestChecker.

The code uses a dummy checker, which wraps to use the TestChecker. It
can be used as example for other tests to remove the need for retries.

Signed-off-by: Kornilios Kourtis <[email protected]>
When our json test check fails, it marks the export file containing the
events so that it will not be deleted. In some cases (e.g., in
TestTestChecker) , however, we expect the test to fail. Add a
DontKeepExportFile for this case.

Also, use DontKeepExportFile in TestTestChecker.

Signed-off-by: Kornilios Kourtis <[email protected]>
doTestGenericTracepointPidFilter calls JsonTestCheck which will call
KeepExportFile if the test fails. Remove unnecessary code.

Signed-off-by: Kornilios Kourtis <[email protected]>
- use t.Logf instead of fmt.Printf
- add comment to doTestGenericTracepointPidFilter

Signed-off-by: Kornilios Kourtis <[email protected]>
We seem to be leaving /sys/fs/bpf/testObserver directories around after
we finish the tests. Make sure they are removed. Also, use
defaults.DefaultMapRoot to define the above directory.

Signed-off-by: Kornilios Kourtis <[email protected]>
Add a LoadSensor function in testutils. Currently, this is only used by
tests in pkg/sensors/test, but other tests might find it useful as well.

Signed-off-by: Kornilios Kourtis <[email protected]>
This commit adds a helper sensor manager than can be used in testing.

Signed-off-by: Kornilios Kourtis <[email protected]>
This patch adds the sensor/test TestChecker in the generic tracepoint
tests.

The immediate result is that failing the test can be failed quicker.

For example, if we modify the code so that the ArgFilterLseek test fails:
        op := func() {
                t.Logf("Calling lseek...\n")
-               unix.Seek(fd, 0, whence)
+               // unix.Seek(fd, 0, whence)
                unix.Seek(fd, 0, whence+1)
        }

Then, with this patch the test fails in 4.5s [1], while without in 46s
[2].

Additionally, the idea here is to try and use this approach in the GH
actions environment and see how well it works. If it does, we can use it
to make our testing more reliable and predicatable.

1:
observer_test_json.go:153: time="2022-05-30T14:28:47+02:00" level=info msg="seen events on all CPUs, finalizing test"
...
--- FAIL: TestGenericTracepointArgFilterLseek (4.46s)

2:
tracepoint_test.go:199: error: JsonTestCheck failed after 20 retries: JsonEOF: failed to match after 167 events: err:Got 0 events while expecting at least 1
...
--- FAIL: TestGenericTracepointArgFilterLseek (45.84s)

Signed-off-by: Kornilios Kourtis <[email protected]>
@kkourt kkourt force-pushed the pr/kkourt/test-improvements branch from e265817 to 0fd9845 Compare May 31, 2022 07:24
@kkourt kkourt merged commit 2ace011 into main Jun 1, 2022
@kkourt kkourt deleted the pr/kkourt/test-improvements branch June 1, 2022 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants