Fix: Optimize single_process_type execution #2110

svteb · 2024-07-17T09:18:01Z

Description

The single_process_type / process_check spec tests took on average 1:30:00 to execute (in github actions), this was reduced to around 10-15 minutes. It was caused by incorrect use of libraries which checked unnecessary processes multiple times.

These two changes need to be merged first:
cluster_tools change: #27
k8s_kernel_introspection change: #4

Because this pull request makes changes in two libraries, it is rather difficult to verify it through actions. Currently there are changes to the shards.yaml and shards.lock files which will be removed once the reviewers approve the change and the individual library pull requests are merged.

For reviewers:
There are two big changes, the first is the replacement of workload_resource_test by cnf_workload_resources, both yield resource names, but workload_resource_test also yields all the container names per resource, thus we could get resources like this (it also did not return all the containers for some reason?):

CNFManager.workload_resource_test(args, config) do |resource, container, initialized|

resource: {kind: "Deployment", name: "nginx-webapp", namespace: "cnfspace"}
resource: {kind: "Deployment", name: "nginx-webapp", namespace: "cnfspace"}
resource: {kind: "Deployment", name: "nginx-webapp", namespace: "cnfspace"}
resource: {kind: "Pod", name: "sidecar-container-demo", namespace: "cnfspace"}
resource: {kind: "Pod", name: "sidecar-container-demo", namespace: "cnfspace"}

optimized version:

CNFManager.cnf_workload_resources(args, config) do |resource|

resource: {kind: "Deployment", name: "nginx-webapp", namespace: "cnfspace"}
resource: {kind: "Pod", name: "sidecar-container-demo", namespace: "cnfspace"}

With this change, we do not iterate over every resource multiple times. The second big change is the addition of pid filtering by cgroups. The previous version was incorrectly checking processes on node instead of a container. This lead to something like this (which makes no sense considering we want verify to processes of specific container).

for each container:
         container_node = get_container_node()
         get_all_pids_on_node(container_node)

optimized version:

for each container:
         container_node = get_container_node()
         get_all_pids_on_container(container, container_node)

The cgroup filtering currently depends on ctr purely because it was easier to write than runc filtering. Considering the fact that the current testsuite depends on containerd runtime (ctr) it should not be too problematic and can be changed in the future. Finally, I noticed that the cluster_tools library provided a function that was a copy of the code in single_process_check. That same function is also shared by the zombie task which has some unique quirks I will not get too deep into here. That shared function (ClusterTools.all_containers_by_resource?) had to be changed slightly so it would not break the zombie task. A better refactor can be considered in the future but it is outside the scope of this pull request.

Issues:

Refs: #2084

How has this been tested:

Types of changes:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

Checklist:

Documentation

My change requires a change to the documentation.
I have updated the documentation accordingly.
No updates required.

Code Review

Does the test handle fatal exceptions, ie. rescue block

Issue

Tasks in issue are checked off

kosstennbl

Nice change, optimizations for this test are very welcome.
Few additional points:

Maybe we could use "crictl" for k8s introspection change for easier support of [Feature] Support multiple container engines #2103 in case it will be implemented?
Operation of workload_resource_test and cnf_workload_resources is complicated and feels that workload_resource_test design could be a cause for multiple slowdowns in our tests. Probably we should look into fixing this at a higher scale. Not a scope of this PR though.

src/tasks/workload/microservice.cr

svteb · 2024-07-23T09:22:52Z

Nice change, optimizations for this test are very welcome. Few additional points:

Maybe we could use "crictl" for k8s introspection change for easier support of [Feature] Support multiple container engines #2103 in case it will be implemented?

Operation of workload_resource_test and cnf_workload_resources is complicated and feels that workload_resource_test design could be a cause for multiple slowdowns in our tests. Probably we should look into fixing this at a higher scale. Not a scope of this PR though.

Yes, after some discussions we have come to the conclusion that ctr might not be generic enough. The issue is that every container cli tool has some nuances that prevent easy access to cgroups/filtering pids. The issue with crictl is that it is too high-level and does not provide access to cgroups. ~~We've decided to use runc instead which is a lower level cli tool.~~

Edit:
After further thought I've decided to forego using any container cli tool and piped a few bash commands together which work like this:

"find /proc -maxdepth 1 -regex '/proc/[0-9]+' -exec grep -l '#{container_id}' {}/cgroup \\; 2>/dev/null | sed -e 's,/proc/\\([0-9]*\\)/cgroup,\\1,'\""

Get all process directories from /proc
Grep through their /proc/<pid>/cgroup file to find if they are matching the container_id
Return matching pids.

The spec tests are passing.

Refs: cnti-testcatalog#2084 - Reduces resource iteration by only checking the container processes. - Also introduced changes to the the cluster_tools and k8s_kernel_introspection libraries. Signed-off-by: svteb <[email protected]>

martin-mat

lgtm

kosstennbl

lgtm

kosstennbl · 2024-08-06T12:23:58Z

src/tasks/workload/microservice.cr

+              Log.for(t.name).info { "multiple proc types detected verified: #{verified}" }
+              fail_msg = "resource: #{resource} has more than one process type (#{container_proctree_statuses.map { |x| x["cmdline"]? }.compact.uniq.join(", ")})"
+              unless fail_msgs.find { |x| x == fail_msg }
+                puts fail_msg.colorize(:red)


That could be changed to stdout_error, but as we have CI fully successful now, we can probably do that on the scope of the whole testsuite with another PR

svteb requested review from denverwilliams and agentpoyo as code owners July 17, 2024 09:18

martin-mat requested a review from kosstennbl July 17, 2024 13:31

svteb marked this pull request as draft July 18, 2024 06:20

svteb force-pushed the process_check branch 4 times, most recently from ff9e3a1 to 2df2ce9 Compare July 22, 2024 07:55

kosstennbl reviewed Jul 22, 2024

View reviewed changes

src/tasks/workload/microservice.cr Outdated Show resolved Hide resolved

svteb force-pushed the process_check branch from 2df2ce9 to b72057e Compare July 23, 2024 09:36

This was referenced Jul 23, 2024

Feat: Allow process filtering by container id cnf-testsuite/k8s_kernel_introspection#4

Merged

Feat: Add option to filter processes by both containers and nodes cnf-testsuite/cluster_tools#27

Merged

svteb force-pushed the process_check branch 3 times, most recently from 3d06250 to b161a6d Compare July 28, 2024 18:15

Fix: Optimize single_process_type execution

0d1a76b

Refs: cnti-testcatalog#2084 - Reduces resource iteration by only checking the container processes. - Also introduced changes to the the cluster_tools and k8s_kernel_introspection libraries. Signed-off-by: svteb <[email protected]>

svteb force-pushed the process_check branch from b161a6d to 0d1a76b Compare July 28, 2024 19:17

martin-mat added 3 commits August 2, 2024 16:48

Update shard.lock

a53295a

Update shard.yml

728001a

Update shard.yml

e374a09

martin-mat marked this pull request as ready for review August 2, 2024 14:52

martin-mat requested review from kosstennbl, taylor, martin-mat, rich-l, horecoli, collivier and Smitholi67 August 2, 2024 14:52

martin-mat approved these changes Aug 6, 2024

View reviewed changes

kosstennbl approved these changes Aug 6, 2024

View reviewed changes

kosstennbl reviewed Aug 6, 2024

View reviewed changes

taylor approved these changes Aug 6, 2024

View reviewed changes

martin-mat merged commit 66752b7 into cnti-testcatalog:main Aug 6, 2024
87 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Optimize single_process_type execution #2110

Fix: Optimize single_process_type execution #2110

svteb commented Jul 17, 2024 •

edited

Loading

kosstennbl left a comment •

edited

Loading

svteb commented Jul 23, 2024 •

edited

Loading

martin-mat left a comment

kosstennbl left a comment

kosstennbl Aug 6, 2024

Fix: Optimize single_process_type execution #2110

Fix: Optimize single_process_type execution #2110

Conversation

svteb commented Jul 17, 2024 • edited Loading

Description

Issues:

How has this been tested:

Types of changes:

Checklist:

kosstennbl left a comment • edited Loading

Choose a reason for hiding this comment

svteb commented Jul 23, 2024 • edited Loading

martin-mat left a comment

Choose a reason for hiding this comment

kosstennbl left a comment

Choose a reason for hiding this comment

kosstennbl Aug 6, 2024

Choose a reason for hiding this comment

svteb commented Jul 17, 2024 •

edited

Loading

kosstennbl left a comment •

edited

Loading

svteb commented Jul 23, 2024 •

edited

Loading