Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Optimize single_process_type execution #2110

Merged
merged 4 commits into from
Aug 6, 2024

Conversation

svteb
Copy link
Collaborator

@svteb svteb commented Jul 17, 2024

Description

The single_process_type / process_check spec tests took on average 1:30:00 to execute (in github actions), this was reduced to around 10-15 minutes. It was caused by incorrect use of libraries which checked unnecessary processes multiple times.

These two changes need to be merged first:
cluster_tools change: #27
k8s_kernel_introspection change: #4

Because this pull request makes changes in two libraries, it is rather difficult to verify it through actions. Currently there are changes to the shards.yaml and shards.lock files which will be removed once the reviewers approve the change and the individual library pull requests are merged.

For reviewers:
There are two big changes, the first is the replacement of workload_resource_test by cnf_workload_resources, both yield resource names, but workload_resource_test also yields all the container names per resource, thus we could get resources like this (it also did not return all the containers for some reason?):

CNFManager.workload_resource_test(args, config) do |resource, container, initialized|

resource: {kind: "Deployment", name: "nginx-webapp", namespace: "cnfspace"}
resource: {kind: "Deployment", name: "nginx-webapp", namespace: "cnfspace"}
resource: {kind: "Deployment", name: "nginx-webapp", namespace: "cnfspace"}
resource: {kind: "Pod", name: "sidecar-container-demo", namespace: "cnfspace"}
resource: {kind: "Pod", name: "sidecar-container-demo", namespace: "cnfspace"}

optimized version:

CNFManager.cnf_workload_resources(args, config) do |resource|

resource: {kind: "Deployment", name: "nginx-webapp", namespace: "cnfspace"}
resource: {kind: "Pod", name: "sidecar-container-demo", namespace: "cnfspace"}

With this change, we do not iterate over every resource multiple times. The second big change is the addition of pid filtering by cgroups. The previous version was incorrectly checking processes on node instead of a container. This lead to something like this (which makes no sense considering we want verify to processes of specific container).

for each container:
         container_node = get_container_node()
         get_all_pids_on_node(container_node)

optimized version:

for each container:
         container_node = get_container_node()
         get_all_pids_on_container(container, container_node)

The cgroup filtering currently depends on ctr purely because it was easier to write than runc filtering. Considering the fact that the current testsuite depends on containerd runtime (ctr) it should not be too problematic and can be changed in the future. Finally, I noticed that the cluster_tools library provided a function that was a copy of the code in single_process_check. That same function is also shared by the zombie task which has some unique quirks I will not get too deep into here. That shared function (ClusterTools.all_containers_by_resource?) had to be changed slightly so it would not break the zombie task. A better refactor can be considered in the future but it is outside the scope of this pull request.

Issues:

Refs: #2084

How has this been tested:

  • Covered by existing integration testing
  • Added integration testing to cover
  • Verified all A/C passes
    • develop
    • master
    • tag/other branch
  • Test environment
    • Shared Packet K8s cluster
    • New Packet K8s cluster
    • Kind cluster
  • Have not tested

Types of changes:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Checklist:

Documentation

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • No updates required.

Code Review

  • Does the test handle fatal exceptions, ie. rescue block

Issue

  • Tasks in issue are checked off

@martin-mat martin-mat requested a review from kosstennbl July 17, 2024 13:31
@svteb svteb marked this pull request as draft July 18, 2024 06:20
@svteb svteb force-pushed the process_check branch 4 times, most recently from ff9e3a1 to 2df2ce9 Compare July 22, 2024 07:55
Copy link
Collaborator

@kosstennbl kosstennbl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change, optimizations for this test are very welcome.
Few additional points:

  • Maybe we could use "crictl" for k8s introspection change for easier support of [Feature] Support multiple container engines #2103 in case it will be implemented?
  • Operation of workload_resource_test and cnf_workload_resources is complicated and feels that workload_resource_test design could be a cause for multiple slowdowns in our tests. Probably we should look into fixing this at a higher scale. Not a scope of this PR though.

src/tasks/workload/microservice.cr Outdated Show resolved Hide resolved
@svteb
Copy link
Collaborator Author

svteb commented Jul 23, 2024

Nice change, optimizations for this test are very welcome. Few additional points:

  • Maybe we could use "crictl" for k8s introspection change for easier support of [Feature] Support multiple container engines #2103 in case it will be implemented?
  • Operation of workload_resource_test and cnf_workload_resources is complicated and feels that workload_resource_test design could be a cause for multiple slowdowns in our tests. Probably we should look into fixing this at a higher scale. Not a scope of this PR though.

Yes, after some discussions we have come to the conclusion that ctr might not be generic enough. The issue is that every container cli tool has some nuances that prevent easy access to cgroups/filtering pids. The issue with crictl is that it is too high-level and does not provide access to cgroups. We've decided to use runc instead which is a lower level cli tool.

Edit:
After further thought I've decided to forego using any container cli tool and piped a few bash commands together which work like this:

"find /proc -maxdepth 1 -regex '/proc/[0-9]+' -exec grep -l '#{container_id}' {}/cgroup \\; 2>/dev/null | sed -e 's,/proc/\\([0-9]*\\)/cgroup,\\1,'\""

  1. Get all process directories from /proc
  2. Grep through their /proc/<pid>/cgroup file to find if they are matching the container_id
  3. Return matching pids.

The spec tests are passing.

Refs: cnti-testcatalog#2084
- Reduces resource iteration by only checking the container processes.
- Also introduced changes to the the cluster_tools and k8s_kernel_introspection libraries.

Signed-off-by: svteb <[email protected]>
@martin-mat martin-mat marked this pull request as ready for review August 2, 2024 14:52
Copy link
Collaborator

@martin-mat martin-mat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Collaborator

@kosstennbl kosstennbl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Log.for(t.name).info { "multiple proc types detected verified: #{verified}" }
fail_msg = "resource: #{resource} has more than one process type (#{container_proctree_statuses.map { |x| x["cmdline"]? }.compact.uniq.join(", ")})"
unless fail_msgs.find { |x| x == fail_msg }
puts fail_msg.colorize(:red)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be changed to stdout_error, but as we have CI fully successful now, we can probably do that on the scope of the whole testsuite with another PR

@martin-mat martin-mat merged commit 66752b7 into cnti-testcatalog:main Aug 6, 2024
87 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants