Skip to content

Conversation

@rbaturov
Copy link
Contributor

Kernel parameters that are not supported on RT kernel systems are being applied, causing errors to be logged to the tuned daemon, resulting in a degraded profile state.
Therefore, I added the openshift-node-performance-rt profile that would be included if an RT kernel is detected, thereby dropping the unsupported kernel parameters before they are applied.

@openshift-ci-robot openshift-ci-robot added the jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. label Feb 14, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 14, 2024
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 14, 2024
@openshift-ci-robot
Copy link
Contributor

@rbaturov: This pull request references Jira Issue OCPBUGS-23167, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Kernel parameters that are not supported on RT kernel systems are being applied, causing errors to be logged to the tuned daemon, resulting in a degraded profile state.
Therefore, I added the openshift-node-performance-rt profile that would be included if an RT kernel is detected, thereby dropping the unsupported kernel parameters before they are applied.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

return nil, err
}
name := components.GetComponentName(profile.Name, components.ProfileNamePerformance)
RealTimeKernelProfileName := components.ProfileNamePerformanceRT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to use the GetComponentName method here too, otherwise two perf profiles will try to create the same profile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why two perf profiles will try to create the same profile.
Anyway I tried to do that before, what happend was that the tuned profile got posted to the cluster with the name of the profile, i. e "-performance" suffix (openshift-node-performance-rt-performance) while the include section is looking for hardcoded "openshift-node-performance-rt"
2024-02-14 10:49:19,013 ERROR tuned.daemon.daemon: Cannot set initial profile. No tunings will be enabled: Cannot load profile(s) 'openshift-node-performance-performance': Cannot find profile 'openshift-node-performance-rt' in '['/etc/tuned', '/usr/lib/tuned']'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think about the flow when you have two PerformanceProfiles in a cluster:

Each performance profile generates a new Tuned object that contains the tuned profiles. So two performance profiles result in two Tuned objects that will both contain the same openshift-node-performance-rt tuned profile. This results in NTO trying to save both of the defined profiles into the same filename.

You are right the include will have to be improved to refer to the proper name. Template variable with the perf profile name can do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed that.

@rbaturov rbaturov force-pushed the add-performance-rt-profile branch 2 times, most recently from bcbe9fb to 9f6360c Compare February 15, 2024 08:54
Copy link
Contributor

@MarSik MarSik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 15, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 15, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MarSik, rbaturov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 15, 2024
@MarSik
Copy link
Contributor

MarSik commented Feb 15, 2024

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 15, 2024
@openshift-ci-robot
Copy link
Contributor

@MarSik: This pull request references Jira Issue OCPBUGS-23167, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 15, 2024
@rbaturov rbaturov force-pushed the add-performance-rt-profile branch 2 times, most recently from 5d0d263 to aba67d3 Compare February 18, 2024 20:00
rendersync bootstrap-cluster/performance pinned-cluster/default bootstrap/no-mcp
rendersync bootstrap-cluster/performance pinned-cluster/default bootstrap-cluster/extra-mcp bootstrap/extra-mcp No newline at end of file
rendersync bootstrap-cluster/performance pinned-cluster/default bootstrap-cluster/extra-mcp bootstrap/extra-mcp
rendersync base/performance manual-cluster/performance no-ref
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the no-ref set should not include any reference label, we need to parametrize the render to deal with this.
I'm not sure why we are adding the render-sync invocations in this PR btw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will issue a different PR that would parametrize the render to deal with this.

@rbaturov
Copy link
Contributor Author

/hold depends on this PR to get merged

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 20, 2024
Kernel parameters that are not supported on RT kernel systems are being applied, causing errors to be logged to the tuned daemon, resulting in a degraded profile state.
Therefore, I added the openshift-node-performance-rt profile that would be included if an RT kernel is detected, thereby dropping the unsupported kernel parameters before they are applied.
This could be great to have a generic test that make sure that after
performance profile is applied the tuned profile is not degraded.
This would prevent future issues like OCPBUGS-23167.
Updated render-sync to include artifacts that are needed for the e2e tests.
Committing the rendered items to this commit.
@rbaturov rbaturov force-pushed the add-performance-rt-profile branch from aba67d3 to ba51af0 Compare February 20, 2024 18:59
@rbaturov rbaturov changed the title WIP: OCPBUGS-23167: Add performance real time tuned template OCPBUGS-23167: Add performance real time tuned template Feb 20, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 20, 2024
@rbaturov
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 20, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 20, 2024

@rbaturov: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a good direction, few comments inside

}
})

It("Tuned profile shouldn't be degraded", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not be degraded after what? in this test you check a condition (this part LGTM) but there is no obvious link to how to get to this state. IOW, does this test depend on a specific order? which part of the system creates the conditions we're checking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not be degraded after tuned profile has been applied to the system.
The object will be degraded if error messages are observed by the tuned daemon when applying the TuneD daemon profile.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ffromani Any error in the tuned log causes the state to switch to degraded. This fix removes a typical case - a sysctl that does not exist for the given kernel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, and what in the test causes any change in the system which could lead to Degraded state?
After reading the surrounding tests (something that could have pointed out earlier) I see other tests do in this debatable way, so this is a more general issue which is out of the scope here.

rendersync bootstrap-cluster/performance pinned-cluster/default bootstrap-cluster/extra-mcp bootstrap/extra-mcp No newline at end of file
rendersync bootstrap-cluster/performance pinned-cluster/default bootstrap-cluster/extra-mcp bootstrap/extra-mcp
rendersync --owner-ref none -- base/performance manual-cluster/performance no-ref
rendersync --owner-ref none -- base/performance manual-cluster/cpuFrequency default/cpuFrequency No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we want owner reference in this case? it should be only no-ref without the reference (hence the name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This syncs the manifests for this test https://github.com/openshift/cluster-node-tuning-operator/blob/master/test/e2e/performanceprofile/functests-render-command/1_render_command/render_test.go#L115

And it uses no owner references. Probably copy&paste. I believe we should default to having the references most of the time, but I do not think we should fix that as part of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 21, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 19686cd into openshift:master Feb 21, 2024
@openshift-ci-robot
Copy link
Contributor

@rbaturov: Jira Issue OCPBUGS-23167: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-23167 has been moved to the MODIFIED state.

Details

In response to this:

Kernel parameters that are not supported on RT kernel systems are being applied, causing errors to be logged to the tuned daemon, resulting in a degraded profile state.
Therefore, I added the openshift-node-performance-rt profile that would be included if an RT kernel is detected, thereby dropping the unsupported kernel parameters before they are applied.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@yanirq
Copy link
Contributor

yanirq commented Feb 21, 2024

/cherry-pick release-4.15

@openshift-cherrypick-robot

@yanirq: #954 failed to apply on top of branch "release-4.15":

Applying: Add performance real time tuned template
.git/rebase-apply/patch:49: trailing whitespace.
#The openshift-node-performance profile inherits these kernel parameters from the network-latency profile. 
warning: 1 line adds whitespace errors.
Using index info to reconstruct a base tree...
M	assets/performanceprofile/tuned/openshift-node-performance
M	pkg/performanceprofile/controller/performanceprofile/components/tuned/tuned.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/performanceprofile/controller/performanceprofile/components/tuned/tuned.go
CONFLICT (content): Merge conflict in pkg/performanceprofile/controller/performanceprofile/components/tuned/tuned.go
Auto-merging assets/performanceprofile/tuned/openshift-node-performance
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Add performance real time tuned template
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-node-tuning-operator-container-v4.16.0-202402211939.p0.g19686cd.assembly.stream.el9 for distgit cluster-node-tuning-operator.
All builds following this will include this PR.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-02-22-021321

@rbaturov
Copy link
Contributor Author

/cherry-pick release-4.15

@openshift-cherrypick-robot

@rbaturov: #954 failed to apply on top of branch "release-4.15":

Applying: Add performance real time tuned template
.git/rebase-apply/patch:49: trailing whitespace.
#The openshift-node-performance profile inherits these kernel parameters from the network-latency profile. 
warning: 1 line adds whitespace errors.
Using index info to reconstruct a base tree...
M	assets/performanceprofile/tuned/openshift-node-performance
M	pkg/performanceprofile/controller/performanceprofile/components/tuned/tuned.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/performanceprofile/controller/performanceprofile/components/tuned/tuned.go
CONFLICT (content): Merge conflict in pkg/performanceprofile/controller/performanceprofile/components/tuned/tuned.go
Auto-merging assets/performanceprofile/tuned/openshift-node-performance
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Add performance real time tuned template
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rbaturov added a commit to rbaturov/cluster-node-tuning-operator that referenced this pull request Mar 7, 2024
* Add performance real time tuned template

Kernel parameters that are not supported on RT kernel systems are being applied, causing errors to be logged to the tuned daemon, resulting in a degraded profile state.
Therefore, I added the openshift-node-performance-rt profile that would be included if an RT kernel is detected, thereby dropping the unsupported kernel parameters before they are applied.

* Added a test that make sure the tuned profile is not degraded

This could be great to have a generic test that make sure that after
performance profile is applied the tuned profile is not degraded.
This would prevent future issues like OCPBUGS-23167.

* render-sync update

Updated render-sync to include artifacts that are needed for the e2e tests.
Committing the rendered items to this commit.
rbaturov added a commit to rbaturov/cluster-node-tuning-operator that referenced this pull request Mar 7, 2024
* Add performance real time tuned template

Kernel parameters that are not supported on RT kernel systems are being applied, causing errors to be logged to the tuned daemon, resulting in a degraded profile state.
Therefore, I added the openshift-node-performance-rt profile that would be included if an RT kernel is detected, thereby dropping the unsupported kernel parameters before they are applied.

* Added a test that make sure the tuned profile is not degraded

This could be great to have a generic test that make sure that after
performance profile is applied the tuned profile is not degraded.
This would prevent future issues like OCPBUGS-23167.

* render-sync update

Updated render-sync to include artifacts that are needed for the e2e tests.
Committing the rendered items to this commit.
rbaturov added a commit to rbaturov/cluster-node-tuning-operator that referenced this pull request Mar 7, 2024
* Add performance real time tuned template

Kernel parameters that are not supported on RT kernel systems are being applied, causing errors to be logged to the tuned daemon, resulting in a degraded profile state.
Therefore, I added the openshift-node-performance-rt profile that would be included if an RT kernel is detected, thereby dropping the unsupported kernel parameters before they are applied.

* Added a test that make sure the tuned profile is not degraded

This could be great to have a generic test that make sure that after
performance profile is applied the tuned profile is not degraded.
This would prevent future issues like OCPBUGS-23167.

* render-sync update

Updated render-sync to include artifacts that are needed for the e2e tests.
Committing the rendered items to this commit.
rbaturov added a commit to rbaturov/cluster-node-tuning-operator that referenced this pull request Mar 7, 2024
* Add performance real time tuned template

Kernel parameters that are not supported on RT kernel systems are being applied, causing errors to be logged to the tuned daemon, resulting in a degraded profile state.
Therefore, I added the openshift-node-performance-rt profile that would be included if an RT kernel is detected, thereby dropping the unsupported kernel parameters before they are applied.

* Added a test that make sure the tuned profile is not degraded

This could be great to have a generic test that make sure that after
performance profile is applied the tuned profile is not degraded.
This would prevent future issues like OCPBUGS-23167.

* render-sync update

Updated render-sync to include artifacts that are needed for the e2e tests.
Committing the rendered items to this commit.

Signed-off-by: rbaturov <[email protected]>
rbaturov added a commit to rbaturov/cluster-node-tuning-operator that referenced this pull request Mar 19, 2024
* Add performance real time tuned template

Kernel parameters that are not supported on RT kernel systems are being applied, causing errors to be logged to the tuned daemon, resulting in a degraded profile state.
Therefore, I added the openshift-node-performance-rt profile that would be included if an RT kernel is detected, thereby dropping the unsupported kernel parameters before they are applied.

* Added a test that make sure the tuned profile is not degraded

This could be great to have a generic test that make sure that after
performance profile is applied the tuned profile is not degraded.
This would prevent future issues like OCPBUGS-23167.

* render-sync update

Updated render-sync to include artifacts that are needed for the e2e tests.
Committing the rendered items to this commit.

Signed-off-by: rbaturov <[email protected]>
Tal-or pushed a commit to Tal-or/cluster-node-tuning-operator that referenced this pull request Apr 1, 2024
* Add performance real time tuned template

Kernel parameters that are not supported on RT kernel systems are being applied, causing errors to be logged to the tuned daemon, resulting in a degraded profile state.
Therefore, I added the openshift-node-performance-rt profile that would be included if an RT kernel is detected, thereby dropping the unsupported kernel parameters before they are applied.

* Added a test that make sure the tuned profile is not degraded

This could be great to have a generic test that make sure that after
performance profile is applied the tuned profile is not degraded.
This would prevent future issues like OCPBUGS-23167.

* render-sync update

Updated render-sync to include artifacts that are needed for the e2e tests.
Committing the rendered items to this commit.
openshift-merge-bot bot pushed a commit that referenced this pull request Apr 4, 2024
…d template (#984)

* Backport: Add performance real time tuned template (#954)

* Add performance real time tuned template

Kernel parameters that are not supported on RT kernel systems are being applied, causing errors to be logged to the tuned daemon, resulting in a degraded profile state.
Therefore, I added the openshift-node-performance-rt profile that would be included if an RT kernel is detected, thereby dropping the unsupported kernel parameters before they are applied.

* Added a test that make sure the tuned profile is not degraded

This could be great to have a generic test that make sure that after
performance profile is applied the tuned profile is not degraded.
This would prevent future issues like OCPBUGS-23167.

* render-sync update

Updated render-sync to include artifacts that are needed for the e2e tests.
Committing the rendered items to this commit.

Signed-off-by: rbaturov <[email protected]>

* NO-JIRA: Update tuned profile degraded test (#1005)

* Added isVM func to util

Signed-off-by: Ronny Baturov <[email protected]>

* Update tuned profile degraded test

Updated the tuned degradation test to fail only when the tuned profile was found degraded
on BM host.
In case of tuned profile found degraded on a VM, only a warning would be reported in the logs.
The reason for that is that the fact CI is using VMs - and some
configurations like certain kernel args can't be applied and therby lead to
the tuned profile being degraded.

Signed-off-by: Ronny Baturov <[email protected]>

---------

Signed-off-by: Ronny Baturov <[email protected]>

---------

Signed-off-by: rbaturov <[email protected]>
Signed-off-by: Ronny Baturov <[email protected]>
rbaturov added a commit to rbaturov/cluster-node-tuning-operator that referenced this pull request Apr 9, 2024
…d template (openshift#984)

* Backport: Add performance real time tuned template (openshift#954)

* Add performance real time tuned template

Kernel parameters that are not supported on RT kernel systems are being applied, causing errors to be logged to the tuned daemon, resulting in a degraded profile state.
Therefore, I added the openshift-node-performance-rt profile that would be included if an RT kernel is detected, thereby dropping the unsupported kernel parameters before they are applied.

* Added a test that make sure the tuned profile is not degraded

This could be great to have a generic test that make sure that after
performance profile is applied the tuned profile is not degraded.
This would prevent future issues like OCPBUGS-23167.

* render-sync update

Updated render-sync to include artifacts that are needed for the e2e tests.
Committing the rendered items to this commit.

Signed-off-by: rbaturov <[email protected]>

* NO-JIRA: Update tuned profile degraded test (openshift#1005)

* Added isVM func to util

Signed-off-by: Ronny Baturov <[email protected]>

* Update tuned profile degraded test

Updated the tuned degradation test to fail only when the tuned profile was found degraded
on BM host.
In case of tuned profile found degraded on a VM, only a warning would be reported in the logs.
The reason for that is that the fact CI is using VMs - and some
configurations like certain kernel args can't be applied and therby lead to
the tuned profile being degraded.

Signed-off-by: Ronny Baturov <[email protected]>

---------

Signed-off-by: Ronny Baturov <[email protected]>

---------

Signed-off-by: rbaturov <[email protected]>
Signed-off-by: Ronny Baturov <[email protected]>
openshift-merge-bot bot pushed a commit that referenced this pull request Apr 17, 2024
…d template (#984) (#1025)

* Backport: Add performance real time tuned template (#954)

* Add performance real time tuned template

Kernel parameters that are not supported on RT kernel systems are being applied, causing errors to be logged to the tuned daemon, resulting in a degraded profile state.
Therefore, I added the openshift-node-performance-rt profile that would be included if an RT kernel is detected, thereby dropping the unsupported kernel parameters before they are applied.

* Added a test that make sure the tuned profile is not degraded

This could be great to have a generic test that make sure that after
performance profile is applied the tuned profile is not degraded.
This would prevent future issues like OCPBUGS-23167.

* render-sync update

Updated render-sync to include artifacts that are needed for the e2e tests.
Committing the rendered items to this commit.



* NO-JIRA: Update tuned profile degraded test (#1005)

* Added isVM func to util



* Update tuned profile degraded test

Updated the tuned degradation test to fail only when the tuned profile was found degraded
on BM host.
In case of tuned profile found degraded on a VM, only a warning would be reported in the logs.
The reason for that is that the fact CI is using VMs - and some
configurations like certain kernel args can't be applied and therby lead to
the tuned profile being degraded.



---------



---------

Signed-off-by: rbaturov <[email protected]>
Signed-off-by: Ronny Baturov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants