Skip to content

Conversation

@haircommander
Copy link
Member

@haircommander haircommander commented Jan 28, 2020

- What I did
Added a ctrcfg e2e test, to hopefully begin to prevent regressions
To get this test to pass, some changes were also needed:

  • refactor some of mcd_test.go (and move to utils_test.go) to not duplicate code between it and ctrcfg_test.go
  • pull in Bug 1794493: Update crio.yaml templates with correct values  #1405, because without it the test fails
  • revendor cri-o to have an updated config structure
  • change a check on LogSizeMax which prevents an empty value ending up in the crio.conf

- How to verify it
go test -v -run="TestContainerRuntimeConfigPidsLimit" ./test/e2e/

- Description for the changelog
added simple e2e test for ContainerRuntimeConfigs
revendor cri-o to have an updated config structure

Note: this is WIP because I haven't seen the test pass yet (as I was working without the required patch). Instead of waiting for a cluster to provision manually, I'll now start letting prow do it. I also want to refactor the changes a bit

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 28, 2020
@runcom
Copy link
Member

runcom commented Jan 28, 2020

awesome 💪

@runcom
Copy link
Member

runcom commented Jan 28, 2020

Added a ctrcfg e2e test, to hopefully begin to prevent regressions with it. also pull in #1405, because without it the test fails

I'd just close #1405 and retitle this PR to account for the Bug as well

@runcom
Copy link
Member

runcom commented Jan 29, 2020

uhm, not sure yet if failures in op tests are fully related

/retest

@runcom
Copy link
Member

runcom commented Jan 29, 2020

uhm I think e2e-gcp-op failures are now related

@haircommander
Copy link
Member Author

updated to have the verification the rollback worked, as well as a bit more debugging logs to test the gcp-op problems

@runcom
Copy link
Member

runcom commented Jan 29, 2020

/skip

@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 29, 2020
@haircommander
Copy link
Member Author

/retest

1 similar comment
@runcom
Copy link
Member

runcom commented Jan 30, 2020

/retest

@runcom runcom changed the title WIP: add ctrcfg e2e test WIP: Bug 1794493: add ctrcfg e2e test Jan 30, 2020
@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jan 30, 2020
@openshift-ci-robot
Copy link
Contributor

@haircommander: This pull request references Bugzilla bug 1794493, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

WIP: Bug 1794493: add ctrcfg e2e test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@runcom
Copy link
Member

runcom commented Jan 30, 2020

/retest

@runcom
Copy link
Member

runcom commented Jan 30, 2020

e2e-gcp-op still related to this now that we're back with that

@haircommander
Copy link
Member Author

yeah I've observed similar things when testing on my own. somehow the LogSizeMax value is being set to 0, and cri-o complains. Still working on it...

@haircommander
Copy link
Member Author


level=error msg="Error: Error applying IAM policy to project \"openshift-gce-devel-ci\": Too many conflicts.  Latest error: Error setting IAM policy for project \"openshift-gce-devel-ci\": googleapi: Error 409: There were concurrent policy changes. Please retry the whole read-modify-write with exponential backoff., aborted"
level=error

/retest

Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test LGTM (to the limited extent I understand MCO).

WRT explicitly setting all the CRI-O options:

  • Neither this nor #1405 provides any rationale. Is there a consensus on this being the right thing?
  • If it is necessary to explicitly set all of those options, is there a process in place to keep updating the templates as new options are added, or defaults change?

@haircommander
Copy link
Member Author

haircommander commented Jan 30, 2020

* Neither this nor #1405 provides any rationale. Is there a consensus on this being the right thing?

agreed, I am still working on figuring out all of the failures. Upon this being the final version (still wip), I'll update it for explainations, but as I can see it:
Right now, pkg/controller/container-runtime-config/* defines a custom struct built off of the crio config. This struct is needs to have all of the options populated (for now, because we base off of a single config file). However, if the template doesn't define a default value for a value, that value will be evaluated as empty ("', 0, [], etc) after we render the ctrcfg. The problem is CRI-O can't discern whether that empty value is a mistake or purposeful (for instance, we set stream_address as empty on purpose). That's the first problem that is being fixed.

The second problem is because we are using an outdated version of the CRI-O config, and there are now some values that, when populated, behave better (version file, for instance). So the vendor bump is to fix that.

Finally, there's some weirdness with LogSizeMax. Somehow, it ends up as 0 after an update is applied. I still haven't gotten to the bottom of that, and once I do I'll add a test for each of the options, update the PR description, and remove the WIP

* If it is necessary to explicitly set all of those options, is there a process in place to keep updating the templates as new options are added, or defaults change?

For now, it is necessary and there's no process. However, there's light at the end of the tunnel. What we really should do is utilize drop in config files and define a structure that is a subset of the options that openshift needs (only the ones we care about configuring). When a ctrcfg comes in, we update that file with its values (keeping most out of the config, and thus most defaulting to the system crio.conf)

@mtrmac
Copy link
Contributor

mtrmac commented Jan 30, 2020

* If it is necessary to explicitly set all of those options, is there a process in place to keep updating the templates as new options are added, or defaults change?

For now, it is necessary and there's no process. However, there's light at the end of the tunnel. What we really should do is utilize drop in config files and define a structure that is a subset of the options that openshift needs (only the ones we care about configuring).

That would be great, and is a good reason not to invest too much in the current hardcode-all-default design — if the drop in solution can happen soon enough.

If there’s a risk that this will persist over a CRI-O major release update (1.y to 1.y+1), it seems desirable to harden the MCO implementation, maybe to at build time call https://github.com/cri-o/cri-o/blob/eb14e0feb01971bf88b2591b744b64dd60272fe5/pkg/config/config.go#L485 to generate the templates.

Or maybe do that in updateCRIOConfig: start with the return value of DefaultConfig() (and apply any MCO-specific hard-coded edits in the code) instead of reading and decoding a template file.

@haircommander
Copy link
Member Author

 === RUN   TestContainerRuntimeConfigPidsLimit
--- PASS: TestContainerRuntimeConfigPidsLimit (215.71s) 

😄
PTAL @kikisdeliveryservice @runcom @umohnani8 @mtrmac

@openshift-ci-robot
Copy link
Contributor

@haircommander: This pull request references Bugzilla bug 1794493, which is valid.

Details

In response to this:

Bug 1794493: add ctrcfg e2e test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

2 similar comments
@openshift-ci-robot
Copy link
Contributor

@haircommander: This pull request references Bugzilla bug 1794493, which is valid.

Details

In response to this:

Bug 1794493: add ctrcfg e2e test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@haircommander: This pull request references Bugzilla bug 1794493, which is valid.

Details

In response to this:

Bug 1794493: add ctrcfg e2e test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@runcom
Copy link
Member

runcom commented Feb 5, 2020

there seems to still be some kind of issue/race right? 🤔

@haircommander
Copy link
Member Author

there seems to still be some kind of issue/race right?

I don't know if it's an "issue". one could call the sleeps a bit hacky, but I don't know of a better way to make sure the operation has made it to the openshift api-server without them.

@runcom
Copy link
Member

runcom commented Feb 5, 2020

I don't know if it's an "issue". one could call the sleeps a bit hacky, but I don't know of a better way to make sure the operation has made it to the openshift api-server without them.

oh no sorry, I meant to say the gpg-op test is still failing with something that I think it's related to this change uhm

umohnani8 and others added 4 commits February 6, 2020 11:01
This reverts commit 69025e8.
as well as updates crio.yaml templates with correct values and fixes whitespace

Signed-off-by: Urvashi Mohnani <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
before, we weren't properly checking LogSizeMax was an empty value. As such, we incorrectly merged the config changes, which bricked the runtime when LogSizeMax wasn't defined.

Fix this by using LogSizeMax uniformly by checking against its Value() method

Signed-off-by: Peter Hunt <[email protected]>
this change includes:
add test/e2e/ctrcfg_test that includes three basic tests for deploying a ctrcfg and seeing it correctly changed the runtime configuration
move/refactor some code from test/e2e/mcd_test to test/e2e/utils_test to prevent duplication between the two files, while also being clear about what tests use what (utils currently houses functions used in both mcd and ctrcfg tests)

Signed-off-by: Peter Hunt <[email protected]>
@haircommander
Copy link
Member Author

Throttling: Rate exceeded\n

/retest

@haircommander
Copy link
Member Author

🎉

ok now it's ready (assuming the aws failure is unrelated, which I think it is?)

@runcom @kikisdeliveryservice @umohnani8 @mtrmac PTAL

@haircommander
Copy link
Member Author

/retest

@haircommander
Copy link
Member Author

/retest

🤞

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Feb 6, 2020

@haircommander: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 ff58ebb link /test e2e-aws-scaleup-rhel7

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@runcom
Copy link
Member

runcom commented Feb 7, 2020

ok now it's ready (assuming the aws failure is unrelated, which I think it is?)

🎉 yep, it looks like it's ok to merge - we'll fix any flake afterwards :trollface:

@runcom
Copy link
Member

runcom commented Feb 7, 2020

/skip
/approve
/lgtm

we need a backport for this for 4.3 as well right? @umohnani8 @haircommander

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 7, 2020
@openshift-merge-robot openshift-merge-robot merged commit ac44c3f into openshift:master Feb 7, 2020
@haircommander haircommander deleted the crio-config-test branch February 7, 2020 13:29
@haircommander
Copy link
Member Author

/skip
/approve
/lgtm

we need a backport for this for 4.3 as well right? @umohnani8 @haircommander

yep, backport here #1447

@runcom
Copy link
Member

runcom commented Feb 10, 2020

/bugzilla refresh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants