Skip to content

Conversation

@mdbooth
Copy link
Contributor

@mdbooth mdbooth commented Feb 9, 2023

The resource and machine tickers are responsible for fetching logs every 5 and 10 seconds respectively during a test run. Each fetch overwrites the previous one. These are problematic because they cause failure state to be overwritten by logs captured during deletion. That is, if a test ends in failure, we continue to capture these logs regularly during cleanup. This causes the test failure state to be overwritten by the state during cleanup.

We already capture these resources after each test regardless of success or failure, so regular capture during the test is redundant. This change simply removes the tickers. We still get the same logs when we execute DumpSpecResourcesAndCleanup() after each test.

Additionally, if there is an error during cleanup we optionally capture that state too, but to a separate directory.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 9, 2023
@k8s-ci-robot k8s-ci-robot requested a review from lentzi90 February 9, 2023 16:16
@netlify
Copy link

netlify bot commented Feb 9, 2023

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name Link
🔨 Latest commit 825ce43
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/63ee1c5658f34d0008081ff2
😎 Deploy Preview https://deploy-preview-1471--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mdbooth

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 9, 2023
@jichenjc
Copy link
Contributor

should we make some error on purpose to prove those files are correctly dumped and get a sample output
then put those files into dev guide for further reference?

Copy link
Contributor

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! The logs are much cleaner with this!
One thing I have noticed (and I don't think it is anything new) is that we don't seem to be able to dump the openstack server instances. Maybe we could fix that at the same time?

Logs look like this:

   [2023-02-09T18:25:48Z] Dumping all OpenStack server instances in the "e2e-pe41vo" namespace
  error getting internal ip for server cluster-e2e-dajzgq-control-plane-7kf5p: internal ip doesn't exist (yet)
  error getting internal ip for server cluster-e2e-dajzgq-bastion: internal ip doesn't exist (yet) 

Perhaps there is something wrong with how we try to get the IP here

ip := instanceNS.IP(openStackCluster.Status.Network.Name)
if ip == "" {
_, _ = fmt.Fprintf(GinkgoWriter, "error getting internal ip for server %s: internal ip doesn't exist (yet)\n", srv.Name)

Any clues?

@mdbooth
Copy link
Contributor Author

mdbooth commented Feb 10, 2023

should we make some error on purpose to prove those files are correctly dumped and get a sample output then put those files into dev guide for further reference?

Yes, definitely. It's on my todo list to prove that errors during delete are captured. I haven't even tested it manually yet.

@mdbooth
Copy link
Contributor Author

mdbooth commented Feb 16, 2023

Nice! The logs are much cleaner with this! One thing I have noticed (and I don't think it is anything new) is that we don't seem to be able to dump the openstack server instances. Maybe we could fix that at the same time?

Logs look like this:

   [2023-02-09T18:25:48Z] Dumping all OpenStack server instances in the "e2e-pe41vo" namespace
  error getting internal ip for server cluster-e2e-dajzgq-control-plane-7kf5p: internal ip doesn't exist (yet)
  error getting internal ip for server cluster-e2e-dajzgq-bastion: internal ip doesn't exist (yet) 

Perhaps there is something wrong with how we try to get the IP here

ip := instanceNS.IP(openStackCluster.Status.Network.Name)
if ip == "" {
_, _ = fmt.Fprintf(GinkgoWriter, "error getting internal ip for server %s: internal ip doesn't exist (yet)\n", srv.Name)

Any clues?

I've also wondered about this. It's on my TODO list to investigate properly.

The resource and machine tickers are responsible for fetching logs every
5 and 10 seconds respectively during a test run. Each fetch overwrites
the previous one. These are problematic because they cause failure state
to be overwritten by logs captured during deletion. That is, if a test
ends in failure, we continue to capture these logs regularly during
cleanup. This causes the test failure state to be overwritten by the
state during cleanup.

We already capture these resources after each test regardless of success
or failure, so regular capture during the test is redundant. This change
simply removes the tickers. We still get the same logs when we execute
DumpSpecResourcesAndCleanup() after each test.

Additionally, if there is an error during cleanup we optionally capture
that state too, but to a separate directory.
@mdbooth
Copy link
Contributor Author

mdbooth commented Feb 16, 2023

/test pull-cluster-api-provider-openstack-e2e-test

@mdbooth
Copy link
Contributor Author

mdbooth commented Feb 16, 2023

We correctly dumped the error on deletion in:

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/1471/pull-cluster-api-provider-openstack-e2e-test/1626172319116300288

Note that there is an additional directory created for the deletion failure:

https://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/1471/pull-cluster-api-provider-openstack-e2e-test/1626172319116300288/artifacts/clusters/bootstrap/resources/deletion-failure/e2e-ptzr4v/

Incidentally, I deliberately added the failure after cleanup had succeeded, so the contents of that directory are 'leaked' objects.

@mdbooth
Copy link
Contributor Author

mdbooth commented Feb 16, 2023

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 16, 2023
@mdbooth
Copy link
Contributor Author

mdbooth commented Feb 16, 2023

/retest-required

Copy link
Contributor

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 16, 2023
@k8s-ci-robot k8s-ci-robot merged commit 77cdd50 into kubernetes-sigs:main Feb 16, 2023
@mdbooth mdbooth deleted the tickers branch February 16, 2023 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants