Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leverage logstream to get better e2e diagnostic logs. #3388

Merged
merged 4 commits into from
Oct 14, 2020

Conversation

mattmoor
Copy link
Member

Changes

logstream is a package that we use in knative to inline the system component logs relevant to a particular test into the test logs. It relies on tests following the auto-naming convention we use with our helper method helpers.ObjectNameForTest(t), which attaches a common prefix to resource names allowing logstream to filter the relevant log lines.

By virtue of using the genreconciler and other knative.dev/pkg controller infrastructure, each reconciler using logging.FromContext(ctx) on the context passed to ReconcileKind already gets a logger that's infused with the "key" being reconciled, but there may be other shared libraries that may require further instrumentation for this to reach its full potential.

I'm instrumenting several tests where I have seen downstream flakes as a start, and hopefully this illuminated why those tests are flaking by shining a light on controller activities during the test's execution.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes tests (if functionality changed/added)
  • Includes docs (if user facing)
  • Commit messages follow commit message best practices
  • Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

NONE

/assign @imjasonh @vdemeester @afrittoli

logstream is a package that we use in knative to inline the system component logs relevant to a particular test into the test logs.  It relies on tests following the auto-naming convention we use with our helper method `helpers.ObjectNameForTest(t)`, which attaches a common prefix to resource names allowing logstream to filter the relevant log lines.

By virtue of using the `genreconciler` and other knative.dev/pkg controller infrastructure, each reconciler using `logging.FromContext(ctx)` on the context passed to `ReconcileKind` already gets a logger that's infused with the "key" being reconciled, but there may be other shared libraries that may require further instrumentation for this to reach its full potential.

I'm instrumenting several tests where I have seen downstream flakes as a start, and hopefully this illuminated why those tests are flaking by shining a light on controller activities during the test's execution.
@tekton-robot tekton-robot added the release-note-none Denotes a PR that doesnt merit a release note. label Oct 14, 2020
@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 14, 2020
@mattmoor
Copy link
Member Author

I may tack on some additional flaky tests here, but I wanted to stage one at a time to check that e2e tests pass (and I'm not making any bad assumptions)

@mattmoor
Copy link
Member Author

/kind feature

@tekton-robot tekton-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 14, 2020
@mattmoor
Copy link
Member Author

/test check-pr-has-kind-label

@mattmoor
Copy link
Member Author

First test passed, adding timeout_test.go as well.

@imjasonh
Copy link
Member

/lgtm

This looks really cool! I think it's going to make debugging flakes a lot simpler, by moving all the relevant information into the same place.

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 14, 2020
@mattmoor
Copy link
Member Author

Yeah, the intent of it was to try and pre-filter, join and interleave all of the relevant logs into t.Log for tests.

It can mean large logs, but it is generally pretty useful.

@mattmoor
Copy link
Member Author

I have a couple more queued up that I've seen flake downstream in the last ~24 hours: cancel_test.go and pipelinefinally_test.go.

I'll push a commit for those once the above comes back clean 🤞

@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 14, 2020
@mattmoor
Copy link
Member Author

Next one passed. I've pushed two more commits with two more tests. I'll stop there, so we can hopefully checkpoint things and get some better diagnostics from failures 🤞

@imjasonh
Copy link
Member

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 14, 2020
Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thank you!
The integration job is still running, once it's done I'll be able to see the difference :)
If we decide to settle for this it would be good to start documenting best practices when writing tests so that new tests will fit this approach.

@mattmoor
Copy link
Member Author

@afrittoli yup, for once we documented it in PKG 😅

https://github.com/knative/pkg/blob/master/test/logstream/README.md

@mattmoor
Copy link
Member Author

mattmoor commented Oct 14, 2020

Here's a sample log line from a serving e2e flake:

kubelogs.go:197: I 01:25:17.385 controller-588cb74895-r25bv [configuration-controller] [serving-tests/scale-to-n-scale-200-065-of-200-ekqhpiww] Reconcile succeeded. Time taken: 11.127µs

The format is {level} {timestamp} {pod-name} [{controller name}] [{resource key}] {log-line}

Full test output (that I picked that from): https://prow.knative.dev/view/gcs/knative-prow/logs/ci-knative-serving-contour-latest/1315090891001565185

@mattmoor
Copy link
Member Author

I'm not gonna lie either, I partly did this because JSON logs made my eyes bleed outside of something like stackdriver 🤣

@mattmoor
Copy link
Member Author

You can start to see it if you crack the raw logs of the running job too:

=== CONT  TestTaskRunPipelineRunCancel/retries=1
    TestTaskRunPipelineRunCancel/retries=0: cancel_test.go:73: Creating PipelineRun in namespace arendelle-9rws9
    TestTaskRunPipelineRunCancel/retries=1: cancel_test.go:73: Creating PipelineRun in namespace arendelle-6p6bp
    TestTaskRunPipelineRunCancel/retries=0: kubelogs.go:197: D 00:00:00.000 tekton-pipelines-controller-68f598fdd7-j7xkj [controller/controller.go:397] [arendelle-9rws9/task-run-pipeline-run-cancel-retries-0-unjegoml] Adding to queue arendelle-9rws9/task-run-pipeline-run-cancel-retries-0-unjegoml (depth: 1)
    TestTaskRunPipelineRunCancel/retries=1: kubelogs.go:197: D 00:00:00.000 tekton-pipelines-controller-68f598fdd7-j7xkj [controller/controller.go:397] [arendelle-6p6bp/task-run-pipeline-run-cancel-retries-1-djhhcdvh] Adding to queue arendelle-6p6bp/task-run-pipeline-run-cancel-retries-1-djhhcdvh (depth: 2)
    TestTaskRunPipelineRunCancel/retries=1: cancel_test.go:78: Waiting for Pipelinerun task-run-pipeline-run-cancel-retries-1-djhhcdvh in namespace arendelle-6p6bp to be started
    TestTaskRunPipelineRunCancel/retries=0: cancel_test.go:78: Waiting for Pipelinerun task-run-pipeline-run-cancel-retries-0-unjegoml in namespace arendelle-9rws9 to be started
    TestTaskRunPipelineRunCancel/retries=0: kubelogs.go:197: D 00:00:00.000 tekton-pipelines-controller-68f598fdd7-j7xkj [controller/controller.go:397] [arendelle-9rws9/task-run-pipeline-run-cancel-retries-0-unjegoml] Adding to queue arendelle-9rws9/task-run-pipeline-run-cancel-retries-0-unjegoml (depth: 3)
    TestTaskRunPipelineRunCancel/retries=0: kubelogs.go:197: I 00:00:00.000 tekton-pipelines-controller-68f598fdd7-j7xkj [pipelinerun/pipelinerun.go:649] [arendelle-9rws9/task-run-pipeline-run-cancel-retries-0-unjegoml] Creating a new TaskRun object task-run-pipeline-run-cancel-retries-0-unjegoml-task-d6s68
    TestTaskRunPipelineRunCancel/retries=1: kubelogs.go:197: D 00:00:00.000 tekton-pipelines-controller-68f598fdd7-j7xkj [controller/controller.go:397] [arendelle-6p6bp/task-run-pipeline-run-cancel-retries-1-djhhcdvh] Adding to queue arendelle-6p6bp/task-run-pipeline-run-cancel-retries-1-djhhcdvh (depth: 3)
    TestTaskRunPipelineRunCancel/retries=1: kubelogs.go:197: I 00:00:00.000 tekton-pipelines-controller-68f598fdd7-j7xkj [pipelinerun/pipelinerun.go:649] [arendelle-6p6bp/task-run-pipeline-run-cancel-retries-1-djhhcdvh] Creating a new TaskRun object task-run-pipeline-run-cancel-retries-1-djhhcdvh-task-wfj26
    TestTaskRunPipelineRunCancel/retries=0: kubelogs.go:197: D 00:00:00.000 tekton-pipelines-controller-68f598fdd7-j7xkj [controller/controller.go:397] [arendelle-9rws9/task-run-pipeline-run-cancel-retries-0-unjegoml-task-d6s68] Adding to queue arendelle-9rws9/task-run-pipeline-run-cancel-retries-0-unjegoml-task-d6s68 (depth: 1)
    TestTaskRunPipelineRunCancel/retries=0: kubelogs.go:197: I 00:00:00.000 tekton-pipelines-controller-68f598fdd7-j7xkj [taskrun/taskrun.go:322] [arendelle-9rws9/task-run-pipeline-run-cancel-retries-0-unjegoml-task-d6s68] Cloud Events: []
    TestTaskRunPipelineRunCancel/retries=0: kubelogs.go:197: I 00:00:00.000 tekton-pipelines-controller-68f598fdd7-j7xkj [pipelinerun/pipelinerun.go:495] [arendelle-9rws9/task-run-pipeline-run-cancel-retries-0-unjegoml] PipelineRun task-run-pipeline-run-cancel-retries-0-unjegoml status is being set to &{Succeeded Unknown  {2020-10-14 19:49:59.214102597 +0000 UTC m=+60.078630826} Running Tasks Completed: 0 (Failed: 0, Cancelled 0), Incomplete: 1, Skipped: 0}
    TestTaskRunPipelineRunCancel/retries=1: kubelogs.go:197: D 00:00:00.000 tekton-pipelines-controller-68f598fdd7-j7xkj [controller/controller.go:397] [arendelle-6p6bp/task-run-pipeline-run-cancel-retries-1-djhhcdvh-task-wfj26] Adding to queue arendelle-6p6bp/task-run-pipeline-run-cancel-retries-1-djhhcdvh-task-wfj26 (depth: 1)
    TestTaskRunPipelineRunCancel/retries=1: kubelogs.go:197: I 00:00:00.000 tekton-pipelines-controller-68f598fdd7-j7xkj [taskrun/taskrun.go:322] [arendelle-6p6bp/task-run-pipeline-run-cancel-retries-1-djhhcdvh-task-wfj26] Cloud Events: []
    TestTaskRunPipelineRunCancel/retries=1: kubelogs.go:197: I 00:00:00.000 tekton-pipelines-controller-68f598fdd7-j7xkj [pipelinerun/pipelinerun.go:495] [arendelle-6p6bp/task-run-pipeline-run-cancel-retries-1-djhhcdvh] PipelineRun task-run-pipeline-run-cancel-retries-1-djhhcdvh status is being set to &{Succeeded Unknown  {2020-10-14 19:49:59.316978857 +0000 UTC m=+60.181507085} Running Tasks Completed: 0 (Failed: 0, Cancelled 0), Incomplete: 1, Skipped: 0}
    TestTaskRunPipelineRunCancel/retries=0: kubelogs.go:197: D 00:00:00.000 tekton-pipelines-controller-68f598fdd7-j7xkj [controller/controller.go:397] [arendelle-9rws9/task-run-pipeline-run-cancel-retries-0-unjegoml] Adding to queue arendelle-9rws9/task-run-pipeline-run-cancel-retries-0-unjegoml (depth: 3)
    TestTaskRunPipelineRunCancel/retries=0: kubelogs.go:197: D 00:00:00.000 tekton-pipelines-controller-68f598fdd7-j7xkj [pipelinerun/reconciler.go:322] [arendelle-9rws9/task-run-pipeline-run-cancel-retries-0-unjegoml] Updating status with:   v1beta1.PipelineRunStatus{
          	Status: v1beta1.Status{
          		ObservedGeneration: 0,
        - 		Conditions:         nil,
        + 		Conditions: v1beta1.Conditions{
        + 			{
        + 				Type:               "Succeeded",
        + 				Status:             "Unknown",
        + 				LastTransitionTime: apis.VolatileTime{Inner: s"2020-10-14 19:49:59.214102597 +0"...},
        + 				Reason:             "Running",
        + 				Message:            "Tasks Completed: 0 (Failed: 0, Cancelled 0), Incomplete: 1, Skip"...,
        + 			},
        + 		},
          		Annotations: nil,
          	},
          	PipelineRunStatusFields: v1beta1.PipelineRunStatusFields{
        - 		StartTime:      nil,
        + 		StartTime:      s"2020-10-14 19:49:58.914672461 +0000 UTC m=+59.779200693",
          		CompletionTime: nil,
        - 		TaskRuns:       nil,
        + 		TaskRuns: map[string]*v1beta1.PipelineRunTaskRunStatus{
        + 			"task-run-pipeline-run-cancel-retries-0-unjegoml-task-d6s68": &{PipelineTaskName: "task", Status: &v1beta1.TaskRunStatus{}},
        + 		},
          		PipelineResults: nil,
        - 		PipelineSpec:    nil,
        + 		PipelineSpec: &v1beta1.PipelineSpec{
        + 			Tasks: []v1beta1.PipelineTask{{Name: "task", TaskSpec: &v1beta1.EmbeddedTask{...}, Retries: 1}},
        + 		},
          		SkippedTasks: nil,
          	},
          }

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this looks really good!
Something that is missing today in logs is a timestamp (this is a separate issue) but you can see it reflected here in the 00:00:00 timestamp:

    TestTaskRunPipelineRunCancel/retries=1: kubelogs.go:197: D 00:00:00.000 tekton-pipelines-controller-68f598fdd7-j7xkj [controller/controller.go:397] [arendelle-6p6bp/task-run-pipeline-run-cancel-retries-1-djhhcdvh-task-wfj26] Adding to queue arendelle-6p6bp/task-run-pipeline-run-cancel-retries-1-djhhcdvh-task-wfj26 (depth: 2)

/approve

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 14, 2020
@tekton-robot tekton-robot merged commit 2436f0d into tektoncd:master Oct 14, 2020
@mattmoor
Copy link
Member Author

@afrittoli 😅 yeah I think that's pretty easy to add. You folks don't put timestamps in your logs by default 🤯

@mattmoor mattmoor deleted the logstream-a-few branch October 14, 2020 20:18
@afrittoli
Copy link
Member

Yeah, I'm not sure why, and every time I forget to make the PR... Will do this time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesnt merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants