Skip to content

LOG-4928: Cluster logging next APIs#1537

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
jcantrill:log4928
May 29, 2024
Merged

LOG-4928: Cluster logging next APIs#1537
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
jcantrill:log4928

Conversation

@jcantrill
Copy link
Contributor

This PR proposes the next version of logging APIs

cc @alanconway @xperimental @periklis @cahartma

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 10, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 10, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 10, 2024

@jcantrill: This pull request references LOG-4928 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.16.0" version, but no target version was set.

Details

In response to this:

This PR proposes the next version of logging APIs

cc @alanconway @xperimental @periklis @cahartma

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from alanconway and periklis January 10, 2024 19:30
Copy link
Contributor

@alanconway alanconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API is right, only 1 minor tweak and clarifying text requested.
Main issues to resolve are how best to manage the codebase for 2 APIs, what upgrade paths we want to support and what previews we do.

IMO the best outcome would be:

  • 5.8 supports v1 and v2-beta (tech preview)
  • 6.0 is v2 GA clean drops of v1 and deprecated dependencies.

However aligning with OCP extended support may make this more complicated. To be discussed.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 16, 2024
Copy link
Contributor

@alanconway alanconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added new comments after todays meeting, if you agree to those then LGTM

@openshift-ci openshift-ci bot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 17, 2024
Copy link
Contributor

@periklis periklis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of general notes mostly verifying my understanding of the goals and future proof integration with LokiStack:

  1. One of the goal says: "Support an API to spec a Red Hat managed LokiStack with the logging tenancy model" Is this something still work in progress or missing? If by this goal we express the fact continuing to use the reserved keyword default, I suggest to spare a small section mentioning it's existence and an example CLF dedicated for this purpose. If on the other hand it means we want to expand to user-defined tenancies or augmented tenancy models (e.g. collected logs from OpenStack on OpenShift), we should dedicate some explanation about it.
  2. The section "Implementation Details/Notes/Constraints" mentions that "V2 of the ClusterLogForwarder is a cluster-wide resource". Is this a typo or do we intend to shift back from namespaced ClusterLogForwarder? If latter is true, how does the transition path look like for the use cases covered by namespaced ClusterLogForwarders today?
  3. IIRC we are requested to support in addition to the daemonset setup a deployment based setup for collector. How is this reflected in the serviceAccount and collector specs in the proposed CRD?

@jcantrill
Copy link
Contributor Author

A couple of general notes mostly verifying my understanding of the goals and future proof integration with LokiStack:

  1. One of the goal says: "Support an API to spec a Red Hat managed LokiStack with the logging tenancy model" Is this something still work in progress or missing? If by this goal we express the fact continuing to use the reserved keyword default, I suggest to spare a small section mentioning it's existence and an example CLF dedicated for this purpose. If on the other hand it means we want to expand to user-defined tenancies or augmented tenancy models (e.g. collected logs from OpenStack on OpenShift), we should dedicate some explanation about it.

Maybe still WIP. The intent is to drop the use of 'default' but we still require a mechanism to allow admin to say "write logs to lokiStack that uses the logging tenant model". My expectation is that you can write to a loki instance or a RH lokiStack where the former you define the tenancy and the latter only allows logging tenancy. I welcome suggestions on making this a reality given it appears not fully expressed here.

  1. The section "Implementation Details/Notes/Constraints" mentions that "V2 of the ClusterLogForwarder is a cluster-wide resource". Is this a typo or do we intend to shift back from namespaced ClusterLogForwarder? If latter is true, how does the transition path look like for the use cases covered by namespaced ClusterLogForwarders today?

This is a summation of our previous discussions to make it a true clusterwide resource as the name implies. We are not providing any auto migration but I believe v1 maps fairly easy to v2. I think we can still apply the same permissions requirements that we have now so I don't see it as different. Also planning to deploy to the NS associated with the SA but maybe it would be clearer to move NS out of the SA block and up a level to make it more explicitly where we intend to land the deployment. Lastly this allows us to embrace "namespace logforwarder" if we ever can figure out what that looks like.

  1. IIRC we are requested to support in addition to the daemonset setup a deployment based setup for collector. How is this reflected in the serviceAccount and collector specs in the proposed CRD?

I believe I answer this in the response to the previous question.

@dhellmann
Copy link
Contributor

#1555 is changing the enhancement template in a way that will cause the header check in the linter job to fail for existing PRs. If this PR is merged within the development period for 4.16 you may override the linter if the only failures are caused by issues with the headers (please make sure the markdown formatting is correct). If this PR is not merged before 4.16 development closes, please update the enhancement to conform to the new template.

@alanconway
Copy link
Contributor

alanconway commented Feb 19, 2024

@jcantrill re loki vsl. LokiStack outputs - I put this on the backlog centuries ago, and eventually closed it as too far in the future. I have re-opened and updated it to be a bit clearer on what is required: https://issues.redhat.com/browse/LOG-2811

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 19, 2024
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 26, 2024
@jcantrill
Copy link
Contributor Author

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 26, 2024
@jcantrill
Copy link
Contributor Author

  1. The section "Implementation Details/Notes/Constraints" mentions that "V2 of the ClusterLogForwarder is a cluster-wide resource". Is this a typo or do we intend to shift back from namespaced ClusterLogForwarder? If latter is true, how does the transition path look like for the use cases covered by namespaced ClusterLogForwarders today?

This is a summation of our previous discussions to make it a true clusterwide resource as the name implies. We are not providing any auto migration but I believe v1 maps fairly easy to v2. I think we can still apply the same permissions requirements that we have now so I don't see it as different. Also planning to deploy to the NS associated with the SA but maybe it would be clearer to move NS out of the SA block and up a level to make it more explicitly where we intend to land the deployment. Lastly this allows us to embrace "namespace logforwarder" if we ever can figure out what that looks like.

Following up to this statement, it is not possible to change the resource deployment scope between versions. This can only be accomplished by renaming the CRD and group. Continuing as a namespaced resource given we need a namespace. Additionally, the only way to offer two versions of the same API is to provide a migration path. Proposal is being update accordingly


* "One click" deployment of a full logging stack as provided by **ClusterLogging** v1
* Complete backwards compatibility to **ClusterLogForwarder** v1
* Automated migration path from v1 to v2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are v1 and v2 CRDs supported at the same time or is there a hard cutover required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, ideally a hard cut over but there are open questions about what it means for users not heeding the deprecation warnings to come to the next version of logging with Elasticsearch and fluentd. Mostly, the v2 api for vector deployments maps in a straight forward fashion from v1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jewzaam further discussion. Updated the proposal to land on a path that:

  • initially migrates the resource
  • drops support for previous version in subsequent release stream

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated again:

  • drops CL objects
  • allows v1 and v2 for vector CLF deployments
  • migrates v1 to v2
  • terminates new feature development for v1

@jcantrill
Copy link
Contributor Author

API is right, only 1 minor tweak and clarifying text requested. Main issues to resolve are how best to manage the codebase for 2 APIs, what upgrade paths we want to support and what previews we do.

IMO the best outcome would be:

  • 5.8 supports v1 and v2-beta (tech preview)
  • 6.0 is v2 GA clean drops of v1 and deprecated dependencies.

However aligning with OCP extended support may make this more complicated. To be discussed.

Updated proposal to identify a migration path that supports new and old and introduces deprecation/drops over multiple z-streams. @alanconway please review latest

@jcantrill
Copy link
Contributor Author

/hold cancel

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 2, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: periklis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 2, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 2, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: periklis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jcantrill jcantrill force-pushed the log4928 branch 3 times, most recently from 3e35578 to 50b6766 Compare May 3, 2024 19:51
@jcantrill
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 7, 2024
Copy link
Contributor

@alanconway alanconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - this is great. I marked a couple of very minor typos and clarifications, but otherwise this is ready IMO.

* Support log forwarder API with minimal or no dependency upon reserved words (e.g. default)
* Support an API to spec a Red Hat managed LokiStack with the logging tenancy model
* Continue to allow deployment of a log forwarder to the output sinks of the administrators choosing
* Automated migration path from *ClusterLogForwarder.logging.openshift.io/v1* to *ClusterLogForwarder.observability.openshift.io/v1*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Automated migration path from *ClusterLogForwarder.logging.openshift.io/v1* to *ClusterLogForwarder.observability.openshift.io/v1*
* Automated migration path from *ClusterLogForwarder.logging.openshift.io/v1* to *ClusterLogForwarder.observability.openshift.io/v2*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the v1 API for this group. @xperimental and I briefly discussed the issue of calling it v2 with the consensus of it causing confusion for users who may ask where v1 was

Comment on lines +93 to +96
1. Deploys the Red Hat **loki-operator**
1. Deploys an instance of **LokiStack** in the `openshift-logging` namespace
1. Deploys the Red Hat **cluster-logging-operator**
1. Creates a **ClusterLogForwarder** custom resource for the **LokiStack**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The COO may automate some or all of this for typical deployments, details still being hashed out.
E.g. the user may enable "default-in-cluster-logging" in some COO resource, and COO automates installing loki and CLO operators, creating initial default loki & CLF resources etc.
The simplified user experience is up to the COO, it doesn't affect what logging delivers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something specifically you wish me to add here to integrate your comment?


#### User Experience

The product is no longer offering a "one-click" experience for deploying a full logging stack from collection to storage. Given we started moving away from this experience when Loki was introduced, this should be low risk. Many customers already have their own log storage solution so they are only making use of log forwarding. Additionally, it is intended for the **cluster-observability-operator** to recognize the existance of the internally managed log storage and automatically deploy the view plugin. This should reduce the burden of administrators.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The COO should mitigate this risk. The goal is for the COO to provide the easy out-of-box experience by installing & configuring multiple operators and resources automatically - not just for logging but for other obs. tools. Ideally COO will finally provide a real one-click setup for simple cases, but still allow users to configure the underlying resources if they need greater control.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there specific verbiage you wish for me to add?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, you've already covered it. I think I didn't read carefully.

@alanconway
Copy link
Contributor

/hold
/lgtm

@jcantrill LGTM from me bar minor typos, will let you /unhold when that's done.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 9, 2024
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 9, 2024
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 9, 2024
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be interested to see if you have the full Go type that you're planning to use for the new API. I can review that in-depth if you'd be interested to provide specific feedback about the types and the validations that can/should be applied.

Otherwise, have provided some general comments on the API and UX in this initial pass

Comment on lines +54 to +57
* Drop support for the **ClusterLogging** custom resource
* Drop support for **ElasticSearch**, **Kibana** custom resources and the **elasticsearch-operator**
* Drop support for Fluentd collector implementation, Red Hat managed Elastic stack (e.g. Elasticsearch, Kibana)
* Drop support in the **cluster-logging-operator** for **logging-view-plugin** management
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there discussion anywhere on what it means to be supported for APIs and functionality within the operator? I know within core OCP we cannot drop support for custom resources, but I don't know what rules the logging operator is bound by

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These APIs have been marked as deprecated for several releases that we can not support. The observability program is moving around responsibilities from some operators to another. To my knowledge we have no formal rules in place but maybe it should be something we consider. cc @eparis

serviceAccount:
name:
collector:
resources: #corev1.ResourceRequirements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it's advised not to re-use upstream types. They may evolve underneath you. Is this passed directly through to a pod spec? If so, can probably excuse that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it's advised not to re-use upstream types.

What is the recommended alternative? Alias the types? re-implement? There seems to be a caveat to this statement IMO. I would think the intent of "core" types is to have slow evolution. Also, there is always a danger of a dependency changing so I'm not certain how you get away from that. We are fairly slow at bumping dependencies and most if not all of them are stable; we are not using anything with high churn.

Is this passed directly through to a pod spec? If so, can probably excuse that

It is directly passed through to help with scheduling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the advice normally is to implement your own type. For example, this is common primarily in the corev1.ObjectReference and its usage. Where people just want to have a secret name and namespace, but have a number of other unused fields as well.

Given this is a resource requirement, I expect you are just plumbing it through to a podSpec? If that's correct, then the usage here is fine. It would only be better to implement your own alternative if you are using it for some other purpose, and just used the type because it was handy that it looked like what you wanted already (like the object references)

tolerations: #corev1.Toleration
inputs:
- name:
type: #enum: application,infrastructure,audit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a discriminated union, which is good. One thing to consider is that embedded DUs (where there are other fields at the same level, but outside of the union, eg receiver) have proven to be confusion to end users, and we currently advise that you do not embed DUs but keep them separate. Would mean an extra level of indentation for type, application, infrastructure and audit, but, provides extra clarity over the behaviour of the fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is one that we are re-using based upon seeing it elsewhere. Can you point to an example or recommendation of what an alternate looks like?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern is correct, you would just move one indentation down. There are some notes in our API conventions

I think right now you probably have something like

type MyUnion struct {
  MyDiscriminator string

  MyMember ...
}

type MyParent struct {
  MyUnion `json:",inline"`
}

Where we prefer not to have the inline there so that the members of the union appear on a separate level to any adjacent fields to the union

http:
format: #enum: kubeAPIAudit , format of incoming data
tls:
cacert:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be camel case, so caCert

key: #the key in the resource
keyPassphrase:
key: #the key in the resource
insecureSkipVerify: #bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid bools, they typically don't age well.
Do you really need this option still? Any chance to drop it in the new API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is common TLS setting which short circuits certificate verification. I'm not certain it can be represented any other way though I agree with you regarding bools.

Any chance to drop it in the new API?
@alanconway you and I have had this discussion in the past and I believe we always land on the "development" scenario where admins want this capability. Our position has to been to lean into security but this is a knob that is useful during adoption

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other TLS verification options? Such as system roots or specific keys?

You could have it as a combined DU, this has the nice benefit of preventing folks from setting keys when they would otherwise be ignored, because you skip the verification anyway

tlsVerification:
  method: SystemRoots | KeyFromSecret | InsecureSkipVerification
  keyFromSecret: <- only when method KeyFromSecret
    secret:
      name:
    key: ...

* Support log forwarder API with minimal or no dependency upon reserved words (e.g. default)
* Support an API to spec a Red Hat managed LokiStack with the logging tenancy model
* Continue to allow deployment of a log forwarder to the output sinks of the administrators choosing
* Automated migration path from *ClusterLogForwarder.logging.openshift.io/v1* to *ClusterLogForwarder.observability.openshift.io/v1*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected to see this detailed, but didn't. How will you migrate users? Moving from one API group to another is complex if you must maintain the functionality and cannot have downtime. What's the plan here?

@jcantrill jcantrill force-pushed the log4928 branch 2 times, most recently from 47d9543 to 5150362 Compare May 22, 2024 13:24
@alanconway
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 29, 2024
@jcantrill
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 29, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 29, 2024

@jcantrill: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 4668a08 into openshift:master May 29, 2024
@jcantrill jcantrill deleted the log4928 branch May 30, 2024 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.