LOG-4928: Cluster logging next APIs#1537
LOG-4928: Cluster logging next APIs#1537openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
Conversation
|
@jcantrill: This pull request references LOG-4928 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.16.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
API is right, only 1 minor tweak and clarifying text requested.
Main issues to resolve are how best to manage the codebase for 2 APIs, what upgrade paths we want to support and what previews we do.
IMO the best outcome would be:
- 5.8 supports v1 and v2-beta (tech preview)
- 6.0 is v2 GA clean drops of v1 and deprecated dependencies.
However aligning with OCP extended support may make this more complicated. To be discussed.
alanconway
left a comment
There was a problem hiding this comment.
Added new comments after todays meeting, if you agree to those then LGTM
periklis
left a comment
There was a problem hiding this comment.
A couple of general notes mostly verifying my understanding of the goals and future proof integration with LokiStack:
- One of the goal says: "Support an API to spec a Red Hat managed LokiStack with the logging tenancy model" Is this something still work in progress or missing? If by this goal we express the fact continuing to use the reserved keyword
default, I suggest to spare a small section mentioning it's existence and an example CLF dedicated for this purpose. If on the other hand it means we want to expand to user-defined tenancies or augmented tenancy models (e.g. collected logs from OpenStack on OpenShift), we should dedicate some explanation about it. - The section "Implementation Details/Notes/Constraints" mentions that "V2 of the ClusterLogForwarder is a cluster-wide resource". Is this a typo or do we intend to shift back from namespaced ClusterLogForwarder? If latter is true, how does the transition path look like for the use cases covered by namespaced ClusterLogForwarders today?
- IIRC we are requested to support in addition to the daemonset setup a deployment based setup for collector. How is this reflected in the
serviceAccountandcollectorspecs in the proposed CRD?
Maybe still WIP. The intent is to drop the use of 'default' but we still require a mechanism to allow admin to say "write logs to lokiStack that uses the logging tenant model". My expectation is that you can write to a loki instance or a RH lokiStack where the former you define the tenancy and the latter only allows logging tenancy. I welcome suggestions on making this a reality given it appears not fully expressed here.
This is a summation of our previous discussions to make it a true clusterwide resource as the name implies. We are not providing any auto migration but I believe v1 maps fairly easy to v2. I think we can still apply the same permissions requirements that we have now so I don't see it as different. Also planning to deploy to the NS associated with the SA but maybe it would be clearer to move NS out of the SA block and up a level to make it more explicitly where we intend to land the deployment. Lastly this allows us to embrace "namespace logforwarder" if we ever can figure out what that looks like.
I believe I answer this in the response to the previous question. |
|
#1555 is changing the enhancement template in a way that will cause the header check in the linter job to fail for existing PRs. If this PR is merged within the development period for 4.16 you may override the linter if the only failures are caused by issues with the headers (please make sure the markdown formatting is correct). If this PR is not merged before 4.16 development closes, please update the enhancement to conform to the new template. |
|
@jcantrill re loki vsl. LokiStack outputs - I put this on the backlog centuries ago, and eventually closed it as too far in the future. I have re-opened and updated it to be a bit clearer on what is required: https://issues.redhat.com/browse/LOG-2811 |
|
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
|
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
|
/remove-lifecycle rotten |
Following up to this statement, it is not possible to change the resource deployment scope between versions. This can only be accomplished by renaming the CRD and group. Continuing as a namespaced resource given we need a namespace. Additionally, the only way to offer two versions of the same API is to provide a migration path. Proposal is being update accordingly |
|
|
||
| * "One click" deployment of a full logging stack as provided by **ClusterLogging** v1 | ||
| * Complete backwards compatibility to **ClusterLogForwarder** v1 | ||
| * Automated migration path from v1 to v2 |
There was a problem hiding this comment.
Are v1 and v2 CRDs supported at the same time or is there a hard cutover required?
There was a problem hiding this comment.
IMO, ideally a hard cut over but there are open questions about what it means for users not heeding the deprecation warnings to come to the next version of logging with Elasticsearch and fluentd. Mostly, the v2 api for vector deployments maps in a straight forward fashion from v1
There was a problem hiding this comment.
@jewzaam further discussion. Updated the proposal to land on a path that:
- initially migrates the resource
- drops support for previous version in subsequent release stream
There was a problem hiding this comment.
Updated again:
- drops CL objects
- allows v1 and v2 for vector CLF deployments
- migrates v1 to v2
- terminates new feature development for v1
Updated proposal to identify a migration path that supports new and old and introduces deprecation/drops over multiple z-streams. @alanconway please review latest |
|
/hold cancel |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: periklis The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: periklis The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
3e35578 to
50b6766
Compare
|
/hold cancel |
alanconway
left a comment
There was a problem hiding this comment.
LGTM - this is great. I marked a couple of very minor typos and clarifications, but otherwise this is ready IMO.
| * Support log forwarder API with minimal or no dependency upon reserved words (e.g. default) | ||
| * Support an API to spec a Red Hat managed LokiStack with the logging tenancy model | ||
| * Continue to allow deployment of a log forwarder to the output sinks of the administrators choosing | ||
| * Automated migration path from *ClusterLogForwarder.logging.openshift.io/v1* to *ClusterLogForwarder.observability.openshift.io/v1* |
There was a problem hiding this comment.
| * Automated migration path from *ClusterLogForwarder.logging.openshift.io/v1* to *ClusterLogForwarder.observability.openshift.io/v1* | |
| * Automated migration path from *ClusterLogForwarder.logging.openshift.io/v1* to *ClusterLogForwarder.observability.openshift.io/v2* |
There was a problem hiding this comment.
This is the v1 API for this group. @xperimental and I briefly discussed the issue of calling it v2 with the consensus of it causing confusion for users who may ask where v1 was
| 1. Deploys the Red Hat **loki-operator** | ||
| 1. Deploys an instance of **LokiStack** in the `openshift-logging` namespace | ||
| 1. Deploys the Red Hat **cluster-logging-operator** | ||
| 1. Creates a **ClusterLogForwarder** custom resource for the **LokiStack** |
There was a problem hiding this comment.
The COO may automate some or all of this for typical deployments, details still being hashed out.
E.g. the user may enable "default-in-cluster-logging" in some COO resource, and COO automates installing loki and CLO operators, creating initial default loki & CLF resources etc.
The simplified user experience is up to the COO, it doesn't affect what logging delivers.
There was a problem hiding this comment.
Is there something specifically you wish me to add here to integrate your comment?
enhancements/cluster-logging/logs-observability-openshift-io-apis.md
Outdated
Show resolved
Hide resolved
|
|
||
| #### User Experience | ||
|
|
||
| The product is no longer offering a "one-click" experience for deploying a full logging stack from collection to storage. Given we started moving away from this experience when Loki was introduced, this should be low risk. Many customers already have their own log storage solution so they are only making use of log forwarding. Additionally, it is intended for the **cluster-observability-operator** to recognize the existance of the internally managed log storage and automatically deploy the view plugin. This should reduce the burden of administrators. |
There was a problem hiding this comment.
The COO should mitigate this risk. The goal is for the COO to provide the easy out-of-box experience by installing & configuring multiple operators and resources automatically - not just for logging but for other obs. tools. Ideally COO will finally provide a real one-click setup for simple cases, but still allow users to configure the underlying resources if they need greater control.
There was a problem hiding this comment.
Is there specific verbiage you wish for me to add?
There was a problem hiding this comment.
Nope, you've already covered it. I think I didn't read carefully.
enhancements/cluster-logging/logs-observability-openshift-io-apis.md
Outdated
Show resolved
Hide resolved
|
/hold @jcantrill LGTM from me bar minor typos, will let you /unhold when that's done. |
JoelSpeed
left a comment
There was a problem hiding this comment.
I'd be interested to see if you have the full Go type that you're planning to use for the new API. I can review that in-depth if you'd be interested to provide specific feedback about the types and the validations that can/should be applied.
Otherwise, have provided some general comments on the API and UX in this initial pass
| * Drop support for the **ClusterLogging** custom resource | ||
| * Drop support for **ElasticSearch**, **Kibana** custom resources and the **elasticsearch-operator** | ||
| * Drop support for Fluentd collector implementation, Red Hat managed Elastic stack (e.g. Elasticsearch, Kibana) | ||
| * Drop support in the **cluster-logging-operator** for **logging-view-plugin** management |
There was a problem hiding this comment.
Is there discussion anywhere on what it means to be supported for APIs and functionality within the operator? I know within core OCP we cannot drop support for custom resources, but I don't know what rules the logging operator is bound by
There was a problem hiding this comment.
These APIs have been marked as deprecated for several releases that we can not support. The observability program is moving around responsibilities from some operators to another. To my knowledge we have no formal rules in place but maybe it should be something we consider. cc @eparis
| serviceAccount: | ||
| name: | ||
| collector: | ||
| resources: #corev1.ResourceRequirements |
There was a problem hiding this comment.
In general it's advised not to re-use upstream types. They may evolve underneath you. Is this passed directly through to a pod spec? If so, can probably excuse that
There was a problem hiding this comment.
In general it's advised not to re-use upstream types.
What is the recommended alternative? Alias the types? re-implement? There seems to be a caveat to this statement IMO. I would think the intent of "core" types is to have slow evolution. Also, there is always a danger of a dependency changing so I'm not certain how you get away from that. We are fairly slow at bumping dependencies and most if not all of them are stable; we are not using anything with high churn.
Is this passed directly through to a pod spec? If so, can probably excuse that
It is directly passed through to help with scheduling.
There was a problem hiding this comment.
So the advice normally is to implement your own type. For example, this is common primarily in the corev1.ObjectReference and its usage. Where people just want to have a secret name and namespace, but have a number of other unused fields as well.
Given this is a resource requirement, I expect you are just plumbing it through to a podSpec? If that's correct, then the usage here is fine. It would only be better to implement your own alternative if you are using it for some other purpose, and just used the type because it was handy that it looked like what you wanted already (like the object references)
| tolerations: #corev1.Toleration | ||
| inputs: | ||
| - name: | ||
| type: #enum: application,infrastructure,audit |
There was a problem hiding this comment.
This looks like a discriminated union, which is good. One thing to consider is that embedded DUs (where there are other fields at the same level, but outside of the union, eg receiver) have proven to be confusion to end users, and we currently advise that you do not embed DUs but keep them separate. Would mean an extra level of indentation for type, application, infrastructure and audit, but, provides extra clarity over the behaviour of the fields
There was a problem hiding this comment.
This pattern is one that we are re-using based upon seeing it elsewhere. Can you point to an example or recommendation of what an alternate looks like?
There was a problem hiding this comment.
The pattern is correct, you would just move one indentation down. There are some notes in our API conventions
I think right now you probably have something like
type MyUnion struct {
MyDiscriminator string
MyMember ...
}
type MyParent struct {
MyUnion `json:",inline"`
}
Where we prefer not to have the inline there so that the members of the union appear on a separate level to any adjacent fields to the union
| http: | ||
| format: #enum: kubeAPIAudit , format of incoming data | ||
| tls: | ||
| cacert: |
There was a problem hiding this comment.
Should be camel case, so caCert
| key: #the key in the resource | ||
| keyPassphrase: | ||
| key: #the key in the resource | ||
| insecureSkipVerify: #bool |
There was a problem hiding this comment.
Avoid bools, they typically don't age well.
Do you really need this option still? Any chance to drop it in the new API?
There was a problem hiding this comment.
This is common TLS setting which short circuits certificate verification. I'm not certain it can be represented any other way though I agree with you regarding bools.
Any chance to drop it in the new API?
@alanconway you and I have had this discussion in the past and I believe we always land on the "development" scenario where admins want this capability. Our position has to been to lean into security but this is a knob that is useful during adoption
There was a problem hiding this comment.
Are there other TLS verification options? Such as system roots or specific keys?
You could have it as a combined DU, this has the nice benefit of preventing folks from setting keys when they would otherwise be ignored, because you skip the verification anyway
tlsVerification:
method: SystemRoots | KeyFromSecret | InsecureSkipVerification
keyFromSecret: <- only when method KeyFromSecret
secret:
name:
key: ...
enhancements/cluster-logging/logs-observability-openshift-io-apis.md
Outdated
Show resolved
Hide resolved
| * Support log forwarder API with minimal or no dependency upon reserved words (e.g. default) | ||
| * Support an API to spec a Red Hat managed LokiStack with the logging tenancy model | ||
| * Continue to allow deployment of a log forwarder to the output sinks of the administrators choosing | ||
| * Automated migration path from *ClusterLogForwarder.logging.openshift.io/v1* to *ClusterLogForwarder.observability.openshift.io/v1* |
There was a problem hiding this comment.
I expected to see this detailed, but didn't. How will you migrate users? Moving from one API group to another is complex if you must maintain the functionality and cannot have downtime. What's the plan here?
47d9543 to
5150362
Compare
|
/lgtm |
|
/hold cancel |
|
@jcantrill: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR proposes the next version of logging APIs
cc @alanconway @xperimental @periklis @cahartma