Skip to content

Fix migration metric registration#620

Merged
k8s-ci-robot merged 1 commit intokubernetes-csi:masterfrom
jsafrane:fix-migration-metrics
May 4, 2021
Merged

Fix migration metric registration#620
k8s-ci-robot merged 1 commit intokubernetes-csi:masterfrom
jsafrane:fix-migration-metrics

Conversation

@jsafrane
Copy link
Contributor

@jsafrane jsafrane commented May 4, 2021

/kind bug
What this PR does / why we need it:
Don't register process_start_time_seconds metric in migration metrics manager to prevent double registration, resulting in this error:

gathered metric family process_start_time_seconds has help "[ALPHA] Start time of the process since unix epoch in seconds." but should have "Start time of the process since unix epoch in seconds."

This happens only when a migratable CSI driver is used. With a regular hostpath CSI driver, the metrics work without any issues.

Which issue(s) this PR fixes:

Fixes #619

Special notes for your reviewer:
cc @pohly @Jiawei0227

Does this PR introduce a user-facing change?:

Fixed reporting of metrics when a migratable CSI driver is used.

Don't register process_start_time_seconds metric in migration metrics
manager to prevent double registration, resulting in this error:

gathered metric family process_start_time_seconds has help "[ALPHA] Start time of the process since unix epoch in seconds." but should have "Start time of the process since unix epoch in seconds."
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 4, 2021
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 4, 2021
Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 4, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsafrane, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented May 4, 2021

@jsafrane: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-csi-external-provisioner-1-21-on-kubernetes-1-21 b2be509 link /test pull-kubernetes-csi-external-provisioner-1-21-on-kubernetes-1-21

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jsafrane
Copy link
Contributor Author

jsafrane commented May 4, 2021

/retest

@k8s-ci-robot k8s-ci-robot merged commit 2890de3 into kubernetes-csi:master May 4, 2021
metricsManager = metrics.NewCSIMetricsManagerWithOptions(provisionerName, metrics.WithMigration())
metricsManager = metrics.NewCSIMetricsManagerWithOptions(provisionerName,
// Will be provided via default gatherer.
metrics.WithProcessStartTime(false),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the processstarttime metric from csi-lib-utils now that it's included by default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some sidecars only use the csi-lib-utils code and nothing else, in which case processtarttime still needs to be added there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the plan was to also add the base metrics to all sidecars?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that's just a plan. I don't know who's working on it. We can only remove it once all sidecars are updated.

@msau42
Copy link
Collaborator

msau42 commented May 5, 2021

Should this fix be cherry-picked?

@pohly
Copy link
Contributor

pohly commented May 5, 2021

Should this fix be cherry-picked?

Right now it would be simpler to release 2.2.1 from master (all it has are the two bug fixes that need to be cherry-picked) and then fast-forward release-2.2, but if you prefer cherry-picking, then please merge #626

@gnufied
Copy link
Contributor

gnufied commented May 5, 2021

Is there a way to catch this from happening? We had some metric e2e tests - https://github.com/kubernetes/kubernetes/blob/master/test/e2e/storage/volume_metrics.go . I don't think we ported these to CSI suite.

@pohly
Copy link
Contributor

pohly commented May 6, 2021

Is there a way to catch this from happening? We had some metric e2e tests - https://github.com/kubernetes/kubernetes/blob/master/test/e2e/storage/volume_metrics.go .

That looks like it tests metrics data in kube-controller-manager. What we want to test are metrics exposed by the sidecars. Retrieving those from an e2e.test isn't easy because from outside of the cluster, port-forwarding has to be used to reach the metrics endpoint of each container. But I found a solution: https://github.com/intel/pmem-csi/blob/devel/test/e2e/pod/dial.go

Simply retrieving the metrics data would already be useful. But ideally we should also validate that the returned data meets expectations, which means we need to define what those expectations are.

But whether that would have caught this issue here is still uncertain. Do we have e2e tests involving a driver where migration is enabled?

@msau42 msau42 mentioned this pull request Aug 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error collecting metrics with migratable CSI driver

5 participants