Skip to content

Bump azidentity#2387

Closed
2uasimojo wants to merge 1 commit intoopenshift:mce-2.5from
2uasimojo:azidentity-mce-2.5
Closed

Bump azidentity#2387
2uasimojo wants to merge 1 commit intoopenshift:mce-2.5from
2uasimojo:azidentity-mce-2.5

Conversation

@2uasimojo
Copy link
Member

@2uasimojo 2uasimojo commented Jul 31, 2024

@openshift-ci openshift-ci bot requested review from dlom and lleshchi July 31, 2024 19:16
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 31, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 31, 2024
@2uasimojo
Copy link
Member Author

/override ci/prow/security

#2353

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 31, 2024

@2uasimojo: Overrode contexts on behalf of 2uasimojo: ci/prow/security

Details

In response to this:

/override ci/prow/security

#2353

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@2uasimojo
Copy link
Member Author

/assign @lleshchi

Manual cherry-pick; you reviewed the original.

@codecov
Copy link

codecov bot commented Jul 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 57.92%. Comparing base (1c99e3b) to head (0fd9146).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff            @@
##           mce-2.5    #2387   +/-   ##
========================================
  Coverage    57.92%   57.92%           
========================================
  Files          186      186           
  Lines        26080    26080           
========================================
  Hits         15106    15106           
  Misses        9711     9711           
  Partials      1263     1263           

@2uasimojo
Copy link
Member Author

/test e2e-azure

@2uasimojo
Copy link
Member Author

/override "Red Hat Konflux / hive-mce-25-on-pull-request"

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 1, 2024

@2uasimojo: Overrode contexts on behalf of 2uasimojo: Red Hat Konflux / hive-mce-25-on-pull-request

Details

In response to this:

/override "Red Hat Konflux / hive-mce-25-on-pull-request"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@2uasimojo 2uasimojo force-pushed the azidentity-mce-2.5 branch from b1ff741 to fcb6d75 Compare August 1, 2024 19:38
@2uasimojo
Copy link
Member Author

/override "Red Hat Konflux / hive-mce-25-on-pull-request"

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 1, 2024

@2uasimojo: Overrode contexts on behalf of 2uasimojo: Red Hat Konflux / hive-mce-25-on-pull-request

Details

In response to this:

/override "Red Hat Konflux / hive-mce-25-on-pull-request"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@2uasimojo
Copy link
Member Author

time="2024-08-01T23:03:08Z" level=fatal msg="Failed to write file" error="open /.azure/osServicePrincipal.json: permission denied" path=/.azure/osServicePrincipal.json

I want to say we've seen this before. Seems intermittent?

/test e2e-azure

Also going to make a no-op mce-2.5 PR to see if somehow this regressed it, though that doesn't seem possible.

@2uasimojo
Copy link
Member Author

Also going to make a no-op mce-2.5 PR to see if somehow this regressed it, though that doesn't seem possible.

#2399

@2uasimojo
Copy link
Member Author

/test e2e-azure

@2uasimojo
Copy link
Member Author

/test e2e-azure

@2uasimojo
Copy link
Member Author

/test e2e-azure

@2uasimojo
Copy link
Member Author

/test e2e-azure

@2uasimojo
Copy link
Member Author

/hold for debug

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 21, 2024
@2uasimojo
Copy link
Member Author

So the problem seems to be that that file already exists; and since we lay it down with 400 perms, the subsequent WriteFile gets EPERM.

Now... why does the file already exist??

@2uasimojo
Copy link
Member Author

Okay, sat in a head-scratching session with @dlom and we finally figured out what's going on. I'm going to write notes here for now, for lack of a better spot.

The uninstall job is actually running the hiveutil container more than once. If you look at the hive_uninstall_job.log:

time="2024-08-22T01:34:15Z" level=info msg="running file observer" files="[]"   <==EMPTY!
I0822 01:34:15.868695       1 observer_polling.go:159] Starting file observer
time="2024-08-22T01:34:15Z" level=info msg="Using loaded object" name=hive-66c257fd-32f9-418f-a27b-d08bd519332c-azure-creds namespace=cluster-test type="*v1.Secret"
time="2024-08-22T01:34:15Z" level=info msg="EFRIED: process file" error="stat /.azure/osServicePrincipal.json: no such file or directory" fileInfo="<nil>" path=/.azure/osServicePrincipal.json   <==ENOENT!
... normal startup of deprovision code ...
time="2024-08-22T01:34:17Z" level=info msg="Loading azure creds" credsFilePath=/.azure/osServicePrincipal.json
time="2024-08-22T01:34:17Z" level=debug msg="deleting public records"
time="2024-08-22T01:34:17Z" level=debug msg="dns.ZonesClient#ListByResourceGroup: Failure responding to request: StatusCode=401 -- Original Error: autorest/azure: Service returned an error. Status=401 Code=\"AuthenticationFailed\" Message=\"Authentication failed. The 'Authorization' header is missing.\""
time="2024-08-22T01:34:17Z" level=debug msg="deleting resource group"
time="2024-08-22T01:34:17Z" level=debug msg="failed to delete hive-66c257fd-32f9-41-x5vbz-rg: resources.GroupsClient#Delete: Failure sending request: StatusCode=401 -- Original Error: Code=\"AuthenticationFailed\" Message=\"Authentication failed. The 'Authorization' header is missing.\""
time="2024-08-22T01:34:17Z" level=debug msg="deleting application registrations"
time="2024-08-22T01:34:17Z" level=fatal msg="Runtime error" error="[unable to authenticate when deleting public DNS records: dns.ZonesClient#ListByResourceGroup: Failure responding to request: StatusCode=401 -- Original Error: autorest/azure: Service returned an error. Status=401 Code=\"AuthenticationFailed\" Message=\"Authentication failed. The 'Authorization' header is missing.\", unable to authenticate when deleting resource group: failed to delete hive-66c257fd-32f9-41-x5vbz-rg: resources.GroupsClient#Delete: Failure sending request: StatusCode=401 -- Original Error: Code=\"AuthenticationFailed\" Message=\"Authentication failed. The 'Authorization' header is missing.\"]"

(So the actual failure is there at the end, which we currently suspect has to do with an incompatibility between the bumped azidentity lib and the level of installer code that's vendored at this branch. Putting that aside for the moment...)

...versus the uninstall pod log, which we are now pretty sure represents a separate invocation of the container from the above:

time="2024-08-22T02:00:26Z" level=debug msg="Couldn't find install logs provider environment variable. Skipping."
time="2024-08-22T02:00:26Z" level=debug msg="no additional log fields found"
time="2024-08-22T02:00:26Z" level=info msg="running file observer" files="[/.azure/osServicePrincipal.json]"   <==L@@K
I0822 02:00:26.887510       1 observer_polling.go:159] Starting file observer
time="2024-08-22T02:00:26Z" level=info msg="Using loaded object" name=hive-66c257fd-32f9-418f-a27b-d08bd519332c-azure-creds namespace=cluster-test type="*v1.Secret"
time="2024-08-22T02:00:26Z" level=info msg="EFRIED: process file" error="<nil>" fileInfo="&{osServicePrincipal.json 233 256 {886325768 63859887255 0x15adae60} {2052 136625069 1 33024 1000630000 1000630000 0 0 233 4096 8 {1724292026 886358193} {1724290455 886325768} {1724292026 646344680} [0 0 0]}}" path=/.azure/osServicePrincipal.json   <== File exists!
time="2024-08-22T02:00:26Z" level=fatal msg="Failed to write file" error="open /.azure/osServicePrincipal.json: permission denied" path=/.azure/osServicePrincipal.json   <== So we fail to write it, because we originally laid it down with 400 perms

Per k8s docs, when a container fails, the Job may redrive it according to restartPolicy in the same pod -- and thus with the same file system structure intact.

It is worth noting that, without #1874, although we still would have hit the actual failure, we wouldn't have seen the EPERM because the creds Secret would have been attached to the pod as a volume mount with the creds file already present, so at the container level we wouldn't have been trying to create files. (By itself, this is not a reason to revert that change.)


So here are some things we can do:

  • Add code to ProjectToDir to delete/skip files if they already exist. I think delete would be better: it would more closely simulate the (usual) volume mount behavior where updates to mounted Secrets are propagated automatically into the pod.
  • Set restartPolicy: Never on Jobs. This would prevent rerunning the deprov process if it fails... but that has surely never happened, or we would have seen this manifestation before?? But if we do the above, this shouldn't be necessary.
  • Beef up e2e to capture a couple more things:
    • uninstall job and pod manifests
    • install pod manifest and logs even when install job has succeeded? Right now it looks like those get purged by the time we capture.

And of course:

  • Identify the root cause of the actual failure, and either fix it or decide to abandon these backports.

@2uasimojo
Copy link
Member Author

  • Add code to ProjectToDir to delete /skip files if they already exist.

#2425

@2uasimojo
Copy link
Member Author

/test e2e-azure

Now that #2427 has merged, the pod log should show the actual failure...

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 28, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 28, 2024
@2uasimojo
Copy link
Member Author

(rebase to remove my debug-by-printf commit)

@2uasimojo
Copy link
Member Author

Now that #2427 has merged, the pod log should show the actual failure...

Yup

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_hive/2387/pull-ci-openshift-hive-mce-2.5-e2e-azure/1828936466248503296/artifacts/e2e-azure/test/artifacts/cluster-test-hive-80774de8-b66c-412d-8f2a-e97c16be7773-uninstall-nknzm.log

This is most likely due to an incompatibility between vendored installer code and azidentity in this branch.

@2uasimojo
Copy link
Member Author

/test e2e-azure

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 31, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 3, 2024
@2uasimojo
Copy link
Member Author

Rebased to pick up #2436, a revendor, which might, if we're really lucky, fix this.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 3, 2024

@2uasimojo: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure 0fd9146 link true /test e2e-azure

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@2uasimojo
Copy link
Member Author

which might, if we're really lucky, fix this.

Nope.

@2uasimojo
Copy link
Member Author

This is most likely due to an incompatibility between vendored installer code and azidentity in this branch.

Abandoning this fix in mce-2.5 and earlier releases due to ⬆️. This is deemed acceptable per guidelines.

#2439 to ignore this CVE via snyk config.

@2uasimojo 2uasimojo closed this Sep 5, 2024
@2uasimojo 2uasimojo deleted the azidentity-mce-2.5 branch September 5, 2024 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants