Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Agent failed to upgrade from 8.4.2 to 8.5.0 BC1 for MAC 12 agent using agent binary. #1392

Merged
merged 2 commits into from
Oct 3, 2022

Conversation

aleksmaus
Copy link
Member

What does this PR do?

  • Reverts the previous revert commit, restoring the elastic-agent app bundle for Mac OS.
  • Implements the workaround for upgrade from pre 8.5 version to 8.5. (assuming that this fix can go into 8.5)

This workaround is ONLY activated when upgrading from pre 8.5 version to 8.5+ on for Mac OS distribution.
The workaround is not used or affects the clean install, nor the it's not going used for any upgrades after 8.5.

For upgrades after 8.5 the symlink will be correctly established from the upgrade handle:
Screen Shot 2022-10-01 at 2 37 20 PM

Here is how is how the workaround works for let say the upgrade from 8.4 to 8.5.

  1. The 8.5 package is bundled with the additional executable shell script that is unpacked into what was previously the elastic-agent binary location. The sole purpose of the shell script is to fix the symlink created by pre 8.5 versions of upgrade handle.

Screen Shot 2022-10-02 at 12 07 28 PM

  1. When the upgrade from 8.4 is initiated the 8.5 is unpacked, and the 8.4 upgrade handle creates the symlink to this new elastic-agent shell script.
  2. The 8.5 elastic agent is started, the launchd executes the elastic-agent shell script that fixes up the symlink to point correctly to the binary inside of the elastic agent app bundle and exits.
  3. The launchd restarts the 8.5 if it exists (KeepAlive is set to true) and this time the symlink is correctly pointed to the binary inside of the bundle.
  4. The actual elastic-agent comes up and the upgrade completes.

I tested the upgrades on all 3 platforms macOS and as a regression test on linux and windows:

  1. from 8.4 to 8.5
  2. from 8.5. to 8.6.

This would need to be backported to main branch as well.

Why is it important?

Addresses the issue: "Agent failed to upgrade from 8.4.2 to 8.5.0 BC1 for MAC 12 agent using agent binary."
#1298

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas

How to test this PR locally

I tested these changes with the local builds that had the artifactory download/verify code disabled so I could drop the newer locally built elastic-agent binaries into the downloads folder.

Related issues

Screenshots

MacOS upgrade from 8.4 to 8.5

Screen Shot 2022-10-01 at 8 53 43 PM

Screen Shot 2022-10-01 at 8 56 41 PM

Screen Shot 2022-10-01 at 8 57 55 PM

MacOS upgrade from 8.5 to 8.6
Screen Shot 2022-10-01 at 9 12 55 PM
Screen Shot 2022-10-01 at 9 27 52 PM

Linux (Ubuntu) upgrade from 8.4 to 8.5, regression test
Screen Shot 2022-10-02 at 9 44 10 AM
Screen Shot 2022-10-02 at 9 45 10 AM

Linux (Ubuntu) upgrade from 8.5 to 8.6, regression test
Screen Shot 2022-10-02 at 10 26 08 AM
Screen Shot 2022-10-02 at 10 27 16 AM

Windows (10) upgrade from 8.4 to 8.5, regression test
Screen Shot 2022-10-02 at 11 31 17 AM
Screen Shot 2022-10-02 at 11 32 10 AM

Windows (10) upgrade from 8.5 to 8.6, regression test
Screen Shot 2022-10-02 at 11 44 45 AM
Screen Shot 2022-10-02 at 11 45 37 AM

@aleksmaus aleksmaus added the bug Something isn't working label Oct 2, 2022
@aleksmaus aleksmaus requested a review from a team as a code owner October 2, 2022 16:33
@elasticmachine
Copy link
Contributor

elasticmachine commented Oct 2, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-10-03T17:12:17.192+0000

  • Duration: 38 min 37 sec

Test stats 🧪

Test Results
Failed 0
Passed 4959
Skipped 17
Total 4976

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Contributor

elasticmachine commented Oct 2, 2022

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 97.333% (73/75) 👍 0.036
Files 67.797% (160/236) 👎 -1.334
Classes 68.182% (315/462) 👎 -0.981
Methods 52.415% (890/1698) 👎 -0.759
Lines 38.703% (9589/24776) 👎 -0.545
Conditionals 100.0% (0/0) 💚

@aleksmaus
Copy link
Member Author

linter fails cause I touched some files to refactor and extract the app bundle binary path building.

  Error: string `windows` has 3 occurrences, make it a constant (goconst)
  Error: string `.exe` has 2 occurrences, make it a constant (goconst)

not going to introduce more changes into this PR in order to keep it's scope limited to only relevant changes. can fix the linter when backporting to main.

@michalpristas
Copy link
Contributor

/package

@michalpristas
Copy link
Contributor

michalpristas commented Oct 3, 2022

can you verify

  • service manager does not have an issue executing script (as it is not a signed binary)
  • build signs/notarises binary at the new location properly
    you need to enable some setting in MacOS I forgot the name of. It will complain

errors.M("destination", paths.ShellWrapperPath))
}
}
err = os.Symlink("/Library/Elastic/Agent/elastic-agent", paths.ShellWrapperPath)
Copy link
Contributor

@michalpristas michalpristas Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with:
filepath.Join(paths.InstallPath, paths.BinaryName)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will leave as is for now, even though the change is trivial, because:

  1. previously this path was hardcode in the shell wrapper script as is
	// ShellWrapper is the wrapper that is installed.
	ShellWrapper = `#!/bin/sh
exec /Library/Elastic/Agent/elastic-agent $@
`
  1. this is a platform specific for Darwin so path separators and the path are fixed

can change (and would have to retest) when merging to the main branch, not for the urgent fix

@michalpristas
Copy link
Contributor

/packaging

@aleksmaus
Copy link
Member Author

can you verify

  • service manager does not have an issue executing script (as it is not a signed binary)

The laundchd had no issues executing the unsigned shell script in all of the previous releases.
If you check the launchd plist for elastic-agent service it executed which was a shell script

/usr/local/bin/elastic-agent

The content of that file was

#!/bin/sh
exec /Library/Elastic/Agent/elastic-agent $@

That was one of the issues before for FDA, so the FDA could not even be given by the agent signature for example because the root process for launch was /bin/sh

The commit that was reverted and then reverted back here also fixed that, the /usr/local/bin/elastic-agent is now a symlink instead of the shell script.

Also in upgrade testing that I did so far the newly introduced script that fixes up the symlink was working fine as well.

  • build signs/notarises binary at the new location properly
    you need to enable some setting in MacOS I forgot the name of. It will complain

The signature of the app bundle was already checked when the change for app bundle was introduced back in August, was able to download the snapshot build from artifactory, check the signature, and use that signature to give FDA to the agent with MDM policy.

@michalpristas
Copy link
Contributor

run end-to-end tests

Copy link
Contributor

@michalpristas michalpristas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested

  • upgrade from 8.4 to 8.5
  • from 8.5 to future release
  • rollback from 8.5 to 8.4
    seems ok

let's see how e2e test ends up
lint errors do not appear difficult to fix as well. please address them

@@ -151,6 +177,9 @@ func findDirectory() (string, error) {
// executable path is being reported as being down inside of data path
// move up to directories to perform the copy
sourceDir = filepath.Dir(filepath.Dir(sourceDir))
if runtime.GOOS == darwin {
sourceDir = filepath.Dir(filepath.Dir(filepath.Dir(sourceDir)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check we're inside *.app dir?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? the agent binary will always be inside of the app bundle with this change, it's not going to be installed outside of the app bundle.

Copy link
Contributor

@michalpristas michalpristas Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah right if it's inside data it cannot be root executable.
I was just trying to prevent some unintentional mixup of executables execution as this is taken from os.Executable path. being defensive on the change so it won't surprise us

@aleksmaus
Copy link
Member Author

lint errors do not appear difficult to fix as well. please address them

yeah, they are not difficult to fix, but it widens the scope of the changes. Even what seems to be the trivial changes need full regression testing. I'm trying to keep the scope of the changes minimal here for 8.5.

@joshdover
Copy link
Contributor

e2e tests

@aleksmaus
Copy link
Member Author

Got in touch on the details of JAMF install, looks like it's always a new install with JAMF (at least with out IT) and then upgrade through the fleet path. So this approach should work ok without a need of any updates for JAMF.

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, my primary concerns are mostly comments/documentation and testing. We can handle those as follow up changes if you want to get this merged to unblock further manual testing given where we are in the release cycle.

@@ -172,10 +177,15 @@ func SetInstall(path string) {
// initialTop returns the initial top-level path for the binary
//
// When nested in top-level/data/elastic-agent-${hash}/ the result is top-level/.
// The agent fexecutable for MacOS is wrappend in the bundle, so the path to the binary is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The agent fexecutable for MacOS is wrappend in the bundle, so the path to the binary is
// The agent executable for MacOS is wrapped in the bundle, so the path to the binary is

@@ -28,13 +37,21 @@ func RunningInstalled() bool {
// executable path is being reported as being down inside of data path
// move up to directories to perform the comparison
execDir = filepath.Dir(filepath.Dir(execDir))
if runtime.GOOS == darwin {
execDir = filepath.Dir(filepath.Dir(filepath.Dir(execDir)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here documenting that you are stripping off the elastic-agent.app/Contents/MacOS paths? Either that or make a small helper function for this, I see the same logic in paths/common.go.

Those who haven't memorized the application bundle structure on Mac might not recognize what this is doing.

)

func TestIsInsideData(t *testing.T) {
validExePath := paths.BinaryDir(filepath.Join("data", fmt.Sprintf("elastic-agent-%s", release.ShortCommit())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you explicitly include a case that contains the MacOS bundle path?

@@ -23,7 +23,8 @@ func ChangeSymlink(ctx context.Context, log *logger.Logger, targetHash string) e
hashedDir := fmt.Sprintf("%s-%s", agentName, targetHash)

symlinkPath := filepath.Join(paths.Top(), agentName)
newPath := filepath.Join(paths.Top(), "data", hashedDir, agentName)

newPath := paths.BinaryPath(filepath.Join(paths.Top(), "data", hashedDir), agentName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we forget paths.BinaryPath here (or elsewhere) to wrap this is there a test that fails? If not, is there a unit test we could write?

I am concerned the need for this won't be obvious to the next engineer to touch this code, and it could be forgotten or removed in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the same thing in service_darwin.go, so this seems to be necessary in lots of places now but I am not sure if anything enforces the need for it.

func invokeCmd(topPath string) *exec.Cmd {
	homeExePath := paths.BinaryPath(topPath, agentName)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah currently the logic is scattered in different places, so we call these wrapper func in few different places. Relying on e2e tests to catch any issues more or less.
It's the same issue with the code before this change actually, nothing is enforcing the right path to be used.
That’s how I found all of the places, changing in on place and finding the next places where it breaks while running e2e upgrades.
I'll address the comments when backporting to main branch.

err,
fmt.Sprintf("failed to write shell wrapper (%s)", paths.ShellWrapperPath),
errors.M("destination", paths.ShellWrapperPath))
// Install symlink for darwin instead
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we explain why we need to do this? Let's document the reason we need this symlink in the code or in the templates/darwin/elastic-agent.tmpl script itself for future engineers to understand why this was necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this change the agent used the shell script wrapper to start the agent, this posed an issue with FDA cause the root process of the launchd service was sh instead of the elastic-agent, so in order to grant FDA the FDA needed to be granted to /bin/sh. Here we replace that wrapper shell script with symlink, so the root process of the service is the elastic-agent. Will add more comments here for main branch backports.

@aleksmaus
Copy link
Member Author

looks like e2e failed on setup

[2022-10-03T15:36:34.247Z] make: *** [Makefile:150: setup-node] Error 2
[12:23](https://elastic.slack.com/archives/D03LG058YN4/p1664814208506939)
Complete output from command python setup.py egg_info:\n    \n            =============================DEBUG ASSISTANCE==========================\n            
If you are seeing an error here please try the following to\n            successfully install cryptography:\n    \n            
Upgrade to the latest pip and try again. This will fix errors for most\n            users. See: https://pip.pypa.io/en/stable/installing/#upgrading-pip\n            =============================DEBUG ASSISTANCE==========================\n    \n    
Traceback (most recent call last):\n      File \"<string>\", line 1, in <module>\n      File \"/tmp/pip-build-oplzx3nh/bcrypt/setup.py\", line 11, in <module>\n        from setuptools_rust import RustExtension\n    ModuleNotFoundError: No module named 'setuptools_rust'\n    \n    ---------------

@aleksmaus
Copy link
Member Author

/test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants