Skip to content

Conversation

@artem-shelkovnikov
Copy link
Member

@artem-shelkovnikov artem-shelkovnikov commented Nov 10, 2025

What does this PR do?

See: https://elastic.slack.com/archives/C0JFN9HJL/p1762485477102909 - there is a problem producing DRA artefacts for agentless images. The bug happens because the connectors were restructured and some folder names changed. In this case the name of the directory extracted from the DRA changed - and this PR updates it.

Why is it important?

DRA is broken, and we need to fix it.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

Unfortunately, I'm not sure.

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@mergify
Copy link
Contributor

mergify bot commented Nov 10, 2025

This pull request does not have a backport label. Could you fix it @artem-shelkovnikov? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Nov 10, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pierrehilbert
Copy link
Contributor

I can still see a reference to the old naming here:

rootDir: elasticsearch_connectors-{{beat_version}}

Should we also change it?

@artem-shelkovnikov
Copy link
Member Author

artem-shelkovnikov commented Nov 10, 2025

Thanks @pierrehilbert!

I'm actually not sure how this part works though - we do the unpacking of a zip in dev-tools/packaging/templates/docker/Dockerfile.elastic-agent.tmpl - what do we do in dev-tools/packaging/packages.yml?

I've done the change, but I'm not aware what it did.

@ebeahan
Copy link
Member

ebeahan commented Nov 10, 2025

I believe the build's failing because we have .package-version on main still pinned to a snapshot before the dir name changes were applied in connectors:

make: *** /usr/share/connectors: Not a directory. Stop.

The GH action won't trigger an update until a new unified release snapshot is produced on main. But we can't update Unified Release because we need to reflect the dir name changes. So, chicken/egg problem.

We could force this change with broken CI through (and leaving main broken), if we feel very confident we've made all the necessary changes for the snapshot to succeed. Once the Unified Snapshot completes, we trigger a .package-version update which should resolve broken main.

@pierrehilbert @blakerouse @michalpristas @swiatekm @pchila - anyone have a better way to resolve this situation?

@artem-shelkovnikov
Copy link
Member Author

Alternatively for connectors we can revert the change, fix everything and re-revert it back once we're sure that the change is compatible with agentless, but it won't be too trivial either IMO

@blakerouse
Copy link
Contributor

@ebeahan I believe your path forward is the best path. I don't like it, but seems like the simplest path without wasting time.

blakerouse
blakerouse previously approved these changes Nov 10, 2025
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks correct. Won't really know until merged and a new DRA is produced, which then can bump the .package-version.

I am good with this change, to get this in.

@pchila
Copy link
Member

pchila commented Nov 11, 2025

@pierrehilbert @blakerouse @michalpristas @swiatekm @pchila - anyone have a better way to resolve this situation?

@ebeahan
.package-version is meant to isolate elastic-agent CI from breakages happening in dependencies until a new unified build is produced and dependencies can be bumped (in this sense it worked perfectly as elastic-agent CI is still chugging along happily).

We can disable the pinning and package "bleeding edge" versions of the dependencies in CI builds by setting USE_PACKAGE_VERSION to false and re-enabling it after the next unified build

@pchila
Copy link
Member

pchila commented Nov 11, 2025

Thanks @pierrehilbert!

I'm actually not sure how this part works though - we do the unpacking of a zip in dev-tools/packaging/templates/docker/Dockerfile.elastic-agent.tmpl - what do we do in dev-tools/packaging/packages.yml?

I've done the change, but I'm not aware what it did.

@artem-shelkovnikov
That block is meant to describe the elastic-agent dependencies for both downloading and (normally) extracting and repackaging.
connectors dependency is special as it's not extracted and repackaged the same way as all other dependency (so the rootDir is not as important), it would be good though to keep the YAML up-to-date.
It would be even better if the component YAML definition could be used directly during packaging (although this is more difficult due to the way onboarding connectors has been implemented with a very specific packaging process, different from the other components) but that would require a bigger effort.

@artem-shelkovnikov
Copy link
Member Author

@pchilla thanks for the explanation!

What's our plan of action now? Is there anything I can do?

@pchila
Copy link
Member

pchila commented Nov 11, 2025

@pchilla thanks for the explanation!

What's our plan of action now? Is there anything I can do?

@artem-shelkovnikov
You can try to see if packaging the latest connectors works by disabling USE_PACKAGE_VERSION as I mentioned above. This will cause the agent to pick up the latest connectors package it can find

@ebeahan
Copy link
Member

ebeahan commented Nov 11, 2025

Now seeing two different issues in CI:

  1. Also need to set USE_PACKAGE_VERSION="false" here: https://github.com/elastic/elastic-agent/blob/main/.buildkite/scripts/steps/integration-cloud-image-push.sh#L19

  2. By setting USE_PACKAGE_VERSION="false" here, we're now hitting a checksum error: https://github.com/elastic/elastic-agent/blob/main/dev-tools/mage/downloads/utils.go#L101.

Error: FetchProjectBinary failed for agentbeat on windows/amd64: checksum of file agentbeat-9.3.0-SNAPSHOT-windows-x86_64.zip does not match expected checksum of 12d8e8273bd74d05aab4682313225b9493bbb039a2122cf3eaee30f090458b2221eba62a84080cecad934f752d96197639b5cf3407617df97384f6d23bf18cec

edit: add error

@pchila
Copy link
Member

pchila commented Nov 11, 2025

@pchila
Copy link
Member

pchila commented Nov 11, 2025

Created a PR that should remove the race conditions while downloading packages with USE_PACKAGE_VERSION=false
#11128

@ebeahan
Copy link
Member

ebeahan commented Nov 11, 2025

We've addressed the bug on the Agent side. Now seeing new error in Packaging:

5.073 make: *** /usr/share/connectors: Not a directory.  Stop.
--
  | 5.073 make: Entering directory '/'
  | 5.073 make: Leaving directory '/'
  | ------
  | Dockerfile:132
  | --------------------
  | 131 \|     COPY --from=home /usr/share/elastic-agent/NOTICE.txt /licenses
  | 132 \| >>> RUN apk add --no-cache git make python-3.11 py3.11-pip && \
  | 133 \| >>>     unzip /usr/share/elastic-agent/data/service/connectors-*.zip -d /usr/share/elastic-agent/data/service && \
  | 134 \| >>>     mv /usr/share/elastic-agent/data/service/connectors-* /usr/share/connectors && \
  | 135 \| >>>     PYTHON=python3.11 make -C /usr/share/connectors clean install install-agent && \
  | 136 \| >>>     chmod 0755 /usr/share/elastic-agent/data/elastic-agent-*/components/connectors
  | 137 \|
  | --------------------

I compared inflating the the zips for 9.2.1-SNAPSHOT to 9.3.0-SNAPSHOT here.

@artem-shelkovnikov It's not just renaming the directory, but the entire directory structure has changed. Is that expected?

Before we were able to move the entire dir using mv /<path>/elasticsearch_connectors-*, but since the structure has changed entirely, we'll need you to direct how to update this part of the build.

@elasticmachine
Copy link
Contributor

elasticmachine commented Nov 12, 2025

💛 Build succeeded, but was flaky

Failed CI Steps

History

cc @artem-shelkovnikov

@ebeahan
Copy link
Member

ebeahan commented Nov 12, 2025

Packaging succeeded in CI. We have one remaining failure due to a known flakey test: #10917.

@artem-shelkovnikov any remaining work? I can merge with the unrelated flakey failure.

@artem-shelkovnikov
Copy link
Member Author

@ebeahan thanks for giving it a look!

I think it's okay, but one question - before there was a failure due to connectors-py unable to start in the tests. I've fixed everything seemingly, but worried that I've missed anything and on deployment to Agentless things can break.

Do you think the PR is safe to merge given that the tests for the image passed, or should we do manual testing in some manner?

@pchila
Copy link
Member

pchila commented Nov 12, 2025

@ebeahan thanks for giving it a look!

I think it's okay, but one question - before there was a failure due to connectors-py unable to start in the tests. I've fixed everything seemingly, but worried that I've missed anything and on deployment to Agentless things can break.

Do you think the PR is safe to merge given that the tests for the image passed, or should we do manual testing in some manner?

The only test that checks connectors-py input type is https://github.com/elastic/elastic-agent/blob/main/testing/integration/k8s/kubernetes_agent_service_test.go which deploys elastic-agent using the service docker image variant and defines a connectors input via an override and checks that the agent starts correctly.

So as far as supporting inputs connectors-py a green TestKubernetesAgentService means that the agent started correctly. Beyond that there's not much we can say with just the one test.

@artem-shelkovnikov
Copy link
Member Author

@pchila thanks! If I want to do integration tests with the image, what's the best way forward? Pulling the artefact from the step locally and running Elasticsearch+Kibana+this image locally?

@pchila
Copy link
Member

pchila commented Nov 12, 2025

@pchila thanks! If I want to do integration tests with the image, what's the best way forward? Pulling the artefact from the step locally and running Elasticsearch+Kibana+this image locally?

If you want to test the image manually, yes. I am not too sure what the exact setup looks like but if you already tested connectors included with elastic-agent in a k8s cluster the setup should be the same.

@artem-shelkovnikov
Copy link
Member Author

Thanks, I will give it a manual test tomorrow!

@ebeahan
Copy link
Member

ebeahan commented Nov 13, 2025

@artem-shelkovnikov we're still blocking unified release snapshots until we fix the agent packaging failures. Are we able to merge these changes to unblock?

@artem-shelkovnikov
Copy link
Member Author

@ebeahan I've taken today to try to test this change with help from #agentless channel but were unable to fully test it.

Here are my current thoughts:

I was not able to test this change in real Agentless or with real Agentless image. If that's okay to merge, let's do it. Otherwise we need help from somebody who knows how to test it end-to-end

@ebeahan
Copy link
Member

ebeahan commented Nov 13, 2025

I was not able to test this change in real Agentless or with real Agentless image. If that's okay to merge, let's do it. Otherwise we need help from somebody who knows how to test it end-to-end.

The Agent team discussed internally and feels confident in merging. I'll see if we can get a green CI run and merge.

@ebeahan ebeahan merged commit 8ece666 into elastic:main Nov 13, 2025
21 checks passed
@artem-shelkovnikov
Copy link
Member Author

@ebeahan how can I learn when the change is available in staging agentless so that I could give it a manual test?

hayotbisonai pushed a commit to hayotbisonai/elastic-agent that referenced this pull request Nov 23, 2025
* Fix dir name for moving elasticsearch connectors

* Also change the pattern in packages.yml

* Try USE_PACKAGE_VERSION=false

* set USE_PACKAGE_VERSION to false

* Update the unzip command to the correct folder

* Update connectors.sh to point to the right dir

---------

Co-authored-by: Eric Beahan <eric.beahan@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants