Skip to content

Conversation

@barkbay
Copy link
Contributor

@barkbay barkbay commented Jul 30, 2025

Update stack versions in recipes/samples and e2e matrix.

@barkbay barkbay added >docs Documentation exclude-from-release-notes Exclude this PR from appearing in the release notes v3.1.0 labels Jul 30, 2025
package:
name: system
version: 9.0.0
version: 9.1.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to check this one, IIUC Agent packages have a different lifecycle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last version of package system seems to be 2.5.1, not sure why we have the stack version here. I'm wondering it is not a bug in update-stack-version.sh.

image

@prodsecmachine
Copy link
Collaborator

prodsecmachine commented Jul 30, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

mixed:
- E2E_STACK_VERSION: "8.18.0"
# current stack version 9.0.0 is tested in all other tests no need to test it again
- E2E_STACK_VERSION: "8.19.0-SNAPSHOT"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- E2E_STACK_VERSION: "8.19.0-SNAPSHOT"
- E2E_STACK_VERSION: "8.19.0"

8.19.0 should have been released in tandem with 9.1.0 I believe.

Copy link
Collaborator

@pebrc pebrc Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the other question is whether we would like to replace 8.18. with 8.19 on L6. And keep 8.19.0-SNAPSHOT for future patch releases of the 8.19 branch here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the other question is whether we would like to replace 8.18. with 8.19 on L6. And keep 8.19.0-SNAPSHOT for future patch releases of the 8.19 branch here.

This was my original plan, I forgot to update 8.18.0 on L6

@barkbay
Copy link
Contributor Author

barkbay commented Jul 31, 2025

buildkite test this -f p=kind -m s=9.1.0,s=9.2.0-SNAPSHOT

@barkbay
Copy link
Contributor Author

barkbay commented Aug 1, 2025

TestFleetKubernetesNonRootIntegrationRecipe/ES_data_should_pass_validations ~ kind-9-1-0
=== RUN   TestFleetKubernetesNonRootIntegrationRecipe/ES_data_should_pass_validations
Retries (15m0s timeout): ..........................................................................................................................................................................................................................................................................................................
    step.go:51: 
        	Error Trace:	/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/utils.go:94
        	Error:      	Received unexpected error:
        	            	elasticsearch client failed for https://elasticsearch-rss8-es-default-2.elasticsearch-rss8-es-default.e2e-h4gpy-mercury:9200/_data_stream/logs-elastic_agent-default: 404 Not Found: {Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [logs-elastic_agent-default] Type:index_not_found_exception StackTrace: RootCause:[{Reason:no such index [logs-elastic_agent-default] Type:index_not_found_exception}]}}
        	Test:       	TestFleetKubernetesNonRootIntegrationRecipe/ES_data_should_pass_validations
TestFleetKubernetesNonRootIntegrationRecipe/ES_data_should_pass_validations ~ kind-9-2-0-snaps
=== RUN   TestFleetKubernetesNonRootIntegrationRecipe/ES_data_should_pass_validations
Retries (15m0s timeout): ..........................................................................................................................................................................................................................................................................................................
    step.go:51: 
        	Error Trace:	/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/utils.go:94
        	Error:      	Received unexpected error:
        	            	elasticsearch client failed for https://elasticsearch-4mbv-es-default-1.elasticsearch-4mbv-es-default.e2e-86oa9-mercury:9200/_data_stream/logs-elastic_agent-default: 404 Not Found: {Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [logs-elastic_agent-default] Type:index_not_found_exception StackTrace: RootCause:[{Reason:no such index [logs-elastic_agent-default] Type:index_not_found_exception}]}}
        	Test:       	TestFleetKubernetesNonRootIntegrationRecipe/ES_data_should_pass_validations

@barkbay
Copy link
Contributor Author

barkbay commented Aug 1, 2025

For 9.1.0 Fleet server and Agents cannot connect to ES:

{
    "log.level": "error",
    "@timestamp": "2025-07-31T07:59:16.798Z",
    "message": "Error dialing x509: certificate signed by unknown authority",
    "component": {
        "binary": "metricbeat",
        "dataset": "elastic_agent.metricbeat",
        "id": "beat/metrics-monitoring",
        "type": "beat/metrics"
    },
    "log": {
        "source": "beat/metrics-monitoring"
    },
    "service.name": "metricbeat",
    "ecs.version": "1.6.0",
    "network.transport": "tcp",
    "server.address": "elasticsearch-rss8-es-http.e2e-h4gpy-mercury.svc:9200",
    "log.logger": "elasticsearch.esclientleg",
    "log.origin": {
        "file.line": 39,
        "file.name": "transport/logging.go",
        "function": "github.com/elastic/elastic-agent-libs/transport/httpcommon.(*HTTPTransportSettings).RoundTripper.LoggingDialer.func2"
    }
}
{
    "log.level": "error",
    "@timestamp": "2025-07-31T07:59:18.667Z",
    "message": "http: TLS handshake error from 10.244.3.1:51158: remote error: tls: bad certificate\n",
    "component": {
        "binary": "fleet-server",
        "dataset": "elastic_agent.fleet_server",
        "id": "fleet-server-default",
        "type": "fleet-server"
    },
    "log": {
        "source": "fleet-server-default"
    },
    "ecs.version": "1.6.0",
    "service.name": "fleet-server",
    "service.type": "fleet-server"
}

I'll check if it is also the case for 9.2.0, and double check if TestFleetKubernetesIntegrationRecipe is also failing.

Edit:

  • Same issue for 9.2.0
  • TestFleetKubernetesIntegrationRecipe does not seem to be affected

@barkbay
Copy link
Contributor Author

barkbay commented Aug 1, 2025

I can reproduce with 9.0.3 but I can't with 9.0.0, so something has changed either in Kibana or in Agent between these 2 versions. (I'll check other patch releases)

@barkbay
Copy link
Contributor Author

barkbay commented Aug 1, 2025

The problem

  • This config (in version 9.0.3) has 3 Agents + 1 Fleet server.
  • That exact same configuration seems to be working as expected with version 9.0.0
  • With 9.0.3 I can see all the Agents being healthy in Kibana while they are all endlessly reporting in their logs:
{
    "log.level": "error",
    "@timestamp": "2025-08-01T08:45:48.269Z",
    "message": "Failed to connect to backoff(elasticsearch(https://elasticsearch-nb68-es-http.e2e-mercury.svc:9200)): Get \"https://elasticsearch-nb68-es-http.e2e-mercury.svc:9200\": x509: certificate signed by unknown authority",
    "component": {
        "binary": "metricbeat",
        "dataset": "elastic_agent.metricbeat",
        "id": "kubernetes/metrics-default",
        "type": "kubernetes/metrics"
    },
    "log": {
        "source": "kubernetes/metrics-default"
    },
    "log.logger": "publisher_pipeline_output",
    "log.origin": {
        "file.line": 149,
        "file.name": "pipeline/client_worker.go",
        "function": "github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).run"
    },
    "service.name": "metricbeat",
    "ecs.version": "1.6.0"
}
image

Side note, this warning is interesting:

image

Kibana configuration

Kibana is configured with:

xpack.fleet.agentPolicies:
    - id: eck-fleet-server
      is_managed: true
      monitoring_enabled:
      - logs
      - metrics
      name: Fleet Server on ECK policy
      namespace: default
      package_policies:
      - id: fleet_server-1
        name: fleet_server-1
        package:
          name: fleet_server
      unenroll_timeout: 900
    - id: eck-agent
      is_managed: true
      monitoring_enabled:
      - logs
      - metrics
      name: Elastic Agent on ECK policy
      namespace: default
      package_policies:
      - name: system-1
        package:
          name: system
      - name: kubernetes-1
        package:
          name: kubernetes
      unenroll_timeout: 900
    xpack.fleet.agents.fleet_server.hosts:
    - https://fleet-server-nb68-agent-http.e2e-mercury.svc:8220
    xpack.fleet.outputs:
    - hosts:
      - https://elasticsearch-nb68-es-http.e2e-mercury.svc:9200
      id: eck-fleet-agent-output-elasticsearch
      is_default: true
      name: eck-elasticsearch
      ssl:
        certificate_authorities:
        - /mnt/elastic-internal/elasticsearch-association/e2e-mercury/elasticsearch-nb68/certs/ca.crt
      type: elasticsearch
    xpack.fleet.packages:
    - name: system
      version: latest
    - name: elastic_agent
      version: latest
    - name: fleet_server
      version: latest
    - name: kubernetes
      version: latest

This is what the output looks like in Kibana:

image

Elasticsearch CA in Kibana is valid

I checked the CA inside Fleet and it is valid:

curl -u elastic:REDACTED  --cacert /mnt/elastic-internal/elasticsearch-association/e2e-mercury/elasticsearch-nb68/certs/ca.crt https://elasticsearch-nb68-es-http.e2e-mercury.svc:9200
{
  "name" : "elasticsearch-nb68-es-default-0",
  "cluster_name" : "elasticsearch-nb68",
  "cluster_uuid" : "rqakRG9hSrSW0AUS6e-7Zg",
  "version" : {
    "number" : "9.0.3",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "cc7302afc8499e83262ba2ceaa96451681f0609d",
    "build_date" : "2025-06-18T22:09:56.772581489Z",
    "build_snapshot" : false,
    "lucene_version" : "10.1.0",
    "minimum_wire_compatibility_version" : "8.18.0",
    "minimum_index_compatibility_version" : "8.0.0"
  },
  "tagline" : "You Know, for Search"
}

If the configured CA is valid, why do we have these certificates errors?

@barkbay
Copy link
Contributor Author

barkbay commented Aug 4, 2025

For 9.0.x and 9.1.x I believe this is going to be fixed by elastic/kibana#230370 and elastic/kibana#230371. In the meantime I'm going to update the code to skip impacted versions for that test.

@barkbay
Copy link
Contributor Author

barkbay commented Aug 5, 2025

buildkite test this -f p=kind,t=TestFleetKubernetesNonRootIntegrationRecipe -m s=9.1.0,s=9.2.0-SNAPSHOT

@barkbay
Copy link
Contributor Author

barkbay commented Aug 6, 2025

My understanding is that last Kibana snapshot (docker.elastic.co/kibana/kibana:9.2.0-SNAPSHOT) built yesterday should include elastic/kibana#230211

I'll retry TestFleetKubernetesNonRootIntegrationRecipe ....

@barkbay
Copy link
Contributor Author

barkbay commented Aug 6, 2025

buildkite test this -f p=kind,t=TestFleetKubernetesNonRootIntegrationRecipe -m s=9.1.0,s=9.2.0-SNAPSHOT

@barkbay
Copy link
Contributor Author

barkbay commented Aug 6, 2025

Looks like this is still not fixed 😞

@barkbay
Copy link
Contributor Author

barkbay commented Aug 7, 2025

We still have the same error with 9.2.0-SNAPSHOT:

{"log.level":"error","@timestamp":"2025-08-06T12:53:38.105Z","message":"Error dialing x509: certificate signed by unknown authority","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"ecs.version":"1.6.0","log.origin":{"file.line":39,"file.name":"transport/logging.go","function":"github.com/elastic/elastic-agent-libs/transport/httpcommon.(*HTTPTransportSettings).RoundTripper.LoggingDialer.func2"},"log.logger":"elasticsearch.esclientleg","service.name":"filebeat","network.transport":"tcp","server.address":"elasticsearch-b7m5-es-http.e2e-r2fx9-mercury.svc:9200","ecs.version":"1.6.0"}
...
{"log.level":"error","@timestamp":"2025-08-06T13:09:58.868Z","message":"Error dialing x509: certificate signed by unknown authority","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"elasticsearch.esclientleg","service.name":"filebeat","network.transport":"tcp","log.origin":{"file.line":39,"file.name":"transport/logging.go","function":"github.com/elastic/elastic-agent-libs/transport/httpcommon.(*HTTPTransportSettings).RoundTripper.LoggingDialer.func2"},"server.address":"elasticsearch-b7m5-es-http.e2e-r2fx9-mercury.svc:9200","ecs.version":"1.6.0","ecs.version":"1.6.0"}

Hi @juliaElastic 👋 , could you please confirm that this should be fixed in 9.2.0-SNAPSHOT? Thanks 🙇

@barkbay
Copy link
Contributor Author

barkbay commented Aug 8, 2025

buildkite test this -f p=kind,t=TestFleetKubernetesNonRootIntegrationRecipe -m s=9.1.1,s=9.2.0-SNAPSHOT

1 similar comment
@barkbay
Copy link
Contributor Author

barkbay commented Aug 21, 2025

buildkite test this -f p=kind,t=TestFleetKubernetesNonRootIntegrationRecipe -m s=9.1.1,s=9.2.0-SNAPSHOT

@barkbay barkbay force-pushed the 3.1.0/update-stack-version branch from 0df0068 to 62e9120 Compare August 25, 2025 07:33
@barkbay
Copy link
Contributor Author

barkbay commented Aug 25, 2025

buildkite test this -f p=kind,t=TestFleetKubernetesNonRootIntegrationRecipe -m s=9.1.2,s=9.2.0-SNAPSHOT

@barkbay
Copy link
Contributor Author

barkbay commented Aug 25, 2025

buildkite test this -f p=kind,t=TestFleetKubernetesNonRootIntegrationRecipe -m s=9.0.5

@barkbay
Copy link
Contributor Author

barkbay commented Aug 25, 2025

buildkite test this -f p=kind,t=TestFleetKubernetesNonRootIntegrationRecipe -m s=9.1.2,9.0.5

@barkbay
Copy link
Contributor Author

barkbay commented Aug 25, 2025

buildkite test this -f p=kind,t=TestFleetKubernetesNonRootIntegrationRecipe -m s=9.1.2,s=9.0.5

@barkbay
Copy link
Contributor Author

barkbay commented Aug 25, 2025

buildkite test this -f p=kind,t=TestFleetKubernetesNonRootIntegrationRecipe -m s=9.0.5,s=9.1.2,s=9.2.0-SNAPSHOT

@barkbay
Copy link
Contributor Author

barkbay commented Aug 25, 2025

Running the tests again with the correct versions:

  • 9.0.5 ✅
  • 9.1.2 ❌
  • 9.2.0-SNAPSHOT (sha256:41d3bbaf75d7866dc4cb3a2914ce50ccd76df3d3854c3c87ff8e2e213097bdbe): ❌

For both 9.1.2 and 9.2.0 I can see the the following in Agent logs:

Get \"https://fleet-server-nbkp-agent-http.e2e-i0oyf-mercury.svc:8220/api/status?\": x509: certificate signed by unknown authority"

Which seems to be a known issue on the Kibana side where the Elasticsearch SSL config is copied to Agent (not sure this issue is tracked in github though).

@barkbay
Copy link
Contributor Author

barkbay commented Sep 3, 2025

@\eedugon kindly created elastic/kibana#233780 to track the bug in Kibana

@barkbay
Copy link
Contributor Author

barkbay commented Sep 8, 2025

I'm going to disable this test, we are not running the e2e tests with the last stack version which is not great. I'll create an issue to be sure we enable it again before the next release.

@barkbay
Copy link
Contributor Author

barkbay commented Sep 8, 2025

buildkite test this -f p=kind,t=TestFleetKubernetesNonRootIntegrationRecipe -m s=9.0.5,s=9.1.2,s=9.2.0-SNAPSHOT

@barkbay barkbay merged commit 64e41ad into elastic:main Sep 8, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>docs Documentation exclude-from-release-notes Exclude this PR from appearing in the release notes v3.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants