Skip to content

revert: revert collector version - CORE-352#3734

Merged
blumamir merged 3 commits into
odigos-io:mainfrom
tamirdavid1:revert-collector-upgrade
Nov 3, 2025
Merged

revert: revert collector version - CORE-352#3734
blumamir merged 3 commits into
odigos-io:mainfrom
tamirdavid1:revert-collector-upgrade

Conversation

@tamirdavid1
Copy link
Copy Markdown
Collaborator

Description

After upgrading the collector to v0.138.0, we started observing frequent OOM situations that did not occur before the version bump. We are continuing to investigate the root cause, but in the meantime, this change has been reverted.

How Has This Been Tested?

  • Added Unit Tests
  • Updated e2e Tests
  • Manual Testing
  • Manual Load Test

Kubernetes Checklist

  • Changes how Odigos interacts with Kubernetes
  • Introduces additional calls to the API Server (potential performance impact)
  • New Query/feature supported in all the k8s versions supported by Odigos
  • Modifies Odigos manifests (addressed in both CLI and Helm)
  • Changes RBAC permissions

User Facing Changes

  • Users need to take action before upgrading
  • Automatic migration will modify existing objects (backward compatible)
  • Changes UI, CLI, or K8s Manifests aspects in a way that users need to be aware of
  • Documentation updated accordingly

@tamirdavid1 tamirdavid1 changed the title revert: revert collector version revert: revert collector version - CORE-352 Nov 3, 2025
@blumamir blumamir merged commit 27056f6 into odigos-io:main Nov 3, 2025
142 of 153 checks passed
blumamir pushed a commit to blumamir/odigos that referenced this pull request Nov 3, 2025
After upgrading the collector to v0.138.0, we started observing frequent
OOM situations that did not occur before the version bump. We are
continuing to investigate the root cause, but in the meantime, this
change has been reverted.

<!-- Describe the tests you ran and how you verified your changes. -->

- [ ] Added Unit Tests
- [ ] Updated e2e Tests
- [X] Manual Testing
- [X] Manual Load Test

<!-- If this PR affects how Odigos interacts with Kubernetes, check the
relevant boxes below and provide more details -->

- [ ] Changes how Odigos interacts with Kubernetes
- [ ] Introduces additional calls to the API Server (potential
performance impact)
- [ ] New Query/feature supported in all the k8s versions supported by
Odigos
- [ ] Modifies Odigos manifests (addressed in both CLI and Helm)
- [ ] Changes RBAC permissions

<!-- Any changes that users will notice or need to be aware of -->

- [ ] Users need to take action before upgrading
- [ ] Automatic migration will modify existing objects (backward
compatible)
- [ ] Changes UI, CLI, or K8s Manifests aspects in a way that users need
to be aware of
- [ ] Documentation updated accordingly
blumamir pushed a commit to blumamir/odigos that referenced this pull request Nov 3, 2025
After upgrading the collector to v0.138.0, we started observing frequent
OOM situations that did not occur before the version bump. We are
continuing to investigate the root cause, but in the meantime, this
change has been reverted.

<!-- Describe the tests you ran and how you verified your changes. -->

- [ ] Added Unit Tests
- [ ] Updated e2e Tests
- [X] Manual Testing
- [X] Manual Load Test

<!-- If this PR affects how Odigos interacts with Kubernetes, check the
relevant boxes below and provide more details -->

- [ ] Changes how Odigos interacts with Kubernetes
- [ ] Introduces additional calls to the API Server (potential
performance impact)
- [ ] New Query/feature supported in all the k8s versions supported by
Odigos
- [ ] Modifies Odigos manifests (addressed in both CLI and Helm)
- [ ] Changes RBAC permissions

<!-- Any changes that users will notice or need to be aware of -->

- [ ] Users need to take action before upgrading
- [ ] Automatic migration will modify existing objects (backward
compatible)
- [ ] Changes UI, CLI, or K8s Manifests aspects in a way that users need
to be aware of
- [ ] Documentation updated accordingly
blumamir pushed a commit to blumamir/odigos that referenced this pull request Nov 3, 2025
## Description

After upgrading the collector to v0.138.0, we started observing frequent
OOM situations that did not occur before the version bump. We are
continuing to investigate the root cause, but in the meantime, this
change has been reverted.

## How Has This Been Tested?

<!-- Describe the tests you ran and how you verified your changes. -->

- [ ] Added Unit Tests
- [ ] Updated e2e Tests
- [X] Manual Testing
- [X] Manual Load Test

## Kubernetes Checklist

<!-- If this PR affects how Odigos interacts with Kubernetes, check the
relevant boxes below and provide more details -->

- [ ] Changes how Odigos interacts with Kubernetes
- [ ] Introduces additional calls to the API Server (potential
performance impact)
- [ ] New Query/feature supported in all the k8s versions supported by
Odigos
- [ ] Modifies Odigos manifests (addressed in both CLI and Helm)
- [ ] Changes RBAC permissions

## User Facing Changes

<!-- Any changes that users will notice or need to be aware of -->

- [ ] Users need to take action before upgrading
- [ ] Automatic migration will modify existing objects (backward
compatible)
- [ ] Changes UI, CLI, or K8s Manifests aspects in a way that users need
to be aware of
- [ ] Documentation updated accordingly
damemi added a commit that referenced this pull request Feb 4, 2026
…or/otel to 141 + Remove deprecated components + Bump k8s min version to 1.21 (#4111)

The clickhouse exporter supports TLS settings similar to otlp:
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/945e5a71ef31793ff3280b28c1425086ea5332b6/exporter/clickhouseexporter/README.md#tls

Some users need this to connect to clickhouse, adding them as options in
the destination here

This adds:

* `insecure_skip_verify`
* `ca_file` (using the k8sconfig interface to mount the secret as a
file, similar to how the GCP exporter supports application default
credentials)

The direct string fields (such as CAPem, CertPem, KeyPem) aren't yet
supported in the clickhouse exporter, so it has to be a mounted file.
See
open-telemetry/opentelemetry-collector-contrib#43911 (comment)

---

To do this, it required bumping the collector/otel deps to 136 when TLS
config support was added to clickhouse. This required the following
changes:

This actually needs collector v0.136.0 for these settings from
open-telemetry/opentelemetry-collector-contrib#42581
(open-telemetry/opentelemetry-collector-contrib@d9769f7)

Also needs to remove loki exporter (removed in 131) for 136 🙃
open-telemetry/opentelemetry-collector-contrib#41413,
see
https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/v0.130.0/exporter/lokiexporter#deprecation-notice
it's replaced with just otlp. The only destination that actually looks
like it's using the loki exporter is OpsVerse

As well as the opencensus exporter, removed in 133 upstream by
open-telemetry/opentelemetry-collector-contrib#42239

Also routing processor
open-telemetry/opentelemetry-collector-contrib#36616

See previous attempt #3669
(reverted in #3734)

---

Then, it turns out that 136 was bugged and did not have full support for
TLS settings like `insecure_skip_verify`. This was fixed in 141, which
required the following extra changes:

Actually needs collector v141 due to this bug in clickhouse not handling
all tls settings:
open-telemetry/opentelemetry-collector-contrib#43911
fixed in
open-telemetry/opentelemetry-collector-contrib#44093

Remove deprecated carbon exporter support (unmaintained upstream)
open-telemetry/opentelemetry-collector-contrib#44532

another upstream breaking change giving go mod trouble
open-telemetry/opentelemetry-collector#13948

configgrpc update:
open-telemetry/opentelemetry-collector#13996

and now metadata.yaml metrics require stablity levels
open-telemetry/opentelemetry-collector#13756

```
Error: failed loading /app/collector/receivers/odigosebpfreceiver/metadata.yaml: decoding failed due to the following error(s):

'telemetry.metrics[ebpf_memory_pressure_wait_time_total]' missing required field: `stability`
'telemetry.metrics[ebpf_total_bytes_read]' missing required field: `stability`
'telemetry.metrics[ebpf_lost_samples]' missing required field: `stability`
Error: failed loading /app/collector/receivers/odigosebpfreceiver/metadata.yaml: decoding failed due to the following error(s):

'telemetry.metrics[ebpf_memory_pressure_wait_time_total]' missing required field: `stability`
'telemetry.metrics[ebpf_total_bytes_read]' missing required field: `stability`
'telemetry.metrics[ebpf_lost_samples]' missing required field: `stability`
Error: metadata.yaml ordering check failed: [telemetry metrics] keys are not sorted: [odigos_log_data_size odigos_metric_data_size odigos_trace_data_size odigos_accepted_spans odigos_accepted_metric_points odigos_accepted_log_records]
Error: metadata.yaml ordering check failed: [telemetry metrics] keys are not sorted: [odigos_log_data_size odigos_metric_data_size odigos_trace_data_size odigos_accepted_spans odigos_accepted_metric_points odigos_accepted_log_records]
```

This bump also required adding the `endpointslices` permission to the
odiglet service account for the data-collection collector

---

Finally, endpointslices was not GA in k8s 1.20. This PR bumps our
minimum supported k8s version to 1.21. Enterprise update in
odigos-io/odigos-enterprise#2117
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants