sdk/log: self observability: batch log processor metrics#7124
sdk/log: self observability: batch log processor metrics#7124mahendrabishnoi2 wants to merge 35 commits intoopen-telemetry:mainfrom
Conversation
- setup and teardown for self observability. Includes setting up counters for queue capacity, size, log processed. Also includes callback registration for queue cap and size - doc comments pending - has some todo comments that I plan to address by discussion on PR review
…ation - increment logProcessedCounter on push to exporter or drops from queue - pending: flush methods where we are pushing to exporter
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7124 +/- ##
=====================================
Coverage 82.9% 82.9%
=====================================
Files 314 316 +2
Lines 24911 24995 +84
=====================================
+ Hits 20661 20738 +77
- Misses 3878 3882 +4
- Partials 372 375 +3
🚀 New features to boost your workflow:
|
…eflect what it means for a log to be processed
|
@mahendrabishnoi2 please take a look at #7017. I have updated the issue with a check list of items from our project Observability guidelines that need to be completed in this PR. |
|
@mahendrabishnoi2 I just wanted to check in with this PR. Are you still able to update this PR or open a new PR to address the instrumentation requirements? |
Yes @MrAlias. I can finish it in 2-3 days as per the updated guidelines. It would be great if I could get some feedback on the approach I'm taking for this PR. Since we need to record the metric just before logs are passed to exporter, I have crated another wrapper (similar to timeoutExporter or chunkedExporter) to achieve this. Just trying to avoid the rework. |
…ty and integrate it with processor.
dashpole
left a comment
There was a problem hiding this comment.
Not sure if you are still working on this, but this would be relevant if someone else picks this up.
I'll raise an updated PR today. Was just waiting for confirmation on my approach as I mentioned in issue as well. |
|
Please remove WIP from the title and mark it ready for review when it is ready for others to take a look |
|
| - Add experimental support for splitting metric data across multiple batches in `go.opentelemetry.io/otel/sdk/metric`. | ||
| Set `OTEL_GO_X_METRIC_EXPORT_BATCH_SIZE=<max_size>` to enable for all periodic readers. | ||
| See `go.opentelemetry.io/otel/sdk/metric/internal/x` for feature documentation. (#8071) | ||
| - Add experimental observability metrics in `go.opentelemetry.io/otel/sdk/log`. (#7124) |
There was a problem hiding this comment.
| - Add experimental observability metrics in `go.opentelemetry.io/otel/sdk/log`. (#7124) | |
| - Add experimental observability metrics to the BatchProcessor in `go.opentelemetry.io/otel/sdk/log`. (#7124) |
There was a problem hiding this comment.
Pull request overview
This PR adds experimental self-observability metrics for sdk/log.BatchProcessor (BatchLogProcessor) behind the OTEL_GO_X_OBSERVABILITY feature flag, using the global MeterProvider, aligning with the semantic conventions for OTel SDK metrics.
Changes:
- Introduces
sdk/log/internal/observ.BLPinstrumentation emittingotel.sdk.processor.log.processed,otel.sdk.processor.log.queue.size, andotel.sdk.processor.log.queue.capacity. - Wires the instrumentation into
BatchProcessor(export path + dropped-records path) and adds/updates tests and benchmarks. - Adds a generated internal counter utility to produce deterministic component IDs in tests.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/log/batch.go | Creates BLP instrumentation in NewBatchProcessor, records queue-full drops, shuts down instrumentation. |
| sdk/log/exporter.go | Adds metricsExporter wrapper to count processed log records right before export. |
| sdk/log/batch_test.go | Adds tests asserting metrics presence/absence and correctness when observability is enabled/disabled. |
| sdk/log/internal/observ/batch_log_processor.go | New BLP instrumentation implementation (queue size/capacity callback + processed counter). |
| sdk/log/internal/observ/batch_log_processor_test.go | New tests/benchmarks for BLP instrumentation behavior. |
| sdk/log/internal/observ/observ_test.go | Shared test helpers + erroring meter provider/meter fakes. |
| sdk/log/internal/counter/counter.go | Generated atomic counter for unique IDs (resettable for tests). |
| sdk/log/internal/counter/counter_test.go | Generated tests for the counter behavior and concurrency-safety. |
| sdk/log/internal/gen.go | Adds go:generate targets for the new internal counter. |
| sdk/log/internal/x/README.md | Documents the newly added BatchProcessor observability metrics. |
| sdk/log/logger_bench_test.go | Updates benchmark MeterProvider setup to drop internal observ scope metrics via a view. |
| CHANGELOG.md | Adds a changelog entry for the new experimental log SDK observability metrics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| }, err | ||
| } | ||
|
|
||
| func (b *BLP) Shutdown() error { |
There was a problem hiding this comment.
NewBLP can return a non-nil *BLP even when meter.RegisterCallback(...) fails. In that case reg is likely nil, and Shutdown() will panic when calling b.reg.Unregister(). Consider initializing reg to a no-op registration (e.g., from go.opentelemetry.io/otel/metric/noop) and only overwriting it on success, or guard Shutdown() with a nil check so it becomes safe/idempotent even when callback registration failed.
| func (b *BLP) Shutdown() error { | |
| func (b *BLP) Shutdown() error { | |
| if b == nil || b.reg == nil { | |
| return nil | |
| } |
There was a problem hiding this comment.
Here is how Codex found the same issue:
If meter.RegisterCallback returns an error, reg is left nil here but NewBLP still returns a non-nil *BLP, so NewBatchProcessor installs it and later (*BLP).Shutdown calls b.reg.Unregister() on a nil registration. In any process that enables OTEL_GO_X_OBSERVABILITY with a custom or partially configured MeterProvider, shutdown turns a handled instrumentation error into a panic.
There was a problem hiding this comment.
I have not added b == nil check as this is internal api and on all callsites, we already check for b.inst == nil.
Do let me know if this check also needs to be added.
| var err error | ||
| b.inst, err = observ.NewBLP( | ||
| counter.NextExporterID(), | ||
| func() int64 { return int64(b.q.Len()) }, | ||
| int64(cfg.maxQSize.Value), | ||
| ) | ||
| if err != nil { | ||
| otel.Handle(err) | ||
| } | ||
|
|
||
| // Wrap exporter with metrics recording if observability is enabled. | ||
| // This must be the innermost wrapper (closest to user exporter) to record | ||
| // metrics just before calling the actual exporter. | ||
| if b.inst != nil { | ||
| exporter = newMetricsExporter(exporter, b.inst) | ||
| } |
There was a problem hiding this comment.
observ.NewBLP(...) may return a non-nil b.inst together with a non-nil err (best-effort init). Right now the code logs the error via otel.Handle(err) but still wraps the exporter and later calls b.inst.Shutdown(), which can lead to runtime panics if the instrumentation is only partially initialized (e.g., callback registration failed). Consider treating any non-nil error as a signal to disable this instrumentation instance (b.inst = nil), or adjust NewBLP so it only returns a non-nil instance when it is safe to use/shutdown.
| // Flush remaining queued before exporter shutdown. | ||
| err := b.exporter.Export(ctx, b.q.Flush()) | ||
| if b.inst != nil { | ||
| err = errors.Join(err, b.inst.Shutdown()) |
There was a problem hiding this comment.
When records have overflowed the ring buffer since the last poll iteration, Shutdown closes pollKill and can exit the poll goroutine without ever consuming q.Dropped(). Because ProcessedQueueFull is only driven from that goroutine, an application that shuts down under backpressure can silently miss otel.sdk.processor.log.processed{error.type="queue_full"} increments for the last dropped log records, so the new observability metric under-reports exactly the shutdown path users care about most.
I think that addressing this would be very hard and I would rather create an issue for tracking this.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| var err error | ||
| b.inst, err = observ.NewBLP( | ||
| counter.NextExporterID(), | ||
| func() int64 { return int64(b.q.Len()) }, | ||
| int64(cfg.maxQSize.Value), | ||
| ) |
There was a problem hiding this comment.
counter.NextExporterID() is evaluated unconditionally before observ.NewBLP(...) can check whether observability is enabled, so the global ID counter is incremented (atomic op) even when OTEL_GO_X_OBSERVABILITY is disabled. This adds avoidable overhead in the default (disabled) path and makes component IDs depend on how many batch processors were created while disabled. Consider gating the ID generation and NewBLP call behind the feature-flag check (or moving ID generation inside NewBLP after it confirms observability is enabled).
|
|
||
| func TestBatchProcessorMetricsDisabled(t *testing.T) { | ||
| t.Setenv("OTEL_GO_X_OBSERVABILITY", "false") | ||
|
|
There was a problem hiding this comment.
This test mutates the global exporter ID counter via counter.SetExporterID(...) but does not restore the previous value. Since Go test execution order across tests in the package is not guaranteed, this can leak state into other tests that create batch processors and makes failures order-dependent. Consider saving the previous counter value and restoring it in t.Cleanup(...).
| origExporterID := counter.ExporterID() | |
| t.Cleanup(func() { counter.SetExporterID(origExporterID) }) |
There was a problem hiding this comment.
similar pattern as other package tests, setting to 0
| func TestBatchProcessorMetrics(t *testing.T) { | ||
| counter.SetExporterID(blpComponentID) | ||
|
|
||
| t.Setenv("OTEL_GO_X_OBSERVABILITY", "true") |
There was a problem hiding this comment.
This test mutates the global exporter ID counter via counter.SetExporterID(...) but does not restore the previous value, which can leak global state into other tests in the package depending on execution order. Consider saving the previous counter value and restoring it in t.Cleanup(...).
Fixes #7017
Adds support for below metrics for sdk/log#BatchProcessor
otel.sdk.processor.log.processedotel.sdk.processor.log.queue.sizeotel.sdk.processor.log.queue.capacityThese metrics are defined at https://github.com/open-telemetry/semantic-conventions/blob/v1.36.0/docs/otel/sdk-metrics.md and are experimental. Because of this, these use global MeterProvider and are behind a feature flag
OTEL_GO_X_OBSERVABILITY.Observability Implementation Checklist
Observability Implementation Checklist
Based on the project Observability guidelines, ensure the following are completed:
Environment Variable Activation
OTEL_GO_X_OBSERVABILITYenvironment variablex.Observability.Enabled()check 1Encapsulation
struct(e.g.,Instrumentation)Initialization
otel.GetMeterProvider())Version)SchemaURL)Performance
Attribute and Option Allocation Management
sync.Poolfor attribute slices and options with dynamic attributesCaching
Benchmarking
b.ReportAllocs()in benchmarks)Error Handling and Robustness
otel.Handle()otel.Handle()only when component cannot report error to userContext Propagation
context.Background())Semantic Conventions Compliance
otelconvconvenience package for metric semantic conventionsTesting
t.Cleanup())t.Setenv()for environment variable testingFootnotes
https://github.com/open-telemetry/opentelemetry-go/blob/e4ab3141123d0811125a69823dbbe4d9ec5a9b8f/exporters/stdout/stdouttrace/internal/observ/instrumentation.go#L101-L103 ↩
https://github.com/open-telemetry/opentelemetry-go/blob/e4ab3141123d0811125a69823dbbe4d9ec5a9b8f/exporters/stdout/stdouttrace/internal/x/x.go ↩
https://github.com/open-telemetry/opentelemetry-go/blob/e4ab3141123d0811125a69823dbbe4d9ec5a9b8f/sdk/internal/x/x.go ↩