Skip to content

sdk/log: self observability: batch log processor metrics#7124

Open
mahendrabishnoi2 wants to merge 35 commits intoopen-telemetry:mainfrom
mahendrabishnoi2:logs-batch-processor-metrics
Open

sdk/log: self observability: batch log processor metrics#7124
mahendrabishnoi2 wants to merge 35 commits intoopen-telemetry:mainfrom
mahendrabishnoi2:logs-batch-processor-metrics

Conversation

@mahendrabishnoi2
Copy link
Copy Markdown
Member

@mahendrabishnoi2 mahendrabishnoi2 commented Aug 4, 2025

Fixes #7017

Adds support for below metrics for sdk/log#BatchProcessor

  • otel.sdk.processor.log.processed
  • otel.sdk.processor.log.queue.size
  • otel.sdk.processor.log.queue.capacity

These metrics are defined at https://github.com/open-telemetry/semantic-conventions/blob/v1.36.0/docs/otel/sdk-metrics.md and are experimental. Because of this, these use global MeterProvider and are behind a feature flag OTEL_GO_X_OBSERVABILITY.

Observability Implementation Checklist

Observability Implementation Checklist

Based on the project Observability guidelines, ensure the following are completed:

Environment Variable Activation

  • Observability features are disabled by default
  • Features are activated through the OTEL_GO_X_OBSERVABILITY environment variable
  • Use consistent pattern with x.Observability.Enabled() check 1
  • Follow established experimental feature pattern 23

Encapsulation

  • Instrumentation is encapsulated within a dedicated struct (e.g., Instrumentation)
  • Instrumentation is not mixed into the instrumented component
  • Instrumentation code is in its own file or package if complex/reused
  • Instrumentation setup doesn't bloat the main component code

Initialization

  • Initialization is only done when observability is enabled
  • Setup is explicit and side-effect free
  • Return errors from initialization when appropriate
  • Use the global Meter provider (e.g., otel.GetMeterProvider())
  • Include proper meter configuration with:
  • Instrumentation package name is used for the Meter
  • Instrumentation version (e.g. Version)
  • Schema URL (e.g. SchemaURL)

Performance

  • Little to no overhead when observability is disabled
  • Expensive operations are only executed when observability is enabled
  • When enabled, instrumentation code paths are optimized to reduce allocation/computation overhead

Attribute and Option Allocation Management

  • Use sync.Pool for attribute slices and options with dynamic attributes
  • Pool objects are properly reset before returning to pool
  • Pools are scoped for maximum efficiency while ensuring correctness

Caching

  • Static attribute sets known at compile time are pre-computed and cached
  • Common attribute combinations use lookup tables/maps

Benchmarking

  • Benchmarks provided for all instrumentation code
  • Benchmark scenarios include both enabled and disabled observability
  • Benchmark results show impact on allocs/op, B/op, and ns/op (use b.ReportAllocs() in benchmarks)

Error Handling and Robustness

  • Errors are reported back to caller when possible
  • Partial failures are handled gracefully
  • Use partially initialized components when available
  • Return errors to caller instead of only using otel.Handle()
  • Use otel.Handle() only when component cannot report error to user

Context Propagation

  • Observability measurements receive the context from the function being measured (don't break context propagation by using context.Background())

Semantic Conventions Compliance

  • All metrics follow OpenTelemetry Semantic Conventions
  • Use the otelconv convenience package for metric semantic conventions
  • Component names follow semantic conventions
  • Use package path scope type as stable identifier for non-standard components
  • Component names are stable unique identifiers
  • Use global counter for uniqueness if necessary
  • Component ID counter is resettable for deterministic testing

Testing

  • Use deterministic testing with isolated state
  • Restore previous state after tests (t.Cleanup())
  • Isolate meter provider for testing
  • Use t.Setenv() for environment variable testing
  • Reset component ID counter for deterministic component names
  • Test order doesn't affect results

Footnotes

  1. https://github.com/open-telemetry/opentelemetry-go/blob/e4ab3141123d0811125a69823dbbe4d9ec5a9b8f/exporters/stdout/stdouttrace/internal/observ/instrumentation.go#L101-L103

  2. https://github.com/open-telemetry/opentelemetry-go/blob/e4ab3141123d0811125a69823dbbe4d9ec5a9b8f/exporters/stdout/stdouttrace/internal/x/x.go

  3. https://github.com/open-telemetry/opentelemetry-go/blob/e4ab3141123d0811125a69823dbbe4d9ec5a9b8f/sdk/internal/x/x.go

- setup and teardown for self observability. Includes setting up counters for queue capacity, size, log processed. Also includes callback registration for queue cap and size
- doc comments pending
- has some todo comments that I plan to address by discussion on PR review
…ation

- increment logProcessedCounter on push to exporter or drops from queue
- pending: flush methods where we are pushing to exporter
@codecov
Copy link
Copy Markdown

codecov Bot commented Aug 4, 2025

Codecov Report

❌ Patch coverage is 95.55556% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.9%. Comparing base (468c62d) to head (170e336).

Files with missing lines Patch % Lines
sdk/log/batch.go 89.4% 1 Missing and 1 partial ⚠️
sdk/log/internal/observ/batch_log_processor.go 96.6% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@          Coverage Diff          @@
##            main   #7124   +/-   ##
=====================================
  Coverage   82.9%   82.9%           
=====================================
  Files        314     316    +2     
  Lines      24911   24995   +84     
=====================================
+ Hits       20661   20738   +77     
- Misses      3878    3882    +4     
- Partials     372     375    +3     
Files with missing lines Coverage Δ
sdk/log/exporter.go 96.5% <100.0%> (+0.2%) ⬆️
sdk/log/internal/counter/counter.go 100.0% <100.0%> (ø)
sdk/log/batch.go 99.0% <89.4%> (-1.0%) ⬇️
sdk/log/internal/observ/batch_log_processor.go 96.6% <96.6%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@MrAlias
Copy link
Copy Markdown
Contributor

MrAlias commented Sep 15, 2025

@mahendrabishnoi2 please take a look at #7017. I have updated the issue with a check list of items from our project Observability guidelines that need to be completed in this PR.

@MrAlias
Copy link
Copy Markdown
Contributor

MrAlias commented Oct 7, 2025

@mahendrabishnoi2 I just wanted to check in with this PR. Are you still able to update this PR or open a new PR to address the instrumentation requirements?

@mahendrabishnoi2
Copy link
Copy Markdown
Member Author

@mahendrabishnoi2 I just wanted to check in with this PR. Are you still able to update this PR or open a new PR to address the instrumentation requirements?

Yes @MrAlias. I can finish it in 2-3 days as per the updated guidelines. It would be great if I could get some feedback on the approach I'm taking for this PR. Since we need to record the metric just before logs are passed to exporter, I have crated another wrapper (similar to timeoutExporter or chunkedExporter) to achieve this. Just trying to avoid the rework.
If this approach seems fine, I can go ahead and updated it as per guidelines, add tests.

Copy link
Copy Markdown
Contributor

@dashpole dashpole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you are still working on this, but this would be relevant if someone else picks this up.

Comment thread sdk/log/internal/observ/batch_log_processor.go Outdated
Comment thread sdk/log/internal/observ/batch_log_processor.go Outdated
Comment thread sdk/log/internal/observ/batch_log_processor.go Outdated
@mahendrabishnoi2
Copy link
Copy Markdown
Member Author

Not sure if you are still working on this, but this would be relevant if someone else picks this up.

I'll raise an updated PR today. Was just waiting for confirmation on my approach as I mentioned in issue as well.
Thanks @dashpole

@dashpole
Copy link
Copy Markdown
Contributor

Please remove WIP from the title and mark it ready for review when it is ready for others to take a look

@mahendrabishnoi2
Copy link
Copy Markdown
Member Author

goos: darwin
goarch: arm64
pkg: go.opentelemetry.io/otel/sdk/log/internal/observ
cpu: Apple M1 Pro
                         │ blp_bench.txt │
                         │    sec/op     │
BLP/Processed-8             0.4520n ± 3%
BLP/ProcessedQueueFull-8    0.4463n ± 1%
BLP/Callback-8               2.692µ ± 8%
geomean                      8.159n

                         │ blp_bench.txt  │
                         │      B/op      │
BLP/Processed-8              0.000 ± 0%
BLP/ProcessedQueueFull-8     0.000 ± 0%
BLP/Callback-8             2.814Ki ± 0%
geomean                                 ¹
¹ summaries must be >0 to compute geomean

                         │ blp_bench.txt │
                         │   allocs/op   │
BLP/Processed-8             0.000 ± 0%
BLP/ProcessedQueueFull-8    0.000 ± 0%
BLP/Callback-8              26.00 ± 0%
geomean                                ¹
¹ summaries must be >0 to compute geomean

@mahendrabishnoi2 mahendrabishnoi2 marked this pull request as ready for review April 12, 2026 04:39
@mahendrabishnoi2 mahendrabishnoi2 changed the title [WIP] sdk/log: self observability: batch log processor metrics sdk/log: self observability: batch log processor metrics Apr 12, 2026
Comment thread CHANGELOG.md Outdated
- Add experimental support for splitting metric data across multiple batches in `go.opentelemetry.io/otel/sdk/metric`.
Set `OTEL_GO_X_METRIC_EXPORT_BATCH_SIZE=<max_size>` to enable for all periodic readers.
See `go.opentelemetry.io/otel/sdk/metric/internal/x` for feature documentation. (#8071)
- Add experimental observability metrics in `go.opentelemetry.io/otel/sdk/log`. (#7124)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Add experimental observability metrics in `go.opentelemetry.io/otel/sdk/log`. (#7124)
- Add experimental observability metrics to the BatchProcessor in `go.opentelemetry.io/otel/sdk/log`. (#7124)

pellared
pellared previously approved these changes Apr 14, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds experimental self-observability metrics for sdk/log.BatchProcessor (BatchLogProcessor) behind the OTEL_GO_X_OBSERVABILITY feature flag, using the global MeterProvider, aligning with the semantic conventions for OTel SDK metrics.

Changes:

  • Introduces sdk/log/internal/observ.BLP instrumentation emitting otel.sdk.processor.log.processed, otel.sdk.processor.log.queue.size, and otel.sdk.processor.log.queue.capacity.
  • Wires the instrumentation into BatchProcessor (export path + dropped-records path) and adds/updates tests and benchmarks.
  • Adds a generated internal counter utility to produce deterministic component IDs in tests.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
sdk/log/batch.go Creates BLP instrumentation in NewBatchProcessor, records queue-full drops, shuts down instrumentation.
sdk/log/exporter.go Adds metricsExporter wrapper to count processed log records right before export.
sdk/log/batch_test.go Adds tests asserting metrics presence/absence and correctness when observability is enabled/disabled.
sdk/log/internal/observ/batch_log_processor.go New BLP instrumentation implementation (queue size/capacity callback + processed counter).
sdk/log/internal/observ/batch_log_processor_test.go New tests/benchmarks for BLP instrumentation behavior.
sdk/log/internal/observ/observ_test.go Shared test helpers + erroring meter provider/meter fakes.
sdk/log/internal/counter/counter.go Generated atomic counter for unique IDs (resettable for tests).
sdk/log/internal/counter/counter_test.go Generated tests for the counter behavior and concurrency-safety.
sdk/log/internal/gen.go Adds go:generate targets for the new internal counter.
sdk/log/internal/x/README.md Documents the newly added BatchProcessor observability metrics.
sdk/log/logger_bench_test.go Updates benchmark MeterProvider setup to drop internal observ scope metrics via a view.
CHANGELOG.md Adds a changelog entry for the new experimental log SDK observability metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}, err
}

func (b *BLP) Shutdown() error {
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewBLP can return a non-nil *BLP even when meter.RegisterCallback(...) fails. In that case reg is likely nil, and Shutdown() will panic when calling b.reg.Unregister(). Consider initializing reg to a no-op registration (e.g., from go.opentelemetry.io/otel/metric/noop) and only overwriting it on success, or guard Shutdown() with a nil check so it becomes safe/idempotent even when callback registration failed.

Suggested change
func (b *BLP) Shutdown() error {
func (b *BLP) Shutdown() error {
if b == nil || b.reg == nil {
return nil
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is how Codex found the same issue:

If meter.RegisterCallback returns an error, reg is left nil here but NewBLP still returns a non-nil *BLP, so NewBatchProcessor installs it and later (*BLP).Shutdown calls b.reg.Unregister() on a nil registration. In any process that enables OTEL_GO_X_OBSERVABILITY with a custom or partially configured MeterProvider, shutdown turns a handled instrumentation error into a panic.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not added b == nil check as this is internal api and on all callsites, we already check for b.inst == nil.
Do let me know if this check also needs to be added.

Comment thread sdk/log/internal/observ/batch_log_processor.go
Comment thread sdk/log/batch.go
Comment on lines +128 to +143
var err error
b.inst, err = observ.NewBLP(
counter.NextExporterID(),
func() int64 { return int64(b.q.Len()) },
int64(cfg.maxQSize.Value),
)
if err != nil {
otel.Handle(err)
}

// Wrap exporter with metrics recording if observability is enabled.
// This must be the innermost wrapper (closest to user exporter) to record
// metrics just before calling the actual exporter.
if b.inst != nil {
exporter = newMetricsExporter(exporter, b.inst)
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observ.NewBLP(...) may return a non-nil b.inst together with a non-nil err (best-effort init). Right now the code logs the error via otel.Handle(err) but still wraps the exporter and later calls b.inst.Shutdown(), which can lead to runtime panics if the instrumentation is only partially initialized (e.g., callback registration failed). Consider treating any non-nil error as a signal to disable this instrumentation instance (b.inst = nil), or adjust NewBLP so it only returns a non-nil instance when it is safe to use/shutdown.

Copilot uses AI. Check for mistakes.
Comment thread sdk/log/batch.go
@pellared pellared dismissed their stale review April 14, 2026 18:02

AI agents found issues.

Comment thread sdk/log/batch.go
// Flush remaining queued before exporter shutdown.
err := b.exporter.Export(ctx, b.q.Flush())
if b.inst != nil {
err = errors.Join(err, b.inst.Shutdown())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When records have overflowed the ring buffer since the last poll iteration, Shutdown closes pollKill and can exit the poll goroutine without ever consuming q.Dropped(). Because ProcessedQueueFull is only driven from that goroutine, an application that shuts down under backpressure can silently miss otel.sdk.processor.log.processed{error.type="queue_full"} increments for the last dropped log records, so the new observability metric under-reports exactly the shutdown path users care about most.

I think that addressing this would be very hard and I would rather create an issue for tracking this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sdk/log/batch.go
Comment on lines +128 to +133
var err error
b.inst, err = observ.NewBLP(
counter.NextExporterID(),
func() int64 { return int64(b.q.Len()) },
int64(cfg.maxQSize.Value),
)
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

counter.NextExporterID() is evaluated unconditionally before observ.NewBLP(...) can check whether observability is enabled, so the global ID counter is incremented (atomic op) even when OTEL_GO_X_OBSERVABILITY is disabled. This adds avoidable overhead in the default (disabled) path and makes component IDs depend on how many batch processors were created while disabled. Consider gating the ID generation and NewBLP call behind the feature-flag check (or moving ID generation inside NewBLP after it confirms observability is enabled).

Copilot uses AI. Check for mistakes.
Comment thread sdk/log/batch_test.go

func TestBatchProcessorMetricsDisabled(t *testing.T) {
t.Setenv("OTEL_GO_X_OBSERVABILITY", "false")

Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test mutates the global exporter ID counter via counter.SetExporterID(...) but does not restore the previous value. Since Go test execution order across tests in the package is not guaranteed, this can leak state into other tests that create batch processors and makes failures order-dependent. Consider saving the previous counter value and restoring it in t.Cleanup(...).

Suggested change
origExporterID := counter.ExporterID()
t.Cleanup(func() { counter.SetExporterID(origExporterID) })

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar pattern as other package tests, setting to 0

Comment thread sdk/log/batch_test.go
Comment on lines +727 to +730
func TestBatchProcessorMetrics(t *testing.T) {
counter.SetExporterID(blpComponentID)

t.Setenv("OTEL_GO_X_OBSERVABILITY", "true")
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test mutates the global exporter ID counter via counter.SetExporterID(...) but does not restore the previous value, which can leak global state into other tests in the package depending on execution order. Consider saving the previous counter value and restoring it in t.Cleanup(...).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Logs SDK observability - batch processor metrics

5 participants