Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

micrometer http.server.requests metric breaks metrics UI with IndexOutOfRangeException #6962

Open
1 task done
stevesea opened this issue Dec 17, 2024 · 6 comments · May be fixed by #6998
Open
1 task done

micrometer http.server.requests metric breaks metrics UI with IndexOutOfRangeException #6962

stevesea opened this issue Dec 17, 2024 · 6 comments · May be fixed by #6998

Comments

@stevesea
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

with a http.server.requests metric exported via the otel agent's micrometer bridge, the aspire dashboard can get into an odd state .

if I navigate to the micrometer metric http.server.requests, I get an IndexOutOfRangeException in the aspire dashboard's container logs.

standalone aspire dashboard started with:

docker run --rm -it -p 8000:18888 -p 4317:18889 -p 4318:18890 --name aspire-dashboard -e DOTNET_DASHBOARD_UNSECURED_ALLOW_ANONYMOUS="true" mcr.microsoft.com/dotnet/aspire-dashboard

enabled micrometer metrics bridge with :
otel.instrumentation.micrometer.enabled=true in my JVM settings

I can navigate via the left-nav to logs or traces, and the UI is responsive.
If I navigate back to metrics, the UI is unresponsive. I am unable to select other metrics.

In the browser window, I have to manually remove the query parameters from the URI, and then navigate to

for example, change from this:

http://localhost:8000/metrics/resource/my-application-name?meter=io.opentelemetry.micrometer-1.5&instrument=http.server.requests&duration=5

to

http://localhost:8000/metrics/resource/my-application-name

once I manually remove those query parameters, I can select other (non-micrometer) metrics in the UI

Expected Behavior

No response

Steps To Reproduce

No response

Exceptions (if any)

warn: Microsoft.AspNetCore.Components.Server.Circuits.RemoteRenderer[100]
Unhandled exception rendering component: Index was outside the bounds of the array.
System.IndexOutOfRangeException: Index was outside the bounds of the array.
at Aspire.Dashboard.Components.Controls.Chart.ChartBase.CalculatePercentile(Int32 percentile, UInt64[] counts, Double[] explicitBounds) in //src/Aspire.Dashboard/Components/Controls/Chart/ChartBase.cs:line 346
at Aspire.Dashboard.Components.Controls.Chart.ChartBase.TryCalculateHistogramPoints(List1 dimensions, DateTimeOffset start, DateTimeOffset end, Dictionary2 traces, List1 exemplars) in /_/src/Aspire.Dashboard/Components/Controls/Chart/ChartBase.cs:line 276 at Aspire.Dashboard.Components.Controls.Chart.ChartBase.CalculateHistogramValues(List1 dimensions, Int32 pointCount, Boolean tickUpdate, DateTimeOffset inProgressDataTime, String yLabel) in /
/src/Aspire.Dashboard/Components/Controls/Chart/ChartBase.cs:line 147
at Aspire.Dashboard.Components.Controls.Chart.ChartBase.UpdateChartAsync(Boolean tickUpdate, DateTimeOffset inProgressDataTime) in //src/Aspire.Dashboard/Components/Controls/Chart/ChartBase.cs:line 506
at Aspire.Dashboard.Components.Controls.Chart.ChartBase.OnAfterRenderAsync(Boolean firstRender) in /
/src/Aspire.Dashboard/Components/Controls/Chart/ChartBase.cs:line 107
at Aspire.Dashboard.Components.PlotlyChart.OnAfterRenderAsync(Boolean firstRender) in //src/Aspire.Dashboard/Components/Controls/Chart/PlotlyChart.razor.cs:line 177
fail: Microsoft.AspNetCore.Components.Server.Circuits.CircuitHost[111]
Unhandled exception in circuit '2YmzVDSrBI8PnW2kK1Vw-iP_YeL1r-p14788V170uBw'.
System.AggregateException: One or more errors occurred. (Index was outside the bounds of the array.)
---> System.IndexOutOfRangeException: Index was outside the bounds of the array.
at Aspire.Dashboard.Components.Controls.Chart.ChartBase.CalculatePercentile(Int32 percentile, UInt64[] counts, Double[] explicitBounds) in /
/src/Aspire.Dashboard/Components/Controls/Chart/ChartBase.cs:line 346
at Aspire.Dashboard.Components.Controls.Chart.ChartBase.TryCalculateHistogramPoints(List1 dimensions, DateTimeOffset start, DateTimeOffset end, Dictionary2 traces, List1 exemplars) in /_/src/Aspire.Dashboard/Components/Controls/Chart/ChartBase.cs:line 276 at Aspire.Dashboard.Components.Controls.Chart.ChartBase.CalculateHistogramValues(List1 dimensions, Int32 pointCount, Boolean tickUpdate, DateTimeOffset inProgressDataTime, String yLabel) in //src/Aspire.Dashboard/Components/Controls/Chart/ChartBase.cs:line 147
at Aspire.Dashboard.Components.Controls.Chart.ChartBase.UpdateChartAsync(Boolean tickUpdate, DateTimeOffset inProgressDataTime) in /
/src/Aspire.Dashboard/Components/Controls/Chart/ChartBase.cs:line 506
at Aspire.Dashboard.Components.Controls.Chart.ChartBase.OnAfterRenderAsync(Boolean firstRender) in //src/Aspire.Dashboard/Components/Controls/Chart/ChartBase.cs:line 107
at Aspire.Dashboard.Components.PlotlyChart.OnAfterRenderAsync(Boolean firstRender) in /
/src/Aspire.Dashboard/Components/Controls/Chart/PlotlyChart.razor.cs:line 177
--- End of inner exception stack trace ---

.NET Version info

No response

Anything else?

aspire dashboard 9.0.0 (standalone)

otel java agent 2.10.0

java app - spring boot 3.4.0, temurin java 21.0.1
configured for jetty, not tomcat.

@joperezr joperezr added untriaged New issue has not been triaged area-dashboard area-telemetry labels Dec 17, 2024
@adamint
Copy link
Member

adamint commented Dec 18, 2024

For this to happen, I think explicitBounds has to be empty?

@stevesea
Copy link
Author

I turned on the console metrics exporter so I could see some of the data being sent via OTLP

the misbehaving micrometer metric http.server.requests :

[otel.javaagent 2024-12-18 14:37:23:216 -0700] [PeriodicMetricReader-1] INFO io.opentelemetry.exporter.logging.LoggingMetricExporter - metric: ImmutableMetricData{resource=Resource{schemaUrl=https://opentelemetry.io/schemas/1.24.0, attributes={deployment.environment="stevec-local", host.arch="aarch64", host.name="M-C44LGCWY6V", os.description="Mac OS X 15.2", os.type="darwin", process.executable.path="/Users/schristensen/.asdf/installs/java/temurin-21.0.1+12.0.LTS/bin/java", process.runtime.description="Eclipse Adoptium OpenJDK 64-Bit Server VM 21.0.1+12-LTS", process.runtime.name="OpenJDK Runtime Environment", process.runtime.version="21.0.1+12-LTS", service.instance.id="stevec-ide", service.name="my-application-name", telemetry.distro.name="opentelemetry-java-instrumentation", telemetry.distro.version="2.10.0", telemetry.sdk.language="java", telemetry.sdk.name="opentelemetry", telemetry.sdk.version="1.44.1"}}, instrumentationScopeInfo=InstrumentationScopeInfo{name=io.opentelemetry.micrometer-1.5, version=null, schemaUrl=null, attributes={}}, name=http.server.requests, description=, unit=s, type=HISTOGRAM, data=ImmutableHistogramData{aggregationTemporality=CUMULATIVE, points=[ImmutableHistogramPointData{getStartEpochNanos=1734557768207179000, getEpochNanos=1734557843212655000, getAttributes={error="none", exception="none", method="GET", outcome="REDIRECTION", status="302", uri="/cookie_cleanser"}, getSum=0.102902916, getCount=1, hasMin=true, getMin=0.102902916, hasMax=true, getMax=0.102902916, getBoundaries=[], getCounts=[1], getExemplars=[ImmutableDoubleExemplarData{filteredAttributes={}, epochNanos=1734557841199000000, spanContext=ImmutableSpanContext{traceId=1aa992b096da38d717911437027aef70, spanId=c3c2a0caf4dede8b, traceFlags=01, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=true}, value=0.102902916}]}]}}

I assume the empty getBoundaries field relates to your observation about empty explicitBounds?

compare that with the equivalent OTel metric http.server.request.duration

[otel.javaagent 2024-12-18 14:40:23:217 -0700] [PeriodicMetricReader-1] INFO io.opentelemetry.exporter.logging.LoggingMetricExporter - metric: ImmutableMetricData{resource=Resource{schemaUrl=https://opentelemetry.io/schemas/1.24.0, attributes={deployment.environment="stevec-local", host.arch="aarch64", host.name="M-C44LGCWY6V", os.description="Mac OS X 15.2", os.type="darwin", process.executable.path="/Users/schristensen/.asdf/installs/java/temurin-21.0.1+12.0.LTS/bin/java", process.runtime.description="Eclipse Adoptium OpenJDK 64-Bit Server VM 21.0.1+12-LTS", process.runtime.name="OpenJDK Runtime Environment", process.runtime.version="21.0.1+12-LTS", service.instance.id="stevec-ide", service.name="my-application-name", telemetry.distro.name="opentelemetry-java-instrumentation", telemetry.distro.version="2.10.0", telemetry.sdk.language="java", telemetry.sdk.name="opentelemetry", telemetry.sdk.version="1.44.1"}}, instrumentationScopeInfo=InstrumentationScopeInfo{name=io.opentelemetry.jetty-12.0, version=2.10.0-alpha, schemaUrl=null, attributes={}}, name=http.server.request.duration, description=Duration of HTTP server requests., unit=s, type=HISTOGRAM, data=ImmutableHistogramData{aggregationTemporality=CUMULATIVE, points=[ImmutableHistogramPointData{getStartEpochNanos=1734557768207179000, getEpochNanos=1734558023213100000, getAttributes=FilteredAttributes{http.request.method=GET,http.response.status_code=302,http.route=/cookie_cleanser,network.protocol.version=1.1,url.scheme=http}, getSum=0.181517666, getCount=1, hasMin=true, getMin=0.181517666, hasMax=true, getMax=0.181517666, getBoundaries=[0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0], getCounts=[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], getExemplars=[]}]}}

where the getBoundaries shows the explicit bucket boundaries.

Feels like the root cause could be a bug in the Micrometer export in the OTel Java agent?

To clarify this ask in this ticket -- I don't expect Aspire to work for this funky malformed metric.
But, it'd be nice if the Aspire dashboard handled the exception in a way that didn't require me to manually correct the URI in my browser.

@stevesea
Copy link
Author

I also tried the other configuration knobs on the Java agent's micrometer bridge

otel.instrumentation.micrometer.prometheus-mode.enabled=true
otel.instrumentation.micrometer.histogram-gauges.enabled=true

setting either or both to true didn't result in a usable java.server.requests within Aspire.

@adamint
Copy link
Member

adamint commented Dec 19, 2024

@stevesea if possible, could you share a minimal spring boot repo that I could use to debug? Thanks

But, it'd be nice if the Aspire dashboard handled the exception in a way that didn't require me to manually correct the URI in my browser.

I agree. As a general rule, I'd like to gracefully handle malformed data if possible, or at least share a more detailed exception message

@JamesNK
Copy link
Member

JamesNK commented Dec 30, 2024

It looks like the data from micrometer is invalid. I see:

getBoundaries=[], getCounts=[1]

Edit: I was wrong below. Having one more bucket count than boundaries is correct. But having bucket counts without any boundaries is wrong.

I'm assuming getBoundaries and getCounts map to explicit_bounds and bucket_counts on a histogram data point. That would line up with the error you see where the code assumes that each bucket count has a bounds. Aspire makes this assumption because the docs on histogram data point say that bounds must be at least one more than count:

https://github.com/open-telemetry/opentelemetry-proto/blob/2bd940b2b77c1ab57c27166af21384906da7bb2b/opentelemetry/proto/metrics/v1/metrics.proto#L463-L465

Without boundaries there is no way to understand what a count means. I think the only thing Aspire can do here is log that data is bad and return a response to the sender that the data point was rejected.

@JamesNK JamesNK self-assigned this Dec 30, 2024
@JamesNK JamesNK removed the untriaged New issue has not been triaged label Dec 30, 2024
@JamesNK JamesNK added this to the 9.1 milestone Dec 30, 2024
@JamesNK JamesNK linked a pull request Dec 30, 2024 that will close this issue
18 tasks
@JamesNK
Copy link
Member

JamesNK commented Dec 30, 2024

PR: #6998. What you'll see now is an empty graph. But it shouldn't error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants