Skip to content

Add token acquisition metrics that cover both successful and failed token acquisition attempts (opt-in)#6004

Merged
ssmelov merged 15 commits into
mainfrom
feature/add_extended_duration_metrics_that_include_success_and_failures
May 27, 2026
Merged

Add token acquisition metrics that cover both successful and failed token acquisition attempts (opt-in)#6004
ssmelov merged 15 commits into
mainfrom
feature/add_extended_duration_metrics_that_include_success_and_failures

Conversation

@ssmelov

@ssmelov ssmelov commented May 13, 2026

Copy link
Copy Markdown
Collaborator

Add token acquisition metrics that cover both successful and failed token acquisition outcomes and close an observability gap by recording IDP network latency for background refresh.

Copilot AI review requested due to automatic review settings May 13, 2026 15:35
@ssmelov ssmelov requested a review from a team as a code owner May 13, 2026 15:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds opt-in “extended” token acquisition OpenTelemetry histograms that capture both success and failure latency, and fills a gap by ensuring background proactive refresh contributes IDP network latency metrics.

Changes:

  • Introduces extended metrics mode (via MSAL_ENABLE_EXTENDED_TOKEN_METRICS) emitting MsalTotalDuration.2 (tagged with success/failure) and MsalDurationInHttp.2 (tagged with HTTP status code), instead of the V1 equivalents.
  • Records HTTP duration for background proactive refresh (success and failure) so IDP latency is visible even when the foreground request is served from cache.
  • Extends failure telemetry plumbing to pass ApiEvent, HTTP status code, and total duration into OTEL instrumentation, and ensures HTTP duration is captured even when HTTP calls throw.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/Microsoft.Identity.Test.Unit/TelemetryTests/OTelInstrumentationTests.cs Adds/updates tests for extended metrics and background refresh HTTP duration, plus tag verification helpers.
src/client/Microsoft.Identity.Client/TelemetryCore/TelemetryConstants.cs Adds HttpStatusCode tag constant for new histogram dimensions.
src/client/Microsoft.Identity.Client/TelemetryCore/OpenTelemetry/NullOtelInstrumentation.cs Updates no-op implementation to match new OTEL instrumentation method signatures.
src/client/Microsoft.Identity.Client/TelemetryCore/OpenTelemetry/IOtelInstrumentation.cs Extends OTEL interface with new failure params and LogSuccessHttpDuration.
src/client/Microsoft.Identity.Client/TelemetryCore/Internal/Events/ApiEvent.cs Adds TokenSource to ApiEvent for richer failure telemetry tagging.
src/client/Microsoft.Identity.Client/Platforms/Features/OpenTelemetry/OtelInstrumentation.cs Implements extended metrics mode and new histograms, plus updated failure logging pipeline.
src/client/Microsoft.Identity.Client/OAuth2/OAuth2Client.cs Accumulates HTTP duration into ApiEvent even when the HTTP request throws.
src/client/Microsoft.Identity.Client/Internal/Requests/SilentRequestHelper.cs Logs background refresh HTTP duration on success; passes status code/duration into failure telemetry for background refresh.
src/client/Microsoft.Identity.Client/Internal/Requests/Silent/SilentRequest.cs Sets ApiEvent.TokenSource when attempting broker-based silent auth.
src/client/Microsoft.Identity.Client/Internal/Requests/RequestBase.cs Measures request duration for failures and forwards HTTP status + total duration into OTEL failure logging.
src/client/Microsoft.Identity.Client/Http/HttpManager.cs Ensures LastRequestDurationInMs is set even when SendAsync throws.
Comments suppressed due to low confidence (2)

src/client/Microsoft.Identity.Client/Internal/Requests/Silent/SilentRequest.cs:85

  • TokenSource is set to Broker before attempting broker auth, but it is not reset when broker returns null and the code falls back to the local cache. If the cache path later throws, failure telemetry will incorrectly report TokenSource=Broker. Consider setting TokenSource only when returning a broker result, or resetting it before executing the cache strategy.
                if (isBrokerConfigured && !isAccountSourceDeviceCodeFlow)
                {
                    _logger.Info("Broker is configured and enabled, attempting to use broker instead.");
                    AuthenticationRequestParameters.RequestContext.ApiEvent.TokenSource = TokenSource.Broker;
                    var brokerResult = await _brokerStrategyLazy.Value.ExecuteAsync(cancellationToken).ConfigureAwait(false);

                    // fallback to local cache if broker fails
                    if (brokerResult != null)
                    {
                        _logger.Verbose(() => "Broker responded to silent request.");
                        return brokerResult;
                    }
                }

                _logger.Verbose(() => "Attempting to acquire token using local cache.");

                return await _clientStrategy.ExecuteAsync(cancellationToken).ConfigureAwait(false);
            }

src/client/Microsoft.Identity.Client/Platforms/Features/OpenTelemetry/OtelInstrumentation.cs:366

  • In the extended-metrics failure path, CacheLevel is emitted as a string (string.Empty) while the success path emits CacheLevel as the CacheLevel enum. Mixing types for the same tag key on the same instrument can lead to inconsistent time series/dimension handling. Emit a consistent CacheLevel value (e.g., CacheLevel.None/Unknown) or remove this dimension from MsalTotalDuration.2 if it isn’t intended.
                    s_durationTotalV2.Value.Record(totalDurationInMs,
                        new(TelemetryConstants.MsalVersion, MsalIdHelper.GetMsalVersion()),
                        new(TelemetryConstants.Platform, platform),
                        new(TelemetryConstants.ApiId, apiEvent.ApiId),
                        new(TelemetryConstants.CallerSdkId, callerSdkId ?? string.Empty + "," + callerSdkVersion ?? string.Empty),
                        new(TelemetryConstants.CacheRefreshReason, apiEvent.CacheInfo),
                        new(TelemetryConstants.CacheLevel, string.Empty),
                        new(TelemetryConstants.TokenSource, apiEvent.TokenSource),
                        new(TelemetryConstants.TokenType, apiEvent.TokenType),

Comment thread src/client/Microsoft.Identity.Client/Internal/Requests/RequestBase.cs Outdated
Comment thread tests/Microsoft.Identity.Test.Unit/TelemetryTests/OTelInstrumentationTests.cs Outdated
Copilot AI review requested due to automatic review settings May 14, 2026 13:46

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.

Comments suppressed due to low confidence (3)

src/client/Microsoft.Identity.Client/Internal/Requests/Silent/SilentRequest.cs:80

  • After setting ApiEvent.TokenSource = TokenSource.Broker, the code may still fall back to the local cache strategy when brokerResult is null. In that fallback (and on subsequent exceptions), ApiEvent.TokenSource is left as Broker, so any failure-path V2 metric (MsalTotalDuration.2) emitted via LogFailureMetrics will be incorrectly tagged with TokenSource=Broker even though the request was ultimately satisfied (or failed) by the local cache / IdP path. Consider resetting TokenSource if the broker returns null, or setting it only after a successful broker result.
                    AuthenticationRequestParameters.RequestContext.ApiEvent.TokenSource = TokenSource.Broker;
                    var brokerResult = await _brokerStrategyLazy.Value.ExecuteAsync(cancellationToken).ConfigureAwait(false);

                    // fallback to local cache if broker fails
                    if (brokerResult != null)
                    {
                        _logger.Verbose(() => "Broker responded to silent request.");
                        return brokerResult;
                    }
                }

src/client/Microsoft.Identity.Client/Platforms/Features/OpenTelemetry/OtelInstrumentation.cs:368

  • For MsalTotalDuration.2 the CacheLevel tag is emitted as string.Empty here, whereas on the success path it is emitted as a CacheLevel enum value (see LogSuccessMetrics). Recording the same tag key with two different value types (string vs enum) leads to inconsistent attribute schemas across metric points and can break downstream aggregation/visualization in OTel backends. Use a CacheLevel value (e.g. CacheLevel.None) instead of string.Empty.
                    s_durationTotalV2.Value.Record(totalDurationInMs,
                        new(TelemetryConstants.MsalVersion, MsalIdHelper.GetMsalVersion()),
                        new(TelemetryConstants.Platform, platform),
                        new(TelemetryConstants.ApiId, apiEvent.ApiId),
                        new(TelemetryConstants.TokenSource, apiEvent.TokenSource),
                        new(TelemetryConstants.CacheLevel, string.Empty),
                        new(TelemetryConstants.CacheRefreshReason, apiEvent.CacheInfo),
                        new(TelemetryConstants.TokenType, apiEvent.TokenType),
                        new(TelemetryConstants.ErrorCode, errorCode),
                        new(TelemetryConstants.Succeeded, false));
                }

src/client/Microsoft.Identity.Client/Platforms/Features/OpenTelemetry/OtelInstrumentation.cs:379

  • When the failure is a request timeout (isTimeout && apiEvent.DurationInHttpInMs > 0), MsalDurationInHttp.2 is still recorded with HttpStatusCode = httpStatusCode, which in the timeout path is 0. Recording a histogram point with HttpStatusCode=0 is misleading (the user-facing status indicates "no HTTP response received"). Consider either omitting the HttpStatusCode tag for the timeout case or using a sentinel/string value that is clearly distinguishable from a real status code in the backend.
                bool isTimeout = string.Equals(errorCode, MsalError.RequestTimeout, StringComparison.OrdinalIgnoreCase);
                if ((httpStatusCode > 0 || (isTimeout && apiEvent.DurationInHttpInMs > 0)) && s_durationInHttpV2.Value.Enabled)
                {
                    s_durationInHttpV2.Value.Record(apiEvent.DurationInHttpInMs,
                        new(TelemetryConstants.MsalVersion, MsalIdHelper.GetMsalVersion()),
                        new(TelemetryConstants.Platform, platform),
                        new(TelemetryConstants.ApiId, apiEvent.ApiId),
                        new(TelemetryConstants.TokenType, apiEvent.TokenType),
                        new(TelemetryConstants.HttpStatusCode, httpStatusCode));
                }

Comment thread src/client/Microsoft.Identity.Client/Internal/Requests/SilentRequestHelper.cs Outdated
Comment thread src/client/Microsoft.Identity.Client/TelemetryCore/Internal/Events/ApiEvent.cs Outdated
Comment thread src/client/Microsoft.Identity.Client/OAuth2/OAuth2Client.cs
@ssmelov ssmelov changed the title Add token acquisition metrics that cover both successful and failed token acquisition attempts Add token acquisition metrics that cover both successful and failed token acquisition attempts (opt-in) May 14, 2026
Copilot AI review requested due to automatic review settings May 14, 2026 14:37

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (4)

src/client/Microsoft.Identity.Client/Platforms/Features/OpenTelemetry/OtelInstrumentation.cs:364

  • The V2 total-duration histogram tags CacheRefreshReason from apiEvent.CacheInfo, while the failure counter built just above uses the cacheRefreshReason parameter. For background proactive refresh failures, callers pass CacheRefreshReason.ProactivelyRefreshed as the parameter but apiEvent.CacheInfo typically reflects the foreground request's cache info (e.g. NoCachedAccessToken / Expired / NotApplicable), so the same failure will report two different cache-refresh-reason values across these two metrics. Either drop the parameter and use apiEvent.CacheInfo everywhere, or use the cacheRefreshReason parameter consistently in both the counter and the V2 histogram so emitted metrics stay aligned.
                    { TelemetryConstants.CacheRefreshReason, cacheRefreshReason },
                    { TelemetryConstants.TokenType, tokenType }
                };

                if (!string.IsNullOrEmpty(rawStsErrorCode))
                    tags.Add(TelemetryConstants.RawStsErrorCode, rawStsErrorCode);

                s_failureCounter.Value.Add(1, in tags);
            }

            if (_isExtendedMetricsEnabled)
            {
                if (totalDurationInMs > 0 && s_durationTotalV2.Value.Enabled)
                {
                    s_durationTotalV2.Value.Record(totalDurationInMs,
                        new(TelemetryConstants.MsalVersion, MsalIdHelper.GetMsalVersion()),
                        new(TelemetryConstants.Platform, platform),
                        new(TelemetryConstants.ApiId, apiEvent.ApiId),
                        new(TelemetryConstants.TokenSource, apiEvent.TokenSource),
                        new(TelemetryConstants.CacheLevel, string.Empty),
                        new(TelemetryConstants.CacheRefreshReason, apiEvent.CacheInfo),

src/client/Microsoft.Identity.Client/Platforms/Features/OpenTelemetry/OtelInstrumentation.cs:379

  • When httpStatusCode == 0 and errorCode is MsalError.RequestTimeout, this branch still records MsalDurationInHttp.2 with HttpStatusCode = 0. A status-code tag of 0 is not a real HTTP status and is indistinguishable from "unset"; downstream dashboards that bucket by status code will see a meaningless 0 series. Consider using a sentinel like -1 or omitting the metric (the V1 HTTP duration histogram intentionally did not include a status code tag at all). Also, for non-timeout transport failures (DNS, TLS, connection reset) httpStatusCode will be 0 and the metric will be skipped, even though apiEvent.DurationInHttpInMs was measured — which means the only no-status path that gets recorded is the timeout path, which is an inconsistent contract.
                bool isTimeout = string.Equals(errorCode, MsalError.RequestTimeout, StringComparison.OrdinalIgnoreCase);
                if ((httpStatusCode > 0 || (isTimeout && apiEvent.DurationInHttpInMs > 0)) && s_durationInHttpV2.Value.Enabled)
                {
                    s_durationInHttpV2.Value.Record(apiEvent.DurationInHttpInMs,
                        new(TelemetryConstants.MsalVersion, MsalIdHelper.GetMsalVersion()),
                        new(TelemetryConstants.Platform, platform),
                        new(TelemetryConstants.ApiId, apiEvent.ApiId),
                        new(TelemetryConstants.TokenType, apiEvent.TokenType),
                        new(TelemetryConstants.HttpStatusCode, httpStatusCode));
                }

src/client/Microsoft.Identity.Client/Platforms/Features/OpenTelemetry/OtelInstrumentation.cs:318

  • LogSuccessHttpDuration always tags HttpStatusCode = 200 for the V2 histogram, but the same metric records HttpStatusCode = 0 (or other values) on the failure path in LogFailureMetrics. Hard-coding 200 here is a guess — successful responses on this code path could in principle be other 2xx values, and more importantly, the IDP HTTP duration is already recorded inside the success path of LogSuccessMetrics for TokenSource == IdentityProvider. For a foreground IDP success this method is not called (only background refresh calls it), but LogSuccessMetrics already records the same V2 histogram for foreground IDP successes — please verify there is no double-recording for a non-cache foreground success, and prefer reading the actual status code from the response (or from ApiEvent/AuthenticationResultMetadata) rather than hard-coding 200.
                    s_durationInHttpV2.Value.Record(authResultMetadata.DurationInHttpInMs,
                        new(TelemetryConstants.MsalVersion, MsalIdHelper.GetMsalVersion()),
                        new(TelemetryConstants.Platform, platform),
                        new(TelemetryConstants.ApiId, apiId),
                        new(TelemetryConstants.TokenType, authResultMetadata.TelemetryTokenType),
                        new(TelemetryConstants.HttpStatusCode, 200));

src/client/Microsoft.Identity.Client/Platforms/Features/OpenTelemetry/OtelInstrumentation.cs:367

  • MsalTotalDuration.2 records the raw errorCode string as a tag value. Error codes can be unbounded in practice (e.g. exception type names from catch (Exception ex) branches, raw STS error strings) and using them as histogram dimensions risks tag-cardinality explosion in metric backends. Consider bounding the set of error codes recorded as a tag (e.g. only well-known MsalError constants) or move the raw error code to a separate, lower-cardinality channel.
                    s_durationTotalV2.Value.Record(totalDurationInMs,
                        new(TelemetryConstants.MsalVersion, MsalIdHelper.GetMsalVersion()),
                        new(TelemetryConstants.Platform, platform),
                        new(TelemetryConstants.ApiId, apiEvent.ApiId),
                        new(TelemetryConstants.TokenSource, apiEvent.TokenSource),
                        new(TelemetryConstants.CacheLevel, string.Empty),
                        new(TelemetryConstants.CacheRefreshReason, apiEvent.CacheInfo),
                        new(TelemetryConstants.TokenType, apiEvent.TokenType),
                        new(TelemetryConstants.ErrorCode, errorCode),
                        new(TelemetryConstants.Succeeded, false));

Comment thread src/client/Microsoft.Identity.Client/Internal/Requests/SilentRequestHelper.cs Outdated
Comment thread src/client/Microsoft.Identity.Client/Http/HttpManager.cs Outdated

@neha-bhargava neha-bhargava left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's combine the platform and msal version tags to keep the tag limit to 8 in this PR. Also added comments to rename the histogram as per the convention.

Comment thread src/client/Microsoft.Identity.Client/TelemetryCore/Internal/Events/ApiEvent.cs Outdated
Comment thread src/client/Microsoft.Identity.Client/Http/HttpManager.cs
Copilot AI review requested due to automatic review settings May 26, 2026 12:00

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@ssmelov ssmelov merged commit 85cacbc into main May 27, 2026
15 checks passed
@ssmelov ssmelov deleted the feature/add_extended_duration_metrics_that_include_success_and_failures branch May 27, 2026 17:40
This was referenced Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants