Add token acquisition metrics that cover both successful and failed token acquisition attempts (opt-in)#6004
Conversation
There was a problem hiding this comment.
Pull request overview
Adds opt-in “extended” token acquisition OpenTelemetry histograms that capture both success and failure latency, and fills a gap by ensuring background proactive refresh contributes IDP network latency metrics.
Changes:
- Introduces extended metrics mode (via
MSAL_ENABLE_EXTENDED_TOKEN_METRICS) emittingMsalTotalDuration.2(tagged with success/failure) andMsalDurationInHttp.2(tagged with HTTP status code), instead of the V1 equivalents. - Records HTTP duration for background proactive refresh (success and failure) so IDP latency is visible even when the foreground request is served from cache.
- Extends failure telemetry plumbing to pass
ApiEvent, HTTP status code, and total duration into OTEL instrumentation, and ensures HTTP duration is captured even when HTTP calls throw.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/Microsoft.Identity.Test.Unit/TelemetryTests/OTelInstrumentationTests.cs | Adds/updates tests for extended metrics and background refresh HTTP duration, plus tag verification helpers. |
| src/client/Microsoft.Identity.Client/TelemetryCore/TelemetryConstants.cs | Adds HttpStatusCode tag constant for new histogram dimensions. |
| src/client/Microsoft.Identity.Client/TelemetryCore/OpenTelemetry/NullOtelInstrumentation.cs | Updates no-op implementation to match new OTEL instrumentation method signatures. |
| src/client/Microsoft.Identity.Client/TelemetryCore/OpenTelemetry/IOtelInstrumentation.cs | Extends OTEL interface with new failure params and LogSuccessHttpDuration. |
| src/client/Microsoft.Identity.Client/TelemetryCore/Internal/Events/ApiEvent.cs | Adds TokenSource to ApiEvent for richer failure telemetry tagging. |
| src/client/Microsoft.Identity.Client/Platforms/Features/OpenTelemetry/OtelInstrumentation.cs | Implements extended metrics mode and new histograms, plus updated failure logging pipeline. |
| src/client/Microsoft.Identity.Client/OAuth2/OAuth2Client.cs | Accumulates HTTP duration into ApiEvent even when the HTTP request throws. |
| src/client/Microsoft.Identity.Client/Internal/Requests/SilentRequestHelper.cs | Logs background refresh HTTP duration on success; passes status code/duration into failure telemetry for background refresh. |
| src/client/Microsoft.Identity.Client/Internal/Requests/Silent/SilentRequest.cs | Sets ApiEvent.TokenSource when attempting broker-based silent auth. |
| src/client/Microsoft.Identity.Client/Internal/Requests/RequestBase.cs | Measures request duration for failures and forwards HTTP status + total duration into OTEL failure logging. |
| src/client/Microsoft.Identity.Client/Http/HttpManager.cs | Ensures LastRequestDurationInMs is set even when SendAsync throws. |
Comments suppressed due to low confidence (2)
src/client/Microsoft.Identity.Client/Internal/Requests/Silent/SilentRequest.cs:85
- TokenSource is set to Broker before attempting broker auth, but it is not reset when broker returns null and the code falls back to the local cache. If the cache path later throws, failure telemetry will incorrectly report TokenSource=Broker. Consider setting TokenSource only when returning a broker result, or resetting it before executing the cache strategy.
if (isBrokerConfigured && !isAccountSourceDeviceCodeFlow)
{
_logger.Info("Broker is configured and enabled, attempting to use broker instead.");
AuthenticationRequestParameters.RequestContext.ApiEvent.TokenSource = TokenSource.Broker;
var brokerResult = await _brokerStrategyLazy.Value.ExecuteAsync(cancellationToken).ConfigureAwait(false);
// fallback to local cache if broker fails
if (brokerResult != null)
{
_logger.Verbose(() => "Broker responded to silent request.");
return brokerResult;
}
}
_logger.Verbose(() => "Attempting to acquire token using local cache.");
return await _clientStrategy.ExecuteAsync(cancellationToken).ConfigureAwait(false);
}
src/client/Microsoft.Identity.Client/Platforms/Features/OpenTelemetry/OtelInstrumentation.cs:366
- In the extended-metrics failure path, CacheLevel is emitted as a string (string.Empty) while the success path emits CacheLevel as the CacheLevel enum. Mixing types for the same tag key on the same instrument can lead to inconsistent time series/dimension handling. Emit a consistent CacheLevel value (e.g., CacheLevel.None/Unknown) or remove this dimension from MsalTotalDuration.2 if it isn’t intended.
s_durationTotalV2.Value.Record(totalDurationInMs,
new(TelemetryConstants.MsalVersion, MsalIdHelper.GetMsalVersion()),
new(TelemetryConstants.Platform, platform),
new(TelemetryConstants.ApiId, apiEvent.ApiId),
new(TelemetryConstants.CallerSdkId, callerSdkId ?? string.Empty + "," + callerSdkVersion ?? string.Empty),
new(TelemetryConstants.CacheRefreshReason, apiEvent.CacheInfo),
new(TelemetryConstants.CacheLevel, string.Empty),
new(TelemetryConstants.TokenSource, apiEvent.TokenSource),
new(TelemetryConstants.TokenType, apiEvent.TokenType),
…nclude_success_and_failures
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.
Comments suppressed due to low confidence (3)
src/client/Microsoft.Identity.Client/Internal/Requests/Silent/SilentRequest.cs:80
- After setting
ApiEvent.TokenSource = TokenSource.Broker, the code may still fall back to the local cache strategy whenbrokerResultis null. In that fallback (and on subsequent exceptions),ApiEvent.TokenSourceis left asBroker, so any failure-path V2 metric (MsalTotalDuration.2) emitted viaLogFailureMetricswill be incorrectly tagged withTokenSource=Brokereven though the request was ultimately satisfied (or failed) by the local cache / IdP path. Consider resettingTokenSourceif the broker returns null, or setting it only after a successful broker result.
AuthenticationRequestParameters.RequestContext.ApiEvent.TokenSource = TokenSource.Broker;
var brokerResult = await _brokerStrategyLazy.Value.ExecuteAsync(cancellationToken).ConfigureAwait(false);
// fallback to local cache if broker fails
if (brokerResult != null)
{
_logger.Verbose(() => "Broker responded to silent request.");
return brokerResult;
}
}
src/client/Microsoft.Identity.Client/Platforms/Features/OpenTelemetry/OtelInstrumentation.cs:368
- For
MsalTotalDuration.2theCacheLeveltag is emitted asstring.Emptyhere, whereas on the success path it is emitted as aCacheLevelenum value (seeLogSuccessMetrics). Recording the same tag key with two different value types (string vs enum) leads to inconsistent attribute schemas across metric points and can break downstream aggregation/visualization in OTel backends. Use aCacheLevelvalue (e.g.CacheLevel.None) instead ofstring.Empty.
s_durationTotalV2.Value.Record(totalDurationInMs,
new(TelemetryConstants.MsalVersion, MsalIdHelper.GetMsalVersion()),
new(TelemetryConstants.Platform, platform),
new(TelemetryConstants.ApiId, apiEvent.ApiId),
new(TelemetryConstants.TokenSource, apiEvent.TokenSource),
new(TelemetryConstants.CacheLevel, string.Empty),
new(TelemetryConstants.CacheRefreshReason, apiEvent.CacheInfo),
new(TelemetryConstants.TokenType, apiEvent.TokenType),
new(TelemetryConstants.ErrorCode, errorCode),
new(TelemetryConstants.Succeeded, false));
}
src/client/Microsoft.Identity.Client/Platforms/Features/OpenTelemetry/OtelInstrumentation.cs:379
- When the failure is a request timeout (
isTimeout && apiEvent.DurationInHttpInMs > 0),MsalDurationInHttp.2is still recorded withHttpStatusCode = httpStatusCode, which in the timeout path is0. Recording a histogram point withHttpStatusCode=0is misleading (the user-facing status indicates "no HTTP response received"). Consider either omitting theHttpStatusCodetag for the timeout case or using a sentinel/string value that is clearly distinguishable from a real status code in the backend.
bool isTimeout = string.Equals(errorCode, MsalError.RequestTimeout, StringComparison.OrdinalIgnoreCase);
if ((httpStatusCode > 0 || (isTimeout && apiEvent.DurationInHttpInMs > 0)) && s_durationInHttpV2.Value.Enabled)
{
s_durationInHttpV2.Value.Record(apiEvent.DurationInHttpInMs,
new(TelemetryConstants.MsalVersion, MsalIdHelper.GetMsalVersion()),
new(TelemetryConstants.Platform, platform),
new(TelemetryConstants.ApiId, apiEvent.ApiId),
new(TelemetryConstants.TokenType, apiEvent.TokenType),
new(TelemetryConstants.HttpStatusCode, httpStatusCode));
}
…ess_and_failures' of https://github.com/AzureAD/microsoft-authentication-library-for-dotnet into feature/add_extended_duration_metrics_that_include_success_and_failures
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (4)
src/client/Microsoft.Identity.Client/Platforms/Features/OpenTelemetry/OtelInstrumentation.cs:364
- The V2 total-duration histogram tags
CacheRefreshReasonfromapiEvent.CacheInfo, while the failure counter built just above uses thecacheRefreshReasonparameter. For background proactive refresh failures, callers passCacheRefreshReason.ProactivelyRefreshedas the parameter butapiEvent.CacheInfotypically reflects the foreground request's cache info (e.g.NoCachedAccessToken/Expired/NotApplicable), so the same failure will report two different cache-refresh-reason values across these two metrics. Either drop the parameter and useapiEvent.CacheInfoeverywhere, or use thecacheRefreshReasonparameter consistently in both the counter and the V2 histogram so emitted metrics stay aligned.
{ TelemetryConstants.CacheRefreshReason, cacheRefreshReason },
{ TelemetryConstants.TokenType, tokenType }
};
if (!string.IsNullOrEmpty(rawStsErrorCode))
tags.Add(TelemetryConstants.RawStsErrorCode, rawStsErrorCode);
s_failureCounter.Value.Add(1, in tags);
}
if (_isExtendedMetricsEnabled)
{
if (totalDurationInMs > 0 && s_durationTotalV2.Value.Enabled)
{
s_durationTotalV2.Value.Record(totalDurationInMs,
new(TelemetryConstants.MsalVersion, MsalIdHelper.GetMsalVersion()),
new(TelemetryConstants.Platform, platform),
new(TelemetryConstants.ApiId, apiEvent.ApiId),
new(TelemetryConstants.TokenSource, apiEvent.TokenSource),
new(TelemetryConstants.CacheLevel, string.Empty),
new(TelemetryConstants.CacheRefreshReason, apiEvent.CacheInfo),
src/client/Microsoft.Identity.Client/Platforms/Features/OpenTelemetry/OtelInstrumentation.cs:379
- When
httpStatusCode == 0anderrorCodeisMsalError.RequestTimeout, this branch still recordsMsalDurationInHttp.2withHttpStatusCode = 0. A status-code tag of0is not a real HTTP status and is indistinguishable from "unset"; downstream dashboards that bucket by status code will see a meaningless 0 series. Consider using a sentinel like-1or omitting the metric (the V1 HTTP duration histogram intentionally did not include a status code tag at all). Also, for non-timeout transport failures (DNS, TLS, connection reset)httpStatusCodewill be 0 and the metric will be skipped, even thoughapiEvent.DurationInHttpInMswas measured — which means the only no-status path that gets recorded is the timeout path, which is an inconsistent contract.
bool isTimeout = string.Equals(errorCode, MsalError.RequestTimeout, StringComparison.OrdinalIgnoreCase);
if ((httpStatusCode > 0 || (isTimeout && apiEvent.DurationInHttpInMs > 0)) && s_durationInHttpV2.Value.Enabled)
{
s_durationInHttpV2.Value.Record(apiEvent.DurationInHttpInMs,
new(TelemetryConstants.MsalVersion, MsalIdHelper.GetMsalVersion()),
new(TelemetryConstants.Platform, platform),
new(TelemetryConstants.ApiId, apiEvent.ApiId),
new(TelemetryConstants.TokenType, apiEvent.TokenType),
new(TelemetryConstants.HttpStatusCode, httpStatusCode));
}
src/client/Microsoft.Identity.Client/Platforms/Features/OpenTelemetry/OtelInstrumentation.cs:318
LogSuccessHttpDurationalways tagsHttpStatusCode = 200for the V2 histogram, but the same metric recordsHttpStatusCode = 0(or other values) on the failure path inLogFailureMetrics. Hard-coding200here is a guess — successful responses on this code path could in principle be other 2xx values, and more importantly, the IDP HTTP duration is already recorded inside the success path ofLogSuccessMetricsforTokenSource == IdentityProvider. For a foreground IDP success this method is not called (only background refresh calls it), butLogSuccessMetricsalready records the same V2 histogram for foreground IDP successes — please verify there is no double-recording for a non-cache foreground success, and prefer reading the actual status code from the response (or fromApiEvent/AuthenticationResultMetadata) rather than hard-coding 200.
s_durationInHttpV2.Value.Record(authResultMetadata.DurationInHttpInMs,
new(TelemetryConstants.MsalVersion, MsalIdHelper.GetMsalVersion()),
new(TelemetryConstants.Platform, platform),
new(TelemetryConstants.ApiId, apiId),
new(TelemetryConstants.TokenType, authResultMetadata.TelemetryTokenType),
new(TelemetryConstants.HttpStatusCode, 200));
src/client/Microsoft.Identity.Client/Platforms/Features/OpenTelemetry/OtelInstrumentation.cs:367
MsalTotalDuration.2records the rawerrorCodestring as a tag value. Error codes can be unbounded in practice (e.g. exception type names fromcatch (Exception ex)branches, raw STS error strings) and using them as histogram dimensions risks tag-cardinality explosion in metric backends. Consider bounding the set of error codes recorded as a tag (e.g. only well-known MsalError constants) or move the raw error code to a separate, lower-cardinality channel.
s_durationTotalV2.Value.Record(totalDurationInMs,
new(TelemetryConstants.MsalVersion, MsalIdHelper.GetMsalVersion()),
new(TelemetryConstants.Platform, platform),
new(TelemetryConstants.ApiId, apiEvent.ApiId),
new(TelemetryConstants.TokenSource, apiEvent.TokenSource),
new(TelemetryConstants.CacheLevel, string.Empty),
new(TelemetryConstants.CacheRefreshReason, apiEvent.CacheInfo),
new(TelemetryConstants.TokenType, apiEvent.TokenType),
new(TelemetryConstants.ErrorCode, errorCode),
new(TelemetryConstants.Succeeded, false));
neha-bhargava
left a comment
There was a problem hiding this comment.
Let's combine the platform and msal version tags to keep the tag limit to 8 in this PR. Also added comments to rename the histogram as per the convention.
…nclude_success_and_failures
Add token acquisition metrics that cover both successful and failed token acquisition outcomes and close an observability gap by recording IDP network latency for background refresh.