Direct: Fixes OpenTcpConnectionTimeout silent truncation of sub-second and fractional values#5873
Conversation
|
@microsoft-github-policy-service agree company="Microsoft" |
2b171e6 to
e391b37
Compare
26654f8 to
1e331b4
Compare
…d and fractional values Fixes Azure#4500. Previously, DocumentClient cast `connectionPolicy.OpenTcpConnectionTimeout.Value.TotalSeconds` directly to int, which silently truncated: * Sub-second values (e.g. `TimeSpan.FromMilliseconds(500)`) to 0, causing the Direct StoreClientFactory to fall back to the request timeout instead of the customer's intent. * Fractional values >= 1s (e.g. `TimeSpan.FromSeconds(2.5)`) down to the next whole second, silently shortening the customer's requested timeout. * Negative values to 0, again silently falling back. This change formalizes the public contract on `CosmosClientOptions.OpenTcpConnectionTimeout`, `ConnectionPolicy.OpenTcpConnectionTimeout`, and `CosmosClientBuilder.WithConnectionModeDirect`: * Negative TimeSpan -> ArgumentOutOfRangeException at the public setter (fail fast). * The customer-supplied TimeSpan is preserved unchanged on the SDK-level properties. * At the int-typed Direct boundary (DocumentClient.cs): * Values in [TimeSpan.Zero, 1 second) normalize to 0 (documented "use request timeout" sentinel). * Values >= 1 second round UP via Math.Ceiling to the nearest whole second so the SDK never silently SHORTENS the customer's requested timeout (e.g. 2.5s -> 3s, never 2s). XML docs on all three public surfaces explicitly call out this contract. Note: the downstream Microsoft.Azure.Cosmos.Direct StoreClientFactory constructor only accepts `int openTimeoutInSeconds` for this parameter, so whole-second resolution at the transport boundary is an external constraint; Math.Ceiling is the closest faithful interpretation of "leave the customer's value unchanged". Tests (CosmosClientOptionsUnitTests): * T1 OpenTcpConnectionTimeout_NegativeTimeSpan_Throws * T2 OpenTcpConnectionTimeout_Zero_IsAllowed_AndRoundTripsThroughConnectionPolicy * T3 OpenTcpConnectionTimeout_SubSecond_NormalizesToZero_InRntbdConfig * T4 OpenTcpConnectionTimeout_WholeSeconds_PreservedInRntbdConfig * T5 WithConnectionModeDirect_NegativeOpenTcpTimeout_Throws * T6 OpenTcpConnectionTimeout_Fractional_RoundsUpInRntbdConfig (2.5s -> 3) * T7 OpenTcpConnectionTimeout_JustOverOneSecond_RoundsUpInRntbdConfig (1.001s -> 2) * T8 OpenTcpConnectionTimeout_Fractional_PreservedOnConnectionPolicyTimeSpan (2.5s round-trips unchanged through ConnectionPolicy) Out of scope: the same (int)TotalSeconds truncation pattern applies to IdleTcpConnectionTimeout at DocumentClient.cs:925. A follow-up issue mirroring Azure#4500 will be filed. Changelog: entry added under Unreleased / Bugs Fixed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1e331b4 to
0dd910c
Compare
Pilchie
left a comment
There was a problem hiding this comment.
Great job! Happy to see a first PR.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| { | ||
| if (value.HasValue && value.Value < TimeSpan.Zero) | ||
| { | ||
| throw new ArgumentOutOfRangeException( |
There was a problem hiding this comment.
Wouldn't throwing here be a breaking change for customers ? The old behavior used to just use the request timeout.
There was a problem hiding this comment.
Changed to a warning log instead of throwing
… warning log, adds int overflow guard - Negative values now reset to null (transport default 5s) with DefaultTrace.TraceWarning - Removed inline ArgumentOutOfRangeException throws (non-breaking) - Added int.MaxValue clamp to prevent overflow at transport boundary - Simplified ConnectionPolicy.OpenTcpConnectionTimeout to get;set; pattern - Updated tests to verify negative values reset to null - Updated XML docs and changelog Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Head branch was pushed to by a user without write access
136ed8f to
3cf2498
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…fRangeException Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
kushagraThapar
left a comment
There was a problem hiding this comment.
Thanks for picking up the truncation half of this fix, @yumnahussain — the Math.Ceiling round-up direction is exactly right (always conservative-safe, never silently shortens a customer-requested timeout). Requesting changes only on the negative-TimeSpan throw, which I think is a behavioral breaking change in disguise — same concern we just raised on PR #5887.
Context: the open @tvaron3 thread
@tvaron3 already asked "Wouldn't throwing here be a breaking change for customers?" and your reply said "Changed to a warning log instead of throwing" — but the final commit on this branch is literally titled "Change negative OpenTcpConnectionTimeout from warning to ArgumentOutOfRangeException". The thread is still unresolved on the PR. I don't think you intended to leave that mismatch — could you please loop back to @tvaron3 either way?
Why we'd land on clamp + warn
Before this PR, a customer passing TimeSpan.FromSeconds(-1) got silent acceptance — the value cast to a negative int and flowed into the RNTBD transport. After this PR, those same customers will get ArgumentOutOfRangeException at CosmosClient construction on upgrade.
For SDK config that customers may have hardcoded years ago, throwing is the riskiest possible upgrade behavior — we'd turn a previously-working client construction into a startup crash with no opt-out. Even if the original value was semantically nonsense, the upgrade breaks customers who have been running successfully.
We landed on the same recommendation for PR #5887 (review here). Doing the same thing here keeps both sibling fixes consistent.
Proposal: clamp + warn instead of throw
Could you please switch the negative-value handling to clamp to TimeSpan.Zero and emit a warning trace, instead of throwing? Something like:
if (value.HasValue && value.Value < TimeSpan.Zero)
{
DefaultTrace.TraceWarning(
"OpenTcpConnectionTimeout value {0} is negative; clamping to TimeSpan.Zero (RequestTimeout will be used as the open-connection timeout).",
value.Value);
value = TimeSpan.Zero;
}This way:
- Existing apps that today pass a negative
TimeSpankeep starting up on upgrade (no live-site impact from a customer SDK bump). - Customers still get pushed toward a sane configuration via the warning trace.
- We avoid having to move this entry under
### Breaking Changes. - The clamped value
TimeSpan.Zeroflows through the existing sub-second-to-0 normalization path inDocumentClient.csand lands on the documented "use RequestTimeout" sentinel — so the behavior gets more correct, not less.
Asks (in priority order)
- (blocking) Switch the negative-throw to clamp + warn in
CosmosClientOptions.cs(and inline the check at the setter rather thanValidateDirectTCPSettings— see inline comment). - (blocking) Update the negative-throw tests to assert the clamp + warning trace instead.
- (blocking) Reply to @tvaron3's outstanding thread.
- (nice-to-have) Fix the duplicate
[5870]changelog entry — see inline. PR #5870 already shipped in3.60.0(line 58 onmain).
If you feel strongly about keeping the throw, happy to discuss — but in that case we'd need to move the changelog entry under ### Breaking Changes and add explicit migration guidance. Thanks for the careful work on the truncation half — the round-up direction is genuinely the right call.
| if (this.OpenTcpConnectionTimeout.HasValue && this.OpenTcpConnectionTimeout.Value < TimeSpan.Zero) | ||
| { | ||
| throw new ArgumentOutOfRangeException( | ||
| nameof(this.OpenTcpConnectionTimeout), | ||
| this.OpenTcpConnectionTimeout.Value, | ||
| $"{nameof(this.OpenTcpConnectionTimeout)} must not be negative."); | ||
| } |
There was a problem hiding this comment.
Could you please switch this from throwing to clamp + warn? Something like:
if (this.OpenTcpConnectionTimeout.HasValue && this.OpenTcpConnectionTimeout.Value < TimeSpan.Zero)
{
DefaultTrace.TraceWarning(
"OpenTcpConnectionTimeout value {0} is negative; clamping to TimeSpan.Zero (RequestTimeout will be used as the open-connection timeout).",
this.OpenTcpConnectionTimeout.Value);
this.OpenTcpConnectionTimeout = TimeSpan.Zero;
}Two structural notes while we're here:
- Wrong method for this check.
ValidateDirectTCPSettings's actual purpose is the Direct-mode coupling invariant ("setting requiresConnectionMode = Direct"). The range check belongs at the property setter, validate-then-assign style. Today, putting it here means:- It fires redundantly on every Direct-mode TCP setter call (idle timeout, open timeout, max requests, port reuse, connection mode), not just on
OpenTcpConnectionTimeout. - If a caller calls
opts.ConnectionMode = Directafter a badOpenTcpConnectionTimeout, the error message will referenceOpenTcpConnectionTimeouteven though they were touchingConnectionMode. MisleadingParamName.
- It fires redundantly on every Direct-mode TCP setter call (idle timeout, open timeout, max requests, port reuse, connection mode), not just on
- Wrong order with the existing checks. Right now,
new CosmosClientOptions { ConnectionMode = Gateway, OpenTcpConnectionTimeout = TimeSpan.FromSeconds(-1) }throwsArgumentException("requires Direct") instead of the documentedArgumentOutOfRangeException. The xmldoc + changelog promise AOORE categorically for negatives. Moving to the setter (per Thanks! #1) also fixes this.
Clean exemplar in the same file: GatewayModeMaxConnectionLimit setter at line 293-310 — validate-then-assign at the setter, with the bound named via a constant.
|
|
||
| #### Bugs Fixed | ||
|
|
||
| - [#5873](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5873) Direct: Fixes silent truncation of `OpenTcpConnectionTimeout`. Sub-second values normalize to `TimeSpan.Zero` (request-timeout fallback). Fractional values greater than or equal to 1 second round up to the nearest whole second at the transport boundary (for example, 2.3 seconds becomes 3 seconds). Negative values now throw an `ArgumentOutOfRangeException`. |
There was a problem hiding this comment.
Two small changelog things while we're in here:
- The
[5870]entry on the next line is a duplicate — PR CrossRegionHedgingAvailabilityStrategy: FixesStackOverflowinCrossRegionHedgingAvailabilityStrategyObserved in .NET Framework 4.7.2 #5870 already shipped in3.60.0(line 58 ofmain:changelog.md). Could you please delete that line? Looks like a stale merge artifact. - Once we switch the implementation to clamp + warn (see inline on
CosmosClientOptions.cs), could you please rephrase this entry to match? Suggested wording:
5873 Direct: Fixes silent truncation of
OpenTcpConnectionTimeout. Sub-second values fall back to the configuredRequestTimeoutat the transport boundary (the property itself preserves the suppliedTimeSpan). Fractional values greater than or equal to 1 second round up to the nearest whole second (for example, 2.3 seconds becomes 3 seconds — rounding up can increase the effective per-attempt connect-phase timeout by up to ~1 second for non-whole-second values). Negative values are clamped toTimeSpan.Zerowith a warning trace.
Also note the format inconsistency with neighbors: this entry uses [#5873] (leading #) while every neighbor uses [5783], [5870] (no #). Could you please drop the # to match?
If you decide to keep the throw, this entry would need to move under ### Breaking Changes instead, since existing customer apps passing negative TimeSpan values will start failing at CosmosClient construction on upgrade.
|
Why negative values are preserved (not thrown/clamped): The full chain for negative values:
Net effect: RNTBD uses Enforcing bounds (throwing or clamping) would be a behavioral breaking change for existing customers who have configured negative values that have been working via this fallback path for a long time. |
|
Why negative values are preserved (not thrown/clamped): The full chain for negative values:
Net effect: RNTBD uses Clamping on the SDK level would not have changed the underlying behavior by the time the value reached RNTBD (it would still end up as Enforcing bounds (throwing or clamping) would be a behavioral breaking change for existing customers who have configured negative values that have been working via this fallback path for a long time. |
…ng-only for backward compatibility - Negative values emit DefaultTrace.TraceWarning but are preserved unchanged - Removes clamping to TimeSpan.Zero and ArgumentOutOfRangeException for negatives - Sub-second values in [0, 1s) explicitly normalize to 0 in DocumentClient - Fractional values >= 1s round up via Math.Ceiling (unchanged) - Updated XML docs, tests, and changelog to reflect warn-only behavior Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
41f0dae to
660bd2e
Compare
| #### Bugs Fixed | ||
|
|
||
| - [5873](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5873) Direct: Fixes silent truncation of `OpenTcpConnectionTimeout`. Sub-second values in [0, 1s) are now explicitly normalized to 0 and fractional values ≥ 1 second are rounded up to the nearest whole second (for example, 2.3s becomes 3s). Negative values emit a warning trace but are left unchanged for backward compatibility. | ||
| - [5870](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5870) CrossRegionHedgingAvailabilityStrategy: Fixes StackOverflow in CrossRegionHedgingAvailabilityStrategy Observed in .NET Framework 4.7.2. |
There was a problem hiding this comment.
| - [5870](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5870) CrossRegionHedgingAvailabilityStrategy: Fixes StackOverflow in CrossRegionHedgingAvailabilityStrategy Observed in .NET Framework 4.7.2. |
tvaron3
left a comment
There was a problem hiding this comment.
LGTM besides small changelog update
Fixes #4500
Previously,
DocumentClientcastconnectionPolicy.OpenTcpConnectionTimeout.Value.TotalSecondsdirectly toint, which silently truncated sub-second values to 0 and fractional values >= 1s down to the next whole second.This change formalizes the public contract for the (0, 1) second range and documents negative value behavior:
Sub-second and fractional values (the fix in this PR):
TimeSpanis preserved unchanged on SDK-level properties.DocumentClient.cs):Math.Ceilingso the SDK never silently shortens the requested timeout (e.g. 2.3s -> 3s, never 2s).Negative values (warning only, no enforcement):
DefaultTrace.TraceWarningadvising they are not recommended.DocumentClient, negative values are truncated via(int)TotalSeconds(e.g. -5.7s -> -5), preserving the pre-PR cast behavior.StoreClientFactorywraps this back toTimeSpan.FromSeconds(-5)and setsTransportClient.Options.OpenTimeout.TransportClient.Options.OpenTimeoutgetter checksif (openTimeout > TimeSpan.Zero); since negative values fail that check, it returnsRequestTimeoutinstead — RNTBD never sees the negative value.Why we do NOT throw or clamp negative values:
Enforcing bounds on negative
OpenTcpConnectionTimeoutvalues (either by throwingArgumentOutOfRangeExceptionor clamping to zero/default) would create a behavioral breaking change for existing customers. Customers who have had negative values configured for a long time are already working correctly because theTransportClient.Options.OpenTimeoutgetter returnsRequestTimeoutfor any value that is not >TimeSpan.Zero. Changing this on SDK upgrade — with no code change on the customer's part — would alter the effective timeout they experience. The warn-only approach documents that negative values are not recommended without disrupting existing configurations.XML docs updated on
CosmosClientOptions,ConnectionPolicy, andCosmosClientBuilder.Unit tests covering negative pass-through, zero allowed, sub-second normalization, whole-second preservation, fractional ceiling rounding, and property round-trip.
Out of scope:
IdleTcpConnectionTimeouthas the same truncation bug; follow-up issue planned.