Skip to content

Direct: Fixes OpenTcpConnectionTimeout silent truncation of sub-second and fractional values#5873

Open
yumnahussain wants to merge 7 commits into
Azure:mainfrom
yumnahussain:users/t-yhussain/fix-open-tcp-connection-timeout-subsecond
Open

Direct: Fixes OpenTcpConnectionTimeout silent truncation of sub-second and fractional values#5873
yumnahussain wants to merge 7 commits into
Azure:mainfrom
yumnahussain:users/t-yhussain/fix-open-tcp-connection-timeout-subsecond

Conversation

@yumnahussain
Copy link
Copy Markdown
Member

@yumnahussain yumnahussain commented May 15, 2026

Fixes #4500

Previously, DocumentClient cast connectionPolicy.OpenTcpConnectionTimeout.Value.TotalSeconds directly to int, which silently truncated sub-second values to 0 and fractional values >= 1s down to the next whole second.

This change formalizes the public contract for the (0, 1) second range and documents negative value behavior:

Sub-second and fractional values (the fix in this PR):

  • The customer-supplied TimeSpan is preserved unchanged on SDK-level properties.
  • At the int-typed Direct boundary (DocumentClient.cs):
    • Values in [TimeSpan.Zero, 1 second) explicitly normalize to 0 (use request timeout sentinel).
    • Values >= 1 second round UP via Math.Ceiling so the SDK never silently shortens the requested timeout (e.g. 2.3s -> 3s, never 2s).

Negative values (warning only, no enforcement):

  • Negative values emit a DefaultTrace.TraceWarning advising they are not recommended.
  • The value is preserved on the SDK property unchanged — no exception, no clamping.
  • In DocumentClient, negative values are truncated via (int)TotalSeconds (e.g. -5.7s -> -5), preserving the pre-PR cast behavior.
  • Downstream, StoreClientFactory wraps this back to TimeSpan.FromSeconds(-5) and sets TransportClient.Options.OpenTimeout.
  • The TransportClient.Options.OpenTimeout getter checks if (openTimeout > TimeSpan.Zero); since negative values fail that check, it returns RequestTimeout instead — RNTBD never sees the negative value.

Why we do NOT throw or clamp negative values:
Enforcing bounds on negative OpenTcpConnectionTimeout values (either by throwing ArgumentOutOfRangeException or clamping to zero/default) would create a behavioral breaking change for existing customers. Customers who have had negative values configured for a long time are already working correctly because the TransportClient.Options.OpenTimeout getter returns RequestTimeout for any value that is not > TimeSpan.Zero. Changing this on SDK upgrade — with no code change on the customer's part — would alter the effective timeout they experience. The warn-only approach documents that negative values are not recommended without disrupting existing configurations.

XML docs updated on CosmosClientOptions, ConnectionPolicy, and CosmosClientBuilder.
Unit tests covering negative pass-through, zero allowed, sub-second normalization, whole-second preservation, fractional ceiling rounding, and property round-trip.

Out of scope: IdleTcpConnectionTimeout has the same truncation bug; follow-up issue planned.

@yumnahussain
Copy link
Copy Markdown
Member Author

@microsoft-github-policy-service agree company="Microsoft"

@yumnahussain yumnahussain force-pushed the users/t-yhussain/fix-open-tcp-connection-timeout-subsecond branch from 2b171e6 to e391b37 Compare May 15, 2026 23:56
NaluTripician
NaluTripician previously approved these changes May 15, 2026
Copy link
Copy Markdown
Contributor

@NaluTripician NaluTripician left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yumnahussain yumnahussain force-pushed the users/t-yhussain/fix-open-tcp-connection-timeout-subsecond branch 2 times, most recently from 26654f8 to 1e331b4 Compare May 16, 2026 00:16
…d and fractional values

Fixes Azure#4500.

Previously, DocumentClient cast `connectionPolicy.OpenTcpConnectionTimeout.Value.TotalSeconds`
directly to int, which silently truncated:
* Sub-second values (e.g. `TimeSpan.FromMilliseconds(500)`) to 0, causing the Direct
  StoreClientFactory to fall back to the request timeout instead of the customer's
  intent.
* Fractional values >= 1s (e.g. `TimeSpan.FromSeconds(2.5)`) down to the next whole
  second, silently shortening the customer's requested timeout.
* Negative values to 0, again silently falling back.

This change formalizes the public contract on `CosmosClientOptions.OpenTcpConnectionTimeout`,
`ConnectionPolicy.OpenTcpConnectionTimeout`, and `CosmosClientBuilder.WithConnectionModeDirect`:
* Negative TimeSpan -> ArgumentOutOfRangeException at the public setter (fail fast).
* The customer-supplied TimeSpan is preserved unchanged on the SDK-level properties.
* At the int-typed Direct boundary (DocumentClient.cs):
  * Values in [TimeSpan.Zero, 1 second) normalize to 0 (documented "use request timeout"
    sentinel).
  * Values >= 1 second round UP via Math.Ceiling to the nearest whole second so the
    SDK never silently SHORTENS the customer's requested timeout (e.g. 2.5s -> 3s,
    never 2s).

XML docs on all three public surfaces explicitly call out this contract. Note: the
downstream Microsoft.Azure.Cosmos.Direct StoreClientFactory constructor only accepts
`int openTimeoutInSeconds` for this parameter, so whole-second resolution at the
transport boundary is an external constraint; Math.Ceiling is the closest faithful
interpretation of "leave the customer's value unchanged".

Tests (CosmosClientOptionsUnitTests):
* T1 OpenTcpConnectionTimeout_NegativeTimeSpan_Throws
* T2 OpenTcpConnectionTimeout_Zero_IsAllowed_AndRoundTripsThroughConnectionPolicy
* T3 OpenTcpConnectionTimeout_SubSecond_NormalizesToZero_InRntbdConfig
* T4 OpenTcpConnectionTimeout_WholeSeconds_PreservedInRntbdConfig
* T5 WithConnectionModeDirect_NegativeOpenTcpTimeout_Throws
* T6 OpenTcpConnectionTimeout_Fractional_RoundsUpInRntbdConfig (2.5s -> 3)
* T7 OpenTcpConnectionTimeout_JustOverOneSecond_RoundsUpInRntbdConfig (1.001s -> 2)
* T8 OpenTcpConnectionTimeout_Fractional_PreservedOnConnectionPolicyTimeSpan
  (2.5s round-trips unchanged through ConnectionPolicy)

Out of scope: the same (int)TotalSeconds truncation pattern applies to
IdleTcpConnectionTimeout at DocumentClient.cs:925. A follow-up issue mirroring Azure#4500
will be filed.

Changelog: entry added under Unreleased / Bugs Fixed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread changelog.md Outdated
Copy link
Copy Markdown
Member

@Pilchie Pilchie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Happy to see a first PR.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@NaluTripician NaluTripician enabled auto-merge (squash) May 18, 2026 18:46
@NaluTripician
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread Microsoft.Azure.Cosmos/src/ConnectionPolicy.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/DocumentClient.cs Outdated
Comment thread changelog.md Outdated
{
if (value.HasValue && value.Value < TimeSpan.Zero)
{
throw new ArgumentOutOfRangeException(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't throwing here be a breaking change for customers ? The old behavior used to just use the request timeout.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to a warning log instead of throwing

… warning log, adds int overflow guard

- Negative values now reset to null (transport default 5s) with DefaultTrace.TraceWarning
- Removed inline ArgumentOutOfRangeException throws (non-breaking)
- Added int.MaxValue clamp to prevent overflow at transport boundary
- Simplified ConnectionPolicy.OpenTcpConnectionTimeout to get;set; pattern
- Updated tests to verify negative values reset to null
- Updated XML docs and changelog

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
auto-merge was automatically disabled May 18, 2026 22:47

Head branch was pushed to by a user without write access

@yumnahussain yumnahussain force-pushed the users/t-yhussain/fix-open-tcp-connection-timeout-subsecond branch from 136ed8f to 3cf2498 Compare May 18, 2026 22:47
@NaluTripician
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

NaluTripician
NaluTripician previously approved these changes May 19, 2026
…fRangeException

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
NaluTripician
NaluTripician previously approved these changes May 19, 2026
Copy link
Copy Markdown
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking up the truncation half of this fix, @yumnahussain — the Math.Ceiling round-up direction is exactly right (always conservative-safe, never silently shortens a customer-requested timeout). Requesting changes only on the negative-TimeSpan throw, which I think is a behavioral breaking change in disguise — same concern we just raised on PR #5887.

Context: the open @tvaron3 thread

@tvaron3 already asked "Wouldn't throwing here be a breaking change for customers?" and your reply said "Changed to a warning log instead of throwing" — but the final commit on this branch is literally titled "Change negative OpenTcpConnectionTimeout from warning to ArgumentOutOfRangeException". The thread is still unresolved on the PR. I don't think you intended to leave that mismatch — could you please loop back to @tvaron3 either way?

Why we'd land on clamp + warn

Before this PR, a customer passing TimeSpan.FromSeconds(-1) got silent acceptance — the value cast to a negative int and flowed into the RNTBD transport. After this PR, those same customers will get ArgumentOutOfRangeException at CosmosClient construction on upgrade.

For SDK config that customers may have hardcoded years ago, throwing is the riskiest possible upgrade behavior — we'd turn a previously-working client construction into a startup crash with no opt-out. Even if the original value was semantically nonsense, the upgrade breaks customers who have been running successfully.

We landed on the same recommendation for PR #5887 (review here). Doing the same thing here keeps both sibling fixes consistent.

Proposal: clamp + warn instead of throw

Could you please switch the negative-value handling to clamp to TimeSpan.Zero and emit a warning trace, instead of throwing? Something like:

if (value.HasValue && value.Value < TimeSpan.Zero)
{
    DefaultTrace.TraceWarning(
        "OpenTcpConnectionTimeout value {0} is negative; clamping to TimeSpan.Zero (RequestTimeout will be used as the open-connection timeout).",
        value.Value);
    value = TimeSpan.Zero;
}

This way:

  • Existing apps that today pass a negative TimeSpan keep starting up on upgrade (no live-site impact from a customer SDK bump).
  • Customers still get pushed toward a sane configuration via the warning trace.
  • We avoid having to move this entry under ### Breaking Changes.
  • The clamped value TimeSpan.Zero flows through the existing sub-second-to-0 normalization path in DocumentClient.cs and lands on the documented "use RequestTimeout" sentinel — so the behavior gets more correct, not less.

Asks (in priority order)

  1. (blocking) Switch the negative-throw to clamp + warn in CosmosClientOptions.cs (and inline the check at the setter rather than ValidateDirectTCPSettings — see inline comment).
  2. (blocking) Update the negative-throw tests to assert the clamp + warning trace instead.
  3. (blocking) Reply to @tvaron3's outstanding thread.
  4. (nice-to-have) Fix the duplicate [5870] changelog entry — see inline. PR #5870 already shipped in 3.60.0 (line 58 on main).

If you feel strongly about keeping the throw, happy to discuss — but in that case we'd need to move the changelog entry under ### Breaking Changes and add explicit migration guidance. Thanks for the careful work on the truncation half — the round-up direction is genuinely the right call.

Comment on lines +1348 to +1354
if (this.OpenTcpConnectionTimeout.HasValue && this.OpenTcpConnectionTimeout.Value < TimeSpan.Zero)
{
throw new ArgumentOutOfRangeException(
nameof(this.OpenTcpConnectionTimeout),
this.OpenTcpConnectionTimeout.Value,
$"{nameof(this.OpenTcpConnectionTimeout)} must not be negative.");
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please switch this from throwing to clamp + warn? Something like:

if (this.OpenTcpConnectionTimeout.HasValue && this.OpenTcpConnectionTimeout.Value < TimeSpan.Zero)
{
    DefaultTrace.TraceWarning(
        "OpenTcpConnectionTimeout value {0} is negative; clamping to TimeSpan.Zero (RequestTimeout will be used as the open-connection timeout).",
        this.OpenTcpConnectionTimeout.Value);
    this.OpenTcpConnectionTimeout = TimeSpan.Zero;
}

Two structural notes while we're here:

  1. Wrong method for this check. ValidateDirectTCPSettings's actual purpose is the Direct-mode coupling invariant ("setting requires ConnectionMode = Direct"). The range check belongs at the property setter, validate-then-assign style. Today, putting it here means:
    • It fires redundantly on every Direct-mode TCP setter call (idle timeout, open timeout, max requests, port reuse, connection mode), not just on OpenTcpConnectionTimeout.
    • If a caller calls opts.ConnectionMode = Direct after a bad OpenTcpConnectionTimeout, the error message will reference OpenTcpConnectionTimeout even though they were touching ConnectionMode. Misleading ParamName.
  2. Wrong order with the existing checks. Right now, new CosmosClientOptions { ConnectionMode = Gateway, OpenTcpConnectionTimeout = TimeSpan.FromSeconds(-1) } throws ArgumentException ("requires Direct") instead of the documented ArgumentOutOfRangeException. The xmldoc + changelog promise AOORE categorically for negatives. Moving to the setter (per Thanks! #1) also fixes this.

Clean exemplar in the same file: GatewayModeMaxConnectionLimit setter at line 293-310 — validate-then-assign at the setter, with the bound named via a constant.

Comment thread changelog.md Outdated

#### Bugs Fixed

- [#5873](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5873) Direct: Fixes silent truncation of `OpenTcpConnectionTimeout`. Sub-second values normalize to `TimeSpan.Zero` (request-timeout fallback). Fractional values greater than or equal to 1 second round up to the nearest whole second at the transport boundary (for example, 2.3 seconds becomes 3 seconds). Negative values now throw an `ArgumentOutOfRangeException`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small changelog things while we're in here:

  1. The [5870] entry on the next line is a duplicate — PR CrossRegionHedgingAvailabilityStrategy: Fixes StackOverflow in CrossRegionHedgingAvailabilityStrategy Observed in .NET Framework 4.7.2 #5870 already shipped in 3.60.0 (line 58 of main:changelog.md). Could you please delete that line? Looks like a stale merge artifact.
  2. Once we switch the implementation to clamp + warn (see inline on CosmosClientOptions.cs), could you please rephrase this entry to match? Suggested wording:

5873 Direct: Fixes silent truncation of OpenTcpConnectionTimeout. Sub-second values fall back to the configured RequestTimeout at the transport boundary (the property itself preserves the supplied TimeSpan). Fractional values greater than or equal to 1 second round up to the nearest whole second (for example, 2.3 seconds becomes 3 seconds — rounding up can increase the effective per-attempt connect-phase timeout by up to ~1 second for non-whole-second values). Negative values are clamped to TimeSpan.Zero with a warning trace.

Also note the format inconsistency with neighbors: this entry uses [#5873] (leading #) while every neighbor uses [5783], [5870] (no #). Could you please drop the # to match?

If you decide to keep the throw, this entry would need to move under ### Breaking Changes instead, since existing customer apps passing negative TimeSpan values will start failing at CosmosClient construction on upgrade.

@yumnahussain
Copy link
Copy Markdown
Member Author

Why negative values are preserved (not thrown/clamped):

The full chain for negative values:

  1. CosmosClientOptions.OpenTcpConnectionTimeout preserves the TimeSpan as-is.
  2. DocumentClient converts to (int)TotalSeconds (e.g. -5s → -5).
  3. StoreClientFactory wraps it back: TimeSpan.FromSeconds(-5), sets TransportClient.Options.OpenTimeout.
  4. TransportClient.Options.OpenTimeout getter checks if (openTimeout > TimeSpan.Zero) — which fails for values ≤ 0 — so it returns RequestTimeout instead.

Net effect: RNTBD uses RequestTimeout — same as historical behavior.

Enforcing bounds (throwing or clamping) would be a behavioral breaking change for existing customers who have configured negative values that have been working via this fallback path for a long time.

@yumnahussain
Copy link
Copy Markdown
Member Author

Why negative values are preserved (not thrown/clamped):

The full chain for negative values:

  1. CosmosClientOptions.OpenTcpConnectionTimeout preserves the TimeSpan as-is.
  2. DocumentClient truncates to (int)TotalSeconds (e.g. -5s -> -5).
  3. StoreClientFactory wraps it back: TimeSpan.FromSeconds(-5), sets TransportClient.Options.OpenTimeout.
  4. TransportClient.Options.OpenTimeout getter checks if (openTimeout > TimeSpan.Zero) — which fails for values <= 0 — so it returns RequestTimeout instead.

Net effect: RNTBD uses RequestTimeout — same as historical behavior.

Clamping on the SDK level would not have changed the underlying behavior by the time the value reached RNTBD (it would still end up as RequestTimeout), but it would cause a behavioral change on the customer end — their stored property value would be different from what they set, which could break assumptions in their code.

Enforcing bounds (throwing or clamping) would be a behavioral breaking change for existing customers who have configured negative values that have been working via this fallback path for a long time.

…ng-only for backward compatibility

- Negative values emit DefaultTrace.TraceWarning but are preserved unchanged
- Removes clamping to TimeSpan.Zero and ArgumentOutOfRangeException for negatives
- Sub-second values in [0, 1s) explicitly normalize to 0 in DocumentClient
- Fractional values >= 1s round up via Math.Ceiling (unchanged)
- Updated XML docs, tests, and changelog to reflect warn-only behavior

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@yumnahussain yumnahussain force-pushed the users/t-yhussain/fix-open-tcp-connection-timeout-subsecond branch from 41f0dae to 660bd2e Compare May 22, 2026 22:16
Comment thread changelog.md
#### Bugs Fixed

- [5873](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5873) Direct: Fixes silent truncation of `OpenTcpConnectionTimeout`. Sub-second values in [0, 1s) are now explicitly normalized to 0 and fractional values ≥ 1 second are rounded up to the nearest whole second (for example, 2.3s becomes 3s). Negative values emit a warning trace but are left unchanged for backward compatibility.
- [5870](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5870) CrossRegionHedgingAvailabilityStrategy: Fixes StackOverflow in CrossRegionHedgingAvailabilityStrategy Observed in .NET Framework 4.7.2.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [5870](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5870) CrossRegionHedgingAvailabilityStrategy: Fixes StackOverflow in CrossRegionHedgingAvailabilityStrategy Observed in .NET Framework 4.7.2.

Copy link
Copy Markdown
Member

@tvaron3 tvaron3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM besides small changelog update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TcpOpenConnectionTimeout TimeSpan contract not honoring values less than a second

6 participants