Skip to content

[azcosmos] tighten default HTTP client timeouts and increase pool size#26856

Merged
tvaron3 merged 7 commits into
Azure:mainfrom
tvaron3:tvaron3/cosmos-default-timeouts
May 20, 2026
Merged

[azcosmos] tighten default HTTP client timeouts and increase pool size#26856
tvaron3 merged 7 commits into
Azure:mainfrom
tvaron3:tvaron3/cosmos-default-timeouts

Conversation

@tvaron3
Copy link
Copy Markdown
Member

@tvaron3 tvaron3 commented May 20, 2026

Cosmos accounts are reachable from the preferred region in well under a second and have a server-side request budget of ~60s, but the SDK has been inheriting azcore's general-purpose HTTP defaults (30s dial timeout, unbounded http.Client.Timeout, 100/10 idle pool). That means a dead endpoint blocks every request for 30s, a runaway request can hang a caller indefinitely if the caller forgot to set a context deadline, and any workload with more than ~10 concurrent in-flight requests per host forces needless TCP+TLS reconnects. This PR installs Cosmos-tuned defaults to fail fast on dead endpoints, backstop runaway requests, and sustain reasonable concurrency.

Defaults changed

Setting Old (azcore default) New
Connect (dial) timeout 30s 5s
http.Client.Timeout (per HTTP attempt) unbounded 65s
MaxIdleConns / MaxIdleConnsPerHost 100 / 10 1000 / 100
HTTP/2 ReadIdleTimeout / PingTimeout 10s / 5s 2s / 1s

About the 65s http.Client.Timeout

This is a wall-clock cap on a single HTTP attempt (dial + request write + header read + body read). It is intentionally set as http.Client.Timeout so it survives custom per-call policies and acts as a hard safety backstop. Trade-offs to know:

  • A caller-supplied context.WithTimeout shorter than 65s still wins; longer deadlines are truncated by the HTTP client. This is the intended safety property: no Cosmos request should hang for an unbounded amount of time even if the caller forgot to set a deadline.
  • The azcore retry policy sits above the transport, so the 65s cap applies per attempt; the policy can still issue additional retries when one attempt exceeds the cap.
  • 65s was chosen to slightly exceed the Cosmos gateway's server-side request budget (~60s) so the server has a chance to return a structured error before the client gives up locally.
  • Callers that legitimately need to drain very large query/change-feed pages in a single attempt should supply their own Transport via azcore.ClientOptions.

About the idle pool sizing

MaxIdleConnsPerHost = 100 is well below the per-process FD ceiling on typical multi-host deployments while comfortably handling normal Cosmos concurrency. The total cap of 1000 prevents unbounded pool growth across many hosts (e.g., GEM-routed replica endpoints) on multi-region accounts.

Approach

  • New file cosmos_default_http_client.go builds a package-level defaultCosmosHTTPClient that mirrors the azcore default transport (TLS 1.2+, ForceAttemptHTTP2, keep-alive 30s, etc.) but overrides the settings above.
  • New build-tagged files cosmos_default_http_client_dialer_{other,wasm}.go mirror azcore's WASM/wasip1 handling so the Cosmos defaults remain portable across build targets.
  • New helper withDefaultTransport() in cosmos_client.go is invoked exactly once at the top of NewClient and NewClientWithKey (NewClientFromConnectionString delegates to NewClientWithKey). The normalized options are reused for both the user-facing pipeline and the global-endpoint-manager pipeline. The helper installs the default only when the caller has not supplied a Transport, and shallow-clones the caller's ClientOptions before assigning so the top-level Transport field never leaks back to the caller. Slice fields (PreferredRegions, PerCallPolicies, etc.) still share backing arrays with the caller, as is conventional for Go config structs.
  • Caller-supplied transports (including the mock.Server used in tests) flow through untouched.

Validation

  • go build ./..., go vet ./..., and go test -count=1 -short ./... in sdk/data/azcosmos pass locally and on CI (ubuntu/windows, go 1.25/1.26).
  • New unit tests cover: default timeout constants are wired into the client, nil options get the default, non-nil options without a Transport get the default without mutating the caller, and a caller-supplied Transport is preserved.

Checklist

  • The purpose of this PR is explained in this or a referenced issue.
  • The PR does not update generated files.
  • Tests are included and/or updated for code changes.
  • Updates to module CHANGELOG.md are included.
  • MIT license headers are included in each file.

tvaron3 and others added 2 commits May 20, 2026 01:24
Override azcore's default HTTP client with Cosmos-specific defaults so
that connect failures surface quickly and high-concurrency workloads
against the gateway do not bottleneck on a small idle-connection pool:

- Connect (dial) timeout: 30s -> 5s
- Overall request timeout: unbounded -> 65s (http.Client.Timeout)
- Idle connection pool: 100 total / 10 per host -> 1000 / 1000
- HTTP/2 health check: ReadIdleTimeout 2s, PingTimeout 1s

The new client is only installed when the caller has not supplied a
custom Transport via azcore.ClientOptions, and caller-supplied options
are never mutated (shallow-cloned before assigning Transport).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 20, 2026 08:26
@tvaron3 tvaron3 requested a review from a team as a code owner May 20, 2026 08:26
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Cosmos-specific HTTP transport defaults (timeouts + connection pool sizing) instead of inheriting azcore’s general-purpose defaults, and wires these defaults into Cosmos client construction only when callers haven’t provided a custom transport.

Changes:

  • Introduces a package-level default http.Client configured with Cosmos-specific dial/request timeouts, idle pool sizing, and HTTP/2 ping settings.
  • Routes newClient and newInternalPipeline through a helper that applies the default transport without mutating caller-supplied ClientOptions.
  • Adds unit tests and a CHANGELOG entry documenting the new defaults.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
sdk/data/azcosmos/cosmos_default_http_client.go Adds Cosmos-specific default HTTP client/transport configuration.
sdk/data/azcosmos/cosmos_default_http_client_test.go Adds tests for default transport wiring and non-mutation behavior.
sdk/data/azcosmos/cosmos_client.go Applies default transport logic in client/pipeline creation via withDefaultTransport().
sdk/data/azcosmos/CHANGELOG.md Documents the default HTTP client behavior changes.

Comment thread sdk/data/azcosmos/cosmos_default_http_client.go
Comment thread sdk/data/azcosmos/cosmos_client.go
tvaron3 and others added 4 commits May 20, 2026 01:39
* Revert idle connection pool sizes to the azcore defaults (100 / 10).
  The earlier bump to 1000 / 1000 conflated Java's total pool sizing
  with per-host sizing in Go's Transport and would have authorized
  N*1000 idle sockets per process for multi-host (GEM) deployments.

* Call withDefaultTransport once per NewClient*/NewClientFromConnectionString
  entry point instead of cloning options twice (once in newClient and
  once in newInternalPipeline). Removes a redundant *ClientOptions clone
  and keeps option normalization in a single, obvious place.

* Rename defaultRequestTimeout -> defaultHTTPRoundTripTimeout to reflect
  the actual layer it bounds (a single HTTP attempt at the http.Client
  level, not a per-Cosmos-operation timeout).

* Expand the constant godoc to explain the reasoning for choosing
  http.Client.Timeout as the safety knob: it is a wall-clock cap on a
  single HTTP attempt; shorter caller context deadlines still win;
  longer caller context deadlines are truncated; the azcore retry
  policy sits above it and can still issue additional attempts; 65s
  was picked to slightly exceed the gateway's ~60s server-side budget
  so the server can return a structured error first.

* Mirror the same reasoning in the CHANGELOG entry so consumers
  understand the new safety behavior before upgrading.

* Run 'go mod tidy' to promote golang.org/x/net from indirect to direct
  since cosmos_default_http_client.go now imports golang.org/x/net/http2
  directly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Fix staticcheck QF1008 lint failure: drop the redundant embedded
  ClientOptions selector in withDefaultTransport.

* Add WASM/wasip1 build-tagged dialer helper (Copilot review): mirror
  azcore's defaultTransportDialContext so the Cosmos default transport
  uses dialer.DialContext on normal platforms and nil on (js && wasm)
  || wasip1, where the runtime supplies its own HTTP transport.

* Narrow the withDefaultTransport godoc (Copilot review): clone is a
  shallow copy, so slice fields like PreferredRegions and embedded
  azcore PerCallPolicies still share backing arrays with the caller.
  Only the top-level Transport assignment is guaranteed not to leak
  into the caller's struct.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Raise MaxIdleConns to 1000 (from azcore's 100) and MaxIdleConnsPerHost
to 100 (from azcore's 10). Cosmos clients typically talk to a small
number of regional gateway/replica hosts under high concurrency, and
the azcore defaults force needless TCP+TLS reconnects for any workload
beyond ~10 concurrent in-flight requests per host. 100 per host is well
below the per-process FD ceiling on typical multi-host deployments,
while the total cap of 1000 prevents unbounded growth across many
hosts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this — really clean implementation and the tuning aligns well with peer SDKs (cross-checked against .NET / Java / Python / Rust and the 65s / 5s / pool numbers all match up). Have one thing I'd love to discuss before merge, dropped inline on the new 65s constant. Curious to hear your thinking!

Comment thread sdk/data/azcosmos/cosmos_default_http_client.go
@tvaron3 tvaron3 merged commit 76d4ae5 into Azure:main May 20, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants