[azcosmos] tighten default HTTP client timeouts and increase pool size#26856
Merged
Conversation
Override azcore's default HTTP client with Cosmos-specific defaults so that connect failures surface quickly and high-concurrency workloads against the gateway do not bottleneck on a small idle-connection pool: - Connect (dial) timeout: 30s -> 5s - Overall request timeout: unbounded -> 65s (http.Client.Timeout) - Idle connection pool: 100 total / 10 per host -> 1000 / 1000 - HTTP/2 health check: ReadIdleTimeout 2s, PingTimeout 1s The new client is only installed when the caller has not supplied a custom Transport via azcore.ClientOptions, and caller-supplied options are never mutated (shallow-cloned before assigning Transport). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds Cosmos-specific HTTP transport defaults (timeouts + connection pool sizing) instead of inheriting azcore’s general-purpose defaults, and wires these defaults into Cosmos client construction only when callers haven’t provided a custom transport.
Changes:
- Introduces a package-level default
http.Clientconfigured with Cosmos-specific dial/request timeouts, idle pool sizing, and HTTP/2 ping settings. - Routes
newClientandnewInternalPipelinethrough a helper that applies the default transport without mutating caller-suppliedClientOptions. - Adds unit tests and a CHANGELOG entry documenting the new defaults.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| sdk/data/azcosmos/cosmos_default_http_client.go | Adds Cosmos-specific default HTTP client/transport configuration. |
| sdk/data/azcosmos/cosmos_default_http_client_test.go | Adds tests for default transport wiring and non-mutation behavior. |
| sdk/data/azcosmos/cosmos_client.go | Applies default transport logic in client/pipeline creation via withDefaultTransport(). |
| sdk/data/azcosmos/CHANGELOG.md | Documents the default HTTP client behavior changes. |
* Revert idle connection pool sizes to the azcore defaults (100 / 10). The earlier bump to 1000 / 1000 conflated Java's total pool sizing with per-host sizing in Go's Transport and would have authorized N*1000 idle sockets per process for multi-host (GEM) deployments. * Call withDefaultTransport once per NewClient*/NewClientFromConnectionString entry point instead of cloning options twice (once in newClient and once in newInternalPipeline). Removes a redundant *ClientOptions clone and keeps option normalization in a single, obvious place. * Rename defaultRequestTimeout -> defaultHTTPRoundTripTimeout to reflect the actual layer it bounds (a single HTTP attempt at the http.Client level, not a per-Cosmos-operation timeout). * Expand the constant godoc to explain the reasoning for choosing http.Client.Timeout as the safety knob: it is a wall-clock cap on a single HTTP attempt; shorter caller context deadlines still win; longer caller context deadlines are truncated; the azcore retry policy sits above it and can still issue additional attempts; 65s was picked to slightly exceed the gateway's ~60s server-side budget so the server can return a structured error first. * Mirror the same reasoning in the CHANGELOG entry so consumers understand the new safety behavior before upgrading. * Run 'go mod tidy' to promote golang.org/x/net from indirect to direct since cosmos_default_http_client.go now imports golang.org/x/net/http2 directly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Fix staticcheck QF1008 lint failure: drop the redundant embedded ClientOptions selector in withDefaultTransport. * Add WASM/wasip1 build-tagged dialer helper (Copilot review): mirror azcore's defaultTransportDialContext so the Cosmos default transport uses dialer.DialContext on normal platforms and nil on (js && wasm) || wasip1, where the runtime supplies its own HTTP transport. * Narrow the withDefaultTransport godoc (Copilot review): clone is a shallow copy, so slice fields like PreferredRegions and embedded azcore PerCallPolicies still share backing arrays with the caller. Only the top-level Transport assignment is guaranteed not to leak into the caller's struct. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Raise MaxIdleConns to 1000 (from azcore's 100) and MaxIdleConnsPerHost to 100 (from azcore's 10). Cosmos clients typically talk to a small number of regional gateway/replica hosts under high concurrency, and the azcore defaults force needless TCP+TLS reconnects for any workload beyond ~10 concurrent in-flight requests per host. 100 per host is well below the per-process FD ceiling on typical multi-host deployments, while the total cap of 1000 prevents unbounded growth across many hosts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
kushagraThapar
left a comment
There was a problem hiding this comment.
Thanks for this — really clean implementation and the tuning aligns well with peer SDKs (cross-checked against .NET / Java / Python / Rust and the 65s / 5s / pool numbers all match up). Have one thing I'd love to discuss before merge, dropped inline on the new 65s constant. Curious to hear your thinking!
simorenoh
approved these changes
May 20, 2026
This was referenced May 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cosmos accounts are reachable from the preferred region in well under a second and have a server-side request budget of ~60s, but the SDK has been inheriting azcore's general-purpose HTTP defaults (30s dial timeout, unbounded
http.Client.Timeout, 100/10 idle pool). That means a dead endpoint blocks every request for 30s, a runaway request can hang a caller indefinitely if the caller forgot to set a context deadline, and any workload with more than ~10 concurrent in-flight requests per host forces needless TCP+TLS reconnects. This PR installs Cosmos-tuned defaults to fail fast on dead endpoints, backstop runaway requests, and sustain reasonable concurrency.Defaults changed
http.Client.Timeout(per HTTP attempt)MaxIdleConns/MaxIdleConnsPerHostReadIdleTimeout/PingTimeoutAbout the 65s
http.Client.TimeoutThis is a wall-clock cap on a single HTTP attempt (dial + request write + header read + body read). It is intentionally set as
http.Client.Timeoutso it survives custom per-call policies and acts as a hard safety backstop. Trade-offs to know:context.WithTimeoutshorter than 65s still wins; longer deadlines are truncated by the HTTP client. This is the intended safety property: no Cosmos request should hang for an unbounded amount of time even if the caller forgot to set a deadline.Transportviaazcore.ClientOptions.About the idle pool sizing
MaxIdleConnsPerHost = 100is well below the per-process FD ceiling on typical multi-host deployments while comfortably handling normal Cosmos concurrency. The total cap of 1000 prevents unbounded pool growth across many hosts (e.g., GEM-routed replica endpoints) on multi-region accounts.Approach
cosmos_default_http_client.gobuilds a package-leveldefaultCosmosHTTPClientthat mirrors the azcore default transport (TLS 1.2+,ForceAttemptHTTP2, keep-alive 30s, etc.) but overrides the settings above.cosmos_default_http_client_dialer_{other,wasm}.gomirror azcore's WASM/wasip1 handling so the Cosmos defaults remain portable across build targets.withDefaultTransport()incosmos_client.gois invoked exactly once at the top ofNewClientandNewClientWithKey(NewClientFromConnectionStringdelegates toNewClientWithKey). The normalized options are reused for both the user-facing pipeline and the global-endpoint-manager pipeline. The helper installs the default only when the caller has not supplied aTransport, and shallow-clones the caller'sClientOptionsbefore assigning so the top-levelTransportfield never leaks back to the caller. Slice fields (PreferredRegions,PerCallPolicies, etc.) still share backing arrays with the caller, as is conventional for Go config structs.mock.Serverused in tests) flow through untouched.Validation
go build ./...,go vet ./..., andgo test -count=1 -short ./...insdk/data/azcosmospass locally and on CI (ubuntu/windows, go 1.25/1.26).Transportget the default without mutating the caller, and a caller-suppliedTransportis preserved.Checklist