[GATEWAY V2]: Bifurcate connect / connection-acquire timeout between Gateway V1 and Gateway V2 endpoints.#48174
Conversation
* fix few tests part 2 --------- Co-authored-by: Annie Liang <anniemac@Annies-MacBook-Pro.local>
…ning effort configuration (Azure#47772) Co-authored-by: Xiting Zhang <xitzhang@microsoft.com>
* [VoiceLive]Release 1.0.0-beta.4 Updated release date for version 1.0.0-beta.4 and added feature details. * Revise CHANGELOG for clarity and bug fixes Updated changelog to remove breaking changes section and added details about bug fixes.
…Java-5433741 (Azure#46952) * Configurations: 'specification/nginx/Nginx.Management/tspconfig.yaml', API Version: 2025-03-01-preview, SDK Release Type: beta, and CommitSHA: 'aae85aa3e7e4fda95ea2d3abac0ba1d8159db214' in SpecRepo: 'https://github.com/Azure/azure-rest-api-specs' Pipeline run: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=5433741 Refer to https://eng.ms/docs/products/azure-developer-experience/develop/sdk-release/sdk-release-prerequisites to prepare for SDK release. * Configurations: 'specification/nginx/Nginx.Management/tspconfig.yaml', API Version: 2025-03-01-preview, SDK Release Type: beta, and CommitSHA: 'de8103ff8e94ea51c56bb22094ded5d2dfc45a6a' in SpecRepo: 'https://github.com/Azure/azure-rest-api-specs' Pipeline run: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=5857234 Refer to https://eng.ms/docs/products/azure-developer-experience/develop/sdk-release/sdk-release-prerequisites to prepare for SDK release. --------- Co-authored-by: Weidong Xu <weidxu@microsoft.com>
false can't be assigned to int in java. Updating type to boolean
* Deprecating azure-resourcemanager-mixedreality * Typos * use 1.0.1 as version * Update CHANGELOG.md --------- Co-authored-by: Michael Zappe <michaelzappe@microsoft.com> Co-authored-by: Weidong Xu <weidxu@microsoft.com>
* fix few tests part 3 --------- Co-authored-by: Annie Liang <anniemac@Annies-MacBook-Pro.local> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Initial regeneration using TypeSpec * Working on migrating tests, adding back convenience APIs that are being kept * Complete most of the migration * Additional work * Stable point before tests * Newer TypeSpec SHA * Add back SearchAudience support * Last changes before testing * Rerecord tests and misc fixes along the way * Fix a few recordings and stress tests * Fix a few recordings and linting * Few more fixes * Another round of recording * Rerun TypeSpec codegen * Remove errant import * Cleanup APIs * Regeneration * Clean up linting
* escape non-ascii character for pkValue --------- Co-authored-by: Annie Liang <anniemac@Annies-MacBook-Pro.local> Co-authored-by: Fabian Meiswinkel <fabianm@microsoft.com>
…k connector 4.43.0 (Azure#47968) * Release azure-cosmos 4.78.0, azure-cosmos-encryption 2.27.0, and Spark connector 4.43.0 --------- Co-authored-by: Annie Liang <anniemac@Annies-MacBook-Pro.local> Co-authored-by: Fabian Meiswinkel <fabianm@microsoft.com>
…into AzCosmos_H2ConnectAcquireTimeout # Conflicts: # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java
… Gateway V2 endpoints.
There was a problem hiding this comment.
Pull request overview
This PR improves Azure Cosmos DB Gateway mode behavior when “thin client” (Gateway V2, port 10250) is enabled by applying a shorter per-request TCP connect timeout for data-plane requests, while keeping the existing longer timeout for Gateway V1 metadata requests (port 443). It also expands diagnostics to surface HTTP/2 channel identity and request timeout details, and adds targeted tests (including manual network-manipulation suites) to validate the behavior.
Changes:
- Apply a per-request
CONNECT_TIMEOUT_MILLISinReactorNettyClientbased on whether the request targets the thin client proxy. - Introduce thin-client-specific timeout policy wiring and new config plumbing (
COSMOS.THINCLIENT_CONNECTION_TIMEOUT_IN_SECONDS) plus diagnostics output updates. - Add/extend unit and fault-injection/manual tests and accompanying docs to validate connect-timeout bifurcation and HTTP/2 connection lifecycle behavior.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ResponseTimeoutAndDelays.java | Adds Duration-based delay representation alongside seconds. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ReactorNettyRequestRecord.java | Captures HTTP/2/channel identifiers for richer diagnostics. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ReactorNettyClient.java | Implements per-request connect timeout selection; captures channel IDs via connection observer. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/HttpTimeoutPolicyForGatewayV2.java | Adds Gateway V2 timeout policy class for thin client document operations. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/HttpTimeoutPolicy.java | Routes eligible thin-client document operations to Gateway V2 timeout policies. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/HttpRequest.java | Adds isThinClientRequest flag + fluent setter for transport customization. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/HttpClientConfig.java | Emits gwV2Cto in diagnostics when thin client is enabled. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/WebExceptionRetryPolicy.java | Switches retry backoff handling to use Duration from timeout policy. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ThinClientStoreModel.java | Marks thin-client requests and hardens ByteBuf handling for released buffers. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxGatewayStoreModel.java | Ensures request record is available on success/error; sets request URI on cancellation for diagnostics. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentServiceRequest.java | Adds useThinClientMode flag and ensures it is preserved during cloning. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java | Sets useThinClientMode when routing to thin client store model. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/DocumentServiceRequestContext.java | Stores per-attempt ReactorNettyRequestRecord for diagnostics enrichment. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java | Introduces thin client connect-timeout config (sysprop/env) with default value. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientSideRequestStatistics.java | Adds HTTP response timeout, channel IDs, HTTP/2 flag, and e2e policy config to gateway stats serialization. |
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Documents the new Gateway V2 connect-timeout behavior and timeout policies. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/TestSuiteBase.java | Minor import/format cleanup. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/WebExceptionRetryPolicyTest.java | Extends test coverage for thin-client timeout policies and write behavior. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/ConfigsTests.java | Adds unit tests for thin client timeout config parsing and request flag defaulting. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/faultinjection/Http2ConnectionLifecycleTests.java | Adds manual tc netem tests to validate HTTP/2 parent connection survival across real delays/timeouts. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/faultinjection/Http2ConnectTimeoutBifurcationTests.java | Adds manual iptables/tc tests to validate connect-timeout bifurcation by port/path. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/faultinjection/FaultInjectionServerErrorRuleOnGatewayV2Tests.java | Updates thin-client FI tests to account for new Gateway V2 timeout behavior. |
| sdk/cosmos/azure-cosmos-tests/NETWORK_DELAY_TESTING_README.md | Documents how to run the new manual network-delay lifecycle tests. |
| sdk/cosmos/azure-cosmos-tests/CONNECT_TIMEOUT_TESTING_README.md | Documents how to run the new manual connect-timeout bifurcation tests. |
Comments suppressed due to low confidence (1)
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/faultinjection/Http2ConnectTimeoutBifurcationTests.java:300
- Method name says “1sFiresOnDroppedSyn”, but the test description and assertions are written for a 5s default connect timeout. Rename the test to match the actual expected behavior (or adjust the timeout setup) so the name stays accurate over time.
@Test(groups = {TEST_GROUP}, timeOut = TEST_TIMEOUT)
public void connectTimeout_GwV2_DataPlane_1sFiresOnDroppedSyn() throws Exception {
// Close and recreate client to ensure no pooled connections exist —
// we need to force a NEW TCP connection which will hit the iptables DROP.
…into AzCosmos_H2ConnectAcquireTimeout # Conflicts: # sdk/cosmos/azure-cosmos/CHANGELOG.md # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ThinClientStoreModel.java
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Problem
When thin client (Gateway V2) is enabled, both metadata requests (port 443, GW V1) and data-plane requests (port 10250, GW V2 HTTP/2) share the same
CONNECT_TIMEOUT_MILLISof 45s. If the thin client proxy on port 10250 is unreachable, the SDK waits 45s per connect attempt before failing — far too long for a data-plane path that should fail fast and trigger regional failover.Solution
Bifurcate
CONNECT_TIMEOUT_MILLISat the Reactor Netty level based on request type:CONNECT_TIMEOUT_MILLISThe timeout is applied per-request via Reactor Netty's immutable
HttpClient.option(), which returns a new config snapshot without mutating the shared client.Diagnostic proof —
connectTimeout_Bifurcation_DelayBasedBoth ports receive the same 7s SYN-only delay via Linux
tc netem+iptables mangle. The only difference isCONNECT_TIMEOUT_MILLIS.Full CosmosDiagnostics — data plane failure:
Reading the diagnostic:
CONTAINER_LOOK_UP: 7512ms— metadata on port 443 took 7.5s (absorbed 7s SYN delay, succeeded)gatewayStatisticsList[0]:connectionAcquired: 5028ms,"connection timed out after 5000 ms: ...eastus2...:10250"— 5s timeout firedgatewayStatisticsList[1]:connectionAcquired: 5004ms, sameconnection timed out after 5000 ms— second attempt, same 5sgatewayStatisticsList[2]:connectionAcquired: 5005ms, same — third attempt on eastus2gatewayStatisticsList[3]:connectionAcquired: 5228ms,"...centralus...:10250"— failover to Central US, still 5s timeoutgatewayStatisticsList[4]:408/20008— e2e timeout (30s budget) cancelled the fifth attemptconnCfg.gw:cto:PT45S, gwV2Cto:PT5S— both timeouts visibleThe bifurcation proof:
cto:PT45SbuildAsyncClient()succeededgwV2Cto:PT5Sconnection timed out after 5000 msSame network condition. Same delay. Different timeouts. Different outcomes.
CosmosDiagnostics — what changes for customers
The
clientCfgs.connCfg.gwdiagnostic string gains the newgwV2Ctofield:Before:
After:
When data-plane connect times out, each
gatewayStatisticsListentry shows:Production code changes
COSMOS.THINCLIENT_CONNECTION_TIMEOUT_IN_SECONDS(env:COSMOS_THINCLIENT_CONNECTION_TIMEOUT_IN_SECONDS), default 5s. New methodgetThinClientConnectionTimeoutInSeconds().HttpRequest.javaisThinClientRequestflag + fluentwithThinClientRequest(boolean)setter.ThinClientStoreModel.java.withThinClientRequest(true)on thin client path requests.ReactorNettyClient.javaresolveConnectTimeoutMs(HttpRequest)— applies per-request via.option(ChannelOption.CONNECT_TIMEOUT_MILLIS, connectTimeoutMs).HttpClientConfig.javatoDiagnosticsString()emitsgwV2Ctoalongsidecto.Testing
connectTimeout_GwV2_DataPlane_1sFiresOnDroppedSyniptables -j DROPSYN on :10250connectTimeout_GwV1_Metadata_UnaffectedByGwV2Dropiptables -j DROPSYN on :10250 onlyconnectTimeout_GwV2_PreciseTimingiptables -j DROPSYN, 12s e2econnectTimeout_Bifurcation_DelayBased_...tc netemSYN-only 7s delay on both portsConfigsTestsTest infra: Added
manual-thinclient-network-delayto@BeforeSuite/@AfterSuitein TestSuiteBase.java.Configuration
COSMOS.THINCLIENT_CONNECTION_TIMEOUT_IN_SECONDSCOSMOS_THINCLIENT_CONNECTION_TIMEOUT_IN_SECONDS5All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines
closes #48092