Skip to content

Fix transient gRPC "Connection refused" errors in integration tests by adding HTTP reachability check to DaprdContainer startup#1782

Merged
WhitWaldo merged 4 commits intomasterfrom
copilot/fix-integration-test-subchannel-connection
Apr 14, 2026
Merged

Fix transient gRPC "Connection refused" errors in integration tests by adding HTTP reachability check to DaprdContainer startup#1782
WhitWaldo merged 4 commits intomasterfrom
copilot/fix-integration-test-subchannel-connection

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 13, 2026

Intermittent Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="Error connecting to subchannel.") failures occur because DaprdContainer.StartAsync returned as soon as the TCP ports accepted a connection — but Dapr continues initializing (connecting to Placement/Scheduler, loading components, starting the workflow engine) after that point, causing a brief window where the gRPC port accepts TCP connections but its request handlers are not yet installed.

Description

  • New ContainerReadinessProbe (internal static) — consolidates the existing TCP port probe (moved from DaprdContainer) and adds two new HTTP polling methods:

    • WaitForHttpReachableAsync — returns as soon as any HTTP response is received (including 5xx), only retrying when the connection itself is refused. Used by DaprdContainer as the final readiness gate.
    • WaitForHttpHealthAsync — stricter check that retries until a 2xx response is received. Retained for use cases that require full application health confirmation.
  • DaprdContainer.StartAsync — replaced inline WaitForTcpPortAsync with ContainerReadinessProbe calls and appended WaitForHttpReachableAsync as the final readiness gate; also removed a dead catch block that was swallowing-and-rethrowing with no effect.

  • AssemblyInfo.cs — adds InternalsVisibleTo("Dapr.Testcontainers.Test") so the internal probe class is directly testable.

  • ContainerReadinessProbeTests — 20 unit tests covering TCP, HTTP reachability, and HTTP health probe behaviors: success on any response code, retry-until-reachable, connection-refused retry, timeout expiry, and cancellation across all target frameworks.

The key distinction between the two HTTP methods: Dapr returns HTTP 500 from /v1.0/healthz while components are loading or while the connected app hasn't started yet (e.g. in resources-first startup mode where DaprdContainer.StartAsync completes before the web app is launched). Using WaitForHttpReachableAsync ensures the gRPC handlers are installed without requiring full application health, avoiding false timeouts.

Issue reference

Checklist

  • Code compiles correctly
  • Created/updated tests
  • Extended the documentation

Copilot AI and others added 2 commits April 13, 2026 09:10
…nection refused errors

Agent-Logs-Url: https://github.com/dapr/dotnet-sdk/sessions/68d785b9-a186-4909-8300-41dafe05dbc0

Co-authored-by: WhitWaldo <2238529+WhitWaldo@users.noreply.github.com>
Agent-Logs-Url: https://github.com/dapr/dotnet-sdk/sessions/68d785b9-a186-4909-8300-41dafe05dbc0

Co-authored-by: WhitWaldo <2238529+WhitWaldo@users.noreply.github.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 13, 2026

Codecov Report

❌ Patch coverage is 81.48148% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.18%. Comparing base (4d7a6be) to head (e792aad).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...r.Testcontainers/Common/ContainerReadinessProbe.cs 85.93% 4 Missing and 5 partials ⚠️
...r.Testcontainers/Containers/Dapr/DaprdContainer.cs 64.70% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1782      +/-   ##
==========================================
+ Coverage   60.95%   61.18%   +0.23%     
==========================================
  Files         284      285       +1     
  Lines        8358     8405      +47     
  Branches      965      974       +9     
==========================================
+ Hits         5095     5143      +48     
+ Misses       3030     3026       -4     
- Partials      233      236       +3     
Flag Coverage Δ
net10.0 61.18% <81.48%> (+0.23%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@WhitWaldo
Copy link
Copy Markdown
Contributor

@copilot This does not appear to have improved the situation as most integration tests are failing. For example:

[xUnit.net 00:16:06.33]     Dapr.IntegrationTest.DistributedLock.DistributedLockTests.ShouldAllowAcquireAfterExpiry [FAIL]

Error: System.InvalidOperationException : Failed to start resource-first Dapr test application after 5 attempts.
---- System.TimeoutException : Timed out waiting for health endpoint http://127.0.0.1:32824/v1.0/healthz to return a successful response.
-------- System.Net.Http.HttpRequestException : Health endpoint at http://127.0.0.1:32824/v1.0/healthz returned HTTP 500.

  Failed Dapr.IntegrationTest.DistributedLock.DistributedLockTests.ShouldAllowAcquireAfterExpiry [5 m 17 s]
  Error Message:
   System.InvalidOperationException : Failed to start resource-first Dapr test application after 5 attempts.
---- System.TimeoutException : Timed out waiting for health endpoint http://127.0.0.1:32824/v1.0/healthz to return a successful response.
-------- System.Net.Http.HttpRequestException : Health endpoint at http://127.0.0.1:32824/v1.0/healthz returned HTTP 500.
  Stack Trace:
     at Dapr.Testcontainers.Common.Testing.DaprTestApplicationBuilder.BuildAndStartAsync() in /home/runner/work/dotnet-sdk/dotnet-sdk/src/Dapr.Testcontainers/Common/Testing/DaprTestApplicationBuilder.cs:line 116
   at Dapr.IntegrationTest.DistributedLock.DistributedLockTests.ShouldAllowAcquireAfterExpiry() in /home/runner/work/dotnet-sdk/dotnet-sdk/test/Dapr.IntegrationTest.DistributedLock/DistributedLockTests.cs:line 113
   at Dapr.IntegrationTest.DistributedLock.DistributedLockTests.ShouldAllowAcquireAfterExpiry() in /home/runner/work/dotnet-sdk/dotnet-sdk/test/Dapr.IntegrationTest.DistributedLock/DistributedLockTests.cs:line 146
--- End of stack trace from previous location ---
----- Inner Stack Trace -----
   at Dapr.Testcontainers.Common.ContainerReadinessProbe.WaitForHttpHealthAsync(String url, TimeSpan timeout, CancellationToken cancellationToken, HttpClient httpClient) in /home/runner/work/dotnet-sdk/dotnet-sdk/src/Dapr.Testcontainers/Common/ContainerReadinessProbe.cs:line 132
   at Dapr.Testcontainers.Containers.Dapr.DaprdContainer.StartAsync(CancellationToken cancellationToken) in /home/runner/work/dotnet-sdk/dotnet-sdk/src/Dapr.Testcontainers/Containers/Dapr/DaprdContainer.cs:line 200
   at Dapr.Testcontainers.Harnesses.BaseHarness.<>c__DisplayClass52_0.<<InitializeAsync>b__0>d.MoveNext() in /home/runner/work/dotnet-sdk/dotnet-sdk/src/Dapr.Testcontainers/Harnesses/BaseHarness.cs:line 229
--- End of stack trace from previous location ---
   at Dapr.Testcontainers.Harnesses.BaseHarness.InitializeAsync(CancellationToken cancellationToken) in /home/runner/work/dotnet-sdk/dotnet-sdk/src/Dapr.Testcontainers/Harnesses/BaseHarness.cs:line 243
   at Dapr.Testcontainers.Common.Testing.DaprTestApplicationBuilder.BuildAndStartAsync() in /home/runner/work/dotnet-sdk/dotnet-sdk/src/Dapr.Testcontainers/Common/Testing/DaprTestApplicationBuilder.cs:line 88
----- Inner Stack Trace -----

…rHttpHealthAsync

Agent-Logs-Url: https://github.com/dapr/dotnet-sdk/sessions/cce14547-9974-4e12-b562-54edf931d423

Co-authored-by: WhitWaldo <2238529+WhitWaldo@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 13, 2026

@copilot This does not appear to have improved the situation as most integration tests are failing. For example:

[xUnit.net 00:16:06.33]     Dapr.IntegrationTest.DistributedLock.DistributedLockT...

The root cause was a distinction between two conditions:

  • Reachable — Dapr's HTTP server is accepting and processing requests (any response, including 500)
  • Healthy — Dapr returns 204 from /v1.0/healthz (all components loaded, app connected, etc.)

My original fix waited for healthy, but Dapr returns 500 from /v1.0/healthz while components are still loading or while the app hasn't started yet (which is exactly the case in resources-first mode, where DaprdContainer.StartAsync completes before the web app is launched). This caused the consistent timeouts you saw.

The correct check to eliminate the "Connection refused" race window is reachable: once Dapr's HTTP server processes any request (even one that returns 500), both the HTTP and gRPC servers have their handlers installed and are stable. Updated in commit cea5689.

New method WaitForHttpReachableAsync only retries when the connection itself is refused; any HTTP response (2xx or 5xx) is treated as "server is up and routing requests."

Copilot AI changed the title Fix transient gRPC "Connection refused" errors in integration tests by adding Dapr HTTP health check to DaprdContainer startup Fix transient gRPC "Connection refused" errors in integration tests by adding HTTP reachability check to DaprdContainer startup Apr 13, 2026
Copilot AI requested a review from WhitWaldo April 13, 2026 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants