Skip to content

Conversation

jacob-pro
Copy link

@jacob-pro jacob-pro commented Jul 16, 2025

This PR replaces the implementation of the doExtensiveHealthChecks() condition, which is used to decide whether to health-check pooled connections prior to re-use.

Currently the socket health check is enabled for all non-GET requests, afterwards the deep health check is only enabled for requests where isOneShot()=true.

See comment: #7007 (comment)

The docs for isOneShot say:

Returns true if this body expects at most one call to writeTo and can be transmitted at most once. This is typically used when writing the request body is destructive and it is not possible to recreate the request body after it has been sent.

By default OkHttp will attempt to retransmit request bodies when the original request fails due to any of:

A stale connection. The request was made on a reused connection and that reused connection has since been closed by the server.

Why

The extensive health check is expensive, and requires a 1ms read timeout check, it should be avoided where possible.

See similar PR: #5795

We were also doing the awkward and expensive read timeout
check on all non-GET requests on pooled connections. Now
we only perform these checks if the connection was idle
for 10 seconds or more.

@jacob-pro
Copy link
Author

@yschimke @swankjesse

@yschimke
Copy link
Collaborator

I'll defer to @swankjesse

My concerns would be

  • whether it's a net performance loss for some conditions, such as slow mobile networks, because of sending expensive but also retryable POSTs that fail
  • whether the existing logic is because GET is assume idempotent, but it's not clear it's about that.

@yschimke
Copy link
Collaborator

Failing on formatting

Run './gradlew :spotlessApply' to fix these violations.

@jacob-pro
Copy link
Author

Thanks @yschimke I have applied the formatting changes

@swankjesse
Copy link
Collaborator

Unfortunately, isOneShot() is concerned with producing the response body (can we re-create the request body bytes if we need to re-transmit) rather than a defensive-check (is the action idempotent). I’d prefer to not conflate the two, though I may need to improve the documentation!

The extensive health check is expensive, and requires a 1ms read timeout check, it should be avoided where possible.

Agreed that it’s expensive!

We currently don’t bother for connections used within the last 10 seconds.

    val idleDurationNs = withLock { nowNs - idleAtNs }
    if (idleDurationNs >= IDLE_CONNECTION_HEALTHY_NS && doExtensiveChecks) {
      return socket.isHealthy(source)
    }

Is this adequate? Would bumping it to 20 or 30 seconds help?

(Could you tell me a bit more about your use case?)

@jacob-pro
Copy link
Author

jacob-pro commented Jul 23, 2025

@swankjesse maybe I am missing something, but I don't understand what the relevance of idempotency is here?

First chain.request.method != "GET" is not very good test for idempotency in the first place, what about PUT or DELETE ?

Idempotency determines whether a request is safe to be retried, but that is not how it is used here, consider what would happen if you had:

  • A non-idempotent POST
  • On a connection last used 9s ago
  • The connection has gone unhealthy
  • The healthcheck would still be skipped due to idleDurationNs
  • If idempotency was actually a concern then the behaviour would be unsafe depending on a 9s or 11s idle?

My understanding of the check is this should be a cost-benefit analysis on whether the extensive health check is likely to be worthwhile:

  • There is low-value in doing the expensive health check on connections that were recently used (<10s) because they are unlikely to have become unhealthy
  • Similarly there would be low-value in doing the expensive health check when the request could easily be retried by the RetryAndFollowUpInterceptor

// We can't send the request body again.
if (requestSendStarted && requestIsOneShot(e, userRequest)) return false

@swankjesse
Copy link
Collaborator

I don’t think we can reliably tell whether a call on a reused connection failed before the server processed the request, or after the server processed the request. If it failed after the server processed the request, we have a correctness problem, not a performance problem.

So OkHttp is trying to be conservative here. 10 seconds is probably way too conservative, and we could probably bump it to 60 seconds without problem, but also I think the net number of health checks saved by that change would be negligible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants