Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: make Transport's idle connection management aware of DNS changes? #23427

Open
szuecs opened this issue Jan 12, 2018 · 51 comments · May be fixed by #46714
Open

net/http: make Transport's idle connection management aware of DNS changes? #23427

szuecs opened this issue Jan 12, 2018 · 51 comments · May be fixed by #46714
Labels
NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@szuecs
Copy link

szuecs commented Jan 12, 2018

What version of Go are you using (go version)?

go version go1.9 linux/amd64

and

go version go1.9.2 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

% go env
GOARCH="amd64"
GOBIN="/home/sszuecs/go/bin"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/sszuecs/go"
GORACE=""
GOROOT="/usr/share/go"
GOTOOLDIR="/usr/share/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build505089582=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

I am running go run main.go and change the /etc/hosts to change www.google.de to 127.0.0.1.
Depending on the order it fails to switch DNS after IdleConnTimeout or it will never request DNS again. DNS will be tried to lookup in case you first target it to 127.0.0.1 and after that comment the entry in /etc/hosts. The problem is that if you want to change your target loadbalancers via DNS lookup, this will not be done. The workaround is commented in the code, which reliably will do the DNS failover. The problem seems to be that IdleConnTimeout is bigger then the time.Sleep duration in the code, which you can also change to see that this works. In case of being an edge proxy with high number of requests, the case IdleConnTimeout < process-next-request will never happen.

package main

import (
	"log"
	"net"
	"net/http"
	"time"
)

func main() {

	tr := &http.Transport{
		DialContext: (&net.Dialer{
			Timeout:   5 * time.Second,
			KeepAlive: 30 * time.Second,
			DualStack: true,
		}).DialContext,
		TLSHandshakeTimeout: 5 * time.Second,
		IdleConnTimeout:     5 * time.Second,
	}

	go func(rt http.RoundTripper) {
		for {
			time.Sleep(1 * time.Second)

			req, err := http.NewRequest("GET", "https://www.google.de/", nil)
			if err != nil {
				log.Printf("Failed to do request: %v", err)
				continue
			}

			resp, err := rt.RoundTrip(req)
			if err != nil {
				log.Printf("Failed to do roundtrip: %v", err)
				continue
			}
			//io.Copy(ioutil.Discard, resp.Body)
			resp.Body.Close()
			log.Printf("resp status: %v", resp.Status)
		}
	}(tr)

	// FIX:
	// go func(transport *http.Transport) {
	// 	for {
	// 		time.Sleep(3 * time.Second)
	// 		transport.CloseIdleConnections()
	// 	}
	// }(tr)

	ch := make(chan struct{})
	<-ch
}

What did you expect to see?

I want to to see that IdleConnTimeout will reliably close idle connections, such that DNS is queried for new connections again, similar to the goroutine case in the code. We need to slowly being able to fade traffic.

What did you see instead?

In case you start the application with /etc/hosts entry is not set, and then change it, it will never fail the request, so the new DNS lookup is not being made.

@bradfitz bradfitz changed the title http client and http.Transport IdleConnTimeout is not reliably doing DNS resolve net/http: make Transport's idle connection management aware of DNS changes? Jan 12, 2018
@bradfitz bradfitz added NeedsFix The path to resolution is known, but the work has not been done. NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. labels Jan 12, 2018
@gopherbot gopherbot removed the NeedsFix The path to resolution is known, but the work has not been done. label Jan 12, 2018
@bradfitz bradfitz added this to the Go1.11 milestone Jan 12, 2018
@bradfitz bradfitz added the NeedsFix The path to resolution is known, but the work has not been done. label Jan 12, 2018
@gopherbot gopherbot removed the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Jan 12, 2018
@bradfitz
Copy link
Contributor

/cc @tombergan

@meirf
Copy link
Contributor

meirf commented Feb 27, 2018

Possible Signals

If persistConn.conn is dialed via hostname:

  • TTL expiration. If TTL expires, take action. This would require an API change since net's TTL doesn't appear to be exposed.
  • Poll LookupIP. Store initial ip set and keep updating with new polling response; if ip set changes (or more strictly if dialed ip is not longer in ip set), take action. How frequently would we poll? Are we willing to store this extra state, spend this time polling, using system resourced even with optimizations?

Action

  • Close/mark for closure. If the connection is idle, close it. If the connection is in use, mark it as not reusable. This would be similar to current usage of persistConn.broken though "broken" maybe too harsh language for this case. (The OP uses CloseIdleConnections, which "does not interrupt any connections currently in use", so calling CloseIdleConnections once could leave many connections on the old ips.)

I guess the use of "?" in the title was due to implementation complexity/propriety of addressing this in the standard library. Please let me know if I've gotten something wrong or if there are other issues I'm missing.

@szuecs szuecs changed the title net/http: make Transport's idle connection management aware of DNS changes? net/http: make Transport's idle connection management aware of DNS changes Feb 27, 2018
@szuecs
Copy link
Author

szuecs commented Feb 27, 2018

@meirf For us CloseIdleConnections works pretty well, we use it in our internet facing central ingress http proxy and it's fast enough to do the DNS lookup while we do DNS based traffic switching (AWS route53 Weighted DNS records). If you would use IdleConnTimeout to call CloseIdleConnections would not do a breaking change of the API. I bet it would be in 99.999% of the cases fine, because applications, also proxies, are most of the time CPU bound.
My question would be why you want to expose additional data if you already have IdleConnTimeout?

To answer your question about "?" in the title: I don't remember the why. I deleted it from the title. I am only digging from time to time into the Go src code, mostly in the net package. I guess with the "?" I wanted to reflect that. I hope it's now more clear. :)

@meirf
Copy link
Contributor

meirf commented Feb 28, 2018

tl;dr (1) Idle connections' awareness of dns changes appears to be absent from the API. (2) The easy part is forcing a redial via closure. (3) IMO the hard part is figuring out what signal the code should use to redial.

@szuecs
My first comment was directed to bradfitz, who set the title to "net/http: make Transport's idle connection management aware of DNS changes?".

Based on my reading of your most recent comment, you think idle connections' awareness of dns changes is already promised by the current API. I (and bradfitz based on his title) do not agree that idle connections' awareness of dns changes is already promised.

I think we agree that currently the only way to be aware of a dns change is to force a redial by not having connections kept alive, but IdleConnTimeout won't force a keep-alive connection to be closed unless your situation meet its definition. If you have a keep-alive connection and it never becomes idle, by definition it will never close itself due to IdleConnTimeout. The cumulative amount of time a keep-alive connection is idle doesn't matter. The only way the keep-alive connection will be closed due to IdleConnTimeout is if it's idle for IdleConnTimeout consecutive/contiguous time.

  1. To prevent any connections from being kept alive you can set DisableKeepAlives.
  2. To increase the likelihood of a keep-alive connection being closed, you can decrease IdleConnTimeout.
  3. To close all currently idle keep-alive connections, you can call CloseIdleConnections.

It's clear that all these would be helpful to you in redialing and therefore connecting to the new ip. This indirectly results in dns awareness (and possibly a lot of inefficiency). But in terms of the standard library, there is no promise of direct dns awareness.

If you would use IdleConnTimeout to call CloseIdleConnections would not do a breaking change of the API.

Briefly touching upon implementation, I don't think dns awareness in the standard library would use CloseIdleConnections since that blindly closes all keep-alive connections and I'd guess that we'd want dns change detection to be done per connection.

@szuecs szuecs changed the title net/http: make Transport's idle connection management aware of DNS changes net/http: make Transport's idle connection management aware of DNS changes? Feb 28, 2018
@szuecs
Copy link
Author

szuecs commented Feb 28, 2018

@meirf Ok I was not sure if questions were asked to me. As far as I understand now you did not asked me. :)
I put again the "?" in the title, because the change was not done by me (I should read the events).
Thanks for your comments!

josephcopenhaver added a commit to josephcopenhaver/go that referenced this issue Jun 13, 2021
Existing implementation appeared to have no way to set a connection
max lifetime. A max lifetime allows for refreshing connection lifecycle
concerns such as dns resolutions for those that need it.

When initialized to a non-zero value the connection will be closed after
the duration has passed. This change is backwards compatible.

Fixes golang#23427
@neild
Copy link
Contributor

neild commented Aug 10, 2021

https://golang.org/cl/341329 has a possible approach that @bradfitz and I came up with.

The idea is to add a ReuseConn hook to http.Transport that permits user-specified policy decisions on when to reuse a connection from an idle pool. Setting a max lifetime would be something like:

transport.ReuseConn = func(info httptrace.ReuseConnInfo) bool {
  return time.Since(info.CreationTime) < MaxConnLifetime
}

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/341329 mentions this issue: net/http: add ReuseConn hook

@jpcope
Copy link

jpcope commented Aug 11, 2021

I like the simplicity of the extra hook. I think I prefer the deadline approach which is used in the idle timeout purely because I know the client is not going to keep the connection around. So don't need a usage attempt to prompt the connection to close.

Would you need to add similar functionality in http2's pool?

@aojea
Copy link
Contributor

aojea commented Aug 27, 2021

https://golang.org/cl/341329 has a possible approach that @bradfitz and I came up with.

The idea is to add a ReuseConn hook to http.Transport that permits user-specified policy decisions on when to reuse a connection from an idle pool. Setting a max lifetime would be something like:

transport.ReuseConn = func(info httptrace.ReuseConnInfo) bool {
  return time.Since(info.CreationTime) < MaxConnLifetime
}

are you targeting 1.18 @neild ? , that hook looks like a really nice addition

@szuecs
Copy link
Author

szuecs commented Aug 30, 2021

@neild for me personally transport.ReuseConn is great and ClientTrace.ReuseConn is not really required, because you would already see in the ClientTrace, that a new conn is created (DNSStart, DNSDone, ConnectStart, ConnectDone and potentially also the TLS parts). It feels also a bit weird to me that clienttrace can now influence the connection pool, which seems more the responsibility of the http.Transport, but maybe it's just my first thought and it is just fine.

Thanks for working on it!

josephcopenhaver added a commit to josephcopenhaver/go that referenced this issue Sep 12, 2021
Existing implementation appeared to have no way to set a connection
max lifetime. A max lifetime allows for refreshing connection lifecycle
concerns such as dns resolutions for those that need it.

When initialized to a non-zero value the connection will be closed after
the duration has passed. This change is backwards compatible.

Fixes golang#23427
josephcopenhaver added a commit to josephcopenhaver/go that referenced this issue Sep 12, 2021
Existing implementation appeared to have no way to set a connection
max lifetime. A max lifetime allows for refreshing connection lifecycle
concerns such as dns resolutions for those that need it.

When initialized to a non-zero value the connection will be closed after
the duration has passed. This change is backwards compatible.

Fixes golang#23427
@jpcope
Copy link

jpcope commented Sep 30, 2021

@neild One of the reasons I am all for adding a MaxConnLifespan to the Transport and socket wait logic is that a given http service which brokers connections to other services can start truncating connections without requiring a new connection/request to be made to a service. This in theory would free up connections for any backing services as horizontal scale-out ( runtime replication ) events occur and potentially reduce the number of idle connections at any given time across the gamut of services. We've implemented this logic before in other protocols and saw clear performance gains as well as the dns re-resolution benefits.

Please consider the PR I have out there as well. It's my first one so I don't mind it ultimately being closed because it has style issues. Would be ecstatic to see the capability in the standard lib even if my name is not attached. 😄

Gonna resolve the conflict in the api/next.txt file again in a sec.

Edit: I am also aware that there is already an idle timeout setting, but I assume the connection reuse logic is FIFO. In such a queuing strategy we'd need to have a solid "idle timeout" period of no new requests for the idle connections to trend towards zero. Depending on traffic patterns this can be quite a rare event.

@aojea
Copy link
Contributor

aojea commented Nov 10, 2021

https://golang.org/cl/341329 has a possible approach that @bradfitz and I came up with.

The idea is to add a ReuseConn hook to http.Transport that permits user-specified policy decisions on when to reuse a connection from an idle pool. Setting a max lifetime would be something like:

transport.ReuseConn = func(info httptrace.ReuseConnInfo) bool {
  return time.Since(info.CreationTime) < MaxConnLifetime
}

@neild assuming http2 is the most impacted by this issue since it will multiplex the requests over a single connection, not dialing again, and that the ClientConnPool already is an interface

https://github.com/golang/net/blob/ef0fda0de5086a0a3c6721b466f49e1f36188234/http2/client_conn_pool.go#L17-L27

and it can be set in the http2 transport

	// ConnPool optionally specifies an alternate connection pool to use.
	// If nil, the default is used.
	ConnPool ClientConnPool

what if you can expose the clientConnPool type so people can implement their own logic on GetClientConn()

type myClientConnPool {
       net.ConnectionPool
}

func (p *myClientConnPool) GetClientConn(req *http.Request, addr string) (*ClientConn, error) {
        // do things here
	return p.GetClientConn(req,addr)
}

func (p *myClientConnPool) MarkDead(c *ClientConn) {
	p.MarkDead(c)
}

I think that the the connection pool type can benefit other issues like #37373 , this last issue may require some additional changes since the conn poll is keyed by host:port but just wanted to know your thoughts

EDIT

I see my mistake, the connection pool used in the std lib is

// noDialClientConnPool is an implementation of http2.ClientConnPool
// which never dials. We let the HTTP/1.1 client dial and use its TLS
// connection instead.
type http2noDialClientConnPool struct{ *http2clientConnPool }

@chengzhicn
Copy link

2. TTL based expiry
   Con: if for some reason, I have set high TTL in my DNS. Connections won't be rotated ever, which again might raise problems in a scenario where your client assumes that the connection is open but the server has actually closed the connection. Whenever we try to use that half-open connection, we get socket timeout.

@rohitkeshwani07 In case of high TTL, redial probably won't work because of DNS cache in host/resolver. If the main concern is DNS change, then it's better to use TTL because:

  1. It's a standard way to control how long the information should stay valid.
  2. Both TTL and DNS change is done by the domain owner, and he can decide how long the TTL will be since different value will have different impact on authoritative DNS server(more/less query because of cache expires) and application server(more/less redial and handshake). He can set a large TTL if he want less load on his servers and accept the fact that DNS change will take a long time to take effect.

But if the main problem resides on client side(eg: firewall drops longstanding connections), then it make sense to take client side approach like configure max connection lifetime or hooks proposed by @neild on client side.

josephcopenhaver added a commit to josephcopenhaver/go that referenced this issue Feb 3, 2022
Existing implementation appeared to have no way to set a connection
max lifetime. A max lifetime allows for refreshing connection lifecycle
concerns such as dns resolutions for those that need it.

When initialized to a non-zero value the connection will be closed after
the duration has passed. This change is backwards compatible.

Fixes golang#23427
@josephcopenhaver
Copy link

Hello again!

I still really like adding logic to the transport layers so system resources can close in advance of trying to pull a connection from the pool should a deadline be reached: #46714

This behavior is more ideal for me just because then I don't have system nodes keeping connections alive just so the pool layer on top can intercept an established connection candidate, see it is too old, and then kill it. May as well let it take care of itself in that regard and short circuit earlier.

Any way I can help ensure an approach of some kind makes it into the standard lib? 😄 I think it would be very powerful to get the standard http kit up to par with or allow for the creation of the same knobs we have in https://pkg.go.dev/database/sql

josephcopenhaver added a commit to josephcopenhaver/go that referenced this issue May 4, 2022
Existing implementation appeared to have no way to set a connection
max lifetime. A max lifetime allows for refreshing connection lifecycle
concerns such as dns resolutions for those that need it.

When initialized to a non-zero value the connection will be closed after
the duration has passed. This change is backwards compatible.

Fixes golang#23427
@josephcopenhaver
Copy link

josephcopenhaver commented May 4, 2022

I wonder how much of this problem is purely one of loadbalancing concepts people may be trying to implement.

If a host record resolves to 3 ip nodes and suddenly we add or remove one, then respectively we need to ensure the new node starts taking requests as soon as we know about it so load spreads ( no need to close connections until we approach the max connection limit or idle timeouts happen ) or remove connections tied to the deregistering ( potentially already deregistered ) ip address.

The latter case of removing a node is trivial and fairly self-correcting before any DNS TTL and refresh cycle would inform us on average.

In the same token, idle connection timeout can create "randomly hot" nodes over time by just the probability of FIFO rotating through connections cyclicly. A max lifespan is a fairly nuclear solution to this problem when perhaps envoy proxy should be the full-featured solution for balancing out-of-app.

For people wanting a solution in the same binary they really probably want LIFO connection queuing under a round-robin ( or some distance/efficiency-aware weighted function ) connection broker over the LIFO queues partitioned by IP. This would allow for any "logically extra" connections the service can live without to end up dying off just by virtue of the connection timeouts.

If DNS appears to remove a host we can assume the DNS source has determined that the host should get phased out and schedule the relevant connections to die as soon as their transactions finish or immediately inform the broker to kill that pool of trusted connections.

As a DNS entry is discovered for a hostname a LIFO queue would be created and added to the connection broker. Load would spread almost immediately across the upstream offering, and old connections no longer getting any use on the other nodes would start to time out and die off, expensive new connection flows continue to be avoided.

Connection max lifetime/reuse is likely not the feature people need or want. ( If you do need/want it cool, my PR offers it! ) It just looks like what we're reaching for in this conversation when we really want to observe+schedule DNS resolution and use that to inform a connection pooling strategy where we can change the implementation of popping and pushing to a queue we maintain externally that is LIFO or FIFO or anything else - but note we need to know the connection's target IP to achieve these goals. It's worth discussing in more detail by persons probably smarter than myself. Typically when users of one of my libs want agency I will happily give them the simplest interface they need to achieve their goals and default the behavior to off. Perhaps the maintainers are willing to do similar here?

@szuecs
Copy link
Author

szuecs commented May 4, 2022

Thanks @josephcopenhaver for your answer. A lot of things make sense to me. The problem is not proxy specific, more http upstream specific.
If you have an url as target and the target is not an anycast ip, then you have some number of IPs and if the target needs to scale out it needs more IPs, so if clients stay the new ones won't be used. The lifo idea might be fixing this problem, so maybe worth a try.

@marcind
Copy link

marcind commented May 4, 2022

I wonder how much of this problem is purely one of loadbalancing concepts people may be trying to implement.

Load balancing is not the reason why I'm interested in this issue. My use case: a long-running workload that has a steady cadence of requests to a HTTP 1.1 service Foo (a few/second, maybe it's logs getting uploaded, etc). Foo's operator has configured a low DNS TTL to ensure traffic can be moved quickly if needed. At the same time, the workload uses HTTP keep-alive to avoid redialing for each request. Foo's operator updates their DNS in hopes of inducing a graceful blue/green move to new infrastructure and yet observes little to no traffic shifting in response, because the steady request cadence ensures no connections go idle. Ideally, the old Foo endpoints would start responding with Connection: close headers, but... Foo is fronted by an AWS Application Load Balancer and there exists no mechanism to induce an ALB to start closing client connections.

If DNS appears to remove a host we can assume the DNS source has determined that the host should get phased out and schedule the relevant connections to die as soon as their transactions finish or immediately inform the broker to kill that pool of trusted connections.

That is not a safe assumption. For example, DNS responses sent by an AWS ALB will cycle through a subset of the existing IP addresses. Just because an IP does not appear in a subsequent DNS query does not mean that the corresponding instance no longer accepts traffic. In fact, that IP will most likely reappear in a following DNS response.

Connection max lifetime/reuse is likely not the feature people need or want. ( If you do need/want it cool, my PR offers it! ) It just looks like what we're reaching for in this conversation when we really want to observe+schedule DNS resolution and use that to inform a connection pooling strategy where we can change the implementation of popping and pushing to a queue we maintain externally that is LIFO or FIFO or anything else - but note we need to know the connection's target IP to achieve these goals. It's worth discussing in more detail by persons probably smarter than myself. Typically when users of one of my libs want agency I will happily give them the simplest interface they need to achieve their goals and default the behavior to off. Perhaps the maintainers are willing to do similar here?

Yes and no. Years of "best practices" have conditioned everyone to expect that setting a low DNS TTL will allow service operators to induce HTTP clients to shift traffic with minimum delay. Many browsers and software SDKs behave that way (even though there's nothing in the HTTP specification that mandates keep-alives honor DNS TTLs), but net/http does not. So on the one hand I would be happy to have the HTTP client track the remaining TTL deadline that was associated with the DNS response used to establish a particular connection. On the other, I would still like the ability to extend that connection's life (assuming the server indicates that this is OK by not returning Connection: close) beyond the 30 seconds that's typically configured for these sort of applications. Setting a MaxLIfetime of say 5 minutes would strike a good balance between "immediately" (30 seconds) and "indefinitely" (i.e. whenever the service operator finally decides to just shut down the "old" deployment since the existing clients don't seem to want to shift gracefully).

@szuecs
Copy link
Author

szuecs commented Apr 2, 2024

okhttp suggested to do this on the server side and send Connection: close response header regularly, maybe similar to my time based suggestion on the client: square/okhttp#3374 (comment)
In the end I think both (client and server side solutions) have their advantages and disadvantages.

@awnumar
Copy link
Contributor

awnumar commented Apr 2, 2024

@szuecs while it is possible to solve this server-side, sometimes the server-side application is owned by a third party so it's not within our control to make changes to it.

Our solution for working around the lack of a client-side connection-lifetime timeout is:

  • have a global http client
  • have a routine that regularly initialises a new client and replaces the global one
  • new requests use the new client which opens a new connection

@szuecs
Copy link
Author

szuecs commented Apr 3, 2024

@awnumar this is also possible and I think it would be beneficial to have it on the client but I wanted also to show that there are others that show server side solutions in this context. A chosen solution depends on the context.

@andrebrait
Copy link

@szuecs just contributing here with my 2 cents: I recently attempted the suggested solution by the OkHttp authors, and it does work, but not with OkHttp because it doesn't actually seem to run name resolution again until it actually gets an error from the server (e.g. a 503).

The server-side approach seems to still require some proper handling on the client to achieve awareness of DNS record changes.

@szuecs
Copy link
Author

szuecs commented Oct 7, 2024

@andrebrait thanks! We also added zalando/skipper#3246 to fix this similar to other proxies. okhttp seems to be unfortunate in handling this case. :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

Successfully merging a pull request may close this issue.