-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialize embedded resolver Start and Stop #1561
Conversation
// valid before Activating the DNS server. This avoids a panic in dns | ||
// libarary code because of missing nil check for interface's data. | ||
// github.com/docker/docker/issues/28112 | ||
if r.conn == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reduces the chances for the bug but does not eliminate them. We need to Lock the resolver before accessing/modifying its members. And we need to pass a reference of r.conn
once acquired under lock.
Also, wouldn't the same issue be there also when we construct the tcpServer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it doesn't eliminate the bug completely. But setupIPTable()
re-execs and programs the iptabels rules in container's net ns. This is an expensive operation. So adding the check after setupIPTable()
significantly reduces the chance of hitting this bug. I prefer not to add more locks in our code. Instead I am going to push a PR in miekg/dns code to check the interface datatype being nil. That will avoid the panic completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't the same issue be there also when we construct the tcpServer ?
- with this change we will return from Start()
if resolver's UDPConn is nil. Its not possible to have resolver.conn != nil && resolver.tcpListen == nil
. But I think there is no harm in checking for resolver.tcpListen as well. Will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not possible to have resolver.conn != nil && resolver.tcpListen == nil
Yes it is, the stop() may be called after the start() started the udp listener but before it started the tcp listener.
Instead I am going to push a PR in miekg/dns code to check the interface datatype being nil
I do not think it is the library responsibility if we nil
the net.PacketConn after initializing its server structure.
I prefer not to add more locks in our code
I am not sure I understand the opposition versus locks when we have the same structure being accessed/modified by different threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is, the stop() may be called after the start() started the udp listener but before it started the tcp listener.
Both udp & tcp sockets are created in SetupFunc
before Start()
. Unless both are created resolver.err will be non-nil which is already checked. So its not possible to have resolver.conn != nil && resolver.tcpListen == nil
I do not think it is the library responsibility if we nil the net.PacketConn after initializing its server structure.
We are not setting net.PacketConn
to nil
after creating dns.Server
; rather net.PacketConn
is set to r.conn
which happens to be nil. Since net.PacketConn
is an interface type the library should check the interface value before accessing it. ie: the following is not safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand the opposition versus locks when we have the same structure being accessed/modified by different threads.
For the reasons mentioned earlier its not needed for this issue. But to be safe we can lock around the resolver fields.
This also fixes #1405 |
afccf9c
to
c5bb4cd
Compare
@@ -140,38 +142,67 @@ func (r *resolver) Start() error { | |||
if r.err != nil { | |||
return r.err | |||
} | |||
// startCh is to seralize resolver Start and Stop | |||
r.Lock() | |||
r.startCh = make(chan struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we initialize the channel in the constructor NewResolver()
?
That way I think we can avoid adding the mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NewResolver()
is called once for a sandbox. But for a given resolver Start
and Stop
can happen more than once; ie: once per container restart. The intent of the change is to serialize a pair of Start
and Stop
calls. That is the reason for not creating startCh
in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand the problem in serializing any Start()/Stop()
sequence which may span across container restarts.
Given the base code does not support concurrent Start()/Stop()
, the safest approach is in fact serializing this couple of functions, regardless.
Also, I am not comfortable with the amount of code refactoring in this PR given it targets a panic fix for docker 1.13 while we are in the 1.13 RC process.
In less abstract terms, I am comparing this changes with the following minimal change: https://github.com/aboch/libnetwork_new/commit/e424dce7938d2cbc5ec003354597c3e98a343939
} | ||
if r.tcpServer != nil { | ||
r.tcpServer.Shutdown() | ||
func (r *resolver) waitForStart() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this function ?
It should be enough to call
r.startCh <- true
defer func() { <-r.startCh }()
at the beginning of Start() and Stop()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As clarified earlier there can be more than one Start and Stop calls for a resolver. This change serializes a pair of Start & Stop calls. With the suggested change a subsequent Start can block on a previous Stop which is to be avoided.
Besides, there is a personal preference rather than technical correctness aspect here. To me having a function waitForStart
conveys the intent more clearly and easier to follow.
LGTM |
Signed-off-by: Santhosh Manohar <[email protected]>
@aboch Changed to the buffered channel approach. PTAL |
Thanks @sanimej LGTM |
If a container comes up and brought down quickly its possible for resolver.Stop() to get called before resolver.Start() completes. In that case
r.conn
can be nils := &dns.Server{Handler: r, PacketConn: r.conn}
dns library code checks only if PacketConn is nil which is an interface type. Because of the missing check for interface data being nil this results in a panic.
https://github.com/miekg/dns/blob/3f1f7c8ec9ead89493df11f2c3d8bec353a2c2c0/server.go#L388
related to
docker #28112
docker #28465
Signed-off-by: Santhosh Manohar [email protected]