-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipam/aws: fixed a bug causing the operator to hang indefinitely when the ENI limits for an instance type could not be determined #14905
ipam/aws: fixed a bug causing the operator to hang indefinitely when the ENI limits for an instance type could not be determined #14905
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I think the core idea of the PR seems good to me. I have comments on how it's implemented however.
Also, the commit msg seems to be better suited as a release note. The commit msg should briefly describe the code change, to conform to https://docs.cilium.io/en/v1.9/contributing/development/contributing_guide/#submitting-a-pull-request. Concretely, the commit msg can be:
ipam/aws: Provide mitigation if ENI limits not found
Elaborating that it fixed a bug would be good to move to the commit msg context instead.
Feel free to add / modify it as you see fit as well.
pkg/aws/eni/node.go
Outdated
@@ -494,9 +495,9 @@ func (n *Node) ResyncInterfacesAndIPs(ctx context.Context, scopedLog *logrus.Ent | |||
|
|||
// GetMaximumAllocatableIPv4 returns the maximum amount of IPv4 addresses | |||
// that can be allocated to the instance | |||
func (n *Node) GetMaximumAllocatableIPv4() int { | |||
func (n *Node) GetMaximumAllocatableIPv4(scopedLog *logrus.Entry) int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this needed? It seems to me that we can already access the logger via n.loggerLocked()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus this modifies the interface to the functions for no concrete benefit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is exactly where the whole situation resides. By calling n.loggerLocked()
within this scope, the function gets blocked as it returns n.node.InstanceID()
which requires a read-lock over the node mutex which is already rw locked
do you know if there is a better way to workaround this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that you found a deadlock in the code prior and this change fixes it? Because the prior code is still using loggerLocked()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, absolutely!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see! I didn't notice that the driver behind this PR is this deadlock. Nice catch!
However, I still think we can do better here than to pollute the interface with a logger; that shouldn't need to be passed around and the fact that we do rely on that to avoid a deadlock is a very strong indicator that we need to rethink the flow and dependencies of this code.
Here's my proposal:
Let's add instanceID string
as a field to the pkg/aws/eni.Node
struct. This would allow us to avoid depending on n.node.InstanceID()
which was done previously (which evidently cannot be called when n.node
is already locked).
We can initialize the field when we call CreateNode
, and we can rely on this to be set because the instance ID is always set when the resource is created.
Something like:
func (m *InstancesManager) CreateNode(obj *v2.CiliumNode, n *ipam.Node) ipam.NodeOperations {
return &Node{k8sObj: obj, manager: m, node: n, instanceID: n.InstanceID()}
}
or feel free to create a constructor for Node
.
If you think this is a good idea, then please implement in a separate commit. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed, this is much cleaner and future proof, thanks for the guidance on this @christarazi 🙇 !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the 👀 feedback @christarazi ! Can you please doublecheck a couple of my subsequent comments? 🙇
pkg/aws/eni/node.go
Outdated
@@ -494,9 +495,9 @@ func (n *Node) ResyncInterfacesAndIPs(ctx context.Context, scopedLog *logrus.Ent | |||
|
|||
// GetMaximumAllocatableIPv4 returns the maximum amount of IPv4 addresses | |||
// that can be allocated to the instance | |||
func (n *Node) GetMaximumAllocatableIPv4() int { | |||
func (n *Node) GetMaximumAllocatableIPv4(scopedLog *logrus.Entry) int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is exactly where the whole situation resides. By calling n.loggerLocked()
within this scope, the function gets blocked as it returns n.node.InstanceID()
which requires a read-lock over the node mutex which is already rw locked
do you know if there is a better way to workaround this issue?
1525fdc
to
2c1da7f
Compare
2c1da7f
to
22f21fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much better! Could you separate out the instanceID
change into a separate commit? The log msg should be in it's own commit. This way we have clear logical commits implementing independent behavior.
5aac0ad
to
2e47ceb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I have 2 more pieces of feedback:
- Minor fixup to the log msg
- Could you update the commit msg of the second commit to elaborate a bit more on the deadlock issue? Maybe describe that in a particular function (I forgot the name of it), a write-lock was obtained and while it was locked, calling
n.loggerLocked()
would calln.node.InstanceID()
which then attempts to read-lock to extract out theinstanceID
.
2e47ceb
to
87c2d70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
LGTM, thanks for the PR! Minor feedback:
- Minor nit below
- In the 2nd commit msg, I'd avoid referencing exact line numbers as they'll get out-of-date quickly. Referencing the function is good enough like you have already done. :)
When instance limits are not found, we will now advise the enduser to set --update-ec2-adapter-limit-via-api=true to allow the operator to dynamically fetch the limits from the EC2 API Signed-off-by: Maxime VISONNEAU <[email protected]>
This change ensures that we do not endup with a deadlock of the goroutines when the EC2 instance limits are not available. The initial lock starts in pkg/ipam/node.go where we set a "n.mutex.Lock()" in "func (n *Node) recalculate()". If the EC2 instance type limits cannot be determined, in cilium/pkg/aws/eni/node.go, there is a call to "n.loggerLocked()" being made which subsequently calls "n.node.InstanceID()". However, this function also expects to have a lock over the same mutex: "n.mutex.RLock()" in pkg/ipam/node.go Leading to a deadlock of the go routine, preventing the operator from functioning in a nominal fashion for any instance of the cluster. Fixes cilium#14802 Signed-off-by: Maxime VISONNEAU <[email protected]>
87c2d70
to
3cfde64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
test-me-please |
Only one failure, the k8s-1.20-kernel-4.9 job which failed on a test that doesn't make sense to fail given the changes. It must be a flake, so I filed #15103 to follow up. No need to block this PR on flaky CI. |
Fixes: #14802
With the current implementation, there is a race condition which causes the operator to completely hang if one of the EC2 instance of the cluster has a type for which the operator cannot determine the limits. This PR sorts this situation by removing the attempt for getting a readlock over the node mutex which already has a readwrite lock at this time.
Here is the pprof trace of the hanging goroutine:
As we discussed with @christarazi in the releated issue, I also took this opportunity to improve the associated logging!