ipam/aws: fixed a bug causing the operator to hang indefinitely when the ENI limits for an instance type could not be determined #14905

mvisonneau · 2021-02-09T11:38:50Z

With the current implementation, there is a race condition which causes the operator to completely hang if one of the EC2 instance of the cluster has a type for which the operator cannot determine the limits. This PR sorts this situation by removing the attempt for getting a readlock over the node mutex which already has a readwrite lock at this time.

Here is the pprof trace of the hanging goroutine:

1 @ 0x43a4a5 0x44afc5 0x44afae 0x46c0e7 0x25c769a 0x25c75e2 0x2649a7b 0x264da08 0x25c7302 0x25c7a6c 0x25c7758 0x25ce2bc 0x25cc525 0x25cc375 0x2660af1 0x19ec99a 0x1b1b61f 0x19dece2 0x19dc3a2 0x1116ebf 0x1115ced 0x1115c18 0x19dc105 0x19dc0a6 0x46fe81
#	0x46c0e6	sync.runtime_SemacquireMutex+0x46						/usr/local/go/src/runtime/sema.go:71
#	0x25c7699	sync.(*RWMutex).RLock+0xd9							/usr/local/go/src/sync/rwmutex.go:50
#	0x25c75e1	github.com/cilium/cilium/pkg/ipam.(*Node).InstanceID+0x21			/go/src/github.com/cilium/cilium/pkg/ipam/node.go:311
#	0x2649a7a	github.com/cilium/cilium/pkg/aws/eni.(*Node).loggerLocked+0x7a			/go/src/github.com/cilium/cilium/pkg/aws/eni/node.go:76
#	0x264da07	github.com/cilium/cilium/pkg/aws/eni.(*Node).GetMaximumAllocatableIPv4+0x67	/go/src/github.com/cilium/cilium/pkg/aws/eni/node.go:503
#	0x25c7301	github.com/cilium/cilium/pkg/ipam.(*Node).getMaxAllocate+0x41			/go/src/github.com/cilium/cilium/pkg/ipam/node.go:222
#	0x25c7a6b	github.com/cilium/cilium/pkg/ipam.(*Node).recalculate+0x18b			/go/src/github.com/cilium/cilium/pkg/ipam/node.go:378
#	0x25c7757	github.com/cilium/cilium/pkg/ipam.(*Node).UpdatedResource+0x97			/go/src/github.com/cilium/cilium/pkg/ipam/node.go:337
#	0x25ce2bb	github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update.func1+0x5b		/go/src/github.com/cilium/cilium/pkg/ipam/node_manager.go:263
#	0x25cc524	github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update+0x184			/go/src/github.com/cilium/cilium/pkg/ipam/node_manager.go:321
#	0x25cc374	github.com/cilium/cilium/pkg/ipam.(*NodeManager).Create+0x34			/go/src/github.com/cilium/cilium/pkg/ipam/node_manager.go:251
#	0x2660af0	main.startSynchronizingCiliumNodes.func1+0x70					/go/src/github.com/cilium/cilium/operator/cilium_node.go:59
#	0x19ec999	k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd+0x59		/go/src/github.com/cilium/cilium/vendor/k8s.io/client-go/tools/cache/controller.go:231
#	0x1b1b61e	github.com/cilium/cilium/pkg/k8s/informer.NewInformerWithStore.func1+0x25e	/go/src/github.com/cilium/cilium/pkg/k8s/informer/informer.go:119

As we discussed with @christarazi in the releated issue, I also took this opportunity to improve the associated logging!

ipam/aws: fixed a bug causing the operator to hang indefinitely when the ENI limits for an instance type could not be determined

christarazi

Thanks for the PR! I think the core idea of the PR seems good to me. I have comments on how it's implemented however.

Also, the commit msg seems to be better suited as a release note. The commit msg should briefly describe the code change, to conform to https://docs.cilium.io/en/v1.9/contributing/development/contributing_guide/#submitting-a-pull-request. Concretely, the commit msg can be:

ipam/aws: Provide mitigation if ENI limits not found

Elaborating that it fixed a bug would be good to move to the commit msg context instead.

Feel free to add / modify it as you see fit as well.

pkg/aws/eni/node.go

christarazi · 2021-02-09T18:36:13Z

pkg/aws/eni/node.go

@@ -494,9 +495,9 @@ func (n *Node) ResyncInterfacesAndIPs(ctx context.Context, scopedLog *logrus.Ent

 // GetMaximumAllocatableIPv4 returns the maximum amount of IPv4 addresses
 // that can be allocated to the instance
-func (n *Node) GetMaximumAllocatableIPv4() int {
+func (n *Node) GetMaximumAllocatableIPv4(scopedLog *logrus.Entry) int {


Why was this needed? It seems to me that we can already access the logger via n.loggerLocked()

Plus this modifies the interface to the functions for no concrete benefit

So this is exactly where the whole situation resides. By calling n.loggerLocked() within this scope, the function gets blocked as it returns n.node.InstanceID() which requires a read-lock over the node mutex which is already rw locked

https://github.com/cilium/cilium/blob/e2676483a51b7b089cd6a476b83038b9a1ec6169/pkg/aws/eni/node.go#L76

https://github.com/cilium/cilium/blob/e2676483a51b7b089cd6a476b83038b9a1ec6169/pkg/ipam/node.go#L311

https://github.com/cilium/cilium/blob/e2676483a51b7b089cd6a476b83038b9a1ec6169/pkg/ipam/node.go#L364

do you know if there is a better way to workaround this issue?

Are you saying that you found a deadlock in the code prior and this change fixes it? Because the prior code is still using loggerLocked()

yes, absolutely!

Ah, I see! I didn't notice that the driver behind this PR is this deadlock. Nice catch!

However, I still think we can do better here than to pollute the interface with a logger; that shouldn't need to be passed around and the fact that we do rely on that to avoid a deadlock is a very strong indicator that we need to rethink the flow and dependencies of this code.

Here's my proposal:

Let's add instanceID string as a field to the pkg/aws/eni.Node struct. This would allow us to avoid depending on n.node.InstanceID() which was done previously (which evidently cannot be called when n.node is already locked).

We can initialize the field when we call CreateNode, and we can rely on this to be set because the instance ID is always set when the resource is created.

Something like:

func (m *InstancesManager) CreateNode(obj *v2.CiliumNode, n *ipam.Node) ipam.NodeOperations { return &Node{k8sObj: obj, manager: m, node: n, instanceID: n.InstanceID()} }

or feel free to create a constructor for Node.

If you think this is a good idea, then please implement in a separate commit. WDYT?

indeed, this is much cleaner and future proof, thanks for the guidance on this @christarazi 🙇 !

pkg/ipam/node.go

mvisonneau

thanks for the 👀 feedback @christarazi ! Can you please doublecheck a couple of my subsequent comments? 🙇

mvisonneau · 2021-02-10T17:00:18Z

pkg/aws/eni/node.go

@@ -494,9 +495,9 @@ func (n *Node) ResyncInterfacesAndIPs(ctx context.Context, scopedLog *logrus.Ent

 // GetMaximumAllocatableIPv4 returns the maximum amount of IPv4 addresses
 // that can be allocated to the instance
-func (n *Node) GetMaximumAllocatableIPv4() int {
+func (n *Node) GetMaximumAllocatableIPv4(scopedLog *logrus.Entry) int {


So this is exactly where the whole situation resides. By calling n.loggerLocked() within this scope, the function gets blocked as it returns n.node.InstanceID() which requires a read-lock over the node mutex which is already rw locked

https://github.com/cilium/cilium/blob/e2676483a51b7b089cd6a476b83038b9a1ec6169/pkg/aws/eni/node.go#L76

https://github.com/cilium/cilium/blob/e2676483a51b7b089cd6a476b83038b9a1ec6169/pkg/ipam/node.go#L311

https://github.com/cilium/cilium/blob/e2676483a51b7b089cd6a476b83038b9a1ec6169/pkg/ipam/node.go#L364

do you know if there is a better way to workaround this issue?

pkg/aws/eni/node.go

pkg/ipam/node.go

christarazi

Looks much better! Could you separate out the instanceID change into a separate commit? The log msg should be in it's own commit. This way we have clear logical commits implementing independent behavior.

pkg/nodediscovery/nodediscovery.go

christarazi

Awesome! I have 2 more pieces of feedback:

Minor fixup to the log msg
Could you update the commit msg of the second commit to elaborate a bit more on the deadlock issue? Maybe describe that in a particular function (I forgot the name of it), a write-lock was obtained and while it was locked, calling n.loggerLocked() would call n.node.InstanceID() which then attempts to read-lock to extract out the instanceID.

pkg/nodediscovery/nodediscovery.go

christarazi

🎉

LGTM, thanks for the PR! Minor feedback:

Minor nit below
In the 2nd commit msg, I'd avoid referencing exact line numbers as they'll get out-of-date quickly. Referencing the function is good enough like you have already done. :)

pkg/nodediscovery/nodediscovery.go

When instance limits are not found, we will now advise the enduser to set --update-ec2-adapter-limit-via-api=true to allow the operator to dynamically fetch the limits from the EC2 API Signed-off-by: Maxime VISONNEAU <[email protected]>

This change ensures that we do not endup with a deadlock of the goroutines when the EC2 instance limits are not available. The initial lock starts in pkg/ipam/node.go where we set a "n.mutex.Lock()" in "func (n *Node) recalculate()". If the EC2 instance type limits cannot be determined, in cilium/pkg/aws/eni/node.go, there is a call to "n.loggerLocked()" being made which subsequently calls "n.node.InstanceID()". However, this function also expects to have a lock over the same mutex: "n.mutex.RLock()" in pkg/ipam/node.go Leading to a deadlock of the go routine, preventing the operator from functioning in a nominal fashion for any instance of the cluster. Fixes cilium#14802 Signed-off-by: Maxime VISONNEAU <[email protected]>

christarazi

🚀

aanm · 2021-02-18T10:40:22Z

test-me-please

joestringer · 2021-02-24T23:36:24Z

Only one failure, the k8s-1.20-kernel-4.9 job which failed on a test that doesn't make sense to fail given the changes. It must be a flake, so I filed #15103 to follow up. No need to block this PR on flaky CI.

mvisonneau requested review from a team as code owners February 9, 2021 11:38

mvisonneau requested review from a team and tgraf February 9, 2021 11:38

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 9, 2021

maintainer-s-little-helper bot assigned tgraf Feb 9, 2021

mvisonneau mentioned this pull request Feb 9, 2021

ipam/aws: updated EC2 instances ENI limits and added an helper function to make it easier to do so in the future #14906

Merged

christarazi requested changes Feb 9, 2021

View reviewed changes

mvisonneau commented Feb 10, 2021

View reviewed changes

joestringer added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Feb 11, 2021

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 11, 2021

mvisonneau force-pushed the ipam_invalid_ec2_limits_hang_fix branch 3 times, most recently from 1525fdc to 2c1da7f Compare February 16, 2021 10:29

mvisonneau requested a review from a team as a code owner February 16, 2021 10:29

mvisonneau requested review from christarazi and removed request for a team February 16, 2021 10:32

maintainer-s-little-helper bot assigned christarazi Feb 16, 2021

mvisonneau force-pushed the ipam_invalid_ec2_limits_hang_fix branch from 2c1da7f to 22f21fa Compare February 16, 2021 10:33

christarazi requested changes Feb 16, 2021

View reviewed changes

pkg/nodediscovery/nodediscovery.go Outdated Show resolved Hide resolved

mvisonneau force-pushed the ipam_invalid_ec2_limits_hang_fix branch 2 times, most recently from 5aac0ad to 2e47ceb Compare February 17, 2021 11:25

mvisonneau requested a review from christarazi February 17, 2021 11:26

christarazi reviewed Feb 17, 2021

View reviewed changes

pkg/nodediscovery/nodediscovery.go Outdated Show resolved Hide resolved

mvisonneau force-pushed the ipam_invalid_ec2_limits_hang_fix branch from 2e47ceb to 87c2d70 Compare February 17, 2021 20:36

mvisonneau requested a review from christarazi February 17, 2021 20:37

christarazi reviewed Feb 17, 2021

View reviewed changes

pkg/nodediscovery/nodediscovery.go Outdated Show resolved Hide resolved

mvisonneau added 2 commits February 17, 2021 20:54

mvisonneau force-pushed the ipam_invalid_ec2_limits_hang_fix branch from 87c2d70 to 3cfde64 Compare February 17, 2021 20:56

christarazi approved these changes Feb 17, 2021

View reviewed changes

joestringer mentioned this pull request Feb 24, 2021

CI: Suite-k8s-1.20.K8sServicesTest Checks service across nodes Tests NodePort (kube-proxy) with the host firewall and externalTrafficPolicy=Local: Exit status 42 #15103

Closed

joestringer merged commit 440c896 into cilium:master Feb 24, 2021

pchaigno added area/eni Impacts ENI based IPAM. sig/ipam IP address management, including cloud IPAM labels Feb 26, 2021

christarazi mentioned this pull request Mar 9, 2021

Add AlibabaCloud Operator #15160

Merged

mvisonneau deleted the ipam_invalid_ec2_limits_hang_fix branch February 1, 2022 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ipam/aws: fixed a bug causing the operator to hang indefinitely when the ENI limits for an instance type could not be determined #14905

ipam/aws: fixed a bug causing the operator to hang indefinitely when the ENI limits for an instance type could not be determined #14905

mvisonneau commented Feb 9, 2021

christarazi left a comment

christarazi Feb 9, 2021

christarazi Feb 9, 2021

mvisonneau Feb 10, 2021

christarazi Feb 10, 2021

mvisonneau Feb 11, 2021

christarazi Feb 16, 2021 •

edited

Loading

mvisonneau Feb 16, 2021

mvisonneau left a comment

mvisonneau Feb 10, 2021

christarazi left a comment

christarazi left a comment •

edited

Loading

christarazi left a comment •

edited

Loading

christarazi left a comment

aanm commented Feb 18, 2021

joestringer commented Feb 24, 2021

ipam/aws: fixed a bug causing the operator to hang indefinitely when the ENI limits for an instance type could not be determined #14905

ipam/aws: fixed a bug causing the operator to hang indefinitely when the ENI limits for an instance type could not be determined #14905

Conversation

mvisonneau commented Feb 9, 2021

christarazi left a comment

Choose a reason for hiding this comment

christarazi Feb 9, 2021

Choose a reason for hiding this comment

christarazi Feb 9, 2021

Choose a reason for hiding this comment

mvisonneau Feb 10, 2021

Choose a reason for hiding this comment

christarazi Feb 10, 2021

Choose a reason for hiding this comment

mvisonneau Feb 11, 2021

Choose a reason for hiding this comment

christarazi Feb 16, 2021 • edited Loading

Choose a reason for hiding this comment

mvisonneau Feb 16, 2021

Choose a reason for hiding this comment

mvisonneau left a comment

Choose a reason for hiding this comment

mvisonneau Feb 10, 2021

Choose a reason for hiding this comment

christarazi left a comment

Choose a reason for hiding this comment

christarazi left a comment • edited Loading

Choose a reason for hiding this comment

christarazi left a comment • edited Loading

Choose a reason for hiding this comment

christarazi left a comment

Choose a reason for hiding this comment

aanm commented Feb 18, 2021

joestringer commented Feb 24, 2021

christarazi Feb 16, 2021 •

edited

Loading

christarazi left a comment •

edited

Loading

christarazi left a comment •

edited

Loading