Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipam/aws: fixed a bug causing the operator to hang indefinitely when the ENI limits for an instance type could not be determined #14905

Merged

Conversation

mvisonneau
Copy link
Contributor

Fixes: #14802

With the current implementation, there is a race condition which causes the operator to completely hang if one of the EC2 instance of the cluster has a type for which the operator cannot determine the limits. This PR sorts this situation by removing the attempt for getting a readlock over the node mutex which already has a readwrite lock at this time.

Here is the pprof trace of the hanging goroutine:

1 @ 0x43a4a5 0x44afc5 0x44afae 0x46c0e7 0x25c769a 0x25c75e2 0x2649a7b 0x264da08 0x25c7302 0x25c7a6c 0x25c7758 0x25ce2bc 0x25cc525 0x25cc375 0x2660af1 0x19ec99a 0x1b1b61f 0x19dece2 0x19dc3a2 0x1116ebf 0x1115ced 0x1115c18 0x19dc105 0x19dc0a6 0x46fe81
#	0x46c0e6	sync.runtime_SemacquireMutex+0x46						/usr/local/go/src/runtime/sema.go:71
#	0x25c7699	sync.(*RWMutex).RLock+0xd9							/usr/local/go/src/sync/rwmutex.go:50
#	0x25c75e1	github.com/cilium/cilium/pkg/ipam.(*Node).InstanceID+0x21			/go/src/github.com/cilium/cilium/pkg/ipam/node.go:311
#	0x2649a7a	github.com/cilium/cilium/pkg/aws/eni.(*Node).loggerLocked+0x7a			/go/src/github.com/cilium/cilium/pkg/aws/eni/node.go:76
#	0x264da07	github.com/cilium/cilium/pkg/aws/eni.(*Node).GetMaximumAllocatableIPv4+0x67	/go/src/github.com/cilium/cilium/pkg/aws/eni/node.go:503
#	0x25c7301	github.com/cilium/cilium/pkg/ipam.(*Node).getMaxAllocate+0x41			/go/src/github.com/cilium/cilium/pkg/ipam/node.go:222
#	0x25c7a6b	github.com/cilium/cilium/pkg/ipam.(*Node).recalculate+0x18b			/go/src/github.com/cilium/cilium/pkg/ipam/node.go:378
#	0x25c7757	github.com/cilium/cilium/pkg/ipam.(*Node).UpdatedResource+0x97			/go/src/github.com/cilium/cilium/pkg/ipam/node.go:337
#	0x25ce2bb	github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update.func1+0x5b		/go/src/github.com/cilium/cilium/pkg/ipam/node_manager.go:263
#	0x25cc524	github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update+0x184			/go/src/github.com/cilium/cilium/pkg/ipam/node_manager.go:321
#	0x25cc374	github.com/cilium/cilium/pkg/ipam.(*NodeManager).Create+0x34			/go/src/github.com/cilium/cilium/pkg/ipam/node_manager.go:251
#	0x2660af0	main.startSynchronizingCiliumNodes.func1+0x70					/go/src/github.com/cilium/cilium/operator/cilium_node.go:59
#	0x19ec999	k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd+0x59		/go/src/github.com/cilium/cilium/vendor/k8s.io/client-go/tools/cache/controller.go:231
#	0x1b1b61e	github.com/cilium/cilium/pkg/k8s/informer.NewInformerWithStore.func1+0x25e	/go/src/github.com/cilium/cilium/pkg/k8s/informer/informer.go:119

As we discussed with @christarazi in the releated issue, I also took this opportunity to improve the associated logging!

ipam/aws: fixed a bug causing the operator to hang indefinitely when the ENI limits for an instance type could not be determined

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I think the core idea of the PR seems good to me. I have comments on how it's implemented however.

Also, the commit msg seems to be better suited as a release note. The commit msg should briefly describe the code change, to conform to https://docs.cilium.io/en/v1.9/contributing/development/contributing_guide/#submitting-a-pull-request. Concretely, the commit msg can be:

ipam/aws: Provide mitigation if ENI limits not found

Elaborating that it fixed a bug would be good to move to the commit msg context instead.

Feel free to add / modify it as you see fit as well.

pkg/aws/eni/node.go Outdated Show resolved Hide resolved
pkg/aws/eni/node.go Outdated Show resolved Hide resolved
pkg/aws/eni/node.go Outdated Show resolved Hide resolved
@@ -494,9 +495,9 @@ func (n *Node) ResyncInterfacesAndIPs(ctx context.Context, scopedLog *logrus.Ent

// GetMaximumAllocatableIPv4 returns the maximum amount of IPv4 addresses
// that can be allocated to the instance
func (n *Node) GetMaximumAllocatableIPv4() int {
func (n *Node) GetMaximumAllocatableIPv4(scopedLog *logrus.Entry) int {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this needed? It seems to me that we can already access the logger via n.loggerLocked()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus this modifies the interface to the functions for no concrete benefit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is exactly where the whole situation resides. By calling n.loggerLocked() within this scope, the function gets blocked as it returns n.node.InstanceID() which requires a read-lock over the node mutex which is already rw locked

https://github.com/cilium/cilium/blob/e2676483a51b7b089cd6a476b83038b9a1ec6169/pkg/aws/eni/node.go#L76

https://github.com/cilium/cilium/blob/e2676483a51b7b089cd6a476b83038b9a1ec6169/pkg/ipam/node.go#L311

https://github.com/cilium/cilium/blob/e2676483a51b7b089cd6a476b83038b9a1ec6169/pkg/ipam/node.go#L364

do you know if there is a better way to workaround this issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that you found a deadlock in the code prior and this change fixes it? Because the prior code is still using loggerLocked()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, absolutely!

Copy link
Member

@christarazi christarazi Feb 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see! I didn't notice that the driver behind this PR is this deadlock. Nice catch!

However, I still think we can do better here than to pollute the interface with a logger; that shouldn't need to be passed around and the fact that we do rely on that to avoid a deadlock is a very strong indicator that we need to rethink the flow and dependencies of this code.

Here's my proposal:

Let's add instanceID string as a field to the pkg/aws/eni.Node struct. This would allow us to avoid depending on n.node.InstanceID() which was done previously (which evidently cannot be called when n.node is already locked).

We can initialize the field when we call CreateNode, and we can rely on this to be set because the instance ID is always set when the resource is created.

Something like:

func (m *InstancesManager) CreateNode(obj *v2.CiliumNode, n *ipam.Node) ipam.NodeOperations {
	return &Node{k8sObj: obj, manager: m, node: n, instanceID: n.InstanceID()}
}

or feel free to create a constructor for Node.

If you think this is a good idea, then please implement in a separate commit. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, this is much cleaner and future proof, thanks for the guidance on this @christarazi 🙇 !

pkg/ipam/node.go Outdated Show resolved Hide resolved
pkg/ipam/node.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@mvisonneau mvisonneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the 👀 feedback @christarazi ! Can you please doublecheck a couple of my subsequent comments? 🙇

@@ -494,9 +495,9 @@ func (n *Node) ResyncInterfacesAndIPs(ctx context.Context, scopedLog *logrus.Ent

// GetMaximumAllocatableIPv4 returns the maximum amount of IPv4 addresses
// that can be allocated to the instance
func (n *Node) GetMaximumAllocatableIPv4() int {
func (n *Node) GetMaximumAllocatableIPv4(scopedLog *logrus.Entry) int {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is exactly where the whole situation resides. By calling n.loggerLocked() within this scope, the function gets blocked as it returns n.node.InstanceID() which requires a read-lock over the node mutex which is already rw locked

https://github.com/cilium/cilium/blob/e2676483a51b7b089cd6a476b83038b9a1ec6169/pkg/aws/eni/node.go#L76

https://github.com/cilium/cilium/blob/e2676483a51b7b089cd6a476b83038b9a1ec6169/pkg/ipam/node.go#L311

https://github.com/cilium/cilium/blob/e2676483a51b7b089cd6a476b83038b9a1ec6169/pkg/ipam/node.go#L364

do you know if there is a better way to workaround this issue?

pkg/aws/eni/node.go Outdated Show resolved Hide resolved
pkg/aws/eni/node.go Outdated Show resolved Hide resolved
pkg/aws/eni/node.go Outdated Show resolved Hide resolved
pkg/ipam/node.go Outdated Show resolved Hide resolved
@joestringer joestringer added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Feb 11, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 11, 2021
@mvisonneau mvisonneau force-pushed the ipam_invalid_ec2_limits_hang_fix branch 3 times, most recently from 1525fdc to 2c1da7f Compare February 16, 2021 10:29
@mvisonneau mvisonneau requested a review from a team as a code owner February 16, 2021 10:29
@mvisonneau mvisonneau requested review from christarazi and removed request for a team February 16, 2021 10:32
@mvisonneau mvisonneau force-pushed the ipam_invalid_ec2_limits_hang_fix branch from 2c1da7f to 22f21fa Compare February 16, 2021 10:33
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much better! Could you separate out the instanceID change into a separate commit? The log msg should be in it's own commit. This way we have clear logical commits implementing independent behavior.

pkg/nodediscovery/nodediscovery.go Outdated Show resolved Hide resolved
@mvisonneau mvisonneau force-pushed the ipam_invalid_ec2_limits_hang_fix branch 2 times, most recently from 5aac0ad to 2e47ceb Compare February 17, 2021 11:25
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I have 2 more pieces of feedback:

  • Minor fixup to the log msg
  • Could you update the commit msg of the second commit to elaborate a bit more on the deadlock issue? Maybe describe that in a particular function (I forgot the name of it), a write-lock was obtained and while it was locked, calling n.loggerLocked() would call n.node.InstanceID() which then attempts to read-lock to extract out the instanceID.

pkg/nodediscovery/nodediscovery.go Outdated Show resolved Hide resolved
@mvisonneau mvisonneau force-pushed the ipam_invalid_ec2_limits_hang_fix branch from 2e47ceb to 87c2d70 Compare February 17, 2021 20:36
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

LGTM, thanks for the PR! Minor feedback:

  • Minor nit below
  • In the 2nd commit msg, I'd avoid referencing exact line numbers as they'll get out-of-date quickly. Referencing the function is good enough like you have already done. :)

pkg/nodediscovery/nodediscovery.go Outdated Show resolved Hide resolved
When instance limits are not found, we will now advise the enduser to
set --update-ec2-adapter-limit-via-api=true to allow the operator to
dynamically fetch the limits from the EC2 API

Signed-off-by: Maxime VISONNEAU <[email protected]>
This change ensures that we do not endup with a deadlock of
the goroutines when the EC2 instance limits are not available.

The initial lock starts in pkg/ipam/node.go where we set
a "n.mutex.Lock()" in "func (n *Node) recalculate()".

If the EC2 instance type limits cannot be determined, in cilium/pkg/aws/eni/node.go,
there is a call to "n.loggerLocked()" being made which subsequently
calls "n.node.InstanceID()".

However, this function also expects to have a lock over the same mutex:
"n.mutex.RLock()" in pkg/ipam/node.go

Leading to a deadlock of the go routine, preventing the operator
from functioning in a nominal fashion for any instance of the cluster.

Fixes cilium#14802

Signed-off-by: Maxime VISONNEAU <[email protected]>
@mvisonneau mvisonneau force-pushed the ipam_invalid_ec2_limits_hang_fix branch from 87c2d70 to 3cfde64 Compare February 17, 2021 20:56
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@aanm
Copy link
Member

aanm commented Feb 18, 2021

test-me-please

@joestringer
Copy link
Member

Only one failure, the k8s-1.20-kernel-4.9 job which failed on a test that doesn't make sense to fail given the changes. It must be a flake, so I filed #15103 to follow up. No need to block this PR on flaky CI.

@joestringer joestringer merged commit 440c896 into cilium:master Feb 24, 2021
@pchaigno pchaigno added area/eni Impacts ENI based IPAM. sig/ipam IP address management, including cloud IPAM labels Feb 26, 2021
@mvisonneau mvisonneau deleted the ipam_invalid_ec2_limits_hang_fix branch February 1, 2022 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/eni Impacts ENI based IPAM. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/ipam IP address management, including cloud IPAM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ipam/aws/eni: IP allocation silently failing when using "unsupported" instance types
6 participants