Skip to content

cluster: unstuck cluster manager when update the initializing cluster#13875

Merged
mattklein123 merged 20 commits intoenvoyproxy:masterfrom
lambdai:warmingclusterfix
Nov 9, 2020
Merged

cluster: unstuck cluster manager when update the initializing cluster#13875
mattklein123 merged 20 commits intoenvoyproxy:masterfrom
lambdai:warmingclusterfix

Conversation

@lambdai
Copy link
Contributor

@lambdai lambdai commented Nov 3, 2020

Commit Message:
Remove existing initializing cluster when updating secondary clusters.

  • Return existing cluster in ClusterManagerImp::loadCluster() to delay
    the destroy of existing cluster.
  • Remove the existing cluster from the secondary cluster list to accounting
    the warming secondary cluster correctly.

Additional Description:
Risk Level: LOW
Testing: integration test
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
fix #13874

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
@lambdai lambdai changed the title Warmingclusterfix cluster: unstuck cluster manager when update the initializing cluster Nov 3, 2020
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
@lambdai
Copy link
Contributor Author

lambdai commented Nov 3, 2020

The new test AdsClusterV3Test.ClusterUpdateWhenWarming is failing on CI but I cannot reproduce locally

Is it another #11877 ? It seems CDS and EDS update on Envoy are concurrent executing

@lizan lizan self-assigned this Nov 3, 2020
@mattklein123
Copy link
Member

The PR that is being reverted is causing production issues, so let's land that revert, and then work on re-applying with better tests? cc @Shikugawa

/wait

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
@lambdai
Copy link
Contributor Author

lambdai commented Nov 3, 2020

linux_x64 release test passed!

@mattklein123
Copy link
Member

Sorry is this still valid after reverting the other PR? Or should I wait until that issue is resolved?

/wait-any

@lambdai
Copy link
Contributor Author

lambdai commented Nov 4, 2020

@mattklein123
There are two PR
#12783 and #13344

My fix is based on #12783 which is not revered.

BTW: I just realized #13344 is the only one reverted but I suspect the problem is introduced in #12783

@lambdai
Copy link
Contributor Author

lambdai commented Nov 4, 2020

@mattklein123
There are two PR
#12783 and #13344

My fix is based on #12783 which is not revered.

BTW: I just realized #13344 is the only one reverted but I suspect the problem is introduced in #12783

Because the major risky change in #13344 is protected by the default off runtime key.

See https://github.com/envoyproxy/envoy/pull/13886/files#diff-c904f9b0d49cdd938ffaa952192372529415afa8602d7dc6acb904e61111d8a5L433-L438

@lizan
Copy link
Member

lizan commented Nov 4, 2020

Based on my conversation with @rgs1 the crash that they are seeing is likely due to #13344.

@rgs1
Copy link
Member

rgs1 commented Nov 4, 2020

I can try this one tomorrow, we should spot the problem immediately.

@rgs1
Copy link
Member

rgs1 commented Nov 4, 2020

I can try this one tomorrow, we should spot the problem immediately.

That is, on top of master with #13344 already reverted.

@rgs1
Copy link
Member

rgs1 commented Nov 4, 2020

I can try this one tomorrow, we should spot the problem immediately.

That is, on top of master with #13344 already reverted.

Oh nvm, this is not for the crash.

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks in general this makes sense to me with a few small comments.

/wait

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
…erfix

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks LGTM pending potentially a better integration test for the primary case. Can you also merge main to pick up the CI fix?

/wait

Comment on lines +3152 to +3154
// The override cluster is added. Manually drop the previous cluster. In production flow this is
// achieved by ClusterManagerImpl.
sds.reset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a cluster manager test of this case that uses "real" DNS clusters? I think the main sequence that can actually happen here is:

  1. CDS adds DNS cluster
  2. DNS cluster begins init
  3. CDS updates DNS cluster?

I think you could also pretty easily test this in your integration test below by doing:

  1. Have CDS deliver a DNS cluster configured with health checking.
  2. DNS cluster won't initialize pending health checking.
  3. Update DNS cluster

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the DNS with health check!

After a quick search I didn't find an existing example but this test case
TEST_F(ClusterManagerImplTest, RemoveWarmingCluster) seem putting a defaultStaticCluster("fake_cluster") in warm.

Writing a test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A slight difference is that DNS cluster is always depending on resolver.

I modified the cluster to STATIC type to make it immediate ready.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this all looks good, but can you add a real integration test for this in ADS integration test? I think it's really easy based on what you already have. Just use a static cluster with health checking, it will be stuck initializing, then you can update it?

/wait

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the impression that static cluster with bad health check doesn't block initialization...
Yeah, adding integrate test to confirm initialization and the behavior health check behavior. Will udpate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Also confirmed that failed health check in static cluster also turn the warm cluster to ready.
And STRICT DNS resolve failure has the same scary side effect.

I end up not accept the tcp connection but not write back to hold the DNS request.

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
}
}

AssertionResult compareSets(const std::set<std::string>& set1, const std::set<std::string>& set2,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this function up so that more functions defined in this file can call it.

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you!

@mattklein123 mattklein123 merged commit 7a15e5d into envoyproxy:master Nov 9, 2020
@lambdai lambdai deleted the warmingclusterfix branch November 9, 2020 20:25
lambdai added a commit to lambdai/envoy-dai that referenced this pull request Nov 10, 2020
istio-testing pushed a commit to istio/envoy that referenced this pull request Nov 11, 2020
* cluster manager: avoid immediate activation for dynamic inserted cluster when initialize (envoyproxy#12783)

Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* cluster: unstuck cluster manager when update the initializing cluster (envoyproxy#13875)

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

Co-authored-by: Rei Shimizu <rei@tetrate.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cluster initialization issue. Envoy start could be blocked forever

4 participants