-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make wait-for-cluster logic more robust #676
Conversation
Thanks for the PR! 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. This makes sense although I am curious. The provider should ensure that we are always past the PROVISIONING
state. Did you encounter this particular case where the cluster was provisioned and Terraform exited successfully but the cluster was still in a PROVISIONING
state?
if the initial cluster node pool is removed it will say RUNNING for a second then go back to RECONCILING again while the node pool is removed. This might cause issues as well.
We wait for the new nodepool to be provisioned before running the wait for cluster. Does that help?
You are right, my bad. I used the same wait-for-cluster logic from the module for a different module of mine and I switched up the two and I thought I encountered this error in the gke module. It works well, it only starts checking after the node pool is created. It might still be worth considering my PR in my opinion because we are waiting for the cluster to be RUNNING rather than waiting for it to be not RECONCILING. At least to me that makes more sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/cc @morgante
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic looks more robust, though I see tests are failing?
@bharathkkb Sure, I’ll do it today :) |
The status is PROVISIONING at the beginning, thus the original logic will always exit and say "Cluster is ready!". If we wait for the RUNNING status, we will know for sure the cluster is up.
/gcbrun |
I don't know if it's somehow just me, but this change to make the script more robust breaks the functionality that worked for terraform cloud. I've tried a variety of workarounds, including "skip_provisioners" and nothing worked. Looking at the changes, I think I may be confused, the gcloud requirement already existed at this point. Apologies... And the version that fails: |
The specific line that changed and has caused an infinite loop is: I did try setting "skip_provisioners = true" in the private cluster module, but it still called Additional reference: https://xkcd.com/1172/ |
@greg-bumped gcloud is a dependency, your issue isn't related to this change. Please review the update notes about how to handle forcing the install of gcloud: https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/master/docs/upgrading_to_v12.0.md#dropped-support-for-gcloud_skip_download-variable |
The other modules seem to be interacting with gcloud just fine. But this script, for some reason beyond me, doesn't seem to be running in a shell that has access to it. And the nature of terraform cloud isn't such that I can remote to it for diagnostics or to see why. This issue was swallowed previously, but this fix seems to have brought it to the forefront, and I've had to lock into a previous version. How can it deploy clusters previously, but it's the script that causes an issue? Does that still mean that gcloud installation is the issue? It's getting the service account credentials from an env variable for the other infrastructure that's being deployed. I apologize for my lack of experience in this regard... When I did give the script access to gcloud, it generated a different set of errors:
Is this a bug in terraform cloud? I can't emphasize enough that this script is the only failure point. Thanks for your time. |
@greg-bumped This script uses You need to do two things:
|
GOOGLE_APPLICATION_CREDENTIALS is set, otherwise none of the infrastructure would be deployed.
Everything works. Wasn't gcloud always a requirement? I don't think anything would deploy in the above situation if the two suggestions you made weren't already moot? The authentication issue seems unique to the script. Even when I set GCLOUD_TF_DOWNLOAD, the version of gcloud the script runs doesn't seem to be leveraging the GOOGLE_APPLICATION_CREDENTIALS environment variable. Where I know the rest of the terraform scripting does. I hope I'm making sense, I can tell I wasn't being clear earlier, I should have mentioned explicitly that I'd done those steps, and that's what caused the auth errors. I don't think the terraform would get past the first line of configuration without the credentials set, as it uses them to collect state, doesn't it? How could it even get to the script at all if it wasn't configured? |
@greg-bumped We changed how gcloud is installed/used so that might be why. Terraform and Can you try setting the |
The status is PROVISIONING at the beginning, thus the original logic will always exit and say "Cluster is ready!". If we wait for the RUNNING status, we will know for sure the cluster is up. Co-authored-by: Bharath KKB <[email protected]>
The status is PROVISIONING at the beginning, thus the original logic will always exit and say "Cluster is ready!". If we wait for the RUNNING status, we will know for sure the cluster is up.
Note: if the initial cluster node pool is removed it will say RUNNING for a second then go back to RECONCILING again while the node pool is removed. This might cause issues as well.