Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RHCOS via ibm_pi_instance timeout waiting for network #1620

Closed
yussufsh opened this issue Jul 1, 2020 · 19 comments
Closed

RHCOS via ibm_pi_instance timeout waiting for network #1620

yussufsh opened this issue Jul 1, 2020 · 19 comments

Comments

@yussufsh
Copy link
Collaborator

yussufsh commented Jul 1, 2020

Hi team,

I am trying to ignite an RHCOS instance using the Power Systems resources. When creating the resource it timeout waiting for network connection.

Console:
image

I am trying to configure a DHCP server to provide address to the machine. But that to work I need the network information of the instance such as ip and macaddress. Terraform won't allow me to fetch these details because the instance resource is not completed. The status of the instance is in 'Warning' state and never turn 'Active'.

Terraform Version

Terraform v0.12.20

Affected Resource(s)

Please list the resources as a list, for example:

  • ibm_pi_instance

Terraform Configuration Files

Debug Output

RHCOS node times out waiting for network.

Error: Failed to provision the instance

  on modules/4_nodes/nodes.tf line 104, in resource "ibm_pi_instance" "master":
 104: resource "ibm_pi_instance" "master" {

Panic Output

Expected Behavior

Need to complete the resource so that the network information can be read and fed to a internal DHCP server.

Actual Behavior

Error: timeout while waiting for state to become 'OK, ACTIVE, OK' (last state: 'BUILD', timeout: 1h0m0s)

Steps to Reproduce

Important Factoids

References

@hkantare
Copy link
Collaborator

hkantare commented Jul 2, 2020

@surajsub Can you have a look and update with your findings

@yussufsh
Copy link
Collaborator Author

yussufsh commented Jul 2, 2020

One possible solution would be to allow a choice to do away with the health check if users are ok to start using it right away (in my case getting network details).

If health check is not so important lets do away with it? It take around 20+ mins for the instance resource to complete.

Without the health check it takes around 3-5 mins for the instance resource to complete. Another 5-7 mins for ssh connection to the server. It is also time saving without waiting for health checks.

@surajsub
Copy link
Contributor

surajsub commented Jul 2, 2020

Setting the status to OK is important because other functions like CPU mods are reliant on the LPAR being in the OK state. Let me evaluate .

@surajsub
Copy link
Contributor

surajsub commented Jul 2, 2020

I forgot to add .It will turn active when the lpar is able to make a connection to novalink

@yussufsh
Copy link
Collaborator Author

yussufsh commented Jul 2, 2020

It is not working from RHCOS point of view where I would like the details. I am sure there could be other support scenarios. What I am saying is the TF times out which could be avoided since the instance status is Active but health status is Warning.

I have tested the plugin by removing the health check from here and it works fine for RHEL and RHCOS as well. Not sure about the dependencies.

@surajsub
Copy link
Contributor

surajsub commented Jul 2, 2020

Are you using cloud-init ? if so add these 3 lines to cloud-init in the right format

  • [ '/opt/rsct/bin/rmcctrl', '-z' ]
  • [ '/opt/rsct/bin/rmcctrl', '-A' ]
  • [ '/opt/rsct/bin/rmcctrl', '-p' ]
    Yes. and I have to care about the dependencies :)

@yussufsh
Copy link
Collaborator Author

yussufsh commented Jul 3, 2020

RHCOS works on Ignition config files.

@bpradipt
Copy link

bpradipt commented Jul 3, 2020

@surajsub one of the challenges is RHCOS (CoreOS) doesn't have rsct yet and hence the Health status shows Warning. We are trying to figure out the best way possible to fix it. However it's a pretty long term item :-(

In the meantime would it be possible to evaluate if we can introduce a flag to ignore the health status. The default can be set to false to retain the existing behaviour. However at the same time it also gives the flexibility to use it in desired fashion.

For our use case we just want to get the allocated Mac and IP address of the instance so that we can use this info to create a private DHCP server and use it for the instances.

I understand there is no easy way. Just sharing few thoughts to start the discussion and figure out a way forward.

@surajsub
Copy link
Contributor

surajsub commented Jul 3, 2020

Absolutely. I'm working on a fix and testing it . Give me a day or two please

@surajsub
Copy link
Contributor

surajsub commented Jul 7, 2020

Code fix is in. Waiting to be merged by the Cloud team

@bpradipt
Copy link

bpradipt commented Jul 7, 2020

Code fix is in. Waiting to be merged by the Cloud team

awesome. Thanks @surajsub

@surajsub
Copy link
Contributor

surajsub commented Jul 7, 2020

Cool. Let me know if there are additional issues. Appreciate the patience

@yussufsh
Copy link
Collaborator Author

yussufsh commented Jul 8, 2020

Thanks @surajsub for the patch. I have tried it today and it works perfectly for our use-case.

There is another issue I am facing after this fix. I am trying to create around 6 LPARs in parallel. Getting below error for random no of instances.

Error: Failed to provision unknown error (status 504): {resp:0xc000c801b0}

  on modules/4_nodes/nodes.tf line 93, in resource "ibm_pi_instance" "master":
  93: resource "ibm_pi_instance" "master" {

Error: Failed to provision unknown error (status 504): {resp:0xc000c80120}

  on modules/4_nodes/nodes.tf line 134, in resource "ibm_pi_instance" "worker":
 134: resource "ibm_pi_instance" "worker" {

But actually the instance is created and running in WARNING status. To re-create them I have to delete them from the console manually since the TF does not pick them up.

Let me know if you want me to create another issue for this.

@yussufsh
Copy link
Collaborator Author

yussufsh commented Jul 8, 2020

I have the actual error message printed after recompiling the code...

Error: Failed to provision the instance: Post "https://<region>.power-iaas.cloud.ibm.com/pcloud/v1/cloud-instances/<uuid>
/pvm-instances": context deadline exceeded

@surajsub
Copy link
Contributor

surajsub commented Jul 9, 2020

I have seen this issue happen with cloud , sometime with powervc as well. this is because if we provision multiple instances , powervc chokes at the back.. and the cloud api responds with the context deadline exceeded message.
what dc are you trying this in ?

@yussufsh
Copy link
Collaborator Author

yussufsh commented Jul 9, 2020

Thanks for confirming. I am trying on "frankfurt1".

Just to add, I have tried multiple deployments on PowerVC with similar automation where we create around 10 VMs in one go. Never seen such issue there. However, I have seen "context deadline exceeded" errors in one of the private OpenStack setup.

@bpradipt
Copy link

bpradipt commented Jul 9, 2020

@yussufsh there will be other users provisioning on the same region as well.. So it could very well be that we are hitting the limits.

@surajsub
Copy link
Contributor

I presume the original issue has been resolved. Can you close this out please ?

@yussufsh
Copy link
Collaborator Author

Thanks alot for helping us get this done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants