RHCOS via ibm_pi_instance timeout waiting for network #1620

yussufsh · 2020-07-01T14:39:52Z

Hi team,

I am trying to ignite an RHCOS instance using the Power Systems resources. When creating the resource it timeout waiting for network connection.

Console:

I am trying to configure a DHCP server to provide address to the machine. But that to work I need the network information of the instance such as ip and macaddress. Terraform won't allow me to fetch these details because the instance resource is not completed. The status of the instance is in 'Warning' state and never turn 'Active'.

Terraform Version

Terraform v0.12.20

Affected Resource(s)

Please list the resources as a list, for example:

ibm_pi_instance

Terraform Configuration Files

Debug Output

RHCOS node times out waiting for network.

Error: Failed to provision the instance

  on modules/4_nodes/nodes.tf line 104, in resource "ibm_pi_instance" "master":
 104: resource "ibm_pi_instance" "master" {

Panic Output

Expected Behavior

Need to complete the resource so that the network information can be read and fed to a internal DHCP server.

Actual Behavior

Error: timeout while waiting for state to become 'OK, ACTIVE, OK' (last state: 'BUILD', timeout: 1h0m0s)

Steps to Reproduce

Important Factoids

References

The text was updated successfully, but these errors were encountered:

hkantare · 2020-07-02T10:28:38Z

@surajsub Can you have a look and update with your findings

yussufsh · 2020-07-02T11:41:58Z

One possible solution would be to allow a choice to do away with the health check if users are ok to start using it right away (in my case getting network details).

If health check is not so important lets do away with it? It take around 20+ mins for the instance resource to complete.

Without the health check it takes around 3-5 mins for the instance resource to complete. Another 5-7 mins for ssh connection to the server. It is also time saving without waiting for health checks.

surajsub · 2020-07-02T13:07:23Z

Setting the status to OK is important because other functions like CPU mods are reliant on the LPAR being in the OK state. Let me evaluate .

surajsub · 2020-07-02T13:58:29Z

I forgot to add .It will turn active when the lpar is able to make a connection to novalink

yussufsh · 2020-07-02T14:33:23Z

It is not working from RHCOS point of view where I would like the details. I am sure there could be other support scenarios. What I am saying is the TF times out which could be avoided since the instance status is Active but health status is Warning.

I have tested the plugin by removing the health check from here and it works fine for RHEL and RHCOS as well. Not sure about the dependencies.

surajsub · 2020-07-02T17:53:23Z

Are you using cloud-init ? if so add these 3 lines to cloud-init in the right format

[ '/opt/rsct/bin/rmcctrl', '-z' ]
[ '/opt/rsct/bin/rmcctrl', '-A' ]
[ '/opt/rsct/bin/rmcctrl', '-p' ]
Yes. and I have to care about the dependencies :)

yussufsh · 2020-07-03T06:29:54Z

RHCOS works on Ignition config files.

bpradipt · 2020-07-03T12:29:57Z

@surajsub one of the challenges is RHCOS (CoreOS) doesn't have rsct yet and hence the Health status shows Warning. We are trying to figure out the best way possible to fix it. However it's a pretty long term item :-(

In the meantime would it be possible to evaluate if we can introduce a flag to ignore the health status. The default can be set to false to retain the existing behaviour. However at the same time it also gives the flexibility to use it in desired fashion.

For our use case we just want to get the allocated Mac and IP address of the instance so that we can use this info to create a private DHCP server and use it for the instances.

I understand there is no easy way. Just sharing few thoughts to start the discussion and figure out a way forward.

surajsub · 2020-07-03T20:04:04Z

Absolutely. I'm working on a fix and testing it . Give me a day or two please

surajsub · 2020-07-07T15:12:43Z

Code fix is in. Waiting to be merged by the Cloud team

bpradipt · 2020-07-07T16:24:28Z

Code fix is in. Waiting to be merged by the Cloud team

awesome. Thanks @surajsub

surajsub · 2020-07-07T16:30:38Z

Cool. Let me know if there are additional issues. Appreciate the patience

yussufsh · 2020-07-08T12:01:39Z

Thanks @surajsub for the patch. I have tried it today and it works perfectly for our use-case.

There is another issue I am facing after this fix. I am trying to create around 6 LPARs in parallel. Getting below error for random no of instances.

Error: Failed to provision unknown error (status 504): {resp:0xc000c801b0}

  on modules/4_nodes/nodes.tf line 93, in resource "ibm_pi_instance" "master":
  93: resource "ibm_pi_instance" "master" {

Error: Failed to provision unknown error (status 504): {resp:0xc000c80120}

  on modules/4_nodes/nodes.tf line 134, in resource "ibm_pi_instance" "worker":
 134: resource "ibm_pi_instance" "worker" {

But actually the instance is created and running in WARNING status. To re-create them I have to delete them from the console manually since the TF does not pick them up.

Let me know if you want me to create another issue for this.

yussufsh · 2020-07-08T12:05:01Z

I have the actual error message printed after recompiling the code...

Error: Failed to provision the instance: Post "https://<region>.power-iaas.cloud.ibm.com/pcloud/v1/cloud-instances/<uuid>
/pvm-instances": context deadline exceeded

surajsub · 2020-07-09T02:35:12Z

I have seen this issue happen with cloud , sometime with powervc as well. this is because if we provision multiple instances , powervc chokes at the back.. and the cloud api responds with the context deadline exceeded message.
what dc are you trying this in ?

yussufsh · 2020-07-09T04:35:34Z

Thanks for confirming. I am trying on "frankfurt1".

Just to add, I have tried multiple deployments on PowerVC with similar automation where we create around 10 VMs in one go. Never seen such issue there. However, I have seen "context deadline exceeded" errors in one of the private OpenStack setup.

bpradipt · 2020-07-09T04:43:24Z

@yussufsh there will be other users provisioning on the same region as well.. So it could very well be that we are hitting the limits.

surajsub · 2020-07-10T12:51:27Z

I presume the original issue has been resolved. Can you close this out please ?

yussufsh · 2020-07-10T13:53:25Z

Thanks alot for helping us get this done!

yussufsh closed this as completed Jul 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RHCOS via ibm_pi_instance timeout waiting for network #1620

RHCOS via ibm_pi_instance timeout waiting for network #1620

yussufsh commented Jul 1, 2020 •

edited

Loading

hkantare commented Jul 2, 2020 •

edited

Loading

yussufsh commented Jul 2, 2020

surajsub commented Jul 2, 2020

surajsub commented Jul 2, 2020

yussufsh commented Jul 2, 2020

surajsub commented Jul 2, 2020

yussufsh commented Jul 3, 2020

bpradipt commented Jul 3, 2020 •

edited

Loading

surajsub commented Jul 3, 2020

surajsub commented Jul 7, 2020

bpradipt commented Jul 7, 2020

surajsub commented Jul 7, 2020

yussufsh commented Jul 8, 2020

yussufsh commented Jul 8, 2020

surajsub commented Jul 9, 2020

yussufsh commented Jul 9, 2020

bpradipt commented Jul 9, 2020

surajsub commented Jul 10, 2020

yussufsh commented Jul 10, 2020

RHCOS via ibm_pi_instance timeout waiting for network #1620

RHCOS via ibm_pi_instance timeout waiting for network #1620

Comments

yussufsh commented Jul 1, 2020 • edited Loading

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Panic Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

hkantare commented Jul 2, 2020 • edited Loading

yussufsh commented Jul 2, 2020

surajsub commented Jul 2, 2020

surajsub commented Jul 2, 2020

yussufsh commented Jul 2, 2020

surajsub commented Jul 2, 2020

yussufsh commented Jul 3, 2020

bpradipt commented Jul 3, 2020 • edited Loading

surajsub commented Jul 3, 2020

surajsub commented Jul 7, 2020

bpradipt commented Jul 7, 2020

surajsub commented Jul 7, 2020

yussufsh commented Jul 8, 2020

yussufsh commented Jul 8, 2020

surajsub commented Jul 9, 2020

yussufsh commented Jul 9, 2020

bpradipt commented Jul 9, 2020

surajsub commented Jul 10, 2020

yussufsh commented Jul 10, 2020

yussufsh commented Jul 1, 2020 •

edited

Loading

hkantare commented Jul 2, 2020 •

edited

Loading

bpradipt commented Jul 3, 2020 •

edited

Loading