Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows Buildkite agents are taking new jobs before the VM has rebooted #42

Open
DilumAluthge opened this issue Jan 13, 2023 · 8 comments

Comments

@DilumAluthge
Copy link
Member

No description provided.

@DilumAluthge
Copy link
Member Author

This often manifests itself with the following error message:

"Secret private key path 'C:\secrets/agent.key' is not a valid private RSA key!"

@staticfloat
Copy link
Member

Thanks to the debugging code added in, we can see in this example that the previous run was canceled by a SIGINT, presumably sent because the entire job was canceled due to a new commit being pushed before the old commit was done testing:

Old run: https://buildkite.com/julialang/julia-master/builds/20727#0186216b-d26a-4bde-b924-469346201a0b/4392-4408
New run: https://buildkite.com/julialang/julia-master/builds/20728#01862174-1b9b-4a4a-8aa9-a548b4eb7635/6-23

This shouldn't cause a problem, but clearly we do have a problem, so let's trace through and see if we can find any issues with our reboot functionality:

  • First, buildkite itself should exit after running. Given that we have new PIDs for the buildkite-agent processes, this seems to be likely working as intended.
  • Next, nssm should be invoking shutdown immediately after the buildkite-agent process exits. It's unclear if this is happening or not.
  • Once the VM shuts down, the state of the VM should be reset. Looking through the VM logs, it is not restarting.

So the smoking gun points to, unsurprisingly, the NSSM script that should be invoking shutdown, but it's failing for some reason.

@Keno
Copy link
Member

Keno commented Jul 26, 2024

  • Next, nssm should be invoking shutdown immediately after the buildkite-agent process exits. It's unclear if this is happening or not.

I'm always nervous about shell quoting, particularly on Windows. Are we sure that this ends up getting parsed appropriately by cmd and not appended as garbage to the end of the buildkite invocation or to some internal command line inside nssm?

@staticfloat
Copy link
Member

I logged into a machine running a job, and used a powershell script to tail the windows event log for nssm and for User32, this is what I saw:

   ProviderName: nssm

TimeCreated                      Id LevelDisplayName Message
-----------                      -- ---------------- -------
7/26/2024 5:52:02 PM           1040 Information      Service buildkite-agent received START control, which will be handled.
7/26/2024 5:52:04 PM           1008 Information      Started C:\Windows\System32\cmd.exe /C C:\buildkite-agent\bin\buildkite-agent.exe start & shutdown /s /t 0 /f /d p:4:1 for service buildkite-agent in C:\W... 
7/26/2024 5:52:17 PM           1040 Information      Service buildkite-agent received SHUTDOWN control, which will be handled.
7/26/2024 5:52:17 PM           1011 Information      Killing process C:\Windows\System32\cmd.exe because service buildkite-agent is stopping.
7/26/2024 5:52:31 PM           1040 Information      Service buildkite-agent received START control, which will be handled.
7/26/2024 5:52:32 PM           1008 Information      Started C:\Windows\System32\cmd.exe /C C:\buildkite-agent\bin\buildkite-agent.exe start & shutdown /s /t 0 /f /d p:4:1 for service buildkite-agent in C:\W... 
7/26/2024 5:52:02 PM           1040 Information      Service buildkite-agent received START control, which will be handled.                                                                                       
7/26/2024 5:52:04 PM           1008 Information      Started C:\Windows\System32\cmd.exe /C C:\buildkite-agent\bin\buildkite-agent.exe start & shutdown /s /t 0 /f /d p:4:1 for service buildkite-agent in C:\W... 
7/26/2024 5:52:17 PM           1040 Information      Service buildkite-agent received SHUTDOWN control, which will be handled.
7/26/2024 5:52:17 PM           1011 Information      Killing process C:\Windows\System32\cmd.exe because service buildkite-agent is stopping.
7/26/2024 5:52:31 PM           1040 Information      Service buildkite-agent received START control, which will be handled.
7/26/2024 5:52:32 PM           1008 Information      Started C:\Windows\System32\cmd.exe /C C:\buildkite-agent\bin\buildkite-agent.exe start & shutdown /s /t 0 /f /d p:4:1 for service buildkite-agent in C:\W...
7/26/2024 6:21:31 PM           1013 Information      Program C:\Windows\System32\cmd.exe for service buildkite-agent exited with return code 0.
7/26/2024 6:21:31 PM           1023 Information      Killing process tree of process 4048 for service buildkite-agent with exit code 0
7/26/2024 6:21:31 PM           1027 Information      Killing PID 4048 in process tree of PID 4048 because service buildkite-agent is stopping.
7/26/2024 6:21:31 PM           1016 Information      Service buildkite-agent action for exit code 0 is Exit. Exiting.
7/26/2024 6:21:31 PM           1012 Information      Requested stop of service buildkite-agent.  No action is required as program C:\Windows\System32\cmd.exe is not running.
   ProviderName: User32

TimeCreated                      Id LevelDisplayName Message                                                                                                                                                      
-----------                      -- ---------------- -------                                                                                                                                                      
6/26/2024 1:37:33 AM           1074 Information      The process C:\Windows\system32\winlogon.exe (MINWINPC) has initiated the restart of computer WIN-HKTBISMGI26 on behalf of user NT AUTHORITY\SYSTEM for th...
6/25/2024 6:46:41 PM           1074 Information      The process C:\Windows\system32\wbem\wmiprvse.exe (WINBASE) has initiated the restart of computer WINBASE on behalf of user WINBASE\julia for the followin...
6/25/2024 7:04:29 PM           1074 Information      The process C:\Windows\system32\wbem\wmiprvse.exe (WINBASE) has initiated the restart of computer WINBASE on behalf of user WINBASE\julia for the followin...
6/25/2024 7:17:11 PM           1074 Information      The process wininit.exe (WINBASE) has initiated the power off of computer WINBASE on behalf of user WINBASE\Administrator for the following reason: Applic...
6/25/2024 7:24:09 PM           1074 Information      The process wininit.exe (WINBASE) has initiated the power off of computer WINBASE on behalf of user WINBASE\Administrator for the following reason: Applic...
7/26/2024 5:52:08 PM           1074 Information      The process C:\Windows\system32\winlogon.exe (WIN2K22-AMDCI6-) has initiated the restart of computer WIN2K22-AMDCI6- on behalf of user NT AUTHORITY\SYSTEM...
6/26/2024 1:37:33 AM           1074 Information      The process C:\Windows\system32\winlogon.exe (MINWINPC) has initiated the restart of computer WIN-HKTBISMGI26 on behalf of user NT AUTHORITY\SYSTEM for th...
6/25/2024 6:46:41 PM           1074 Information      The process C:\Windows\system32\wbem\wmiprvse.exe (WINBASE) has initiated the restart of computer WINBASE on behalf of user WINBASE\julia for the followin...
6/25/2024 7:04:29 PM           1074 Information      The process C:\Windows\system32\wbem\wmiprvse.exe (WINBASE) has initiated the restart of computer WINBASE on behalf of user WINBASE\julia for the followin...
6/25/2024 7:17:11 PM           1074 Information      The process wininit.exe (WINBASE) has initiated the power off of computer WINBASE on behalf of user WINBASE\Administrator for the following reason: Applic...
6/25/2024 7:24:09 PM           1074 Information      The process wininit.exe (WINBASE) has initiated the power off of computer WINBASE on behalf of user WINBASE\Administrator for the following reason: Applic...
7/26/2024 5:52:08 PM           1074 Information      The process C:\Windows\system32\winlogon.exe (WIN2K22-AMDCI6-) has initiated the restart of computer WIN2K22-AMDCI6- on behalf of user NT AUTHORITY\SYSTEM...
7/26/2024 6:21:31 PM           1074 Information      The process C:\Windows\System32\shutdown.exe (WIN2K22-AMDCI6-) has initiated the shutdown of computer WIN2K22-AMDCI6- on behalf of user WIN2K22-AMDCI6-\ju...

So it looks to me like it is correctly invoking the shutdown command. In any case, if it weren't, our bots wouldn't work at all, rather than not working sporadically.

@staticfloat
Copy link
Member

I wonder if shutdown can ever fail? Is that possible?

@Keno
Copy link
Member

Keno commented Jul 26, 2024

Those are all the success case though. Do we have any logs for the failure case?

@staticfloat
Copy link
Member

No, I have to login and instrument the VM beforehand, as all state is lost after each job finishes.

@Keno
Copy link
Member

Keno commented Jul 27, 2024

Can we dump the logs in one of the early scripts if C:\secrets/agent.key is missing, which is one of the indicators of this failure?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants