Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task stuck in running when replacing deployment #60

Open
cattuz opened this issue Jan 27, 2022 · 7 comments
Open

Task stuck in running when replacing deployment #60

cattuz opened this issue Jan 27, 2022 · 7 comments

Comments

@cattuz
Copy link

cattuz commented Jan 27, 2022

Issue

I have a problem where tasks occasionally get stuck when updating:

bild

I'm thinking the issue occurs when cleaning up the app pool fails for any reason. It's possibly to manually remove the app pool and site, but the nomad status is still stuck in running. Might be related to #41?

IIS/VM state

Eventlog:
bild

Contents of \?\C:\inetpub\temp\apppools\b5f132be-a977-3106-ef71-b936ad90c24b\b5f132be-a977-3106-ef71-b936ad90c24b.config
<!-- ERROR: There's been an error reading or processing the applicationhost.config file. Line number: 0 Error message: Cannot read configuration file -->

IIS apppools:
bild

IIS sites:
bild

Logs

Output from nomad alloc status b5f132be-a977-3106-ef71-b936ad90c24b

ID                   = b5f132be-a977-3106-ef71-b936ad90c24b
Eval ID              = 698c9204
Name                 = xxx.xxx[0]
Node ID              = 1b82983e
Node Name            = VMSDFWIN00000E
Job ID               = xxx
Job Version          = 26
Client Status        = running
Client Description   = Tasks are running
Desired Status       = stop
Desired Description  = alloc is being updated due to job update
Created              = 1d1h ago
Modified             = 17h4m ago
Deployment ID        = 7338489c
Deployment Health    = unhealthy
Replacement Alloc ID = 0a3dac7b

Allocation Addresses
Label       Dynamic  Address
*xxx yes      11.0.2.4:28248

Task "setup" (prestart) is "dead"
Task Resources
CPU     Memory        Disk     Addresses
10 MHz  10 MiB        300 MiB
        Max: 500 MiB

Task Events:
Started At     = 2022-01-26T07:03:18Z
Finished At    = 2022-01-26T07:03:25Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type                   Description
2022-01-26T16:21:15+01:00  Killing                Sent interrupt. Waiting 5s before force killing
2022-01-26T08:03:24+01:00  Terminated             Exit Code: 0
2022-01-26T08:03:18+01:00  Started                Task started by client
2022-01-26T08:01:20+01:00  Downloading Artifacts  Client is downloading artifacts
2022-01-26T08:01:16+01:00  Task Setup             Building Task Directory
2022-01-26T08:01:16+01:00  Received               Task received by client

Task "xxx" is "running"
Task Resources
CPU        Memory        Disk     Addresses
0/100 MHz  0 B/200 MiB   300 MiB
           Max: 500 MiB

Task Events:
Started At     = 2022-01-26T15:10:00Z
Finished At    = N/A
Total Restarts = 2
Last Restart   = 2022-01-26T15:08:50Z

Recent Events:
Time                       Type        Description
2022-01-26T16:21:15+01:00  Killing     Sent interrupt. Waiting 5s before force killing
2022-01-26T16:10:00+01:00  Started     Task started by client
2022-01-26T16:08:50+01:00  Restarting  Task restarting in 1m9.801230412s
2022-01-26T16:08:50+01:00  Terminated  Exit Code: 0
2022-01-26T15:52:48+01:00  Started     Task started by client
2022-01-26T15:51:43+01:00  Restarting  Task restarting in 1m4.151514396s
2022-01-26T08:03:29+01:00  Started     Task started by client
2022-01-26T08:03:25+01:00  Task Setup  Building Task Directory
2022-01-26T08:01:16+01:00  Received    Task received by client

Output from nomad alloc status 0a3dac7b-8377-d0f2-8eb6-9b85bd0c6d39

ID                  = 0a3dac7b-8377-d0f2-8eb6-9b85bd0c6d39
Eval ID             = 187ba4e3
Name                = xxx.xxx[0]
Node ID             = 1b82983e
Node Name           = VMSDFWIN00000E
Job ID              = xxx
Job Version         = 32
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 17h7m ago
Modified            = 18m41s ago
Deployment ID       = 7338489c
Deployment Health   = unset

Allocation Addresses
Label       Dynamic  Address
*xxx yes      11.0.2.4:2829
@Vulfox
Copy link
Contributor

Vulfox commented Jan 27, 2022

I would agree with your assumption that it is related to #41 . This version of the driver is a little bit wonky on how it goes about performing these status changes. I have an unfinished v0.2.0 version of this driver in a different branch that should address this problem. I haven't touched it in a long while. I can try and allocate some time here in the coming days to help address this problem for ya.

@cattuz
Copy link
Author

cattuz commented Jan 28, 2022

Yeah I suspected it was the same root cause. If the 2.0 driver is working I can test it out on my workload as it is. If that would help.

@cattuz
Copy link
Author

cattuz commented Jan 29, 2022

I'm having a few issues with raw_exec on windows VMs which might be related? hashicorp/nomad#11939

@Vulfox
Copy link
Contributor

Vulfox commented Jan 29, 2022

Hmmm, possibly, but I am inclined to believe it's more of a problem on this code base's end than on Hashicorp's nomad code. I could be wrong and judging from the response you got, this is a known issue that's hard to replicate. I'll try and keep an eye as things progress on that issue and its potential root cause.

@cattuz
Copy link
Author

cattuz commented Feb 1, 2022

The windows desktop heap configuration solving hashicorp/nomad#11939 (comment) appears to have mitigated this issue as well. The IIS processes no longer get "stuck" as they did previously. I'm cautiously optimistic after seeing no problems on the windows VMs now for a few days :)

@cattuz
Copy link
Author

cattuz commented Feb 14, 2022

The issue with tasks getting stuck has resurfaced, although much more rarely than before. Is there any quick workaround to getting the tasks unstuck manually? It happens so rarely that doing it manually is not out of the question.

I've tried various commands with the nomad cli nomad alloc stop etc, and I've tried manually removing the offending app pool and site directly on the VM, but the tasks remains stuck and the pedning task refuses to start.

@Vulfox
Copy link
Contributor

Vulfox commented Feb 14, 2022

Yea, this is what I have experienced on some occasions with test automation. I am not 100% sure where the fault lies between Nomad or the driver itself. This kinda goes back to me thinking it is a logic problem in how the driver tries to handle its state in this version vs the v0.2. To guarantee a clean slate for a single node, I ended up clearing all allocs manually in IIS and deleting the nomad db/data files/dirs on the client. I have not done so in a clustered environment and I don't know how the nomad servers themselves would treat the scenario (unchanged, dead, failed states for allocs). It may require a forced garbage collect of the system afterwards (https://www.nomadproject.io/docs/commands/system/gc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants