-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs stuck in inactive state #130
Comments
Yes, I am having the same problem... do you have a patch? |
We do not have a patch yet - as we're not sure of the root cause. Right now we are wondering if there is an intermittent problem with the Job.prototype.state = function(state){
var client = this.client;
this.removeState();
this._state = state;
this.set('state', state);
client.zadd('q:jobs', this._priority, this.id);
client.zadd('q:jobs:' + state, this._priority, this.id);
client.zadd('q:jobs:' + this.type + ':' + state, this._priority, this.id);
//increase available jobs, used by Worker#getJob()
if ('inactive' == state) client.lpush('q:' + this.type + ':jobs', 1);
return this;
}; We have added some diagnostics in job.js ctor to log errors for the client and waiting to repro: this.client.on('error', function (err) {
console.log('redis job client error ' + err);
}); This may not be the cause, if anyone else has ideas or a patch we would love to know. They get stuck often so I was surprised that more people have not run into this. |
Is very extrange... I am trying to find the cause... When I call: jobs.inactive() I got the Jobs IDs# [ '147', '149', '144', '164', '168', '172', '176' ] But for some reason jobs.process() don't see them not process them. |
I am able to reproduce the problem when stoping a worker... tasks tacked by the worker will be stuck in inactive state. if I run: jobs.process 'email', 4, (job, done) -> 1 task remains inconclusive in active state and 3 more remains stuck on inactive state forever. If I reactive the worker all the other pending task are been processed but the ones I mention are getting stuck forever. |
We have just seen this issue as well (job stuck in the inactive state). |
@sebicas do you have more details on your repro? Not sure what it means to be "tacked" by a worker while still inactive? |
@mikemoser not sure if "tacked" was the right word... what I tried to said if that for some reason the amount of stuck tasks is some way related to the number of simultaneous tasks indicated on the job. For example if I do: jobs.process 'email', 4, (job, done) -> 4 tasks will be stuck jobs.process 'email', 6, (job, done) -> 6 tasks will be stuck and so on... |
We use kue to manage somewhere between 1k-20k jobs per day and see the same problems. For us sometimes it's once a week. Other times multiple per day. Unfortunately, the root cause of these issues are likely fundamental to the way kue is written - since changes are applied serially in kue, not as an atomic transaction, any little glitch / crash can cause the items in a job to get partially applied, leading to the need to manually repair "broken" jobs. We're at the stage where we're deciding whether to rewrite the innards of kue to be more reliable, or whether to move to something else. Any thoughts would be appreciated. |
Unfortunately we are in the same situation as @dfoody :( |
should be pretty trivial to make things atomic, I dont have time to look at it right now but even at worst we could use a little lua script. Though even if this portion is fully atomic there's always the chance of something being stuck, if the process is killed etc.. really I think the bigger problem is that we need to recover from half-processed jobs etc |
I agree, besides making things atomic... is process is killed in the middle of jobs execution, that causes the job to get stuck... @visionmedia any suggestions in how to solve that? |
off the top of my head I can't think of any way really to differentiate between an active job and an active job whose's proc died. We could possibly tie PIDs into the whole thing, or alternatively just "timeout" those jobs, if it's been active for N minutes and it's not complete kill it and retry |
I do think they could be separated a bit, it's a pretty huge patch, I dont think some of that belongs in core and it takes more time to review really big pull-requests that have a larger scope |
@dfoody just mention he still have stuck jobs, so I guess his patch didn't solve the problem completely. |
There are really two separate issues here: (1) What do you do with jobs that legitimately fail (this is where the watchdog enhancement I put in does work well - as long as you're sure that, when the watchdog fires, it's really failed and not just slow - so set your timeouts appropriately). The only alternative to really know if jobs have died or not is to use something like zookeeper under the covers (which has a nice feature that locks can automatically be released when a process dies). (2) What happens when kue's data structures get corrupted. This is happening to us a lot right now, due to a combination of factors we believe: We're now doing calls across Amazon availability zones (Redis in a different AZ from Kue servers - increasing the latency between a series of redis requests) and we're now running significantly more kue servers than we were before. We think it's this combination of factors causing issues us to see the corruptions much more often. This is where moving to atomic redis operations (with appropriate use of 'watch') will hopefully help. |
@dfoody - thanks for qualifying. To be clear, this issue represents (2) we have a very basic setup and we see the redis indexes as describe above get out of sync before a job is ever processes, so they just stay in the inactive state. It happens a lot, however we can not get a consistant repro. Does anyone have a repro? |
@mikemoser given what you describe - does your process that queues the job quit soon after queuing it? |
We've also seen the same issue, our queuing processes are live ones. Any workarounds? is this issue finally clarified? |
@dfoody our worker process is always running, so kue should have all the time it needs to finish the operation of adding a job (e.g. call line 447 of kue/lib/jobs.js to increment the index by one for the new job). We are not able to get a consistant repro, so it's proving hard to fix, however we see it happen all the time. I want to reiterate this issue is about "new" jobs never getting out of the inactive sate, not jobs in process that get stuck. Those of you that said you've seen the same behavior is it the "new jobs stuck in inactive state and never getting processed?" |
I've not seen the case on our side where jobs get stuck in inactive without something else happening around the same time (e.g. a crash, etc. that corrupts the data, AWS having "minor" issues like all of EBS going down. etc). But, when a job does get stuck, we have to manually correct things before it starts moving again. That said, we're running off my fork, not the original (which has lots of QoS and reliability changes). One other thing to try: Have you restarted Redis recently? We have seen that sometimes redis does need a restart and that fixes some things. |
We're seeing similar behavior as well. What we see is that new jobs are stuck in an inactive state. We have concurrency set to 1, but have a cluster of 4 processes. Looking at Redis, we currently have two 'inactive' jobs. When a new job is created, the oldest of the two inactive jobs suddenly gets processed. So, we have, essentially, the two newest jobs always stuck - until they're displaced by new jobs. |
There seem to be two causes for new jobs to never get processed and stuck in the inactive state.
@edwardmsmith it sounds like your symptoms are related to #1. You can verify this by checking the if After correcting #1, we still noticed jobs stuck in the inactive state. It seems that BLPOP becomes unresponsive for certain job types and those jobs never process, even though the redis indexes look good. We don't have a high-volume of jobs for these types and our theory is that something goes wrong with with the redis connection, but it fails silently and BLPOP just remains blocking and doesn't process any more jobs of that type. We have to restart our worker process and it starts processing all the jobs properly. Has anyone seen BLPOP exhibit this behavior? We're considering switching to lpop and adding a setTimeout to throttle the loop, however we'd prefer to keep BLOP and not add what is essentially a high-frequency polling solution. |
This might help you. Here's the rough set of steps we typically follow to repair various kue issues we see regularly: Failed Jobs Not Showing redis-cli zrange q:jobs:failed 0 -1 For each do hget q:job:NUM type until you find one that has 'type' null (or no 'type' field shows up) If there is no 'data' json blob, you can't recover - just delete the job as follows: That should make the jobs now appear. If that doesn't work (e.g. it corrupts the failed queue again), here's how to manually delete a job: Even if there is a 'data' json blob, other fields might be messed up. It's best to find out what type of job it is and who it applies to (via looking in the log files), do the above procedure and then kick off a new job (via the admin ui) to replace the corrupt one. Jobs Staying in Queued First, find the queue that's misbehaving. Find out how many jobs are queued: There are two possible problems here:
Jobs Staying in Staged Assuming this shows a job number, get that job's current state: If it's current state is complete, you just need to delete the job and that should get the queue flowing. You may also need to repair the staged queue if it's corrupt after deleting the job: If you can't get to the specific job, try clearing the completed queue. If the current state of the job that has the lock is 'staged', then you should move that job directly to 'inactive' manually in the UI (since it already has the lock it can go ahead and be moved to execute).
On Dec 7, 2012, at 2:24 PM, Michael Moser [email protected] wrote:
|
@mikemoser - Thanks for the reply - interestingly, I don't have a key (my job type is 'email')
So I had two stuck jobs:
So, that seems to have cleared out the stuck items for now. @dfoody - Wow, thanks for that! |
@edwardmsmith looks like your key was empty and it does seem that the indexes were our of sync. You can add a watchdog for each type to check this and correct it like we have. @dfoody thanks for sharing - looks like y'all are having a lot of issues. We hope this is not a sign to come for us as we get more volume through kue. You state only 2 reasons for "Jobs Staying in Queued" however we have seen a third and where the numbers match on the indexes and they are greater than zero. This is where we just see the worker for that type sitting on the BLPOP command even through we are pushing new jobs to the key it's blocking on (e.g. |
We've not seen issues with BLPOP. We host redis ourselves, and it's entirely possible - if you're not local to your redis server - that that could be the cause of issues (though I've not looked at the underlying redis protocol layer to see how they implement it to know more concretely if that type of thing could be an issue - e.g. does it heartbeat the connection to detect failures, etc.). On Dec 7, 2012, at 3:16 PM, Michael Moser [email protected] wrote:
|
We're thinking about changing the kue/lib/queue/worker.js getJob() function to no longer use BLPOP and just use LPOP with a setTimeout. Here is a change that we've been testing locally. Any thoughts? /**
* Attempt to fetch the next job.
*
* @param {Function} fn
* @api private
*/
Worker.prototype.getJob = function(fn){
var self = this;
// alloc a client for this job type
var client = clients[self.type]
|| (clients[self.type] = redis.createClient());
// BLPOP indicates we have a new inactive job to process
// client.blpop('q:' + self.type + ':jobs', 0, function(err, result) {
// self.zpop('q:jobs:' + self.type + ':inactive', function(err, id){
// if (err) return fn(err);
// if (!id) return fn();
// Job.get(id, fn);
// });
// });
client.lpop('q:' + self.type + ':jobs', function(err, result) {
setTimeout(function () {
self.zpop('q:jobs:' + self.type + ':inactive', function(err, id){
if (err) return fn(err);
if (!id) return fn();
Job.get(id, fn);
});
}, result ? 0 : self.interval);
}); |
Any news on this? Looking to run > 200k jobs/day and need sth. stable since it will be kinda impossible to handle errors/stuck jobs manually. |
We have determined and fixed the BLPOP not responding. There were a few factors in play:
So, the reason the BLPOP appeared unresponsive was b\c it had connected to the wrong database instance (e.g. back to index 0). We fixed this by:
kue/lib/queue/worker.js getJob()
This is not the best place for this logic, I'm assuming we'd want to make the change in the core reconnect logic and ensure it does not execute the BLPOP until we're sure the database has been selected, however we have had this fix in place for several weeks and things are looking much better for us. We continue to have a watchdog to fix the indexes, however we're observing to see if that issue is related to the selected db issue on reconnect as well. |
I've not posted any related issues yet, it should be investigated more first to catch a better resolution of the problem @tobalsgithub |
I am having this issue and it behaves as describer by @knation -- when the queue is inactive for too long (ex.: test environment is up over the weekend) -- any new jobs to that worker will get stuck. When I restart the worker, it processes the job. Is there any more information about this? I'll try @knation solution, but maybe a keep alive should standard. |
if your case is when Kue workers are idle for a long time, that is a different issue, and could be related to node_redis and it's connection properties. It may be the connection which is dropped for inactivity. |
Thanks for pointing to a direction. I suspect using a
|
I recently moved my Kue workers from Heroku to a Ubuntu machine on Azure. On heroku everything works fine but on Azure I have the same issue of workers stopping taking jobs as described above. As soon as the Azure workers get idle for about 5 minutes, they stop taking jobs forever. If I restart them, they start processing jobs again. I do not see anything on the Azure or the Redis log saying anything is wrong, but i suspect it is indeed the BLPOP which is never responding. Since both workers are running the exact same code, I would believe this could be linked to a configuration on the Ubuntu Server 14.04.4 machine or the environment. Any idea where to search? |
This may mean that your redis client connections are being closed or dropped after some idle time, Some saw this on cloud deployments. You can monitor your redis instace connections, or increase idle connection timeout, I don't remember the option name in node_redis, you can search Kue issues for that. |
noticed this mentioned in features "Graceful workers shutdown", I don't see how this is handled only by a queue process kill signal listener (workers can be in dozens of different processes connected to redis) I ran into something with gearman workers/jobs a few years back (@enobrev wrestled this issue). I believe it could be related, as I can consistently recreate stuck jobs in kue.js by queueing them up and killing the worker. Whenever our workers (which aren't running in the same process as the queue) restart or are killed (any deployment), they are not cleanly deregistering:
something like this snipped from kue.js but just for a given process' workers and active jobs
or even better this!
This happens on every code push when I'm deploying (to heroku, locally, ec2, etc) and this is highly correlated with where I run into the stuck jobs/queue. I noticed the reference to gracefully shutting down the queue or restarting, but nothing about the workers that exec the jobs themselves. What I believe is needed is something like the below code in each and every worker process (I'm testing today locally since I recreate the stuck queue - version -> master git+https://[email protected]/Automattic/kue.git )
If memory serves, there may be other signals we'll need to listen to as well. Please let me know if there's a simple wrapper I can add somewhere, I'm hoping I can add it safely to my Job class. |
Ok put together a gist with graceful queue and worker shutdown. I'm still seeing a stuck active job, so I think worker pause is not triggering active jobs into an inactive state. I'm working on that last bit now. Here's the gist: |
Updated the gist above to check if a job is active, and set its state to inactive so other workers or this worker can pick it up when they resume. Trying to ensure now there are no race conditions, and I don't have too many process SIGINT,SIGTERM listeners (default is 10, can bump it higher within reason). The timing is a little weird, I want to deregister all the workers so they don't grab any jobs, and then I want to make all current jobs inactive. But after the workers are shutdown, the ctx for the job makes setting its state raise an error. After ctx.pause() you can't make a kue.Job.get call and use the returned job to set it inactive, if its still active.
Ideally a single SIGINT, SIGTERM listener pair would cover all workers and active jobs per process, so those particular workers can gracefully shutdown, and any active jobs with those workerIds are made inactive after the workers are made inactive (so they don't try to grab the freshly made inactive jobs). update |
If you really care about this issue, latest bull is really at par feature wise with kue but with a non polling, and mostly atomic design, why wait to kue 1.0 when you can use bull? :) https://www.npmjs.com/package/bull DISCLAIMER, I am the author of the package, I started it out of the frustration of some of the long standing issues with kue, which still today are not completely fixed, and can tell by experience that it is not completely trivial to rewrite everything using lua scripts and blocking redis calls... |
To remedy this issue, just create another job that periodically wakes the queue roughly every 1 minute then when complete , just remove. This way all jobs will get awoken and never get truly stuck. |
Who watches the Watchmen? @Caspain |
@knation do you have a complete snippet for the keep alive interval. We're still seeing stuck inactive jobs only in our dev environment likely related to long periods of inactivity. Is this sufficient? I noticed your ... above queue.process('stuck_queue', 10, function(job, done) {
if (job.data.keepAlive) return done();
});
setInterval(function() { queue.create('stuck_queue', { keepAlive: true }).save(); }, 300000); it looks like it just queues a job, every 300000 ms/ 5min any word back from redis @behrad ? |
@victusfate Honestly, we abandoned this quite some time ago and migrated to a pubsub message system. That said, I believe it's close to what we had. |
Thanks @knation I ended up not needing it (just had a worker issue). |
Hey @victusfate maybe can you comment what kind of issues ? i some times get stucked jobs could not find the reason yet. |
I do a graceful job to inactive shift on process kill or term and that normally handles any stuck jobs. My issue was just a faulty worker |
Still facing this issue! I have all the error handlers, etc. as mentioned in the documentation. There was no failed job event raised either. I wasted the last three weeks implementing a solution with Kue/Redis that's completely unreliable! Going to switch to something else... will try Bull and if that doesn't work, I will move to RabbitMQ. |
The number of jobs that are hung every time are equal to the number of parallel threads processing jobs. |
@theoutlander try bull which has a similar API to Kue, and if you need help ask in the gitter channel: https://gitter.im/OptimalBits/bull |
@manast Thanks for creating this. I'm loving it so far. Have ran into a weird issue today (https://github.com/OptimalBits/bull/issues/170)....not sure why. It went away after a while / restarting the IDE (Webstorm). I haven't faced any issues with stuck jobs so far! Good work! And great job keeping a similar API...the transition was seamless! |
Jobs get stuck in the inactive state fairly often for us. We noticed that the length of
q:[type]:jobs
is zero, even when there are inactive jobs of that type, so when getJob callsblpop
, there is nothing to process.It looks like this gets set when a job is saved and the state is set to inactive using
lpush q:[type]:jobs 1
. We're wondering if this is failing in some cases and once the count is off, jobs remain unprocessed.Has anyone else seen this issue?
The text was updated successfully, but these errors were encountered: