Memory Leak? #88

sschepens · 2015-04-08T00:53:52Z

Memory seems to grow slowly but without limit when using shoryuken with rails environment, a single worker and waiting on an empty queue.

Any ideas what could be causing this behaviour?

phstc · 2015-04-08T00:56:46Z

hey @sschepens which version of Shoryuken are you using? Old versions of Shoryuken had a problem with message_attributes, but @elsurudo fixed that d0e69dc

sschepens · 2015-04-08T00:58:23Z

@phstc I'm using latest version

leemhenson · 2015-04-08T08:56:53Z

ruby -v?

sschepens · 2015-04-08T13:09:28Z

ruby 2.2.0p0 (2014-12-25 revision 49005) [x86_64-linux-gnu]

leemhenson · 2015-04-08T13:15:20Z

What sort of growth are you talking about? Can you give numbers / new relic graphs etc?

sschepens · 2015-04-08T13:18:51Z

i'll leave a process running for a few hours recording heap metrics and will come back with that info

phstc · 2015-04-08T13:49:47Z

Any chance it's your worker itself causing the memory leak? Could you try to run Shoryuken with an empty perform?

Some questions for troubleshooting:

How many Shoryuken processes do you have?
What's the concurrency configured for each?
What are your queues priorities?
What's your host/container/whatever CPU (cores) and memory?

sschepens · 2015-04-08T14:08:57Z

Only one process with 5 concurrency, only one queue with a priority of 2.
This behaviour happens on my laptop and in an EC2 instance, the process starts with around 150-160Mb of memory according to htop, and on an EC2 instance the process has been running for a day and it now consumes 700Mb. This instance hast 2000Mb of ram and 1 core.
On my laptop (16Gb ram and 4 cores) after an hour it now consumes 200Mb and continues to grow but very slowly. I have recorded heap metrics every two minutes on my laptop but cant seem to find something that ould indicate the growth of 40 mb.

I dont think running with an empty perform will do much, as i'm hanging on an empty queue.

leemhenson · 2015-04-08T14:23:32Z

What happens if you stick in an occasional GC.start?

sschepens · 2015-04-08T14:24:35Z

That is exactly what i was doing now, and recording metrics before and after gc start

phstc · 2015-04-09T02:13:14Z

@sschepens sorry I missed that part:

waiting on an empty queue.

You are right, no point on testing an empty work.

I will try to run the same test here to see how it goes.

phstc · 2015-04-12T13:52:50Z

I kept it running for a few hours with default options (checking an empty queue) and the real memory size was consistent around 52MB.

dotenv bin/shoryuken -q default -r ./examples/default_worker.rb -p shoryuken.pid -l shoryuken.log -d

top -pid $(cat shoryuken.pid)

@sschepens are you using the default delay?

My ruby version ruby 2.0.0p451. I will bump my ruby to try it with 2.2.

sschepens · 2015-04-13T13:27:00Z

Yes, i'm using default delay.
The symptom i'm having is that on weekends (when we don't deploy and we don't do processing) instances die because shotyuken uses up all memory.
I have two different shoryuken processes on two separate machines and both die.
The things they have in common:

Same ruby version
Both include Rails (but only one actually uses rails stack (ActiveRecord) on process.)
Both use aws-sdk on process
Both use auto-delete and body_parser with MultiJson + Oj

I still have to record some useful heap information to get a clue of what's going on.

phstc · 2015-04-13T13:37:45Z

hm I will try to run that with Rails, because running with Ruby 2.2.1 the real memory size was consistent around 30MB (better than with 2.0)

sschepens · 2015-04-13T14:15:58Z

Ok, i'm using ruby 2.2.0 and it seems to have a memory leak on symbol gc: https://bugs.ruby-lang.org/issues/10686
Gonna try updating to 2.2.1 to see if it gets fixed

curtislinden · 2015-04-15T18:27:45Z

We are also experiencing memory run-out problems

We are running a worker process in docker with a memory cap of 500MB.

We have filled our queue with fake data that can be partially processed and will generate an exception after a few lines of code. This keeps the worker busy and makes it easier to monitor it's memory usage behavior.

After 12 hours the process hits the memory cap and is shutdown.

We are running ruby mri 2.1.5 and Rails 4.1 environment with Shoryuken 1.0.1

Same issue occurs with jruby 9.0.0.0.pre1 and Shoryuken 1.0.2.

Essentially the process grows overtime and then hits some upper bound and that's all she wrote.

phstc · 2015-04-18T13:12:03Z

I'm using ruby 2.2.0 and it seems to have a memory leak on symbol gc: https://bugs.ruby-lang.org/issues/10686

@curtislinden could you try ruby 2.2.1?

sschepens · 2015-04-20T13:11:07Z

I still have leaks on ruby 2.2.1 and 2.2.2 :(
Didnt have time to get good traces of where the memory is leaking

curtislinden · 2015-04-20T17:27:46Z

Same issues with 2.2.2

I found a similar conversation in celluloid and have some insight to share.

I believe this will affect shoryuken as I believe in particular when an exception is caught and the celluloid actor is killed it doesn't have it's memory cleaned up. It may be true for actors that complete a task successfully - as-well

celluloid/celluloid#463 (comment)

@curtislinden try to delete actor instance variables in the celluloid finalizer. The "celluloized" object is known to be never collected by GC.
You may check a w/a for DCell: niamster/dcell@61d4e0d

sschepens · 2015-04-20T19:58:46Z

The problem i believe is that shoryuken doesn't recreate actors, they are long living and only recreated if an exception happened (and only workers i believe)

curtislinden · 2015-04-20T20:11:45Z

Since Celluloid was written in the spirit of erlang - I assume that "fail-fast" and "let it die" were good patterns to follow with our workers. In that spirit when our workers can't process a job - they throw exceptions for various cases. In practice these exceptions are common but temporary.

Since we are throwing exceptions often in our current data-set - I suspect given @sschepens belief about recreation exception handling in shoryuken may be the place to have cleanup.

sschepens · 2015-04-20T20:22:10Z

I know i've seen exceptions being swallowed in shoryuken, for example here: fetcher
Also, watchdog annotation swallows exceptions and only logs them and seems to be used a lot.

Don't know if there are more occurrences of these, and i'm not really sure if those are causing issues.
Maybe shoryuken should just let exceptions kill all actors and use supervise or Celluloid::SupervisionGroup to autostart actors.

sschepens · 2015-04-27T20:27:53Z

@phstc @curtislinden @leemhenson
Just wanting to let you guys I created a new project I took some code from shoryuken but simplified the architecture. It's not forked because some things are just a complete rewrite.
I saw shoryuken as too complex for what it does or should do (in my opinion :)) and was trying to hunt for possible memory leaks, so I made a few simplifications:

I removed priority based loadbalancing
Spawn one long polling fetcher per queue
Fetchers get messages according to amount of idle processors and waits until at least one is available to get more messages.
Added configurable pool of processors per queue, which should give a desirable amount of loadbalancing.
Added a new actor Scheduler which is spawned for each processor now, it's task it's just to take care of the timers of the processor (when as processor is processing it's most definitely doing blocking calls so the processor can't trigger timers while it's processing, this breaks auto_visibility_timeout). I wanted to go for one unique Scheduler but Amazon SDK uses Net::HTTP which is blocking so multiple timers would not trigger concurrently leaving a good chance of breaking auto_visibility_timeout.
Usage of Actor Registry instead of passing actor on initialization
Extensive usage of linking and supervising, every actor is restarted when an exception is raised.
Actors free instance variables on shutdown to prevent memory leaks.

I've also removed some code which i didn't need right now, such as ActiveJob integration.

Anyway, I have been testing and my processes have now been running for a week more or less and I have not had any memory issues.
The project is not documented at all right now and has no tests, but just wanting to let you know that it existed.
Please tell me if you have any issues with me taking code from shoryuken, and if you like it maybe we can work on merging both of them.

Cheers!
Sebastian

phstc · 2015-04-28T01:36:18Z

I know i've seen exceptions being swallowed in shoryuken, for example here: fetcher
Also, watchdog annotation swallows exceptions and only logs them and seems to be used a lot.

Don't know if there are more occurrences of these, and i'm not really sure if those are causing issues.
Maybe shoryuken should just let exceptions kill all actors and use supervise or Celluloid::SupervisionGroup to autostart actors.

Fetcher exceptions should be rare. And if a processor dies, it will be handled by manager/trap_exit.

I totally believe in your and I'm sure this issue is happening for you. But I couldn't reproduce it in my production and development environment.

Please tell me if you have any issues with me taking code from shoryuken, and if you like it maybe we can work on merging both of them.

👍 for the initiative! Feel free to copy any code you need! A did the same with Sidekiq. Remember to keep the LGPLv3 License.

@sschepens @curtislinden as you can reproduce the issue, could you try 🙏

# https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L78-79
@threads.delete(processor.object_id)
@busy.delete processor
remove_instance_variable processor

???

Following niamster/dcell@61d4e0d?

curtislinden · 2015-04-28T17:40:33Z

Hey! yeah We can try this

On Mon, Apr 27, 2015 at 6:36 PM, Pablo Cantero [email protected]
wrote:

I know i've seen exceptions being swallowed in shoryuken, for example
here: fetcher
Also, watchdog annotation swallows exceptions and only logs them and seems
to be used a lot.

Don't know if there are more occurrences of these, and i'm not really sure
if those are causing issues.
Maybe shoryuken should just let exceptions kill all actors and use
supervise or Celluloid::SupervisionGroup to autostart actors.

Fetcher exceptions should be rare. And if a processor dies, it will be
handled by manager/trap_exit
https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L11
.

I totally believe in your and I'm sure this issue is happening for you.
But I couldn't reproduce it in my production and development environment.

Please tell me if you have any issues with me taking code from shoryuken,
and if you like it maybe we can work on merging both of them.

[image: 👍] for the initiative! Feel free to copy any code you need! A did
the same https://github.com/phstc/shoryuken#credits with Sidekiq.
Remember to keep the LGPLv3 License
https://github.com/phstc/shoryuken/blob/master/LICENSE.txt.

@sschepens https://github.com/sschepens @curtislinden
https://github.com/curtislinden as you can reproduce the issue, could
you try [image: 🙏]

https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L78-79

@threads.delete(processor.object_id)
@busy.delete processor
remove_instance_variable processor

???

Following niamster/dcell@61d4e0d
niamster/dcell@61d4e0d?

—
Reply to this email directly or view it on GitHub
#88 (comment).

Curtis J Schofield | Senior Systems Analyst | U2U
Skype robotarmymade | Second Life Curtis Linden | C 415.632.6001

Linden Lab | Makers of Shared Creative Spaces

phstc · 2015-05-08T12:11:30Z

hey @curtislinden did you have a chance to try that?

curtislinden · 2015-05-08T18:06:57Z

Hey - Super interested in trying this out - just have to get some time set
aside for this. I'll see If i can get to it today.

On Fri, May 8, 2015 at 5:11 AM, Pablo Cantero [email protected]
wrote:

hey @curtislinden https://github.com/curtislinden did you have a chance
to try that?

—
Reply to this email directly or view it on GitHub
#88 (comment).

Curtis J Schofield | Senior Systems Analyst | U2U
Skype robotarmymade | Second Life Curtis Linden | C 415.632.6001

Linden Lab | Makers of Shared Creative Spaces

curtislinden · 2015-05-08T20:49:57Z

Hi - tried it out and it appears that processor isn't an instance variable and also remove_instance_variable is not an expected menthod... I figured this might be because of the 'poetic mode) but it doesn't make sense to remove_instance_variable(:@processor) as that is a local var.

2015-05-08T20:42:57.461708742Z 20:42:57 worker.1 | 2015-05-08T20:42:57Z 14 TID-otkh73en8 ERROR: /usr/local/bundle/gems/celluloid-0.16.0/lib/celluloid/proxies/sync_proxy.rb:23:in method_missing' 2015-05-08T20:42:57.461798959Z 20:42:57 worker.1 | /usr/local/bundle/gems/celluloid-0.16.0/lib/celluloid/proxies/sync_proxy.rb:18:inrespond_to?'
2015-05-08T20:42:57.462189657Z 20:42:57 worker.1 | /usr/local/bundle/bundler/gems/shoryuken-49574533fa26/lib/shoryuken/manager.rb:80:in `remove_instance_variable'

phstc · 2015-05-10T22:05:18Z

Good point! The processors are elements inside the arrays @ready and @busy (which are instance variables).

The processors flow:

The manager creates the processors based on the concurrency configuration during its initialization: https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L21
Then to process messages, they are moved from @ready to @busy: https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L95-96
Then moved back when the processing is done: https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L64-69

Based on what I understood on this thread, the processors might be kept in memory when they die, increasing the memory usage. The remove_instance_variable was an attempt to make sure that we are removing them out from memory when they die. But we probably need to find a different solution for that.

curtislinden · 2015-05-11T03:03:22Z

Interesting.. I think the hypothesis is indeed that the processors might be
kept in memory after they die.

This is an interesting problem.

On Sun, May 10, 2015 at 3:05 PM, Pablo Cantero [email protected]
wrote:

Good point! The processors are elements inside the arrays @ready and @busy
(which are instance variables).

The processors flow:

The manager creates the processors based on the concurrency
configuration during its initialization:
https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L21

Then to process messages, they are moved from @ready to @busy:
https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L95-96

Then moved back when the processing is done:
https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L64-69

Based on what I understood on this thread, the processors might be kept in
memory when they die
https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L78-79,
increasing the memory usage. The remove_instance_variable was an attempt
to make sure that we are removing them out from memory when they die. But
we probably need to find a different solution for that.

—
Reply to this email directly or view it on GitHub
#88 (comment).

Curtis J Schofield | Senior Systems Analyst | U2U
Skype robotarmymade | Second Life Curtis Linden | C 415.632.6001

Linden Lab | Makers of Shared Creative Spaces

phstc · 2016-01-21T12:24:07Z

@Senjai Not resolved yet. But this is tricky, for example I couldn't reproduce it. Did you? I closed it because of the lack of activity on this thread, but maybe we can re-open it, you have thoughts on it?

sschepens · 2016-01-21T13:02:15Z

I believe sidekiq is in the process of or has already migrated to concurrent-ruby instead of celluloid, because it was having similar issues, rails 5 is also gonna depend on concurrent-ruby.
I did a small sqs consumer proyect using concurrent-ruby and found no issues whatsoever.
@phstc maybe you should consider dropping celluloid.

Senjai · 2016-01-21T17:54:59Z

@phstc We've ran into the issue. We dont care if heroku kills our dyno due to it really as our system is designed around that possibility, so its rather low priority. But it still appears to be a problem. 👍 for dropping celluloid though, but thats a larger discussion

gustavogsimas · 2016-10-07T13:24:33Z

Any progress on this? I currently have the same issue, shoryuken gets 60% of memory usage only by listening for new messages in SQS...

phstc · 2016-10-07T13:26:15Z

@twizzzlers there's a big chance it's more related to what you are doing in your worker instead of in Shoryuken itself. What are you processing?

gustavogsimas · 2016-10-07T13:31:34Z

I'm executing some jobs with normal database functions, it could be heavy while executing. But my question is why the memory so high even when its only listening for messages?

Senjai · 2016-10-07T14:03:38Z

Its just shoryuken, you can get the same issue with a worker that does nothing

gustavogsimas · 2016-10-07T17:58:58Z

So what is my option here? Is there a workaround to this? Should I change to sidekiq or something else?

phstc · 2016-10-08T22:33:11Z

@twizzzlers you can try Sidekiq, it's using concurrent-ruby, please if you do, let me know the results. I couldn't easily reproduce the memory issue, but that's happening for sure for some people. Long time ago, when I was using Sidekiq (Celluloid implementation), I use to auto restart Sidekiq using monit service checks, but my workers were doing some (headless) browser operations with libraries known for leaking memory.

Senjai · 2016-10-08T23:20:40Z

Either way, its an issueand until it can be nailed down to a particular library or code path the issue should be open.

kwokster10 · 2016-11-17T16:34:27Z

I agree with @Senjai. Since this is an outstanding issue, it should remain open regardless of the amount of activity. We are seeing this in our app but I did not realize the gem had a memory leak. Our process has been dying in the perform method due to some data missing in the body that we were expecting.

Error on Heroku: ERROR: Process died, reason: undefined method[]' for nil:NilClass`. We are correcting this now but am wondering what we can do to prevent this memory leak.

It's causing our memory quota to reach 1GB within 6 hours. See image below.

phstc · 2016-11-17T16:58:31Z

@Senjai @kwokster10 I don't mind on reopening it, but unfortunately, I can't work on this at moment. Not sure if we can do much with Celluloid, I guess the fix would be to migrate to concurrent-ruby as the other cool guys.

gustavogsimas · 2016-11-17T18:38:12Z

We are staying with this issue here in my company for now while we study other options, altough we'd like to continue with shoryuken since it works so much better with AWS SQS...

phstc · 2016-12-05T15:34:07Z

Hey!

I started to work on the concurrent-ruby migration 🍻

mariokostelac · 2016-12-13T21:18:17Z

@Senjai @twizzzlers @kwokster10 @gshutler are you using any middleware? From shoryuken or any other repo?

mariokostelac · 2016-12-14T20:42:36Z

Does not matter. I found an issue. It was not Celluloid, it was not Ruby (or at least not just Celluloid and not just Ruby), it was the way we handled bookkeeping of threads objects.

phstc · 2016-12-23T00:13:04Z

@mariokostelac amazing work on this 👏

Shoryuken 2.1.2 is out with your fix. For more details: CHANGELOG.

kwokster10 · 2016-12-29T20:23:46Z

@mariokostelac I don't believe we are using any middleware, only our custom WorkerClass.

mariokostelac · 2017-01-06T09:25:28Z

@kwokster10 does not matter. We've explained what was the problem and how we've fixed it. Thanks for replying.

phstc · 2017-03-12T22:54:21Z

Hey all,

Shoryuken v3 is out 🎉 using concurrent-ruby instead of Celulloid, it should fix this leaking issue. Based on this (thanks @paul) the overall memory footprint is lower and more consistent.

Besides that, I added some cool/handy SQS commands, such as ls to list queues and messages (available/in flight), mv to move messages from one queue to another, dump to dump messages, and requeue to enqueue messages from a dump file. Have a look at shoryuken help sqs for some guidance.

Full CHANGELOG.

elsurudo · 2017-03-13T08:38:50Z

That's great! Will have to try this one out. Btw @phstc, does Shoryuken use semver? ie. when you incremented the major version, does that mean it's not backwards-compatible?

phstc · 2017-03-13T12:22:29Z

Hi @elsurudo

Shoryuken uses semver. The version 3.0.0, introduces some incompatible API changes depending on how you use Shoryuken. For instance, deprecation warnings and setting up AwsConfig through Shoryuken were removed. For most projects, I believe people can just bump Shoryuken.

phstc · 2017-07-03T00:20:00Z

Hi all

Just released a new version of Shoryuken 3.1.0 with a new concurrent-ruby implementation using Futures and Promises.

Shoryuken now also supports processing groups (set of queues), which means it supports concurrency, exclusive fetcher, long polling etc per group 🤘

phstc mentioned this issue Feb 29, 2016

Concurrent-ruby support #185

Closed

gshutler mentioned this issue Jul 26, 2016

Extracted queue polling strategy #106

Closed

phstc reopened this Nov 17, 2016

phstc mentioned this issue Dec 5, 2016

Concurrent ruby #280

Closed

mariokostelac mentioned this issue Dec 14, 2016

Fix memory leak caused by async tracking busy threads #289

Merged

phstc closed this as completed in 453d6de Dec 14, 2016

phstc mentioned this issue Dec 19, 2016

RIP Celluloid, Hello concurrent-ruby/Thread pool #291

Merged

Memory Leak? #88

Memory Leak? #88

Comments

sschepens commented Apr 8, 2015

phstc commented Apr 8, 2015

sschepens commented Apr 8, 2015

leemhenson commented Apr 8, 2015

sschepens commented Apr 8, 2015

leemhenson commented Apr 8, 2015

sschepens commented Apr 8, 2015

phstc commented Apr 8, 2015

sschepens commented Apr 8, 2015

leemhenson commented Apr 8, 2015

sschepens commented Apr 8, 2015

phstc commented Apr 9, 2015

phstc commented Apr 12, 2015

sschepens commented Apr 13, 2015

phstc commented Apr 13, 2015

sschepens commented Apr 13, 2015

curtislinden commented Apr 15, 2015

phstc commented Apr 18, 2015

sschepens commented Apr 20, 2015

curtislinden commented Apr 20, 2015

sschepens commented Apr 20, 2015

curtislinden commented Apr 20, 2015

sschepens commented Apr 20, 2015

sschepens commented Apr 27, 2015

phstc commented Apr 28, 2015

curtislinden commented Apr 28, 2015

https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L78-79

phstc commented May 8, 2015

curtislinden commented May 8, 2015

curtislinden commented May 8, 2015

phstc commented May 10, 2015

curtislinden commented May 11, 2015

phstc commented Jan 21, 2016

sschepens commented Jan 21, 2016

Senjai commented Jan 21, 2016

gustavogsimas commented Oct 7, 2016

phstc commented Oct 7, 2016

gustavogsimas commented Oct 7, 2016

Senjai commented Oct 7, 2016

gustavogsimas commented Oct 7, 2016

phstc commented Oct 8, 2016 • edited Loading

Senjai commented Oct 8, 2016

kwokster10 commented Nov 17, 2016

phstc commented Nov 17, 2016

gustavogsimas commented Nov 17, 2016

phstc commented Dec 5, 2016

mariokostelac commented Dec 13, 2016 • edited Loading

mariokostelac commented Dec 14, 2016

phstc commented Dec 23, 2016 • edited Loading

kwokster10 commented Dec 29, 2016

mariokostelac commented Jan 6, 2017 • edited Loading

phstc commented Mar 12, 2017 • edited Loading

elsurudo commented Mar 13, 2017

phstc commented Mar 13, 2017

phstc commented Jul 3, 2017

phstc commented Oct 8, 2016 •

edited

Loading

mariokostelac commented Dec 13, 2016 •

edited

Loading

phstc commented Dec 23, 2016 •

edited

Loading

mariokostelac commented Jan 6, 2017 •

edited

Loading

phstc commented Mar 12, 2017 •

edited

Loading