Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak? #88

Closed
sschepens opened this issue Apr 8, 2015 · 81 comments
Closed

Memory Leak? #88

sschepens opened this issue Apr 8, 2015 · 81 comments

Comments

@sschepens
Copy link
Contributor

Memory seems to grow slowly but without limit when using shoryuken with rails environment, a single worker and waiting on an empty queue.

Any ideas what could be causing this behaviour?

@phstc
Copy link
Collaborator

phstc commented Apr 8, 2015

hey @sschepens which version of Shoryuken are you using? Old versions of Shoryuken had a problem with message_attributes, but @elsurudo fixed that d0e69dc

@sschepens
Copy link
Contributor Author

@phstc I'm using latest version

@leemhenson
Copy link
Contributor

ruby -v?

@sschepens
Copy link
Contributor Author

ruby 2.2.0p0 (2014-12-25 revision 49005) [x86_64-linux-gnu]

@leemhenson
Copy link
Contributor

What sort of growth are you talking about? Can you give numbers / new relic graphs etc?

@sschepens
Copy link
Contributor Author

i'll leave a process running for a few hours recording heap metrics and will come back with that info

@phstc
Copy link
Collaborator

phstc commented Apr 8, 2015

Any chance it's your worker itself causing the memory leak? Could you try to run Shoryuken with an empty perform?

Some questions for troubleshooting:

  • How many Shoryuken processes do you have?
  • What's the concurrency configured for each?
  • What are your queues priorities?
  • What's your host/container/whatever CPU (cores) and memory?

@sschepens
Copy link
Contributor Author

Only one process with 5 concurrency, only one queue with a priority of 2.
This behaviour happens on my laptop and in an EC2 instance, the process starts with around 150-160Mb of memory according to htop, and on an EC2 instance the process has been running for a day and it now consumes 700Mb. This instance hast 2000Mb of ram and 1 core.
On my laptop (16Gb ram and 4 cores) after an hour it now consumes 200Mb and continues to grow but very slowly. I have recorded heap metrics every two minutes on my laptop but cant seem to find something that ould indicate the growth of 40 mb.

I dont think running with an empty perform will do much, as i'm hanging on an empty queue.

@leemhenson
Copy link
Contributor

What happens if you stick in an occasional GC.start?

@sschepens
Copy link
Contributor Author

That is exactly what i was doing now, and recording metrics before and after gc start

@phstc
Copy link
Collaborator

phstc commented Apr 9, 2015

@sschepens sorry I missed that part:

waiting on an empty queue.

You are right, no point on testing an empty work.

I will try to run the same test here to see how it goes.

@phstc
Copy link
Collaborator

phstc commented Apr 12, 2015

I kept it running for a few hours with default options (checking an empty queue) and the real memory size was consistent around 52MB.

dotenv bin/shoryuken -q default -r ./examples/default_worker.rb -p shoryuken.pid -l shoryuken.log -d

top -pid $(cat shoryuken.pid)

@sschepens are you using the default delay?

My ruby version ruby 2.0.0p451. I will bump my ruby to try it with 2.2.

@sschepens
Copy link
Contributor Author

Yes, i'm using default delay.
The symptom i'm having is that on weekends (when we don't deploy and we don't do processing) instances die because shotyuken uses up all memory.
I have two different shoryuken processes on two separate machines and both die.
The things they have in common:

  • Same ruby version
  • Both include Rails (but only one actually uses rails stack (ActiveRecord) on process.)
  • Both use aws-sdk on process
  • Both use auto-delete and body_parser with MultiJson + Oj

I still have to record some useful heap information to get a clue of what's going on.

@phstc
Copy link
Collaborator

phstc commented Apr 13, 2015

hm I will try to run that with Rails, because running with Ruby 2.2.1 the real memory size was consistent around 30MB (better than with 2.0)

@sschepens
Copy link
Contributor Author

Ok, i'm using ruby 2.2.0 and it seems to have a memory leak on symbol gc: https://bugs.ruby-lang.org/issues/10686
Gonna try updating to 2.2.1 to see if it gets fixed

@curtislinden
Copy link

We are also experiencing memory run-out problems

We are running a worker process in docker with a memory cap of 500MB.

We have filled our queue with fake data that can be partially processed and will generate an exception after a few lines of code. This keeps the worker busy and makes it easier to monitor it's memory usage behavior.

After 12 hours the process hits the memory cap and is shutdown.

We are running ruby mri 2.1.5 and Rails 4.1 environment with Shoryuken 1.0.1

Same issue occurs with jruby 9.0.0.0.pre1 and Shoryuken 1.0.2.

Essentially the process grows overtime and then hits some upper bound and that's all she wrote.

@phstc
Copy link
Collaborator

phstc commented Apr 18, 2015

I'm using ruby 2.2.0 and it seems to have a memory leak on symbol gc: https://bugs.ruby-lang.org/issues/10686

@curtislinden could you try ruby 2.2.1?

@sschepens
Copy link
Contributor Author

I still have leaks on ruby 2.2.1 and 2.2.2 :(
Didnt have time to get good traces of where the memory is leaking

@curtislinden
Copy link

Same issues with 2.2.2

I found a similar conversation in celluloid and have some insight to share.

I believe this will affect shoryuken as I believe in particular when an exception is caught and the celluloid actor is killed it doesn't have it's memory cleaned up. It may be true for actors that complete a task successfully - as-well


celluloid/celluloid#463 (comment)

@curtislinden try to delete actor instance variables in the celluloid finalizer. The "celluloized" object is known to be never collected by GC.
You may check a w/a for DCell: niamster/dcell@61d4e0d

@sschepens
Copy link
Contributor Author

The problem i believe is that shoryuken doesn't recreate actors, they are long living and only recreated if an exception happened (and only workers i believe)

@curtislinden
Copy link

Since Celluloid was written in the spirit of erlang - I assume that "fail-fast" and "let it die" were good patterns to follow with our workers. In that spirit when our workers can't process a job - they throw exceptions for various cases. In practice these exceptions are common but temporary.

Since we are throwing exceptions often in our current data-set - I suspect given @sschepens belief about recreation exception handling in shoryuken may be the place to have cleanup.

@sschepens
Copy link
Contributor Author

I know i've seen exceptions being swallowed in shoryuken, for example here: fetcher
Also, watchdog annotation swallows exceptions and only logs them and seems to be used a lot.

Don't know if there are more occurrences of these, and i'm not really sure if those are causing issues.
Maybe shoryuken should just let exceptions kill all actors and use supervise or Celluloid::SupervisionGroup to autostart actors.

@sschepens
Copy link
Contributor Author

@phstc @curtislinden @leemhenson
Just wanting to let you guys I created a new project I took some code from shoryuken but simplified the architecture. It's not forked because some things are just a complete rewrite.
I saw shoryuken as too complex for what it does or should do (in my opinion :)) and was trying to hunt for possible memory leaks, so I made a few simplifications:

  • I removed priority based loadbalancing
  • Spawn one long polling fetcher per queue
  • Fetchers get messages according to amount of idle processors and waits until at least one is available to get more messages.
  • Added configurable pool of processors per queue, which should give a desirable amount of loadbalancing.
  • Added a new actor Scheduler which is spawned for each processor now, it's task it's just to take care of the timers of the processor (when as processor is processing it's most definitely doing blocking calls so the processor can't trigger timers while it's processing, this breaks auto_visibility_timeout). I wanted to go for one unique Scheduler but Amazon SDK uses Net::HTTP which is blocking so multiple timers would not trigger concurrently leaving a good chance of breaking auto_visibility_timeout.
  • Usage of Actor Registry instead of passing actor on initialization
  • Extensive usage of linking and supervising, every actor is restarted when an exception is raised.
  • Actors free instance variables on shutdown to prevent memory leaks.

I've also removed some code which i didn't need right now, such as ActiveJob integration.

Anyway, I have been testing and my processes have now been running for a week more or less and I have not had any memory issues.
The project is not documented at all right now and has no tests, but just wanting to let you know that it existed.
Please tell me if you have any issues with me taking code from shoryuken, and if you like it maybe we can work on merging both of them.

Cheers!
Sebastian

@phstc
Copy link
Collaborator

phstc commented Apr 28, 2015

I know i've seen exceptions being swallowed in shoryuken, for example here: fetcher
Also, watchdog annotation swallows exceptions and only logs them and seems to be used a lot.

Don't know if there are more occurrences of these, and i'm not really sure if those are causing issues.
Maybe shoryuken should just let exceptions kill all actors and use supervise or Celluloid::SupervisionGroup to autostart actors.

Fetcher exceptions should be rare. And if a processor dies, it will be handled by manager/trap_exit.

I totally believe in your and I'm sure this issue is happening for you. But I couldn't reproduce it in my production and development environment.

Please tell me if you have any issues with me taking code from shoryuken, and if you like it maybe we can work on merging both of them.

👍 for the initiative! Feel free to copy any code you need! A did the same with Sidekiq. Remember to keep the LGPLv3 License.

@sschepens @curtislinden as you can reproduce the issue, could you try 🙏

# https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L78-79
@threads.delete(processor.object_id)
@busy.delete processor
remove_instance_variable processor

???

Following niamster/dcell@61d4e0d?

@curtislinden
Copy link

Hey! yeah We can try this

On Mon, Apr 27, 2015 at 6:36 PM, Pablo Cantero [email protected]
wrote:

I know i've seen exceptions being swallowed in shoryuken, for example
here: fetcher
Also, watchdog annotation swallows exceptions and only logs them and seems
to be used a lot.

Don't know if there are more occurrences of these, and i'm not really sure
if those are causing issues.
Maybe shoryuken should just let exceptions kill all actors and use
supervise or Celluloid::SupervisionGroup to autostart actors.

Fetcher exceptions should be rare. And if a processor dies, it will be
handled by manager/trap_exit
https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L11
.

I totally believe in your and I'm sure this issue is happening for you.
But I couldn't reproduce it in my production and development environment.

Please tell me if you have any issues with me taking code from shoryuken,
and if you like it maybe we can work on merging both of them.

[image: 👍] for the initiative! Feel free to copy any code you need! A did
the same https://github.com/phstc/shoryuken#credits with Sidekiq.
Remember to keep the LGPLv3 License
https://github.com/phstc/shoryuken/blob/master/LICENSE.txt.

@sschepens https://github.com/sschepens @curtislinden
https://github.com/curtislinden as you can reproduce the issue, could
you try [image: 🙏]

https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L78-79

@threads.delete(processor.object_id)
@busy.delete processor
remove_instance_variable processor

???

Following niamster/dcell@61d4e0d
niamster/dcell@61d4e0d?


Reply to this email directly or view it on GitHub
#88 (comment).

Curtis J Schofield | Senior Systems Analyst | U2U
Skype robotarmymade | Second Life Curtis Linden | C 415.632.6001

Linden Lab | Makers of Shared Creative Spaces

@phstc
Copy link
Collaborator

phstc commented May 8, 2015

hey @curtislinden did you have a chance to try that?

@curtislinden
Copy link

Hey - Super interested in trying this out - just have to get some time set
aside for this. I'll see If i can get to it today.

On Fri, May 8, 2015 at 5:11 AM, Pablo Cantero [email protected]
wrote:

hey @curtislinden https://github.com/curtislinden did you have a chance
to try that?


Reply to this email directly or view it on GitHub
#88 (comment).

Curtis J Schofield | Senior Systems Analyst | U2U
Skype robotarmymade | Second Life Curtis Linden | C 415.632.6001

Linden Lab | Makers of Shared Creative Spaces

@curtislinden
Copy link

Hi - tried it out and it appears that processor isn't an instance variable and also remove_instance_variable is not an expected menthod... I figured this might be because of the 'poetic mode) but it doesn't make sense to remove_instance_variable(:@processor) as that is a local var.

2015-05-08T20:42:57.461708742Z 20:42:57 worker.1 | 2015-05-08T20:42:57Z 14 TID-otkh73en8 ERROR: /usr/local/bundle/gems/celluloid-0.16.0/lib/celluloid/proxies/sync_proxy.rb:23:in method_missing' 2015-05-08T20:42:57.461798959Z 20:42:57 worker.1 | /usr/local/bundle/gems/celluloid-0.16.0/lib/celluloid/proxies/sync_proxy.rb:18:inrespond_to?'
2015-05-08T20:42:57.462189657Z 20:42:57 worker.1 | /usr/local/bundle/bundler/gems/shoryuken-49574533fa26/lib/shoryuken/manager.rb:80:in `remove_instance_variable'

@phstc
Copy link
Collaborator

phstc commented May 10, 2015

Good point! The processors are elements inside the arrays @ready and @busy (which are instance variables).

The processors flow:

Based on what I understood on this thread, the processors might be kept in memory when they die, increasing the memory usage. The remove_instance_variable was an attempt to make sure that we are removing them out from memory when they die. But we probably need to find a different solution for that.

@curtislinden
Copy link

Interesting.. I think the hypothesis is indeed that the processors might be
kept in memory after they die.

This is an interesting problem.

On Sun, May 10, 2015 at 3:05 PM, Pablo Cantero [email protected]
wrote:

Good point! The processors are elements inside the arrays @ready and @busy
(which are instance variables).

The processors flow:

Based on what I understood on this thread, the processors might be kept in
memory when they die
https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L78-79,
increasing the memory usage. The remove_instance_variable was an attempt
to make sure that we are removing them out from memory when they die. But
we probably need to find a different solution for that.


Reply to this email directly or view it on GitHub
#88 (comment).

Curtis J Schofield | Senior Systems Analyst | U2U
Skype robotarmymade | Second Life Curtis Linden | C 415.632.6001

Linden Lab | Makers of Shared Creative Spaces

@phstc
Copy link
Collaborator

phstc commented Jan 21, 2016

@Senjai Not resolved yet. But this is tricky, for example I couldn't reproduce it. Did you? I closed it because of the lack of activity on this thread, but maybe we can re-open it, you have thoughts on it?

@sschepens
Copy link
Contributor Author

I believe sidekiq is in the process of or has already migrated to concurrent-ruby instead of celluloid, because it was having similar issues, rails 5 is also gonna depend on concurrent-ruby.
I did a small sqs consumer proyect using concurrent-ruby and found no issues whatsoever.
@phstc maybe you should consider dropping celluloid.

@Senjai
Copy link
Contributor

Senjai commented Jan 21, 2016

@phstc We've ran into the issue. We dont care if heroku kills our dyno due to it really as our system is designed around that possibility, so its rather low priority. But it still appears to be a problem. 👍 for dropping celluloid though, but thats a larger discussion

@gustavogsimas
Copy link

Any progress on this? I currently have the same issue, shoryuken gets 60% of memory usage only by listening for new messages in SQS...

@phstc
Copy link
Collaborator

phstc commented Oct 7, 2016

@twizzzlers there's a big chance it's more related to what you are doing in your worker instead of in Shoryuken itself. What are you processing?

@gustavogsimas
Copy link

I'm executing some jobs with normal database functions, it could be heavy while executing. But my question is why the memory so high even when its only listening for messages?

@Senjai
Copy link
Contributor

Senjai commented Oct 7, 2016

Its just shoryuken, you can get the same issue with a worker that does nothing

@gustavogsimas
Copy link

So what is my option here? Is there a workaround to this? Should I change to sidekiq or something else?

@phstc
Copy link
Collaborator

phstc commented Oct 8, 2016

@twizzzlers you can try Sidekiq, it's using concurrent-ruby, please if you do, let me know the results. I couldn't easily reproduce the memory issue, but that's happening for sure for some people. Long time ago, when I was using Sidekiq (Celluloid implementation), I use to auto restart Sidekiq using monit service checks, but my workers were doing some (headless) browser operations with libraries known for leaking memory.

@Senjai
Copy link
Contributor

Senjai commented Oct 8, 2016

Either way, its an issueand until it can be nailed down to a particular library or code path the issue should be open.

@kwokster10
Copy link

I agree with @Senjai. Since this is an outstanding issue, it should remain open regardless of the amount of activity. We are seeing this in our app but I did not realize the gem had a memory leak. Our process has been dying in the perform method due to some data missing in the body that we were expecting.

Error on Heroku: ERROR: Process died, reason: undefined method[]' for nil:NilClass`. We are correcting this now but am wondering what we can do to prevent this memory leak.

It's causing our memory quota to reach 1GB within 6 hours. See image below.

shoryuken-memory-graph

@phstc phstc reopened this Nov 17, 2016
@phstc
Copy link
Collaborator

phstc commented Nov 17, 2016

@Senjai @kwokster10 I don't mind on reopening it, but unfortunately, I can't work on this at moment. Not sure if we can do much with Celluloid, I guess the fix would be to migrate to concurrent-ruby as the other cool guys.

@gustavogsimas
Copy link

We are staying with this issue here in my company for now while we study other options, altough we'd like to continue with shoryuken since it works so much better with AWS SQS...

@phstc
Copy link
Collaborator

phstc commented Dec 5, 2016

Hey!

I started to work on the concurrent-ruby migration 🍻

@phstc phstc mentioned this issue Dec 5, 2016
@mariokostelac
Copy link
Contributor

mariokostelac commented Dec 13, 2016

@Senjai @twizzzlers @kwokster10 @gshutler are you using any middleware? From shoryuken or any other repo?

@mariokostelac
Copy link
Contributor

Does not matter. I found an issue. It was not Celluloid, it was not Ruby (or at least not just Celluloid and not just Ruby), it was the way we handled bookkeeping of threads objects.

@phstc
Copy link
Collaborator

phstc commented Dec 23, 2016

@mariokostelac amazing work on this 👏

Shoryuken 2.1.2 is out with your fix. For more details: CHANGELOG.

@kwokster10
Copy link

@mariokostelac I don't believe we are using any middleware, only our custom WorkerClass.

@mariokostelac
Copy link
Contributor

mariokostelac commented Jan 6, 2017

@kwokster10 does not matter. We've explained what was the problem and how we've fixed it. Thanks for replying.

@phstc
Copy link
Collaborator

phstc commented Mar 12, 2017

Hey all,

Shoryuken v3 is out 🎉 using concurrent-ruby instead of Celulloid, it should fix this leaking issue. Based on this (thanks @paul) the overall memory footprint is lower and more consistent.

Besides that, I added some cool/handy SQS commands, such as ls to list queues and messages (available/in flight), mv to move messages from one queue to another, dump to dump messages, and requeue to enqueue messages from a dump file. Have a look at shoryuken help sqs for some guidance.

Full CHANGELOG.

@elsurudo
Copy link
Contributor

That's great! Will have to try this one out. Btw @phstc, does Shoryuken use semver? ie. when you incremented the major version, does that mean it's not backwards-compatible?

@phstc
Copy link
Collaborator

phstc commented Mar 13, 2017

Hi @elsurudo

Shoryuken uses semver. The version 3.0.0, introduces some incompatible API changes depending on how you use Shoryuken. For instance, deprecation warnings and setting up AwsConfig through Shoryuken were removed. For most projects, I believe people can just bump Shoryuken.

@phstc
Copy link
Collaborator

phstc commented Jul 3, 2017

Hi all

Just released a new version of Shoryuken 3.1.0 with a new concurrent-ruby implementation using Futures and Promises.

Shoryuken now also supports processing groups (set of queues), which means it supports concurrency, exclusive fetcher, long polling etc per group 🤘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests