Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive CPU usage #266

Closed
0atman opened this issue Sep 16, 2016 · 29 comments
Closed

Massive CPU usage #266

0atman opened this issue Sep 16, 2016 · 29 comments
Assignees
Labels
Milestone

Comments

@0atman
Copy link
Member

0atman commented Sep 16, 2016

I tried the latest version of alda today (alda update), and found that running alda up maxed out my machine's CPU and basically locked it up with loadaverages over 80(!).

Here are the offending threads:

2016-09-16-132039_2020x123_scrot

What's happened in the last 10 days? My machine is very powerful, and more importantly didn't used to have this problem.

My version:

$ alda repl
Sep 16, 2016 1:25:00 PM com.jsyn.engine.SynthesisEngine start
INFO: Pure Java JSyn from www.softsynth.com, rate = 44100, RT, V16.7.3 (build 457, 2014-12-25)
Preparing MIDI system... done.

 █████╗ ██╗     ██████╗  █████╗ 
██╔══██╗██║     ██╔══██╗██╔══██╗
███████║██║     ██║  ██║███████║
██╔══██║██║     ██║  ██║██╔══██║
██║  ██║███████╗██████╔╝██║  ██║
╚═╝  ╚═╝╚══════╝╚═════╝ ╚═╝  ╚═╝

            v1.0.0-rc38
         repl session

My machine:

$ uname -a
Linux T420 4.7.2-1-ARCH #1 SMP PREEMPT Sat Aug 20 23:02:56 CEST 2016 x86_64 GNU/Linux

My jvm:

$ java -version
java version "1.8.0_102"
Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
@daveyarwood
Copy link
Member

You're not alone... I've started to notice similar issues on my machine, which is also fairly powerful (it's a 2014 Macbook Pro).

The problem seems to be something about the worker processes as they're starting up. With my system monitor open, I see the CPU of each worker process spike to 100-200% CPU for about 10-15 seconds, then go down to 2-3% and stay there.

I'll do some experiments with commenting out blocks of code and see if I can figure out the cause for the high CPU usage.

@0atman
Copy link
Member Author

0atman commented Sep 16, 2016

Thanks Dave, maybe they are doing compilation? I don't know the details.

Maybe you could tweak the niceness of the processes while starting up, so they are more polite? No idea how to do that in java I'm afraid, maybe some jvm options?

@daveyarwood
Copy link
Member

The code for forking the worker processes is here.

There is no compilation involved -- Alda basically runs from a jar file (it's wrapped in an executable which essentially just runs java -jar followed by the contents of the jar file). The jar file contains all of the compiled code for the client, server, and workers. So to start a worker process, the server runs a util function to determine the path of the program you're running (e.g. /usr/local/bin/alda), then starts a new process of the same program, but with different command-line args so that it starts up a worker.

I've been googling around and not really finding any definitive solutions for limiting CPU usage in a JVM process (apparently there isn't a convenient JVM option for that)... tweaking niceness might be something we can do.

This looks promising, though: http://stackoverflow.com/a/2877813/2338327

Apparently spawning a child process in Java causes high memory usage in general, and a workaround is to spawn one "exec server" process early on in the application, then have that secondary process be the one to spawn your external processes. Because the second process is much smaller, the external processes end up consuming less memory. It's something to do with child processes using the same amount of memory as the parent process, or something to that affect. I'll play around with this.

@0atman
Copy link
Member Author

0atman commented Sep 16, 2016

Got it, but be careful of mem being a red herring: See my screenshot, I in no way ran out of memory:

2016-09-16-132039_2020x123_scrot

My load was 80 - EIGHTY! - before I managed to kill the server, no swapping or out of memory errors, even no I/O!.

@0atman
Copy link
Member Author

0atman commented Sep 16, 2016

As we are friends, I know you would tell me if you were mining bitcoin on my computer, so it can't be that...

@daveyarwood
Copy link
Member

Hmm... perhaps memory is not the issue after all. I'll try the change I mentioned anyway and see if it helps at all -- if not, then back to the drawing board.

@daveyarwood
Copy link
Member

I tried the ProcessHelper (intermediate process-spawning process) approach -- here's my code on a branch. I'm having trouble getting it to work, though. The child processes are not being spawned.

I don't think the issue is memory, anyway -- it's unconstrained CPU use -- so I think I'll abandon that approach.

@daveyarwood
Copy link
Member

I'm not sure if we can tweak niceness in a cross-platform way, but we could at least shell out and use Unix commands to set the niceness of each worker process pid immediately after spawning it... it's better than nothing.

@0atman
Copy link
Member Author

0atman commented Sep 16, 2016

Would be perfect to profile the code and see what is spinning and optimise
it - solve it at source. But First we'd need to replicate it.

On Fri, 16 Sep 2016 at 15:43 Dave Yarwood [email protected] wrote:

I'm not sure if we can tweak niceness in a cross-platform way, but we
could at least shell out and use Unix commands to set the niceness of each
worker process pid immediately after spawning it... it's better than
nothing.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#266 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAG9sfU9MOq_JlroP2va4ur5dUxzo1Pbks5qqqrzgaJpZM4J-4my
.

@daveyarwood
Copy link
Member

I did take a stab at profiling using jvisualvm, but I wasn't really able to find a single source of high CPU usage... although the problem there could definitely be lack of experience with profiling on my part.

I am reproducing this issue every time I run alda up. I can see the worker processes start up in Activity Monitor, and CPU spikes to around 200% for about 10 seconds before dropping back down to a reasonable level (2-3%) and staying there.

screen shot 2016-09-16 at 11 29 08 am

screen shot 2016-09-16 at 11 29 19 am

I think I might be getting somewhere by commenting out code as a hacky attempt at profiling... will keep digging!

@daveyarwood
Copy link
Member

daveyarwood commented Sep 16, 2016

Simply requiring our logging library appears to cause CPU to spike >200% in the worker processes briefly when they start up... not sure what that's all about. Maybe we could try a build where workers do no logging at all?

Currently the only time the user ever sees worker logs is if they log an error to ~/.alda/logs/error.log, but we could easily get the same functionality without requiring the whole log library.

@refi64
Copy link

refi64 commented Sep 16, 2016

@daveyarwood What if you compile the library AOT?

@daveyarwood
Copy link
Member

AOT compiling does not appear to help.

@elydpg
Copy link
Contributor

elydpg commented Sep 17, 2016

Hmm......

I wonder if this is why my computer is freezing beyond use? Of course this is from leaving the server running by default, so idk if it's the cause of the issue.

@daveyarwood
Copy link
Member

daveyarwood commented Sep 17, 2016

@elyisgreat Yikes, sorry about that! I noticed on my computer that the problem gets worse when the computer goes into sleep mode and then comes back*, but I think that high CPU usage is the root cause.

*what happens is, the existing worker processes are hanging around briefly while they're shutting down (see #160), and meanwhile the server is starting N more workers (4 by default, or a different number if you started the server with the --workers N option), so you have a total of N x 2 worker processes active at once. Because of the CPU usage issue, this can really slow down even a high-powered system.

I've been thinking about how to handle this in terms of Alda releases... Seems like it is not the easiest thing in the world to fix the immediate issue with CPU usage. One thing I can do though, is push a release out ASAP that disables the "immediately start more workers after coming out of sleep" behavior, and make it opt-in via an option when you run alda up. Once the CPU usage issue is fixed, I can make it the default again. This will at least make it safe to have your Alda server always running in the background.

@daveyarwood
Copy link
Member

1.0.0-rc39 makes the worker-cycling behavior introduced in 1.0.0-rc38 an opt-in feature. To opt in, include --cycle-workers when starting a server, i.e. alda --cycle-workers up. See the release notes for more details.

Once we are able to fix this CPU issue, we can remove the --cycle-workers option and go back to it happening by default.

@daveyarwood
Copy link
Member

So far, I've been able to determine that requiring timbre and instaparse both cause the CPU to spike. The way I'm testing this is by commenting out code so that only certain code paths are happening, and each time I test it, I am building alda to /tmp and running /tmp/alda -p 12345 worker. I also commented out nearly all the code in the alda.worker namespace, so that all it's doing is requiring alda.now, waiting 20 seconds, and then exiting.

I'm not sure yet what we can do about this. We definitely need instaparse, at least. I wonder if there is some way we can get the Java process to take its time loading the dependencies, so it doesn't blow through all the CPU at once?

@daveyarwood
Copy link
Member

I created a branch with this change to alda.worker where it just loads timbre and instaparse, sleeps for 20 seconds, then exits. With the requires, I'm seeing my CPU spike to 150-200%. Without them, CPU usage is < 5%.

@daveyarwood daveyarwood added this to the 1.0.0 milestone Sep 17, 2016
@daveyarwood
Copy link
Member

An interesting observation: when starting 1 worker, my CPU spikes ~30-40%. When starting 4 workers, it spikes 175-200%. Could there be something off about the way we start a worker process?

Code for starting a worker process is here -- basically we just call alda.worker/start-worker! from Java using Clojure's Java API.

@daveyarwood
Copy link
Member

daveyarwood commented Sep 18, 2016

I had an idea for a workaround this morning -- regardless of what is causing the spike in CPU, we can work around it by not trying to start up all 4 workers at the same time. We really only need 1 worker to be available for the server to be considered "up," and the other 3 workers can take their time. I don't think it will be a major inconvenience if all 4 workers are not available right away.

So what we can do is stagger the spawning of the workers, i.e.:

  • spawn worker 1
  • wait 5 seconds; spawn worker 2
  • wait 5 seconds; spawn worker 3
  • wait 5 seconds; spawn worker 4

We can do this on a background thread so it doesn't hold up the server, and this won't feel like it takes any longer than it does currently to spawn the workers. (In fact, it might even be faster, because the first worker won't be competing with 3 others for CPU.)

@0atman
Copy link
Member Author

0atman commented Sep 18, 2016

nice workaround!

@daveyarwood
Copy link
Member

I also think the default number of workers should maybe be 2 instead of 4. For the most part, you shouldn't need to play more than 1 score at a time, and having 2 workers allows you to start playing the next score as the one before it is ending.

(Remember that you can always add more workers when starting the server by running e.g. alda --workers 4 up)

@daveyarwood
Copy link
Member

daveyarwood commented Sep 18, 2016

I believe 1.0.0-rc41 fixes this issue. @0atman @kirbyfan64 @elyisgreat please test and let me know what you think!

To summarize the changes in this release:

  • There is now a 10 second delay between starting workers. The server is still considered "up" as soon as there is at least 1 worker available.
  • The default number of workers is now 2.
  • The server now checks on (and adjusts) the number of workers every 60 seconds, instead of every 30 seconds. (This is not exactly relevant to the high CPU usage, but worth mentioning)
  • Removed the --cycle-workers option -- cycling workers is now the default behavior again, like it was prior to 1.0.0-rc39.

@daveyarwood daveyarwood self-assigned this Sep 18, 2016
@0atman
Copy link
Member Author

0atman commented Sep 19, 2016

Awesome, rc41 works great on my linux machine.

However, it doesn't spawn workers on my windows machine:

λ alda up
[27713] Starting Alda server...
[27713] Server up ?
[27713] Starting worker processes...
[ this hangs, I waited for 5 minutes ]
^C

λ alda status
[27713] Server up (0/2 workers available)

@daveyarwood
Copy link
Member

That's strange... was it working before this release? I can't think of what might have changed between 1.0.0-rc38 and 41 that would affect workers spawning on Windows.

@0atman
Copy link
Member Author

0atman commented Sep 19, 2016

Ah, I had upgraded from rc31, sorry I can't be more specific!

Interestingly alda repl works fine.

@daveyarwood
Copy link
Member

It makes sense that the REPL works -- the REPL doesn't use a server or worker processes at all, it's just a standalone program.

Can you try alda -p 12345 worker and see if any errors come up? 12345 is a made-up number, but we can at least see if the workers are crashing right when they start up.

@daveyarwood
Copy link
Member

daveyarwood commented Sep 20, 2016

I'm going to close this -- I think the problem is solved by 1.0.0-rc41, and workers not spawning on Windows seems like a separate issue. @0atman I'll be happy to continue to troubleshoot that with you -- I want to make sure Alda works on Windows!

@elydpg
Copy link
Contributor

elydpg commented Sep 20, 2016

Makes sense. If I'm still seeing freezing issues I will comment here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants