-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Houdini "java.net.SocketException: Too many open files" #978
Comments
Seems like maybe Drupal is too slow compared to convert and we need to limit the number of messages it tries to process at one time. Perhaps setting a limit on the number of activemq consumers is the way to go. |
@seth-shaw-unlv @whikloj I need to verify we're not leaking files first, which is difficult. I've been pretty careful about it and I'd like to say it's not that, but it's the most probable reason for hitting the limit. Regardless, we'll have to throttle in some way. Due to PHP peculiarities, we're forced to use to a temp stream in As @whikloj suggests, we'll want to make sure messages are not being processed in parallel with activemq. If we still hit the limit after dropping the number of concurrent consumers to 1, then we'll have to use camel's throttler on the derivative generation routes. Basically, just throw a |
To slow and make careful my derivative-toolkit I setup a pool of activemq consumers and that limits how many we open at a time. I used this block, then left the |
I'll also note that I discovered that by default each ActiveMQ consumer pre-fetches about 1,000 messages. I had a problem with OpenStack clients losing connection to push work back, so I limited the pre-fetch to 1 so if it failed I lost 1 message instead of 1,000. blueprint config - https://github.com/whikloj/islandora-1x-derivative-toolkit/blob/master/islandora-1x-derivative-worker/src/main/resources/OSGI-INF/blueprint/blueprint.xml#L17 |
@whikloj Yeah, I've been bitten by Activemq's prefetch before, too. Thanks for bringing that up and the examples. It's pretty straight-forward from that. |
So, I did some more looking around where others have run into this type of error. The near-universal solution was to increase the system's maximum number of open file descriptors. There are several blog posts that talk about this, although the Nuxeo one is probably the simplest and not cluttered by ads. In short,
I made the stated changes to our CentOS 7 box and kicked off the media migrations again. I was able to complete the media migration without any camel errors. The ActiveMQ islandora-connector-houdini queue shows that houdini is still chugging happily along progressively generating service images and thumbnail messages are being added to the queue without problems. |
Basically, what it boils down to is having too many services trying to run on the same machine with their own open files and socket connections. It just gets too crowded with the default system settings. So, either allow users more file connections or subdivide your setup across multiple boxes. I think we can probably close this issue with the workaround. The prefetch and pooling are probably a separate ticket. Thoughts on that @dannylamb and @whikloj ? |
@seth-shaw-unlv That's great news. Pro-actively we could bake that into claw-playbook or maybe the karaf role directly since it's the biggest offender? I imagine even if you sub-divide your setup across multiple boxes, you'll still wanna max out those numbers. Otherwise, I'd consider this a documentation issue, I guess. Although I'm not sure where the best place to put it would be. Prefetch and pooling will definitely come into play later if we ever reach the new limits, so a separate ticket is a good idea. |
After running into this again, I have some additional notes: Using the * for limits.conf will not apply to root. To adjust root give it it's own lines:
Also, karaf has been using more than 500k open "files" at a time. I didn't realize this at first because I was running lsof as an unprivileged user which showed karaf had < 200 open files, but running the command as root revealed karaf using ~512k open files. I'm hoping my new 900k limit will avoid any more problems. |
Nope. Still hitting the wall. It looks like we have a problem with connections not closing. In a recent test I started seeing the errors again so I checked lsof. Karaf had 518,906 file descriptors open. 460,320 of them (88.7%) were TCP connections with the status of CLOSE_WAIT. That means that karaf (or the code it is executing) is holding onto connections a lot longer than it should. Which connections (houdini, Fedora, Drupal...) is unknown. |
@seth-shaw-unlv It's hard to tell by Karaf because Karaf connects to everything. My suspicion is the
That seems like a lot of |
Resolved via Islandora/Alpaca@67eac07 |
Trying to create derivatives for a new collection of 1734 images is failing. Camel is reporting a SocketException with too many open files:
The houdini log doesn't show any problems, other than abruptly stopping a while ago (presumably because no more calls were being issued).
It looks like we aren't closing a connection OR we need to rate-limit requests somehow.
Unfortunately, I am heading out of the office now and won't be back until 2018-11-26. I'll try to debug some more then.
The text was updated successfully, but these errors were encountered: