Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Houdini "java.net.SocketException: Too many open files" #978

Closed
seth-shaw-unlv opened this issue Nov 15, 2018 · 12 comments
Closed

Houdini "java.net.SocketException: Too many open files" #978

seth-shaw-unlv opened this issue Nov 15, 2018 · 12 comments

Comments

@seth-shaw-unlv
Copy link
Contributor

Trying to create derivatives for a new collection of 1734 images is failing. Camel is reporting a SocketException with too many open files:

2018-11-15 15:40:22,854 | DEBUG | nnector-houdini] | DefaultErrorHandler              | 84 - org.apache.camel.camel-core - 2.19.2 | Failed delivery for (MessageId: ID-dams-37173-1542151183356-7-22070 on ExchangeId: ID-dams-37173-1542151183356-7-22060). On delivery attempt: 10 caught: java.net.SocketException: Too many open files
2018-11-15 15:40:22,855 | ERROR | nnector-houdini] | DefaultErrorHandler              | 84 - org.apache.camel.camel-core - 2.19.2 | Failed delivery for (MessageId: ID-dams-37173-1542151183356-7-22070 on ExchangeId: ID-dams-37173-1542151183356-7-22060). Exhausted after delivery attempt: 11 caught: java.net.SocketException: Too many open files. Processed by failure processor: FatalFallbackErrorHandler[Channel[Log(ca.islandora.alpaca.connector.houdini.HoudiniConnector)[Error connecting generating derivative with Houdini: ${exception.message}

${exception.stacktrace}]]]

Message History
---------------------------------------------------------------------------------------------------------------------------------------
RouteId              ProcessorId          Processor                                                                        Elapsed (ms)
[IslandoraConnector] [IslandoraConnector] [activemq://queue:islandora-connector-houdini                                  ] [     16975]
[IslandoraConnector] [unmarshal6        ] [unmarshal[org.apache.camel.model.dataformat.JsonDataFormat@28a80bd5]          ] [         1]
[IslandoraConnector] [setProperty15     ] [setProperty[event]                                                            ] [         0]
[IslandoraConnector] [removeHeaders10   ] [removeHeaders[*]                                                              ] [         0]
[IslandoraConnector] [setHeader35       ] [setHeader[CamelHttpMethod]                                                    ] [         0]
[IslandoraConnector] [setHeader36       ] [setHeader[Accept]                                                             ] [         0]
[IslandoraConnector] [setHeader37       ] [setHeader[X-Islandora-Args]                                                   ] [         0]
[IslandoraConnector] [setHeader38       ] [setHeader[Apix-Ldp-Resource]                                                  ] [         0]
[IslandoraConnector] [setBody12         ] [setBody[simple{Simple: ${null}}]                                              ] [         0]
[IslandoraConnector] [to42              ] [{{houdini.convert.url}}                                                       ] [      6946]
[IslandoraConnector] [removeHeaders11   ] [removeHeaders[*]                                                              ] [         0]
[IslandoraConnector] [setHeader39       ] [setHeader[Content-Location]                                                   ] [         0]
[IslandoraConnector] [setHeader40       ] [setHeader[CamelHttpMethod]                                                    ] [         1]
[IslandoraConnector] [toD7              ] [                                                                              ] [     10027]
[IslandoraConnector] [log13             ] [log                                                                           ] [         1]

Stacktrace
---------------------------------------------------------------------------------------------------------------------------------------
java.net.SocketException: Too many open files
	at java.net.Socket.createImpl(Socket.java:460)[:1.8.0_191]
	at java.net.Socket.getImpl(Socket.java:520)[:1.8.0_191]
	at java.net.Socket.setSoTimeout(Socket.java:1141)[:1.8.0_191]
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:120)[158:org.apache.httpcomponents.httpclient:4.5.3]
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359)[158:org.apache.httpcomponents.httpclient:4.5.3]
	at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)[158:org.apache.httpcomponents.httpclient:4.5.3]
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)[158:org.apache.httpcomponents.httpclient:4.5.3]
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)[158:org.apache.httpcomponents.httpclient:4.5.3]
	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)[158:org.apache.httpcomponents.httpclient:4.5.3]
	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)[158:org.apache.httpcomponents.httpclient:4.5.3]
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)[158:org.apache.httpcomponents.httpclient:4.5.3]
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)[158:org.apache.httpcomponents.httpclient:4.5.3]
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)[158:org.apache.httpcomponents.httpclient:4.5.3]
	at org.apache.camel.component.http4.HttpProducer.executeMethod(HttpProducer.java:334)[86:org.apache.camel.camel-http4:2.19.2]
	at org.apache.camel.component.http4.HttpProducer.process(HttpProducer.java:193)[86:org.apache.camel.camel-http4:2.19.2]
	at org.apache.camel.util.AsyncProcessorConverterHelper$ProcessorToAsyncProcessorBridge.process(AsyncProcessorConverterHelper.java:61)[84:org.apache.camel.camel-core:2.19.2]
	at org.apache.camel.processor.SendDynamicProcessor$1.doInAsyncProducer(SendDynamicProcessor.java:124)[84:org.apache.camel.camel-core:2.19.2]
	at org.apache.camel.impl.ProducerCache.doInAsyncProducer(ProducerCache.java:436)[84:org.apache.camel.camel-core:2.19.2]
	at org.apache.camel.processor.SendDynamicProcessor.process(SendDynamicProcessor.java:119)[84:org.apache.camel.camel-core:2.19.2]
	at org.apache.camel.management.InstrumentationProcessor.process(InstrumentationProcessor.java:77)[84:org.apache.camel.camel-core:2.19.2]
	at org.apache.camel.processor.RedeliveryErrorHandler.process(RedeliveryErrorHandler.java:541)[84:org.apache.camel.camel-core:2.19.2]
	at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:198)[84:org.apache.camel.camel-core:2.19.2]
	at org.apache.camel.processor.Pipeline.process(Pipeline.java:120)[84:org.apache.camel.camel-core:2.19.2]
	at org.apache.camel.processor.Pipeline.process(Pipeline.java:83)[84:org.apache.camel.camel-core:2.19.2]
	at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:198)[84:org.apache.camel.camel-core:2.19.2]
	at org.apache.camel.processor.DelegateAsyncProcessor.process(DelegateAsyncProcessor.java:97)[84:org.apache.camel.camel-core:2.19.2]
	at org.apache.camel.component.jms.EndpointMessageListener.onMessage(EndpointMessageListener.java:112)[90:org.apache.camel.camel-jms:2.19.2]
	at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:721)[117:org.apache.servicemix.bundles.spring-jms:4.2.8.RELEASE_1]
	at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:681)[117:org.apache.servicemix.bundles.spring-jms:4.2.8.RELEASE_1]
	at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:651)[117:org.apache.servicemix.bundles.spring-jms:4.2.8.RELEASE_1]
	at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:313)[117:org.apache.servicemix.bundles.spring-jms:4.2.8.RELEASE_1]
	at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:251)[117:org.apache.servicemix.bundles.spring-jms:4.2.8.RELEASE_1]
	at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1164)[117:org.apache.servicemix.bundles.spring-jms:4.2.8.RELEASE_1]
	at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1156)[117:org.apache.servicemix.bundles.spring-jms:4.2.8.RELEASE_1]
	at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:1053)[117:org.apache.servicemix.bundles.spring-jms:4.2.8.RELEASE_1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)[:1.8.0_191]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)[:1.8.0_191]
	at java.lang.Thread.run(Thread.java:748)[:1.8.0_191]
2018-11-15 15:40:22,855 | DEBUG | nnector-houdini] | Pipeline                         | 84 - org.apache.camel.camel-core - 2.19.2 | Message exchange has failed: so breaking out of pipeline for exchange: Exchange[ID-dams-37173-1542151183356-7-22060] Exception: java.net.SocketException: Too many open files
2018-11-15 15:40:22,855 | WARN  | nnector-houdini] | EndpointMessageListener          | 84 - org.apache.camel.camel-core - 2.19.2 | Execution of JMS message listener failed. Caused by: [org.apache.camel.RuntimeCamelException - java.net.SocketException: Too many open files]
org.apache.camel.RuntimeCamelException: java.net.SocketException: Too many open files
	at org.apache.camel.util.ObjectHelper.wrapRuntimeCamelException(ObjectHelper.java:1774)[84:org.apache.camel.camel-core:2.19.2]
... trimmed

The houdini log doesn't show any problems, other than abruptly stopping a while ago (presumably because no more calls were being issued).

It looks like we aren't closing a connection OR we need to rate-limit requests somehow.

Unfortunately, I am heading out of the office now and won't be back until 2018-11-26. I'll try to debug some more then.

@whikloj
Copy link
Member

whikloj commented Nov 16, 2018

Seems like maybe Drupal is too slow compared to convert and we need to limit the number of messages it tries to process at one time. Perhaps setting a limit on the number of activemq consumers is the way to go.

@dannylamb
Copy link
Contributor

dannylamb commented Nov 16, 2018

@seth-shaw-unlv @whikloj I need to verify we're not leaking files first, which is difficult. I've been pretty careful about it and I'd like to say it's not that, but it's the most probable reason for hitting the limit.

Regardless, we'll have to throttle in some way. Due to PHP peculiarities, we're forced to use to a temp stream in CmdExecuteService for output, which opens a file if it is > 4kb. So all the derivative generation microservices are likely to hit the open file limit under load. So if everything's on the same box, then... 💣💣💣

As @whikloj suggests, we'll want to make sure messages are not being processed in parallel with activemq. If we still hit the limit after dropping the number of concurrent consumers to 1, then we'll have to use camel's throttler on the derivative generation routes. Basically, just throw a .throttle(100) at the beginning of the route.

@whikloj
Copy link
Member

whikloj commented Nov 16, 2018

To slow and make careful my derivative-toolkit I setup a pool of activemq consumers and that limits how many we open at a time.

I used this block, then left the ${concurrent.consumers} as a configurable option so you can increase or decrease the number of consumers.

@whikloj
Copy link
Member

whikloj commented Nov 16, 2018

I'll also note that I discovered that by default each ActiveMQ consumer pre-fetches about 1,000 messages. I had a problem with OpenStack clients losing connection to push work back, so I limited the pre-fetch to 1 so if it failed I lost 1 message instead of 1,000.

blueprint config - https://github.com/whikloj/islandora-1x-derivative-toolkit/blob/master/islandora-1x-derivative-worker/src/main/resources/OSGI-INF/blueprint/blueprint.xml#L17
Add the parameter to the route - https://github.com/whikloj/islandora-1x-derivative-toolkit/blob/master/islandora-1x-derivative-worker/src/main/resources/OSGI-INF/blueprint/blueprint.xml#L17

@dannylamb
Copy link
Contributor

@whikloj Yeah, I've been bitten by Activemq's prefetch before, too. Thanks for bringing that up and the examples. It's pretty straight-forward from that.

@seth-shaw-unlv
Copy link
Contributor Author

So, I did some more looking around where others have run into this type of error. The near-universal solution was to increase the system's maximum number of open file descriptors. There are several blog posts that talk about this, although the Nuxeo one is probably the simplest and not cluttered by ads.

In short,

  1. sudo vi /etc/security/limits.conf
  2. Add * soft nofile 4096 and * hard nofile 8192 to the bottom. The stars can be replaced with specific users you want to change limits to (and karaf seems to be the biggest offender) but nobody explained why you wouldn't want to apply it to everyone.
  3. System restart so that all the users will have the new limit applied.

I made the stated changes to our CentOS 7 box and kicked off the media migrations again. I was able to complete the media migration without any camel errors. The ActiveMQ islandora-connector-houdini queue shows that houdini is still chugging happily along progressively generating service images and thumbnail messages are being added to the queue without problems.

@seth-shaw-unlv
Copy link
Contributor Author

seth-shaw-unlv commented Nov 29, 2018

Basically, what it boils down to is having too many services trying to run on the same machine with their own open files and socket connections. It just gets too crowded with the default system settings. So, either allow users more file connections or subdivide your setup across multiple boxes.

I think we can probably close this issue with the workaround. The prefetch and pooling are probably a separate ticket. Thoughts on that @dannylamb and @whikloj ?

@dannylamb
Copy link
Contributor

@seth-shaw-unlv That's great news. Pro-actively we could bake that into claw-playbook or maybe the karaf role directly since it's the biggest offender? I imagine even if you sub-divide your setup across multiple boxes, you'll still wanna max out those numbers. Otherwise, I'd consider this a documentation issue, I guess. Although I'm not sure where the best place to put it would be.

Prefetch and pooling will definitely come into play later if we ever reach the new limits, so a separate ticket is a good idea.

@seth-shaw-unlv
Copy link
Contributor Author

After running into this again, I have some additional notes:

Using the * for limits.conf will not apply to root. To adjust root give it it's own lines:

root soft nofile 900000
root hard nofile 900000

Also, karaf has been using more than 500k open "files" at a time. I didn't realize this at first because I was running lsof as an unprivileged user which showed karaf had < 200 open files, but running the command as root revealed karaf using ~512k open files. I'm hoping my new 900k limit will avoid any more problems.

@seth-shaw-unlv
Copy link
Contributor Author

Nope. Still hitting the wall. It looks like we have a problem with connections not closing. In a recent test I started seeing the errors again so I checked lsof. Karaf had 518,906 file descriptors open. 460,320 of them (88.7%) were TCP connections with the status of CLOSE_WAIT. That means that karaf (or the code it is executing) is holding onto connections a lot longer than it should. Which connections (houdini, Fedora, Drupal...) is unknown.

@dannylamb
Copy link
Contributor

@seth-shaw-unlv It's hard to tell by Karaf because Karaf connects to everything. My suspicion is the CmdExecuteService in Crayfish Commons. That'd make Apache hold on to the files too long, but I'm not sure how to confirm.

lsof wants a PID and I'm not entirely sure how to get one for Apache.

vagrant@claw:~$ ps ax | grep apache
 1092 ?        Sl     0:00 /opt/apache-karaf-4.0.8/bin/karaf-wrapper /opt/karaf/etc/karaf-wrapper.conf wrapper.syslog.ident=karaf wrapper.pidfile=/opt/apache-karaf-4.0.8/data/karaf.pid wrapper.daemonize=TRUE wrapper.lockfile=/var/lock/subsys/karaf
 1164 ?        Sl     0:28 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dkaraf.home=/opt/apache-karaf-4.0.8 -Dkaraf.base=/opt/apache-karaf-4.0.8 -Dkaraf.data=/opt/apache-karaf-4.0.8/data -Dkaraf.etc=/opt/karaf/etc -Dcom.sun.management.jmxremote -Dkaraf.startLocalConsole=false -Dkaraf.startRemoteShell=true -Djava.endorsed.dirs=/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/endorsed:/usr/lib/jvm/java-8-openjdk-amd64/lib/endorsed:/opt/apache-karaf-4.0.8/lib/endorsed -Djava.ext.dirs=/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext:/usr/lib/jvm/java-8-openjdk-amd64/lib/ext:/opt/apache-karaf-4.0.8/lib/ext -Xmx512m -Djava.library.path=/opt/apache-karaf-4.0.8/lib/wrapper/ -classpath /opt/apache-karaf-4.0.8/lib/boot/org.apache.karaf.main-4.0.8.jar:/opt/apache-karaf-4.0.8/lib/boot/org.apache.karaf.jaas.boot-4.0.8.jar:/opt/apache-karaf-4.0.8/lib/boot/org.apache.karaf.diagnostic.boot-4.0.8.jar:/opt/apache-karaf-4.0.8/lib/boot/org.osgi.core-6.0.0.jar:/opt/apache-karaf-4.0.8/lib/wrapper/karaf-wrapper.jar:/opt/apache-karaf-4.0.8/lib/wrapper/karaf-wrapper-main.jar -Dwrapper.key=KnCm_3MUMtEPLNqU -Dwrapper.port=32000 -Dwrapper.jvm.port.min=31000 -Dwrapper.jvm.port.max=31999 -Dwrapper.pid=1092 -Dwrapper.version=3.2.3 -Dwrapper.native_library=wrapper -Dwrapper.service=TRUE -Dwrapper.cpu.timeout=10 -Dwrapper.jvmid=1 org.apache.karaf.wrapper.internal.service.Main
 1476 ?        Sl     1:08 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Djava.util.logging.config.file=/var/lib/tomcat8/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.awt.headless=true -Dfile.encoding=UTF-8 -server -Xms512m -Xmx1024m -XX:NewSize=256m -XX:MaxNewSize=256m -XX:PermSize=256m -XX:MaxPermSize=256m -Dfcrepo.home=/opt/fcrepo -Dfcrepo.modeshape.configuration=file:///opt/fcrepo/configs/repository.json -Dfcrepo.activemq.configuration=file:///opt/fcrepo/configs/activemq.xml -Dfcrepo.spring.configuration=file:///opt/fcrepo/configs/fcrepo-config.xml -Dcom.bigdata.rdf.sail.webapp.ConfigParams.propertyFile=/opt/blazegraph/conf/RWStore.properties -Dcantaloupe.config=/opt/cantaloupe/cantaloupe.properties -Dorg.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true -Djava.endorsed.dirs=/usr/share/tomcat8/endorsed -classpath /usr/share/tomcat8/bin/bootstrap.jar:/usr/share/tomcat8/bin/tomcat-juli.jar -Dcatalina.base=/var/lib/tomcat8 -Dcatalina.home=/usr/share/tomcat8 -Djava.io.tmpdir=/tmp/tomcat8-tomcat8-tmp org.apache.catalina.startup.Bootstrap start
 1510 ?        Ss     0:00 /usr/sbin/apache2 -k start
 1611 ?        S      0:00 /usr/sbin/apache2 -k start
 1612 ?        S      0:00 /usr/sbin/apache2 -k start
 1615 ?        S      0:00 /usr/sbin/apache2 -k start
 1616 ?        S      0:00 /usr/sbin/apache2 -k start
 1617 ?        S      0:00 /usr/sbin/apache2 -k start
 2350 pts/0    S+     0:00 grep --color=auto apache

That seems like a lot of starts for /usr/sbin/apache2

@dannylamb
Copy link
Contributor

Resolved via Islandora/Alpaca@67eac07

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants