Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xtdb can get into a zombie state after out-of-memory (OOM) #652

Closed
praseodym opened this issue Apr 3, 2023 · 8 comments
Closed

xtdb can get into a zombie state after out-of-memory (OOM) #652

praseodym opened this issue Apr 3, 2023 · 8 comments
Assignees
Labels
bug Something isn't working xtdb

Comments

@praseodym
Copy link
Contributor

xtdb can get into a zombie state after the JVM being out-of-memory (OOM). In my local environment with a lot of findings and OOIs, xtdb logged the following message:

11:41:04.133 [qtp1169020492-23] WARN  o.e.j.util.thread.QueuedThreadPool - 
java.lang.OutOfMemoryError: Java heap space

after which for every query the following message was logged:

11:41:06.761 [qtp1169020492-28] WARN  org.eclipse.jetty.server.HttpChannel - /_xtdb/zeef/query
java.lang.IllegalStateException: COMPLETED
        at org.eclipse.jetty.server.HttpChannelState.sendError(HttpChannelState.java:915)
        at org.eclipse.jetty.server.Response.sendError(Response.java:471)
        at juxt.clojars_mirrors.ring_jetty_adapter.v0v14v2.ring.adapter.jetty9$proxy_handler$fn__16448.invoke(jetty9.clj:59)
        at juxt.clojars_mirrors.ring_jetty_adapter.v0v14v2.ring.adapter.jetty9.proxy$org.eclipse.jetty.server.handler.AbstractHandler$ff19274a.handle(Unknown Source)
        at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
        at org.eclipse.jetty.server.Server.handle(Server.java:516)
        at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
        at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
        at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
        at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
        at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:386)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
        at java.base/java.lang.Thread.run(Unknown Source)

Octopoes logged the follow messages; note that xtdb has trouble parsing a completely valid query (Query didn't match expected structure):

[2023-04-03 11:51:48 +0000] [8] [ERROR] [client] http://crux:3000/_xtdb/zeef/query?valid-time=2023-04-03T11%3A51%3A48.433097%2B00%3A00
[2023-04-03 11:51:48 +0000] [8] [ERROR] [client] 
                {
                    :query {
                        :find [(count ?e)]
                        :in [[_object_type ...] [_scan_level ...] [_scan_profile_type ...]]
                        :where [[?e :object_type _object_type]
                                [?scan_profile :type "ScanProfile"]
                                [?scan_profile :reference ?e]
                                [?scan_profile :level _scan_level]
                                [?scan_profile :scan_profile_type _scan_profile_type]]
                    }
                    :in-args [["Finding"], [0 1 2 3 4], ["inherited" "empty" "declared"]]
                }
                
[2023-04-03 11:51:48 +0000] [8] [ERROR] [client] {"xtdb.error/error-type":"illegal-argument","xtdb.error/error-key":"query-spec-failed","xtdb.error/message":"Query didn't match expected structure","explain":{"clojure.spec.alpha/problems":[{"path":["where","triple","a"],"pred":"xtdb.codec/valid-id?","val":"object_type","via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/triple"],"in":["where",0,1]},{"path":["where","not"],"pred":"clojure.core/seq?","val":["?e","object_type","_object_type"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/not"],"in":["where",0]},{"path":["where","not-join"],"pred":"clojure.core/seq?","val":["?e","object_type","_object_type"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/not-join"],"in":["where",0]},{"path":["where","or"],"pred":"clojure.core/seq?","val":["?e","object_type","_object_type"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/or"],"in":["where",0]},{"path":["where","or-join"],"pred":"clojure.core/seq?","val":["?e","object_type","_object_type"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/or-join"],"in":["where",0]},{"path":["where","range"],"pred":["clojure.core/=",["clojure.core/count","%"],1],"val":["?e","object_type","_object_type"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/range"],"in":["where",0]},{"path":["where","rule"],"pred":"clojure.core/list?","val":["?e","object_type","_object_type"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/rule"],"in":["where",0]},{"path":["where","pred","pred"],"pred":"clojure.core/seq?","val":"?e","via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/pred"],"in":["where",0,0]},{"path":["where","triple","a"],"pred":"xtdb.codec/valid-id?","val":"type","via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/triple"],"in":["where",1,1]},{"path":["where","not"],"pred":"clojure.core/seq?","val":["?scan_profile","type","ScanProfile"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/not"],"in":["where",1]},{"path":["where","not-join"],"pred":"clojure.core/seq?","val":["?scan_profile","type","ScanProfile"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/not-join"],"in":["where",1]},{"path":["where","or"],"pred":"clojure.core/seq?","val":["?scan_profile","type","ScanProfile"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/or"],"in":["where",1]},{"path":["where","or-join"],"pred":"clojure.core/seq?","val":["?scan_profile","type","ScanProfile"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/or-join"],"in":["where",1]},{"path":["where","range"],"pred":["clojure.core/=",["clojure.core/count","%"],1],"val":["?scan_profile","type","ScanProfile"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/range"],"in":["where",1]},{"path":["where","rule"],"pred":"clojure.core/list?","val":["?scan_profile","type","ScanProfile"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/rule"],"in":["where",1]},{"path":["where","pred","pred"],"pred":"clojure.core/seq?","val":"?scan_profile","via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/pred"],"in":["where",1,0]},{"path":["where","triple","a"],"pred":"xtdb.codec/valid-id?","val":"reference","via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/triple"],"in":["where",2,1]},{"path":["where","not"],"pred":"clojure.core/seq?","val":["?scan_profile","reference","?e"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/not"],"in":["where",2]},{"path":["where","not-join"],"pred":"clojure.core/seq?","val":["?scan_profile","reference","?e"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/not-join"],"in":["where",2]},{"path":["where","or"],"pred":"clojure.core/seq?","val":["?scan_profile","reference","?e"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/or"],"in":["where",2]},{"path":["where","or-join"],"pred":"clojure.core/seq?","val":["?scan_profile","reference","?e"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/or-join"],"in":["where",2]},{"path":["where","range"],"pred":["clojure.core/=",["clojure.core/count","%"],1],"val":["?scan_profile","reference","?e"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/range"],"in":["where",2]},{"path":["where","rule"],"pred":"clojure.core/list?","val":["?scan_profile","reference","?e"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/rule"],"in":["where",2]},{"path":["where","pred","pred"],"pred":"clojure.core/seq?","val":"?scan_profile","via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/pred"],"in":["where",2,0]},{"path":["where","triple","a"],"pred":"xtdb.codec/valid-id?","val":"level","via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/triple"],"in":["where",3,1]},{"path":["where","not"],"pred":"clojure.core/seq?","val":["?scan_profile","level","_scan_level"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/not"],"in":["where",3]},{"path":["where","not-join"],"pred":"clojure.core/seq?","val":["?scan_profile","level","_scan_level"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/not-join"],"in":["where",3]},{"path":["where","or"],"pred":"clojure.core/seq?","val":["?scan_profile","level","_scan_level"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/or"],"in":["where",3]},{"path":["where","or-join"],"pred":"clojure.core/seq?","val":["?scan_profile","level","_scan_level"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/or-join"],"in":["where",3]},{"path":["where","range"],"pred":["clojure.core/=",["clojure.core/count","%"],1],"val":["?scan_profile","level","_scan_level"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/range"],"in":["where",3]},{"path":["where","rule"],"pred":"clojure.core/list?","val":["?scan_profile","level","_scan_level"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/rule"],"in":["where",3]},{"path":["where","pred","pred"],"pred":"clojure.core/seq?","val":"?scan_profile","via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/pred"],"in":["where",3,0]},{"path":["where","triple","a"],"pred":"xtdb.codec/valid-id?","val":"scan_profile_type","via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/triple"],"in":["where",4,1]},{"path":["where","not"],"pred":"clojure.core/seq?","val":["?scan_profile","scan_profile_type","_scan_profile_type"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/not"],"in":["where",4]},{"path":["where","not-join"],"pred":"clojure.core/seq?","val":["?scan_profile","scan_profile_type","_scan_profile_type"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/not-join"],"in":["where",4]},{"path":["where","or"],"pred":"clojure.core/seq?","val":["?scan_profile","scan_profile_type","_scan_profile_type"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/or"],"in":["where",4]},{"path":["where","or-join"],"pred":"clojure.core/seq?","val":["?scan_profile","scan_profile_type","_scan_profile_type"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/or-join"],"in":["where",4]},{"path":["where","range"],"pred":["clojure.core/=",["clojure.core/count","%"],1],"val":["?scan_profile","scan_profile_type","_scan_profile_type"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/range"],"in":["where",4]},{"path":["where","rule"],"pred":"clojure.core/list?","val":["?scan_profile","scan_profile_type","_scan_profile_type"],"via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/rule"],"in":["where",4]},{"path":["where","pred","pred"],"pred":"clojure.core/seq?","val":"?scan_profile","via":["xtdb.query/query","xtdb.query/where","xtdb.query/term","xtdb.query/pred"],"in":["where",4,0]}],"clojure.spec.alpha/spec":"xtdb.query/query","clojure.spec.alpha/value":{"find":[["count","?e"]],"in":[["_object_type","..."],["_scan_level","..."],["_scan_profile_type","..."]],"where":[["?e","object_type","_object_type"],["?scan_profile","type","ScanProfile"],["?scan_profile","reference","?e"],["?scan_profile","level","_scan_level"],["?scan_profile","scan_profile_type","_scan_profile_type"]]}}}
[2023-04-03 11:51:48 +0000] [8] [CRITICAL] [api] 400 Client Error: Bad Request for url: http://crux:3000/_xtdb/zeef/query?valid-time=2023-04-03T11%3A51%3A48.433097%2B00%3A00
[2023-04-03 11:51:48 +0000] [8] [INFO] [h11_impl] 172.18.0.13:47418 - "GET /zeef/objects?types=Finding&valid_time=2023-04-03+11%3A51%3A48.433097%2B00%3A00&offset=0&limit=99999999 HTTP/1.1" 502

After restarting the xtdb container, it is functioning properly again.

This issue is primarily meant to document this xtdb zombie state so we know how to handle it if it occurs in the future. I haven't gotten around to debugging xtdb internals to properly resolve it. As a workaround, we could start the xtdb JVM using the ExitOnOutOfMemoryError (exit JVM on the first occurrence of an out-of-memory error) or CrashOnOutOfMemoryError (which produces crash files for debugging) flag.

Relates to #429

@praseodym praseodym added bug Something isn't working xtdb labels Apr 3, 2023
@praseodym praseodym self-assigned this Apr 3, 2023
@praseodym
Copy link
Contributor Author

Alternative to #676 is to fix it upstream: dekkers/xtdb-http-multinode#3

@praseodym
Copy link
Contributor Author

dekkers/xtdb-http-multinode#3 is merged, we'll need to update docs so that xtdb-http-multinode v1.0.4 is used.

@ammar92
Copy link
Contributor

ammar92 commented Apr 10, 2023

This issue can be closed now, right?

@praseodym
Copy link
Contributor Author

I think there is still some docs work to do, also see #681 (comment)

@praseodym
Copy link
Contributor Author

Created #703 to improve docs, so everything is this issue is resolved now.

@praseodym
Copy link
Contributor Author

praseodym commented May 9, 2023

In a new setup crux ended up in a zombie state again. It looks like the -XX:+ExitOnOutOfMemoryError JVM option was not properly passed to crux, as the first line in the log was this:

crux_1                 | Picked up JAVA_TOOL_OPTIONS: -Xms128M -Xmx512M -XX:MaxDirectMemorySize=512M

And after increasing memory it was this:

crux_1                 | Picked up JAVA_TOOL_OPTIONS: -Xms128M -Xmx8G -XX:MaxDirectMemorySize=8G -XX:+ExitOnOutOfMemoryError

Let's investigate why the OOM option is missing in the default setup.

@praseodym praseodym reopened this May 9, 2023
@praseodym
Copy link
Contributor Author

Apparently we forgot to add the option in the Dockerfile, dekkers/xtdb-http-multinode#8 fixes this.

@praseodym
Copy link
Contributor Author

dekkers/xtdb-http-multinode#8 was merged and because we use the main Docker image tag this is resolved after a Docker image pull.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working xtdb
Projects
Archived in project
Development

No branches or pull requests

2 participants