-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow couchdb after many queries #5044
Comments
Your expectation is well-founded, and couchdb does behave that way. So there must be something wrong somewhere. Could it be that the number of in-progress requests keeps climbing? That is, making more and more queries even though previous ones have not yet completed? If so, increased latency on the latter requests is expected due to internal queues. I suggest using a tool like ApacheBench to confirm or refute this, set it with some modest concurrency and point it at a view with some appropriate query parameters (i.e, don't read the entire view on each request). Note that a view query operates in two phases. the first phase is to ensure the view is up to date with respect to any database updates (by running any new updates through the map and reduce functions). The second phase is the actual query. It will help us to know if the database in question is being written to (and, if so, how intensely) during these queries. e.g, we would expect a view query to take longer if a large number of updates were made to the database just beforehand. |
Thanks for the reply! a. we do them in series - i.e. we wait for one response before throwing the new query from an outside point of view looks like there's some kind of short term memory cache which gets bigger and bigger and each query traverses the whole object which gets bigger and bigger all the time. for the record: the requests right after the service restart take somethink like few milliseconds, whereas request after some few thousands get even to 120.000 ms (which is the maximum amount of time for us before calling it a day and raising an exception) we'll try benchmarking with |
there is no cache of query results inside couchdb, for sure. Is the database growing during this? we'd expect longer query times for larger views, even if the view is up-to-date when queried (it's ultimately B+Tree's, so the more data the taller the tree, and the larger the database the more likely we have to read more nodes from disk) At this point it would help to see your specific map/reduce functions and an example document. |
also check |
no the database is not growing during the read operations (the actual platform is a QA environment where the only operation is the "mega read"). We're speaking of 100.000 reads.
I can show you the go code for the chaincode we does the reads though: func (c *xxxxx) Token(ctx contractapi.TransactionContextInterface, tokenId string) (string, error) {
token, err := GetTokenData(ctx, tokenId)
if err != nil {
return "", err
}
tokenxxxxx := xxxxxx
iterator, err := ctx.GetStub().GetStateByPartialCompositeKey(PrefixTokenxxxxxx, []string{tokenId})
if err != nil {
return "", fmt.Errorf("error creating asset chaincode: %v", err)
}
for iterator.HasNext() {
queryResponse, err := iterator.Next()
if err != nil {
return "", err
}
// Recupero il tokenId
_, compositeKeyParts, err := ctx.GetStub().SplitCompositeKey(queryResponse.Key)
if err != nil {
return "", err
}
/// OTHER CODE REDACTED BUT OF NO CONCERN
}
_ = iterator.Close()
resultByte, err := json.Marshal(tokenWithxxxxxxx)
return string(resultByte), nil
} |
Can you express that as an http request? |
I think that gets translated to two kind of queries:
the first is the one I'm benchmarking now (-c=10 -n=100000) and is sometimes really slow but still I don't understand why if I run the requests one after another and not in parallel they influence each other. |
here's the ab result
|
browsing through old issues I found this one, which sounds a lot like our problem: |
Hi, yes, it could be. A typical http client would reuse connections (indeed, would manage a pool of them), check that you're doing that first. Otherwise you could possibly use up all possible ports and have to wait for one to become available. Try your |
With That spawns 10 background workers at the fabric (cluster level) to open 10 documents in parallel, but that puts more work on the cluster. Can try decreasing it or increasing it a bit and see if it changes anything. |
I tried with
we'll definitely try this!! |
@luca-simonetti thanks for trying the concurrency setting. Yeah, memory leak is plausible we have debugged a number of those last few years. Some in OTP 25 and 24:
Some memory usage increase is expected perhaps if it's the page cache that's using the memory, thought that wouldn't explain the slowdown... Another thing to tweak might be the max_dbs_open. That helped in another issue related to memory usage: #4992, another interesting one might be: #4988 (comment). Though, again in those cases performance didn't seem to be much of an issue. Are there any exceptions or errors showing in the logs? To get more details about the internals cay try getting the output from |
which version of erlang are you using? |
{
"javascript_engine": {
"version": "78",
"name": "spidermonkey"
},
"erlang": {
"version": "24.3.4.15",
"supported_hashes": [
"blake2s",
"blake2b",
"sha3_512",
"sha3_384",
"sha3_256",
"sha3_224",
"sha512",
"sha384",
"sha256",
"sha224",
"sha",
"ripemd160",
"md5",
"md4"
]
},
"collation_driver": {
"name": "libicu",
"library_version": "70.1",
"collator_version": "153.112",
"collation_algorithm_version": "14"
}
} this is the full _config/_local/_versions output as for the memory leaks provided: we tried to look into that but those are like 50GB memory leak whereas in our case the memory used is around 200MB and it's only 15% of the total available. For this very reason I don't really think it's related to those problems. Also: we noticed that the CPU used is around 25% which is really low and tweaking the all_docs_concurrency to 100 didn't change really much meaning that the time taken increases over time, CPU and RAM usages are the same as before. |
and this is the current state of _node/_local/_system {
"uptime": 77078,
"memory": {
"other": 26982505,
"atom": 631001,
"atom_used": 605053,
"processes": 170112432,
"processes_used": 170076040,
"binary": 474168,
"code": 13917182,
"ets": 2267344
},
"run_queue": 0,
"run_queue_dirty_cpu": 0,
"ets_table_count": 157,
"context_switches": 851304076,
"reductions": 3036158123783,
"garbage_collection_count": 20109323,
"words_reclaimed": 35778913031,
"io_input": 144808316,
"io_output": 1204370753,
"os_proc_count": 0,
"stale_proc_count": 0,
"process_count": 444,
"process_limit": 262144,
"message_queues": {
"couch_file": {
"50": 0,
"90": 0,
"99": 0,
"count": 38,
"min": 0,
"max": 0
},
"couch_db_updater": {
"50": 0,
"90": 0,
"99": 0,
"count": 38,
"min": 0,
"max": 0
},
// a bunch of keys with value:0
} |
for an extreme test we tried to set UPDATE: I actually restarted the process to verify that this value actually fixes the problem and it does not. UPDATE2: Idk if this is the case, but it could just be related to some hyperledger internal caching. I'm really struggling how the whole system works |
All doc_docs concurrency limit how many parallel doc read to do at the cluster level. That speeds up the Another factor to play with is Q. What is your Q value for the dbs? By default it's 2. You can try experimenting with higher values like 8, 16, 64.
If on your containers your CPU and memory usage are not very high, it does seem like a slow disk problem issue but a disk with a page cache or other limited faster cache in front of it. Try using a faster local disk or another setup as an experiment with the same container and see what numbers you get. |
I don't think the disk is the issue here:
as you can see the system doesn't spend any time waiting for IO ops. What I see though is a memory cache increasing. After restart that very same cache decreases a lot and starts growing back again. One more note: the requests that are slow are not the "include_documents=true" ones, but the "attachments=true" and we make a request for each of those. So if there's something cached it's of no use or even in this case a issue since each subsequent request is accessing a different document from the request before. Also: we cannot change the Q factor since the database is already created and we don't want to risk resharding and losing the whole thing... |
a memory leak around attachments is feasible, as they are handled as large binaries in erlang, and there have been some leaks in BEAM around that. What version of erlang is couchdb using here? |
as for resharding, our advice to increase |
the Erlang version inside couchdb is: |
ok, let's get you onto 24.3.4.17, as we know there are bugs in earlier point releases of that series. Let's rule those out first. |
ok, cool! How do I do that? |
you'll need to download or build 24.3.4.17 (https://www.erlang.org/patches/otp-24.3.4.17) and build couchdb against it. How did you install couchdb initially? We should update our binary packages with 24.3.4.17 for the same reason. |
we installed the current couchdb version (3.3..3) using the official repository version for Ubuntu. I think erlang came bundled with it |
I just rebuilt all the 3.3.3 packages to use the latest Erlang patch version 24.3.4.17. If you update from the deb repo you should get the latest version from there: 3.3.3-2 |
thank you for the help! We updated couchdb as requested: {
"javascript_engine": {
"version": "78",
"name": "spidermonkey"
},
"erlang": {
"version": "24.3.4.17",
"supported_hashes": [
"blake2s",
"blake2b",
"sha3_512",
"sha3_384",
"sha3_256",
"sha3_224",
"sha512",
"sha384",
"sha256",
"sha224",
"sha",
"ripemd160",
"md5",
"md4"
]
},
"collation_driver": {
"name": "libicu",
"library_version": "70.1",
"collator_version": "153.112",
"collation_algorithm_version": "14"
}
} but unfortunately didn't help. The problem is still the same 😢 |
hello @luca-simonetti , thanks |
no we changed which API from couchdb we call. Instead of calling each single attachment, we call the doc with include_attachmets=true and that does the trick. |
one note about an earlier comment, "What I see though is a memory cache increasing. After restart that very same cache decreases a lot and starts growing back again.", this is referring to the kernels disk page cache and it is a good thing (critical for performance), and definitely not the sign of a problem. that it goes away when you restart makes sense, as this invalidates the cache. |
the thing is that when we restart the couch service the performance goes back to normal. In this case the first thing that comes to my mind is some cache somewhere. Not necessarily as part of couch itself but also as part of the underlying OS or something else. |
thanks for your answer |
I checked the management of attachments and they are retrieved in batches and not one by one. how can we see our version of erlang? |
visit this |
thanks, so we have the version 24.3.4.14
the os is DEBIAN, the detail of package :
|
i see there are two new versions of packages 3.3.3-1 and 3.3.3-2
|
I am having the same issue, but I don't do anything with attachments. |
Description
CouchDB gets tremendously slow after many queries are made. The situation gets better after some pause in the process. But as soon as the queries resume the system gets slow really really fast.
Steps to Reproduce
note: we are using couchdb as part of our Hyperledger Fabric Cluster. The queries are then made through blockchain requests.
Expected Behaviour
We expect that the same query takes roughly the same amount of time, regardless of the number of queries in the previous amount of time (let's say the previous 5 minutes)
Your Environment
Additional Context
We're using Hyperledger Fabric. After the service of couchdb is restarted with
/etc/init.d/couchdb restart
the situation goes back to normal, but it only takes a couple of minutes (which is something like 5 x 5 x 60 = 1500 queries) and the situation starts degrade real quick.The text was updated successfully, but these errors were encountered: