Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Queries After Doing A Bulk Load On Nightly Build #1168

Closed
willcj33 opened this issue Jul 10, 2017 · 8 comments
Closed

Slow Queries After Doing A Bulk Load On Nightly Build #1168

willcj33 opened this issue Jul 10, 2017 · 8 comments
Assignees
Labels
kind/bug Something is broken.
Milestone

Comments

@willcj33
Copy link

willcj33 commented Jul 10, 2017

Version:

Dgraph version   : v0.7.7-dev
Commit SHA-1     : 07abc9a2
Commit timestamp : 2017-07-10 19:24:11 +1000
Branch           : HEAD

I am experiencing LONG delays (3.5-10s) on queries (usually heavily filtered ones that access a lot of related data) within hours of restoring data to the newest release of Dgraph. My CPU seems lost (< 10%) pretty much at all times. I try to restart it, and it does nothing to help. When I run it a second time (same query), What took 5s, takes 35ms (which is how fast it ran on 0.7.7). Seemingly, after a day or so of this instance running (untouched) the speeds are faster that 0.7.7, where after a restart (to ensure no caching or anything), what took 35ms on 0.7.7, takes around 8ms on master, but it seems I have to wait some random amount of time. UID queries are blazing fast always. Tagging @janardhan1993 and @pawanrawal as they are most familiar with my issue.

@willcj33
Copy link
Author

profile (19).zip

Added a profile for when these long queries happen. This one actually times out through the UI. This is hours after my data load when making this ticket.

@willcj33
Copy link
Author

It finally returned in 37 seconds. This is a millisecond response on 0.7.7, so something is going on here for sure. This is after I bulk loaded data still, 2 hours after. I have also restarted the server once to try to fix the speed issues. Attached my /debug/vars at the time this is happening.

vars.pdf

@manishrjain manishrjain added the kind/bug Something is broken. label Jul 11, 2017
@manishrjain manishrjain added this to the v0.8 milestone Jul 11, 2017
@pawanrawal
Copy link
Contributor

pawanrawal commented Jul 11, 2017

(pprof) top10       
1680ms of 2660ms total (63.16%)
Dropped 38 nodes (cum <= 13.30ms)
Showing top 10 nodes out of 109 (cum >= 80ms)
      flat  flat%   sum%        cum   cum%
     240ms  9.02%  9.02%      260ms  9.77%  github.com/AndreasBriese/bbloom.Bloom.Has
     230ms  8.65% 17.67%      230ms  8.65%  runtime.usleep
     200ms  7.52% 25.19%      200ms  7.52%  runtime.cmpbody
     180ms  6.77% 31.95%      390ms 14.66%  github.com/dgraph-io/badger/table.(*BlockIterator).parseKV
     170ms  6.39% 38.35%      560ms 21.05%  github.com/dgraph-io/badger/table.(*BlockIterator).Next
     160ms  6.02% 44.36%      160ms  6.02%  runtime.memmove
     150ms  5.64% 50.00%      350ms 13.16%  runtime.mallocgc
     140ms  5.26% 55.26%      160ms  6.02%  syscall.Syscall6
     130ms  4.89% 60.15%      130ms  4.89%  runtime.futex
      80ms  3.01% 63.16%       80ms  3.01%  runtime.memclrNoHeapPointers
(pprof) top10 --cum
0.05s of 2.66s total ( 1.88%)
Dropped 38 nodes (cum <= 0.01s)
Showing top 10 nodes out of 109 (cum >= 0.88s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%      2.35s 88.35%  runtime.goexit
     0.01s  0.38%  0.38%      1.77s 66.54%  github.com/dgraph-io/badger.(*KV).get
     0.01s  0.38%  0.75%      1.67s 62.78%  github.com/dgraph-io/badger.(*logFile).iterate
         0     0%  0.75%      1.67s 62.78%  github.com/dgraph-io/badger.(*valueLog).doRunGC
         0     0%  0.75%      1.67s 62.78%  github.com/dgraph-io/badger.(*valueLog).runGCInLoop
         0     0%  0.75%      1.60s 60.15%  github.com/dgraph-io/badger.(*valueLog).doRunGC.func1
     0.03s  1.13%  1.88%      1.56s 58.65%  github.com/dgraph-io/badger.(*levelHandler).get
         0     0%  1.88%      1.56s 58.65%  github.com/dgraph-io/badger.(*levelsController).get
         0     0%  1.88%      0.88s 33.08%  github.com/dgraph-io/badger/table.(*TableIterator).Seek
         0     0%  1.88%      0.88s 33.08%  github.com/dgraph-io/badger/table.(*TableIterator).seek


pprof001

@pawanrawal pawanrawal self-assigned this Jul 11, 2017
@manishrjain
Copy link
Contributor

We can make our GC smarter. Right now, it just starts testing if a value log needs to be rewritten every 10 mins. We need to expose some vars from Badger to be able to really figure out what to optimize there. Maybe we can allow a user to turn it off entirely. Or, reduce how many iterations it does before deciding to rewrite the file.

@manishrjain
Copy link
Contributor

manishrjain commented Jul 13, 2017

So, @willcj33 : This randomness of being slow, sounds to me like it is a GC problem. And the cpu profiling indicates that well. We could just reduce the amount of GC work going on in the background, and that should be sufficient here.

Is this blocking for you? Do you see this often?

Also, @janardhan1993 : Didn't you change the table mapping to memory? I think that would affect this heavily. We should just set it back to load to RAM (and provide a flag to allow users to change it to memory map or nothing, with a warning that things would be slow).

@manishrjain
Copy link
Contributor

manishrjain commented Jul 13, 2017

So, to summarize, I think the real change that's required here is to have LSM tree be in RAM. That would speed up both the GC effort in background and the startup slowness mentioned in #1180 .

Having said that, we can also reduce the megabytes of data we comb through in value log for GC (reduce from 100 to 50 MB, @pawanrawal ).

@pawanrawal
Copy link
Contributor

@willcj33 confirmed that after switching to using SSD this is not an issue anymore.

@manishrjain should we still reduce the window size?

@manishrjain
Copy link
Contributor

manishrjain commented Jul 17, 2017 via email

@manishrjain manishrjain added the kind/bug Something is broken. label Mar 22, 2018
jarifibrahim pushed a commit that referenced this issue Mar 16, 2020
Important changes
```
 - Changes to overlap check in compaction.
 - Remove 'this entry should've been caught' log.
 - Changes to write stalling on levels 0 and 1.
 - Compression is disabled by default in Badger.
 - Bloom filter caching in a separate ristretto cache.
 - Compression/Encryption in background.
 - Disable cache by default in badger.
```

The following new changes are being added from badger
`git log ab4352b00a17...91c31ebe8c22`

```
91c31eb Disable cache by default (#1257)
eaf64c0 Add separate cache for bloom filters (#1260)
1bcbefc Add BypassDirLock option (#1243)
c6c1e5e Add support for watching nil prefix in subscribe API (#1246)
b13b927 Compress/Encrypt Blocks in the background (#1227)
bdb2b13 fix changelog for v2.0.2 (#1244)
8dbc982 Add Dkron to README (#1241)
3d95b94 Remove coveralls from Travis Build(#1219)
5b4c0a6 Fix ValueThreshold for in-memory mode (#1235)
617ed7c Initialize vlog before starting compactions in db.Open (#1226)
e908818 Update CHANGELOG for Badger 2.0.2 release. (#1230)
bce069c Fix int overflow for 32bit (#1216)
e029e93 Remove ExampleDB_Subscribe Test (#1214)
8734e3a Add missing package to README for badger.NewEntry (#1223)
78d405a Replace t.Fatal with require.NoError in tests (#1213)
c51748e Fix flaky TestPageBufferReader2 test (#1210)
eee1602 Change else-if statements to idiomatic switch statements. (#1207)
3e25d77 Rework concurrency semantics of valueLog.maxFid (#1184) (#1187)
4676ca9 Add support for caching bloomfilters (#1204)
c3333a5 Disable compression and set ZSTD Compression Level to 1 (#1191)
0acb3f6 Fix L0/L1 stall test (#1201)
7e5a956 Support disabling the cache completely. (#1183) (#1185)
82381ac Update ristretto to version  8f368f2 (#1195)
3747be5 Improve write stalling on level 0 and 1
5870b7b Run all tests on CI (#1189)
01a00cb Add Jaegar to list of projects (#1192)
9d6512b Use fastRand instead of locked-rand in skiplist (#1173)
2698bfc Avoid sync in inmemory mode (#1190)
2a90c66 Remove the 'this entry should've caught' log from value.go (#1170)
0a06173 Fix checkOverlap in compaction (#1166)
0f2e629 Fix windows build (#1177)
03af216 Fix commit sha for WithInMemory in CHANGELOG. (#1172)
23a73cd Update CHANGELOG for v2.0.1 release. (#1181)
465f28a Cast sz to uint32 to fix compilation on 32 bit (#1175)
ea01d38 Rename option builder from WithInmemory to WithInMemory. (#1169)
df99253 Remove ErrGCInMemoryMode in CHANGELOG. (#1171)
8dfdd6d Adding changes for 2.0.1 so far (#1168)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something is broken.
Development

No branches or pull requests

3 participants