-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fillseq throughput is 13% slower from PR 6862 #9351
Comments
Using smaller SST files to get "enough" files with a smaller database might help repro this faster. |
This shows the insert rate every 20 seconds. For the diff immediately prior to the PR 6862 diff the rate starts at 840k/s and drops to 761k. For the problem diff the rate starts at 822k and drops to 617k. So the difference is larger when there are more files (at test end). |
The PR in question does not affect the application traffic directly; the runtime cost it adds is essentially one hash operation per file per background job (flush/compaction). Do you see any write stalls in your tests? If yes, what kind? |
Code appears to be on the user's Put path because CPU/insert is up, throughput is down and stalls are down. From test end for diff immediately prior to this one And from this diff |
You mentioned stall counts are down; what about total stall time? |
Looking at your configuration options, |
For cumulative stall: For max_background_jobs, the server has 36 HW threads and 18 cores. 8 background jobs gets almost half of the CPU cores when CPU bound, that isn't very low. If RocksDB hardwires the number of background flushes to jobs/4 then that sounds like a feature missing when the code was changed to remove max_background_flushes. Regardless, the slowdown doesn't appear to be from write stalls. |
|
Ran the test for v6.26.1 with max_jobs=16. That did not help and performance was similar to max_jobs=8. |
The root cause appears to be that this diff increases the amount of time the global mutex is locked by background threads and then the user thread waits more while trying to lock that mutex when doing writes. From a test that inserts 1B KV pairs via fillseq with PMP running each ~60 seconds using this diff (e3f953a) and the one immediately prior to it (bcefc59). I counted the number of times the user thread was blocked in _lll_lock_wait (pthread mutex internals) in PipelinedWriteImpl -- there were 3 for bcefc59 vs 5 for e3f953a. When the user thread blocks on _lll_lock_wait, it is here (see "line 477" below)
And this is the flattened PMP stack trace for the blocked user thread:
And these are the 3 instances for bcefc59 and 5 for e3f953a, that include the filenames I used to save them:
Finally, the stack traces for the thread that I suspect holds the global mutex for e3f953a. I list 6 here (and above), 5 when the user thread is blocked on _lll_lock_wait and one where it is still in pthread_mutex_lock. There are in a gist. |
Some browsing suggests that the (eventual) users of |
Flamegraphs are here. github didn't allow me to attach them directly. The 1550 directory is for e3f953a and the 1551 directory is for bcefc59. I was unable to make differential flamegraphs but comparing them side-by-side I see that TryInstallMemtableFlushResults takes more time with e3f953a. Also, one of the stack traces for e3f953a shows that the Compaction destructor runs for more time with e3f953a from the deallocation it must do for the STL container holding file_locations_ and I think the global mutex was held during that. |
It looks like that most work in flush/compaction threads that can't be parallelize is in VersionBuilder::Rep::SaveTo(), and a good share is hash table operations. If we can move them out of DB mutex, it's possible that the throughput can be back. |
Thanks for the detailed results @mdcallag. Similarly to #9354 , it was a conscious decision to make the consistency checks which rely on this hash structure mandatory. For most workloads, the overhead is negligible; it seems |
Looking at the code, there is one simple optimization we can do to reduce the overhead. We can preallocate the space for the hash, which should eliminate the costly reallocation/rehashing as the hash expands. Will put up a patch for this in the next couple of days. I expect this to significantly reduce the performance overhead of the consistency checks but in case it doesn't, we can also look into moving the version saving logic out from under the mutex lock, although I would expect that to be a much more involved change. |
... actually, it might not be that involved after all. Planning to post a patch in the next few days |
@ltamasi any update to the evaluate of whether "it might not be that involved after all" is true? |
Yeah :) this was fixed in #9504 in February. |
How about "moving the version saving logic out from under the mutex lock"? |
The PR I mentioned moved the hash building part out (which is what was causing the regression and what I had in mind). Not sure how easy it would be to do more of the version saving outside the mutex. |
Thanks for the clarification. |
fillseq throughput is 13% slower after PR 6862 with git hash e3f953a. The problem is new CPU overhead (user, not system). The diff landed in v6.11.
The test server is a spare UDB host (many core, fast SSD) and fillseq throughput drops from ~800k/s to ~700k/s with this diff. One example of throughput by version (6.0 to 6.22) is here. There is also a regression in 6.14 for which I am still searching.
The problem is new CPU overhead that shows up as more user CPU time as measured via /bin/time db_bench ...
The test takes ~1000 seconds and prior to this diff uses ~3000 seconds of user CPU time vs ~3500 seconds of user CPU time at this diff.
I am not sure whether this depends more on the number of files or concurrency because it doesn't show up as a problem on IO-bound or CPU-bound configs on a small server, nor does it show up on a CPU-bound config on this server. The repro here is what I call an IO-bound config and has ~20X more data (and files) than the CPU-bound config.
I don't see an increase in the context switch rate, so mutex contention does not appear to be a problem.
The command line is:
/usr/bin/time -f '%e %U %S' -o bm.lc.nt16.cm1.d0/1550.e3f953a/benchmark_fillseq.wal_disabled.v400.log.time numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1641213884
The text was updated successfully, but these errors were encountered: