Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

db: more cache budget for BODIES and EXTRA columns #11548

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ordian
Copy link
Collaborator

@ordian ordian commented Mar 5, 2020

It seems that our assumption about db sizes was not adequate to reality.

** Compaction Stats [col0] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      1/0   16.52 MB   0.5      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L3      1/0   28.62 MB   0.1      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L4      3/0   150.05 MB   0.8      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L5     35/0    1.04 GB   0.6      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L6    811/0   49.12 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
 Sum    851/0   50.35 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0

** Compaction Stats [col2] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      2/2   315.72 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L2      1/1   13.96 MB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L3      1/0   30.18 MB   0.1      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L4     13/0   628.76 MB   0.2      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L5    121/0    6.31 GB   0.3      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L6   1025/0   63.35 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
 Sum   1163/3   70.31 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
** Compaction Stats [col3] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      1/0   330.66 KB   0.5      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L2      1/0    5.23 MB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L3      1/0   62.90 MB   0.2      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L4     15/0   772.35 MB   0.3      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L5    150/0    7.70 GB   0.3      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L6   1272/0   78.24 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
 Sum   1440/0   86.76 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0

Changing the default memory distribution seems to help with #11494.

@ordian ordian added A0-pleasereview 🤓 Pull request needs code review. M4-core ⛓ Core client code / Rust. labels Mar 5, 2020
@dvdplm
Copy link
Collaborator

dvdplm commented Mar 17, 2020

Help for reviewers:
col0 is COL_STATE, where all the 80M+ accounts and their balances/code is stored, 50.3Gb in the DB above.
col2 is COL_BODIES, which stores block bodies, 70.3Gb (this is indeed surprising to me)
col3 is COL_EXTRAS, which stores block "details" and receipts and is also where we keep track of the current best/oldest known block; 86.7Gb – I suspect that this column should not be allowed to grow unbounded and afaict it is not ever pruned which makes no sense to me (bug?)

Copy link
Collaborator

@dvdplm dvdplm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand you correctly this PR stems from the observation that the column sizes do not match the cache memory allocated to them; I think you're saying "the three columns are roughly of the same size, they should have roughly the same amount of memory assigned"?

It is a really good question but I have several hard-to-answer questions:

  • isn't it true that the read/write pattern to COL_STATE is much less regular than to the others? It is really tricky to make any kind of caching efficient when users control the kinds of queries by deploying solidity code and making token transfers from addresses that can be anywhere in the DB? Allocating as much memory as possible to speed up seeks in COL_STATE still seems like a smart move.
  • why isn't COL_BODIES ever pruned? This might be the dumbest question ever, but when a node warp syncs it doesn't have all block bodies back to genesis, does it? We backfill ancient blocks, but what is the actual purpose of that: crucial for security or "nice to have"?
  • I was under the illusion that COL_EXTRA was a column where we tossed random bits and pieces we didn't have a better place for, e.g. the best block etc. TIL that is not at all the case but why do we need to store all transaction receipts for ever? Can't we prune this? From a cursory glance at the code using COL_EXTRA it seems like we're mostly writing to it, but most reads seem to be for the "first", "best" and "ancient" keys; if it is indeed mostly appended to, spending cache on it is likely wasted? Maybe we need a COL_RECEIPT?

EDIT: There is no benchmarking data here – do you have any? What changes with the cache redistribution?

@ordian
Copy link
Collaborator Author

ordian commented Mar 17, 2020

I think you're saying "the three columns are roughly of the same size, they should have roughly the same amount of memory assigned"

The problem is that it's not just the memory assigned, it's the size of levels L0 and L1 in rocksdb, which affects the overall db (column) layout.

@dvdplm
Copy link
Collaborator

dvdplm commented Mar 17, 2020

The problem is that it's not just the memory assigned, it's the size of levels L0 and L1 in rocksdb, which affects the overall db (column) layout.

I don't know what you mean, ELI5 pls?

@AtkinsChang
Copy link
Contributor

@ordian FYI the stats of overlayrecent database sync with this patch

** Compaction Stats [col0] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       3.1      3.1       0.0   1.0      0.0     88.5     36.02             32.43        16    2.251       0      0
  L4     11/0   679.83 MB   1.0      7.7     3.1      4.6       6.2      1.5       0.0   2.0     87.6     69.7     90.54             64.85         8   11.317     69M  2006K
  L5     45/0    1.77 GB   1.0      3.7     1.3      2.4       3.4      1.0       0.0   2.5     75.7     69.3     50.11             39.12        21    2.386     30M   876K
  L6    867/0   52.10 GB   0.0     17.6     1.1     16.5      16.5      0.1       0.0  15.1     60.9     57.4    295.08            256.43        23   12.830     17M    13M
 Sum    923/0   54.54 GB   0.0     29.0     5.5     23.5      29.2      5.7       0.0   9.4     63.0     63.4    471.75            392.83        68    6.937    117M    16M
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0

** Compaction Stats [col2] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      1/0    1.97 MB   0.5      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0    114.7      0.02              0.00         1    0.017       0      0
  L2     25/0    1.55 GB   0.9      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L3      1/0   64.24 MB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L4     13/0   690.88 MB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L5    143/0    7.23 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L6   1193/0   72.86 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
 Sum   1376/0   82.38 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0    114.7      0.02              0.00         1    0.017       0      0
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0

** Compaction Stats [col3] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      1/0    2.53 MB   0.5      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0     74.5      0.03              0.00         1    0.034       0      0
  L2     12/0   780.07 MB   0.5      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L3     23/0    1.45 GB   0.9      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L4     18/0   923.59 MB   0.1      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L5    173/0    9.01 GB   0.1      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L6   1466/0   90.62 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
 Sum   1693/0   102.74 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0     74.5      0.03              0.00         1    0.034       0      0
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0

@dvdplm this may help
https://github.com/facebook/rocksdb/blob/master/options/options.cc#L525

@dvdplm
Copy link
Collaborator

dvdplm commented Mar 24, 2020

@dvdplm this may help
https://github.com/facebook/rocksdb/blob/master/options/options.cc#L525

Sort of, I've read that file many times but I still have a hard time getting an intuition for what config values are relevant to us. https://github.com/facebook/rocksdb/wiki/Leveled-Compaction is a good read too, but only somewhat related to this PR.

Do you have similar stats for a DB without this patch? Do you expect the level distribution to be different? Why?

@ordian
Copy link
Collaborator Author

ordian commented Mar 24, 2020

What I meant is that it's not just the memory assigned, if you look how we use memory budget,
https://github.com/paritytech/parity-common/blob/939151e23b132110628739e8458e6cece1f1c8d0/kvdb-rocksdb/src/lib.rs#L208
we set the optimize_level_style_compaction, which in turn sets the size of L0 and L1 layers of rocksdb column, which in turn affects the whole db layout. So the https://github.com/facebook/rocksdb/wiki/Leveled-Compaction is actually relevant here.

@AtkinsChang
Copy link
Contributor

AtkinsChang commented Mar 25, 2020

@dvdplm Sorry that I only left few word without descriptive information. I just want to explain why it change the level distribution.
Like what Ordian said, the function optimize_level_style_compaction we used is actually setting file size of L0 to memory_budget / 2 and total size of L1 to memory budget.
If I understand correctly, It changes all the level distribution because of the modification of max_bytes_for_level_base.

But I got another question that optimize_level_style_compaction aim to achieve:
2 memtables -> L0
2 L0 -> L1
So it turns off L0 L1's compression and tunes its size. But these settings seems to be overwritten. I can't not find the reason in upstream or rocksdb official wiki.

I don't have non-patched state db now.

@dvdplm dvdplm added A1-onice 🌨 Pull request is reviewed well, but should not yet be merged. and removed A0-pleasereview 🤓 Pull request needs code review. labels Apr 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A1-onice 🌨 Pull request is reviewed well, but should not yet be merged. M4-core ⛓ Core client code / Rust.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants