forked from radii/undupfs
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathNOTES
114 lines (90 loc) · 4.24 KB
/
NOTES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
Expected use case: 100GB storage under mgmt, 300GB virtual space.
Max use case: 1PB storage.
Block size: 4096 bytes.
100GB = 25M blocks.
1TB = 260M blocks
10TB = 2 billion blocks
100TB = 20 billion blocks
1PB = 200 billion blocks
We'll store multiple blocks per bundle. How many bundles do we need?
| bl/bu
size | 512 4096 16384
------+--------------------
100GB | 50k 6400 1600
1TB | 500k 64k 16k
10TB | 5M 600k 160k
100TB | 50M 6M 1.6M
1PB | 500M 60M 16M
What's the storage overhead of 16k blocks per bundle -- at 4k with 8 byte CID,
128k per bundle of overhead. First 10 blocks in the FS requires 10 bundles
on average with 128k of overhead per, ~1MB for first 40k written.
Idea! Let bundle allocations scale without bound starting from a minimum.
Every undupfs will have bundles 0 .. f, for a minimum storage overhead
(supposing 16k blocks per bundle) of
16384 * 8 * 16 = 2MB
(blocks/bundle) * (bytes/CID) * minimum bundles
When a bundle fills up, it splits for further allocations. So if bundle `a`
fills up, then `a/{0..f}` are created to hold further allocations. Existing
blocks are not moved, so finding blocks a4ee... requires checking bundle `a`
then looking in bundle `a/4` then potentially `a/4/e` and so on.
Problem. we need some way to delete blocks. Blocks that are referred to from a
single file are easy (if we know so); when the file content changes, the
previous block becomes unreferenced and is marked to be overwritten.
Multireference blocks are harder. We could keep a reference count in the bundle
header; giving up 2 bytes of the CID (down to 48 bits of ID) is probably doable
but would kinda suck.
Could use a single bit to denote "single reference" versus "multi reference",
on the theory that multireference blocks are less likely to be deleted. Then
freeing single reference blocks is trivial, while orphaned multiref blocks would
need a mark and sweep GC or similar.
It would be really nice to avoid needing a GC.
The single bit idea could be extended to a saturating counter of any bittiness,
4 bits perhaps. Still requires a GC though, and I suspect the rewards are very
diminishing -- single versus multi is probably a big win, but 2->bigN is
probably a smaller ramp. Could use some real world numbers here, but just
consider the target use case of multiple VMs running the same guest OS.
-----------
Performance
-----------
To achieve 100 MB/sec must do 25,600 4k pages per second, 40 usec per page.
----------
Tree Shape
----------
Overall Bloom filter tree size, and filter query count per lookup, is affected
by the number of blocks stored and by the fan-out at each level of the tree.
Suppose we store 100 GB (26 million blocks, which at 128 blocks per
first-level filter gives 204,800 first-level filters). Then various fanouts
give the following tree values:
Fanout Height Filters
------ ------ -------
2 18 204800 + 204800
4 9 204800 + 68266
8 6 204800 + 29257
16 5 204800 + 13653
32 4 204800 + 6606
64 3 204800 + 3250
128 3 204800 + 1612
Suppose we store 1TB (250M blocks, 2 million first-level filters). Then we
have
2 21 2000000 + 2000000
4 11 2000000 + 666666
8 7 2000000 + 285714
16 6 2000000 + 133333
32 5 2000000 + 64516
64 4 2000000 + 31746
128 3 2000000 + 15748
How often does a higher level tree entry rule out a branch? This is a
function of the FP rate, which is determined by the Bloom parameters. The
following table provides the accuracy rate (computed as 1 - FP rate) at the
second level, for various accuracy rates and fanout sizes. Obviously larger
fanouts require higher accuracy (more bits per filter), but it's enlightening
to see how high accuracy is required to get any value at all out of a
multilevel high fanout Bloom filter tree.
accuracy | fanout
| 2 4 8 16 32 64 128
---------+------------------------------------------
0.900000 | 0.656 0.185 0.001 0.000 0.000 0.000 0.000
0.990000 | 0.961 0.851 0.526 0.076 0.000 0.000 0.000
0.999000 | 0.996 0.984 0.938 0.774 0.359 0.017 0.000
0.999900 | 1.000 0.998 0.994 0.975 0.903 0.664 0.194
0.999990 | 1.000 1.000 0.999 0.997 0.990 0.960 0.849