-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Repos where created by doing the following:
1000x: random 4000 > afile && ipfs add afile
100x: random-files -q -dirs 4 -files 35 -depth 5 -random-fanout -random-size d && ipfs add -r d
100x: random-files -q -dirs 4 -files 35 -depth 5 -random-fanout -random-size d && ipfs add --pin=false -r d
10x: random-files -q -dirs 4 -files 100 -depth 7 -random-fanout -random-size d && ipfs add -r d
10x: random-files -q -dirs 4 -files 100 -depth 7 -random-fanout -random-size d && ipfs add --pin=false -r d
2x: random-files -q -dirs 4 -files 100 -depth 10 -random-fanout -random-size d && ipfs add -r d
2x: random-files -q -dirs 4 -files 100 -depth 10 -random-fanout -random-size d && ipfs add --pin=false -r d
Other notes:
- The
-seedoption was used inrandomandrandom-filesin order to get deterministic results. - Repo. where created both with with and without using the
--raw-leavesoption. - Tests where done before and after any of my optimizations landed.
Implemenation notes:
- Due to XXX three IPFS repo's where created, one with normal leaves, one with the incorrect Cid for raw leaves, and one with the correct Cid.
- All three repo where on a 32 Gib xfs partition on linux so that I can easily backup the partition to undo the effects of a GC run.
This creates a repo with around 1.6 million small blocks (under 4k), perfect for stress testing the pinner and filesystem. About 50% of the content is pinned.
The G.C. is done in basically two passes, the first pass collects the colored set, that is the set of blocks to keep, the second step scans the blockstore and removes any blocks in the set. I measured each pass seperatly. For the first pass the only thing that made and measurable difference is if raw leaves, which lead to a speedup of over an order of magnitude.
| What | Normal Leaves | Raw Leaves | Speedup |
|---|---|---|---|
| Get Colored Set (First Time) | 282s | 20.7s | 13.6x |
| Get Colored Set (In Cache) | 26.3s | 2.1s | 12.5x |
For the second part when there are a significant amount of item's to blocks to delete the time spent deleting the files outweighs all other factors. In addition the time spend is highly depeneding on the
filesystem used. For the xfs paratation I got different numbers depending on which repo. was used. The time was anywhere from 20 to 75 minutes. I have no idea what can be done about deletion time as long as we stick to one file per block. Increasing the prefix length in flatfs might help (there where on average 3,100 files per directory), but I have not investigated that.
After the initial GC was run and there was nothing to delete my optimizations has an effect on the second pass:
| What | Pre Opt | Post Opt | Speedup |
|---|---|---|---|
| Normal Leaves, First Time | 5.7s | 2.6s | 2.2x |
| Normal Leaves, Second Time | 5.8s | 2.5s | 2.3x |
| Raw Leaves, First Time | 217s | 4.4s | 49x |
| Raw Leaves, Second Time | 5.8s | 2.2s | 2.6x |
For the case with normal leaves, since everything is pinned, the fpass to get the colored already loaded everything in the OS cache so the second time numbers are about the same as the first.
When raw leaves are used the story is very different. Only a small fraction of the blocks needed to be read from the store for the first pass so during the second pass there was still a lot of disk IO.
I double checked the first time results with raw leaves, there really is that large of a speed up after my optimizations. This is most likely due to the fact that after my optimizations the inodes for the files did not need to read from the disk in order to just list the blocks in the flatfs datastore.
In all cases when everything is in the os cache the speed up from my optimization is somewhere between 2x and 3x.