Eventually validators run out of disk space, shut down and become unrecoverable.

**Describe the bug**
The database grows unbounded and eventually consumes most of the disk space. Recovery involves pruning and compacting the database, which, when attempted, consumes the entire disk and then fails.

**To Reproduce**
1. Start a node.
2. Wait (a long time - more than a year, depending on the disk used - I used a 1.8TB ssd disk).

**Expected behavior**
The node should be self-maintaining. It should run indefinitely if the hardware is sufficient and reliable.

**Logs**
```
Nov 12 19:13:46 sam avalanchego[459459]: [11-12|19:13:46.774] FATAL node/node.go:1437 low on disk space. Shutting down... {"remainingDiskBytes": 534765568}
...
Nov 12 19:13:48 sam systemd[1]: avalanchego.service: Main process exited, code=exited, status=1/FAILURE
Nov 12 19:13:48 sam systemd[1]: avalanchego.service: Failed with result 'exit-code'.
Nov 12 19:13:48 sam systemd[1]: avalanchego.service: Consumed 3d 17h 48min 4.952s CPU time.
Nov 12 19:13:49 sam systemd[1]: avalanchego.service: Scheduled restart job, restart counter is at 1.
...
Nov 12 19:18:28 sam avalanchego[476120]: [11-12|19:18:28.499] FATAL node/node.go:1437 low on disk space. Shutting down... {"remainingDiskBytes": 536281088}
...
Nov 12 19:18:28 sam systemd[1]: avalanchego.service: Main process exited, code=exited, status=1/FAILURE
Nov 12 19:18:28 sam systemd[1]: avalanchego.service: Failed with result 'exit-code'.
Nov 12 19:18:28 sam systemd[1]: avalanchego.service: Consumed 3min 34.223s CPU time.
Nov 12 19:18:29 sam systemd[1]: avalanchego.service: Scheduled restart job, restart counter is at 2.
...
Nov 12 19:18:37 sam systemd[1]: avalanchego.service: Scheduled restart job, restart counter is at 3.
...
Nov 12 19:18:44 sam systemd[1]: avalanchego.service: Scheduled restart job, restart counter is at 4.
...
Nov 14 07:19:02 sam systemd[1]: avalanchego.service: Scheduled restart job, restart counter is at 1113.
...
Nov 12 19:18:28 sam avalanchego[476120]: [11-12|19:18:28.499] FATAL node/node.go:1437 low on disk space. Shutting down... {"remainingDiskBytes": 536281088}
Nov 14 07:34:54 sam avalanchego[649091]: [11-14|07:34:54.206] INFO node/node.go:1684 shutting down node {"exitCode": 0}
...
Nov 13 00:07:10 sam avalanchego[525565]: [11-13|00:07:10.510] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data                       nodes=1,397,335 skipped=597,716 size=445.63MiB elapsed=8.000s     eta=3h50m26.255s
Nov 13 00:07:17 sam avalanchego[525565]: [11-13|00:07:17.958] WARN health/worker.go:252 check started failing {"name": "health", "name": "diskspace", "tags": ["application"], "error": "remaining available disk space (546373632) is below the warning threshold of disk space (1073741824)"}
Nov 13 00:07:18 sam avalanchego[525565]: [11-13|00:07:18.510] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data                       nodes=2,790,874 skipped=1,195,235 size=890.45MiB elapsed=16.000s    eta=3h50m28.554s
Nov 13 00:07:26 sam avalanchego[525565]: [11-13|00:07:26.510] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data                       nodes=4,155,672 skipped=1,780,698 size=1.29GiB   elapsed=24.000s    eta=3h51m56.112s
Nov 13 00:07:34 sam avalanchego[525565]: [11-13|00:07:34.511] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data                       nodes=5,546,536 skipped=2,376,216 size=1.73GiB   elapsed=32.000s    eta=3h51m37.708s
Nov 13 00:07:42 sam avalanchego[525565]: [11-13|00:07:42.511] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data                       nodes=6,902,015 skipped=2,957,578 size=2.15GiB   elapsed=40.001s    eta=3h52m34.325s
Nov 13 00:07:47 sam avalanchego[525565]: [11-13|00:07:47.959] FATAL node/node.go:1437 low on disk space. Shutting down... {"remainingDiskBytes": 347484160}
Nov 13 00:07:47 sam avalanchego[525565]: [11-13|00:07:47.959] INFO node/node.go:1684 shutting down node {"exitCode": 1}
Nov 13 00:07:47 sam avalanchego[525565]: [11-13|00:07:47.959] INFO health/worker.go:115 registered new check and initialized its state to failing {"name": "health", "name": "shuttingDown", "tags": ["application"]}
Nov 13 00:07:47 sam avalanchego[525565]: [11-13|00:07:47.959] INFO chains/manager.go:1523 shutting down chain manager
Nov 13 00:07:50 sam avalanchego[525565]: [11-13|00:07:50.511] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data                       nodes=8,289,390 skipped=3,550,890 size=2.58GiB   elapsed=48.001s    eta=3h52m15.32s
Nov 13 00:07:58 sam avalanchego[525565]: [11-13|00:07:58.511] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data                       nodes=9,672,450 skipped=4,144,695 size=3.02GiB   elapsed=56.001s    eta=3h52m5.032s
Nov 13 00:08:06 sam avalanchego[525565]: [11-13|00:08:06.518] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data                       nodes=10,820,482 skipped=4,638,590 size=3.37GiB   elapsed=1m4.007s   eta=3h56m58.779s
Nov 13 00:08:14 sam avalanchego[525565]: [11-13|00:08:14.518] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data                       nodes=11,550,248 skipped=4,951,899 size=3.60GiB   elapsed=1m12.007s  eta=4h9m42.402s
Nov 13 00:08:17 sam avalanchego[525565]: [11-13|00:08:17.959] FATAL node/node.go:1437 low on disk space. Shutting down... {"remainingDiskBytes": 347484160}
Nov 13 00:08:22 sam avalanchego[525565]: [11-13|00:08:22.518] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data                       nodes=12,302,704 skipped=5,273,684 size=3.84GiB   elapsed=1m20.007s  eta=4h20m24.815s
Nov 13 00:08:30 sam avalanchego[525565]: [11-13|00:08:30.518] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data                       nodes=13,129,570 skipped=5,629,254 size=4.09GiB   elapsed=1m28.008s  eta=4h28m20.582s
Nov 13 00:08:38 sam avalanchego[525565]: [11-13|00:08:38.518] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data                       nodes=13,931,657 skipped=5,972,134 size=4.34GiB   elapsed=1m36.008s  eta=4h35m47.129s
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.396] ERROR corruptabledb/db.go:146 closing database to avoid possible corruption {"error": "write /home/julian/.avalanchego/db/mainnet/v1.4.5/32830132.log: no space left on device"}
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.397] FATAL chains/manager.go:392 error creating required chain {"subnetID": "11111111111111111111111111111111LpoYY", "chainID": "2q9e4r6Mu3U68nU1fYjgbR6JvwrRx36CohpAX5UQxse55x1Q5", "vmID": "mgj786NP7uDwBCcq6YwThhaN8FLyybkCa4zBWTQbNgmK6k9A6", "error": "error while >
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.397] INFO router/chain_router.go:374 shutting down chain router
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.397] INFO <P Chain> bootstrap/bootstrapper.go:805 shutting down bootstrapper
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.399] ERROR <P Chain> handler/handler.go:1019 failed while shutting down the chain {"error": "closed to avoid possible corruption, init error: write /home/julian/.avalanchego/db/mainnet/v1.4.5/32830132.log: no space left on device"}
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.399] INFO <P Chain> router/chain_router.go:398 chain shutdown {"shutdownDuration": "2.37771ms"}
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.399] INFO network/network.go:1142 shutting down the p2p networking
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.446] INFO nat/nat.go:193 Unmapped all ports
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.446] INFO node/node.go:1738 cleaning up plugin runtimes
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.446] ERROR node/node.go:1743 failed to delete ungraceful shutdown key {"error": "closed to avoid possible corruption, init error: write /home/julian/.avalanchego/db/mainnet/v1.4.5/32830132.log: no space left on device"}
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.459] WARN node/node.go:1750 error during DB shutdown {"error": "write /home/julian/.avalanchego/db/mainnet/v1.4.5/32830133.ldb: no space left on device"}
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.459] INFO node/node.go:1766 finished node shutdown
Nov 13 00:08:46 sam systemd[1]: avalanchego.service: Main process exited, code=exited, status=1/FAILURE
Nov 13 00:08:46 sam systemd[1]: avalanchego.service: Failed with result 'exit-code'.
Nov 13 00:08:46 sam systemd[1]: avalanchego.service: Consumed 2h 45min 21.193s CPU time.
Nov 13 00:08:47 sam systemd[1]: avalanchego.service: Scheduled restart job, restart counter is at 1.
...
Nov 13 00:08:54 sam systemd[1]: avalanchego.service: Scheduled restart job, restart counter is at 2.
...
Nov 13 00:09:00 sam systemd[1]: avalanchego.service: Scheduled restart job, restart counter is at 3.
...
Nov 13 00:11:23 sam systemd[1]: avalanchego.service: Scheduled restart job, restart counter is at 25.
```
**Metrics**
Physical disk was a 1.8TB SSD
Partition size was 1.8TB
Freed space to effect recovery was 100GB

**Operating System**
Linux kernel 6.9.3 - Ubuntu POP!_OS 22.04 LTS

**Additional context**

After some discussion in a discord thread (https://discord.com/channels/578992315641626624/1277371673322061834/1438910336945098864), some possible fixes were discussed, in increasing order of (assumed) difficulty:

1. Change the setting for FATAL shutdown to 100GB free and warn at 150GB. This should provide sufficient free space to perform pruning/compacting.
2. While pruning/compacting, perform the operation in small batches, being conscious of available space. Only prune while there is free space, then compact to free some more, then prune some more, etc.
3. Perform pruning/compacting while the server is running.  Periodically and automatically.
4. Implement the firewood db which shouldn't need pruning.

Only the last two proposals feel like "real" solutions to me -- otherwise, extensive downtime is required during the expected operation of a node; there is no way to achieve 99.999% (five 9s) reliability.

**Workaround**

If this situation is encountered, recovery is possible:

1. Connect additional storage (1TB USB disk).
2. Shut down the avalanche service.
3. symlink-move enough files from the database folder to the new disk so that 100GB of free space is created on the main disk. Here is a small script to do that -- it may need to be run (from the database folder (`.avalanchego/db/mainnet/###/`) more than once to free enough space:
```
FILES=`ls -t |tail -10000 |grep -v LOCK`

for i in ${FILES} ; do  
    mv ./$i /mnt/ldb/  
    ln -s /mnt/ldb/$i ./$i
done
```
4. Prune the database according to the instructions (edit config.json, restart, etc).
5. Once compaction is complete - may take 24 hours or so...
6. Shut down the service again.
7. Restore the moved database files to their normal location:
```
FILES=`find -type l`

for i in ${FILES} ; do  
    cp $i temp
    mv temp $i
done
```
8. Restart the service. It should resume normally.
9. Unmount and disconnect the additional storage from step 1.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eventually validators run out of disk space, shut down and become unrecoverable. #4517

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eventually validators run out of disk space, shut down and become unrecoverable. #4517

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions