-
Notifications
You must be signed in to change notification settings - Fork 826
Description
Describe the bug
The database grows unbounded and eventually consumes most of the disk space. Recovery involves pruning and compacting the database, which, when attempted, consumes the entire disk and then fails.
To Reproduce
- Start a node.
- Wait (a long time - more than a year, depending on the disk used - I used a 1.8TB ssd disk).
Expected behavior
The node should be self-maintaining. It should run indefinitely if the hardware is sufficient and reliable.
Logs
Nov 12 19:13:46 sam avalanchego[459459]: [11-12|19:13:46.774] FATAL node/node.go:1437 low on disk space. Shutting down... {"remainingDiskBytes": 534765568}
...
Nov 12 19:13:48 sam systemd[1]: avalanchego.service: Main process exited, code=exited, status=1/FAILURE
Nov 12 19:13:48 sam systemd[1]: avalanchego.service: Failed with result 'exit-code'.
Nov 12 19:13:48 sam systemd[1]: avalanchego.service: Consumed 3d 17h 48min 4.952s CPU time.
Nov 12 19:13:49 sam systemd[1]: avalanchego.service: Scheduled restart job, restart counter is at 1.
...
Nov 12 19:18:28 sam avalanchego[476120]: [11-12|19:18:28.499] FATAL node/node.go:1437 low on disk space. Shutting down... {"remainingDiskBytes": 536281088}
...
Nov 12 19:18:28 sam systemd[1]: avalanchego.service: Main process exited, code=exited, status=1/FAILURE
Nov 12 19:18:28 sam systemd[1]: avalanchego.service: Failed with result 'exit-code'.
Nov 12 19:18:28 sam systemd[1]: avalanchego.service: Consumed 3min 34.223s CPU time.
Nov 12 19:18:29 sam systemd[1]: avalanchego.service: Scheduled restart job, restart counter is at 2.
...
Nov 12 19:18:37 sam systemd[1]: avalanchego.service: Scheduled restart job, restart counter is at 3.
...
Nov 12 19:18:44 sam systemd[1]: avalanchego.service: Scheduled restart job, restart counter is at 4.
...
Nov 14 07:19:02 sam systemd[1]: avalanchego.service: Scheduled restart job, restart counter is at 1113.
...
Nov 12 19:18:28 sam avalanchego[476120]: [11-12|19:18:28.499] FATAL node/node.go:1437 low on disk space. Shutting down... {"remainingDiskBytes": 536281088}
Nov 14 07:34:54 sam avalanchego[649091]: [11-14|07:34:54.206] INFO node/node.go:1684 shutting down node {"exitCode": 0}
...
Nov 13 00:07:10 sam avalanchego[525565]: [11-13|00:07:10.510] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data nodes=1,397,335 skipped=597,716 size=445.63MiB elapsed=8.000s eta=3h50m26.255s
Nov 13 00:07:17 sam avalanchego[525565]: [11-13|00:07:17.958] WARN health/worker.go:252 check started failing {"name": "health", "name": "diskspace", "tags": ["application"], "error": "remaining available disk space (546373632) is below the warning threshold of disk space (1073741824)"}
Nov 13 00:07:18 sam avalanchego[525565]: [11-13|00:07:18.510] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data nodes=2,790,874 skipped=1,195,235 size=890.45MiB elapsed=16.000s eta=3h50m28.554s
Nov 13 00:07:26 sam avalanchego[525565]: [11-13|00:07:26.510] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data nodes=4,155,672 skipped=1,780,698 size=1.29GiB elapsed=24.000s eta=3h51m56.112s
Nov 13 00:07:34 sam avalanchego[525565]: [11-13|00:07:34.511] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data nodes=5,546,536 skipped=2,376,216 size=1.73GiB elapsed=32.000s eta=3h51m37.708s
Nov 13 00:07:42 sam avalanchego[525565]: [11-13|00:07:42.511] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data nodes=6,902,015 skipped=2,957,578 size=2.15GiB elapsed=40.001s eta=3h52m34.325s
Nov 13 00:07:47 sam avalanchego[525565]: [11-13|00:07:47.959] FATAL node/node.go:1437 low on disk space. Shutting down... {"remainingDiskBytes": 347484160}
Nov 13 00:07:47 sam avalanchego[525565]: [11-13|00:07:47.959] INFO node/node.go:1684 shutting down node {"exitCode": 1}
Nov 13 00:07:47 sam avalanchego[525565]: [11-13|00:07:47.959] INFO health/worker.go:115 registered new check and initialized its state to failing {"name": "health", "name": "shuttingDown", "tags": ["application"]}
Nov 13 00:07:47 sam avalanchego[525565]: [11-13|00:07:47.959] INFO chains/manager.go:1523 shutting down chain manager
Nov 13 00:07:50 sam avalanchego[525565]: [11-13|00:07:50.511] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data nodes=8,289,390 skipped=3,550,890 size=2.58GiB elapsed=48.001s eta=3h52m15.32s
Nov 13 00:07:58 sam avalanchego[525565]: [11-13|00:07:58.511] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data nodes=9,672,450 skipped=4,144,695 size=3.02GiB elapsed=56.001s eta=3h52m5.032s
Nov 13 00:08:06 sam avalanchego[525565]: [11-13|00:08:06.518] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data nodes=10,820,482 skipped=4,638,590 size=3.37GiB elapsed=1m4.007s eta=3h56m58.779s
Nov 13 00:08:14 sam avalanchego[525565]: [11-13|00:08:14.518] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data nodes=11,550,248 skipped=4,951,899 size=3.60GiB elapsed=1m12.007s eta=4h9m42.402s
Nov 13 00:08:17 sam avalanchego[525565]: [11-13|00:08:17.959] FATAL node/node.go:1437 low on disk space. Shutting down... {"remainingDiskBytes": 347484160}
Nov 13 00:08:22 sam avalanchego[525565]: [11-13|00:08:22.518] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data nodes=12,302,704 skipped=5,273,684 size=3.84GiB elapsed=1m20.007s eta=4h20m24.815s
Nov 13 00:08:30 sam avalanchego[525565]: [11-13|00:08:30.518] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data nodes=13,129,570 skipped=5,629,254 size=4.09GiB elapsed=1m28.008s eta=4h28m20.582s
Nov 13 00:08:38 sam avalanchego[525565]: [11-13|00:08:38.518] INFO <C Chain> core/state/pruner/pruner.go:191 Pruning state data nodes=13,931,657 skipped=5,972,134 size=4.34GiB elapsed=1m36.008s eta=4h35m47.129s
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.396] ERROR corruptabledb/db.go:146 closing database to avoid possible corruption {"error": "write /home/julian/.avalanchego/db/mainnet/v1.4.5/32830132.log: no space left on device"}
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.397] FATAL chains/manager.go:392 error creating required chain {"subnetID": "11111111111111111111111111111111LpoYY", "chainID": "2q9e4r6Mu3U68nU1fYjgbR6JvwrRx36CohpAX5UQxse55x1Q5", "vmID": "mgj786NP7uDwBCcq6YwThhaN8FLyybkCa4zBWTQbNgmK6k9A6", "error": "error while >
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.397] INFO router/chain_router.go:374 shutting down chain router
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.397] INFO <P Chain> bootstrap/bootstrapper.go:805 shutting down bootstrapper
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.399] ERROR <P Chain> handler/handler.go:1019 failed while shutting down the chain {"error": "closed to avoid possible corruption, init error: write /home/julian/.avalanchego/db/mainnet/v1.4.5/32830132.log: no space left on device"}
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.399] INFO <P Chain> router/chain_router.go:398 chain shutdown {"shutdownDuration": "2.37771ms"}
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.399] INFO network/network.go:1142 shutting down the p2p networking
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.446] INFO nat/nat.go:193 Unmapped all ports
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.446] INFO node/node.go:1738 cleaning up plugin runtimes
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.446] ERROR node/node.go:1743 failed to delete ungraceful shutdown key {"error": "closed to avoid possible corruption, init error: write /home/julian/.avalanchego/db/mainnet/v1.4.5/32830132.log: no space left on device"}
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.459] WARN node/node.go:1750 error during DB shutdown {"error": "write /home/julian/.avalanchego/db/mainnet/v1.4.5/32830133.ldb: no space left on device"}
Nov 13 00:08:46 sam avalanchego[525565]: [11-13|00:08:46.459] INFO node/node.go:1766 finished node shutdown
Nov 13 00:08:46 sam systemd[1]: avalanchego.service: Main process exited, code=exited, status=1/FAILURE
Nov 13 00:08:46 sam systemd[1]: avalanchego.service: Failed with result 'exit-code'.
Nov 13 00:08:46 sam systemd[1]: avalanchego.service: Consumed 2h 45min 21.193s CPU time.
Nov 13 00:08:47 sam systemd[1]: avalanchego.service: Scheduled restart job, restart counter is at 1.
...
Nov 13 00:08:54 sam systemd[1]: avalanchego.service: Scheduled restart job, restart counter is at 2.
...
Nov 13 00:09:00 sam systemd[1]: avalanchego.service: Scheduled restart job, restart counter is at 3.
...
Nov 13 00:11:23 sam systemd[1]: avalanchego.service: Scheduled restart job, restart counter is at 25.
Metrics
Physical disk was a 1.8TB SSD
Partition size was 1.8TB
Freed space to effect recovery was 100GB
Operating System
Linux kernel 6.9.3 - Ubuntu POP!_OS 22.04 LTS
Additional context
After some discussion in a discord thread (https://discord.com/channels/578992315641626624/1277371673322061834/1438910336945098864), some possible fixes were discussed, in increasing order of (assumed) difficulty:
- Change the setting for FATAL shutdown to 100GB free and warn at 150GB. This should provide sufficient free space to perform pruning/compacting.
- While pruning/compacting, perform the operation in small batches, being conscious of available space. Only prune while there is free space, then compact to free some more, then prune some more, etc.
- Perform pruning/compacting while the server is running. Periodically and automatically.
- Implement the firewood db which shouldn't need pruning.
Only the last two proposals feel like "real" solutions to me -- otherwise, extensive downtime is required during the expected operation of a node; there is no way to achieve 99.999% (five 9s) reliability.
Workaround
If this situation is encountered, recovery is possible:
- Connect additional storage (1TB USB disk).
- Shut down the avalanche service.
- symlink-move enough files from the database folder to the new disk so that 100GB of free space is created on the main disk. Here is a small script to do that -- it may need to be run (from the database folder (
.avalanchego/db/mainnet/###/) more than once to free enough space:
FILES=`ls -t |tail -10000 |grep -v LOCK`
for i in ${FILES} ; do
mv ./$i /mnt/ldb/
ln -s /mnt/ldb/$i ./$i
done
- Prune the database according to the instructions (edit config.json, restart, etc).
- Once compaction is complete - may take 24 hours or so...
- Shut down the service again.
- Restore the moved database files to their normal location:
FILES=`find -type l`
for i in ${FILES} ; do
cp $i temp
mv temp $i
done
- Restart the service. It should resume normally.
- Unmount and disconnect the additional storage from step 1.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status