Extremely slow mainnet sync with pruning enabled #3333

blampe · 2022-01-23T19:58:25Z

System information

erigon version 2022.01.2-beta (sync started on v2021.12.03)
OS & Version: Linux

Observed behaviour

I've been waiting several weeks for a mainnet node to sync using --prune=htc --prune.r.before=11184524 as recommended for validators. It's now to a point where it's stalled on stage 5 / execution, and seems to be taking 5-10s to execute individual blocks. At this rate with ~3.6M blocks left to go this will take almost year to finish!

My disk is not saturated at all: it's seeing a sustained ~0.37MB/s for reads and no write activity.

Commands:

erigon --prune.r.before=11184524 --prune=htc --datadir=/data --chain=mainnet --state.stream.disable
rpcdaemon --private.api.addr=127.0.0.1:9090 --state.cache=0 --http.addr=0.0.0.0 --ws --http.api=eth,net,erigon --http.vhosts=*

Logs:

erigon [INFO] [01-23|18:37:36.021] Initialising Ethereum protocol           network=1
erigon [INFO] [01-23|18:37:36.050] Effective                                prune="--prune=htc --prune.r.before=11184524"
erigon [INFO] [01-23|18:37:52.190] Starting private RPC server              on=127.0.0.1:9090
erigon [INFO] [01-23|18:37:52.190] [1/15 Headers] Waiting for headers...    from=14063381
erigon [INFO] [01-23|18:37:53.087] Started P2P networking                   version=66 self=enode://[email protected]:30303 name=erigon/v2022.01.2-beta-5d3c7761/linux-amd64/go1.17.6
erigon [INFO] [01-23|18:37:53.530] event subscription channel established with the RPC daemon
erigon [INFO] [01-23|18:37:55.499] New txs subscriber joined
erigon [INFO] [01-23|18:38:22.191] [1/15 Headers] Wrote block headers       number=14063381 blk/second=0.000 alloc="97.73 MiB" sys="132.47 MiB"
erigon [INFO] [01-23|18:38:52.191] [1/15 Headers] Wrote block headers       number=14063381 blk/second=0.000 alloc="104.09 MiB" sys="140.47 MiB"
erigon [INFO] [01-23|18:39:22.191] [1/15 Headers] Wrote block headers       number=14063381 blk/second=0.000 alloc="112.18 MiB" sys="150.60 MiB"
erigon [INFO] [01-23|18:39:36.051] [p2p] GoodPeers                          eth66=2
erigon [WARN] [01-23|18:39:36.198] Invalidating anchor for suspected unavailability height=14063572
erigon [INFO] [01-23|18:39:52.191] [1/15 Headers] Wrote block headers       number=14063381 blk/second=0.000 alloc="75.41 MiB" sys="155.16 MiB"
erigon [INFO] [01-23|18:40:24.861] Generating ethash verification cache     epoch=468 percentage=96 elapsed=3.000s
erigon [INFO] [01-23|18:40:25.021] [1/15 Headers] Inserting headers         progress=14063382
erigon [INFO] [01-23|18:40:27.970] Generating ethash verification cache     epoch=469 percentage=96 elapsed=3.001s
erigon [INFO] [01-23|18:40:38.475] [1/15 Headers] Processed                 highest inserted=14063582 age=42s
erigon [INFO] [01-23|18:40:38.475] [1/15 Headers] DONE                      in=2m46.284907383s
erigon [INFO] [01-23|18:40:42.559] [3/15 Bodies] Processing bodies...       from=14063381 to=14063582
erigon [INFO] [01-23|18:41:12.559] [3/15 Bodies] Wrote block bodies         block_num=14063405 delivery/sec="253.11 KiB" wasted/sec="0.00 B" alloc="276.10 MiB" sys="335.79 MiB"
erigon [INFO] [01-23|18:41:19.632] [3/15 Bodies] Processed                  highest=14063582
erigon [INFO] [01-23|18:41:19.632] [4/15 Senders] Started                   from=14063381 to=14063582
erigon [INFO] [01-23|18:41:20.964] [5/15 Execution] Blocks execution        from=10469143 to=14063582
erigon [INFO] [01-23|18:41:36.051] [p2p] GoodPeers                          eth66=9
erigon [INFO] [01-23|18:41:57.545] [5/15 Execution] Executed blocks         number=10469145 blk/s=0.055 tx/s=8.584 Mgas/s=0.637 batch="80.80 KiB" alloc="439.80 MiB" sys="484.79 MiB"
erigon [INFO] [01-23|18:42:29.058] [5/15 Execution] Executed blocks         number=10469148 blk/s=0.095 tx/s=17.136 Mgas/s=1.134 batch="188.74 KiB" alloc="266.50 MiB" sys="501.29 MiB"
erigon [INFO] [01-23|18:42:55.334] [5/15 Execution] Executed blocks         number=10469151 blk/s=0.114 tx/s=20.513 Mgas/s=1.361 batch="275.89 KiB" alloc="300.03 MiB" sys="501.29 MiB"
erigon [INFO] [01-23|18:43:30.429] [5/15 Execution] Executed blocks         number=10469157 blk/s=0.171 tx/s=28.807 Mgas/s=2.036 batch="415.87 KiB" alloc="370.34 MiB" sys="501.48 MiB"
erigon [INFO] [01-23|18:43:36.051] [p2p] GoodPeers                          eth66=15
erigon [INFO] [01-23|18:43:52.281] [5/15 Execution] Executed blocks         number=10469161 blk/s=0.183 tx/s=23.796 Mgas/s=1.635 batch="506.01 KiB" alloc="405.55 MiB" sys="501.54 MiB"
erigon [INFO] [01-23|18:44:22.310] [5/15 Execution] Executed blocks         number=10469165 blk/s=0.133 tx/s=24.576 Mgas/s=1.587 batch="618.29 KiB" alloc="273.21 MiB" sys="501.79 MiB"
erigon [INFO] [01-23|18:44:55.907] [5/15 Execution] Executed blocks         number=10469170 blk/s=0.149 tx/s=23.008 Mgas/s=1.774 batch="762.54 KiB" alloc="332.25 MiB" sys="501.85 MiB"
erigon [INFO] [01-23|18:45:21.501] [5/15 Execution] Executed blocks         number=10469175 blk/s=0.195 tx/s=34.110 Mgas/s=2.328 batch="883.07 KiB" alloc="367.52 MiB" sys="501.98 MiB"
erigon [INFO] [01-23|18:45:36.051] [p2p] GoodPeers                          eth66=18
erigon [INFO] [01-23|18:45:57.434] [5/15 Execution] Executed blocks         number=10469182 blk/s=0.195 tx/s=29.750 Mgas/s=1.990 batch="1.01 MiB" alloc="427.50 MiB" sys="502.10 MiB"
erigon [INFO] [01-23|18:46:23.798] [5/15 Execution] Executed blocks         number=10469187 blk/s=0.190 tx/s=31.254 Mgas/s=2.262 batch="1.12 MiB" alloc="349.47 MiB" sys="548.73 MiB"
erigon [INFO] [01-23|18:46:51.025] [5/15 Execution] Executed blocks         number=10469192 blk/s=0.184 tx/s=31.918 Mgas/s=2.188 batch="1.24 MiB" alloc="411.84 MiB" sys="548.73 MiB"
erigon [INFO] [01-23|18:47:23.621] [5/15 Execution] Executed blocks         number=10469200 blk/s=0.245 tx/s=37.244 Mgas/s=2.173 batch="1.40 MiB" alloc="314.18 MiB" sys="560.86 MiB"
erigon [INFO] [01-23|18:47:36.050] [p2p] GoodPeers                          eth66=22
erigon [INFO] [01-23|18:47:53.831] [5/15 Execution] Executed blocks         number=10469205 blk/s=0.166 tx/s=32.141 Mgas/s=1.971 batch="1.52 MiB" alloc="413.85 MiB" sys="560.86 MiB"
erigon [INFO] [01-23|18:48:21.807] [5/15 Execution] Executed blocks         number=10469210 blk/s=0.179 tx/s=34.566 Mgas/s=2.126 batch="1.66 MiB" alloc="475.26 MiB" sys="560.86 MiB"
erigon [INFO] [01-23|18:48:53.177] [5/15 Execution] Executed blocks         number=10469215 blk/s=0.159 tx/s=34.460 Mgas/s=1.902 batch="1.79 MiB" alloc="531.15 MiB" sys="589.67 MiB"
erigon [INFO] [01-23|18:49:25.025] [5/15 Execution] Executed blocks         number=10469222 blk/s=0.220 tx/s=36.329 Mgas/s=2.247 batch="1.94 MiB" alloc="312.33 MiB" sys="601.67 MiB"
erigon [INFO] [01-23|18:49:36.051] [p2p] GoodPeers                          eth66=27
erigon [INFO] [01-23|18:49:55.530] [5/15 Execution] Executed blocks         number=10469227 blk/s=0.164 tx/s=25.930 Mgas/s=1.923 batch="2.08 MiB" alloc="367.56 MiB" sys="601.67 MiB"
erigon [INFO] [01-23|18:50:22.150] [5/15 Execution] Executed blocks         number=10469233 blk/s=0.225 tx/s=32.006 Mgas/s=2.238 batch="2.20 MiB" alloc="411.54 MiB" sys="601.67 MiB"
erigon [INFO] [01-23|18:50:54.556] [5/15 Execution] Executed blocks         number=10469239 blk/s=0.185 tx/s=35.827 Mgas/s=2.182 batch="2.35 MiB" alloc="496.23 MiB" sys="601.86 MiB"
erigon [INFO] [01-23|18:51:21.056] [5/15 Execution] Executed blocks         number=10469244 blk/s=0.189 tx/s=37.321 Mgas/s=2.187 batch="2.46 MiB" alloc="320.23 MiB" sys="601.86 MiB"

The text was updated successfully, but these errors were encountered:

AndreaLanfranchi · 2022-01-24T08:52:34Z

Please provide more info about your hardware configuration. Total RAM, storage kind (hdd SSD nvme) cpu

blampe · 2022-01-25T01:34:54Z

@AndreaLanfranchi sure thing!

Important to mention this is with the thorax/erigon:v2022.01.02 container running on GCP.

I scaled up the available resources to 6 vCPU and 8GB RAM (e2-standard-8) and still saw ~blk/s=0.15. Most of the resources were idle: CPU utilization was at 8%; memory utilization was at 13%; disk read still ~0.4MB/s (100 IOPS) and write 0MB/s (0 IOPS).

The disk is 512GB pd-standard. Performance is somewhat complicated, but it's roughly equivalent to an HDD. In this configuration it's capable of 60MB/s sustained R/W (and ~400/800 R/W IOPS), which I've confirmed by reading and writing /dev/urandom to the drive and observing throughput in the 50-100MB/s range.

I wouldn't be surprised if the disk is the bottleneck, but what is surprising it how little load erigon is putting on it. If I was seeing this slowness and the disk maxed out, then upgrading to a faster disk would be a no-brainer. Strange that the disk is barely showing any activity, though!

Slightly unrelated -- I'd still like to get this node up and running and am trying to think of some creative workarounds to this problem. I found some some full archive torrents -- if I were to grab one of those, would it be compatible with the prune settings I'm running with?

AskAlexSharov · 2022-01-25T02:52:29Z

HDD - no, don’t expect enough good performance of Erigon on HDD.
Don’t use Network-based Disks in cloud - they are bandwidth-optimized, you need latency-optimized real disk. Bottleneck of genesis sync: disk latency and RAM size.
“Compatible with pruning settings” - prune settings can’t be changed after db creation.

blampe · 2022-01-30T21:39:11Z

Indeed, after digging a bit more it seems this was a re-hash of #1516.

HDD - no, don’t expect enough good performance of Erigon on HDD.

My goal was only to do the the slow sync on the cheaper HDD drive class, and then upgrade to pd-balanced or pd-ssd once up and running. I thought this strategy would work because I've done it successfully with Nethermind, but clearly Erigon has very different disk access patterns during sync!

Don’t use Network-based Disks in cloud - they are bandwidth-optimized, you need latency-optimized real disk. Bottleneck of genesis sync: disk latency and RAM size.

I did a quick experiment to test this: I upgraded a snapshot to pd-ssd (which provides maybe 1/10th-ish the latency?) and re-ran the sync from the same starting point. That got things up to ~8blk/s, which does confirm latency as a bottleneck.

That's still super slow and doesn't justify the cost of running the expensive cloud drive, so I'm going to do this locally and then upload to my instance when that's ready.

Snapshotting is sure to be a killer feature for Erigon -- getting the DB into a workable state has been fairly finicky and resource intensive, but once it's sync'd I've found it to be very reliable without requiring much resources. Snapshots will allow it to more easily run on a larger variety of hardware.

“Compatible with pruning settings” - prune settings can’t be changed after db creation.

Super helpful, thanks!

AskAlexSharov · 2022-01-31T01:39:29Z

Have fun

AskAlexSharov closed this as completed Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely slow mainnet sync with pruning enabled #3333

Extremely slow mainnet sync with pruning enabled #3333

blampe commented Jan 23, 2022

AndreaLanfranchi commented Jan 24, 2022

blampe commented Jan 25, 2022

AskAlexSharov commented Jan 25, 2022

blampe commented Jan 30, 2022

AskAlexSharov commented Jan 31, 2022

Extremely slow mainnet sync with pruning enabled #3333

Extremely slow mainnet sync with pruning enabled #3333

Comments

blampe commented Jan 23, 2022

System information

Observed behaviour

AndreaLanfranchi commented Jan 24, 2022

blampe commented Jan 25, 2022

AskAlexSharov commented Jan 25, 2022

blampe commented Jan 30, 2022

AskAlexSharov commented Jan 31, 2022