Experience with multi-terabyte datasets? #186

vedantroy · 2022-07-11T19:45:59Z

vedantroy
Jul 11, 2022

I used this earlier for a web development project, and the experience was pretty smooth, so I was thinking of using it for an ML project!
Would this be good for handling multi-terabyte datasets?
E.g, a simple schema of: id => (img_bytes, img_caption) or something similar.

Or do you think performance would degrade and random look-ups would be very slow?

kriszyp · 2022-07-25T12:06:39Z

kriszyp
Jul 25, 2022
Maintainer

@vedantroy Sorry for the slow response, somehow missed this earlier. I do believe this would work well with multi-terabyte datasets. I have consistently used this with 100GB+ datasets; not quite as big as multi-terabyte, but still significantly larger than the RAM available on the servers it was running on, and had great results. Naturally, just like any other database, when the database is larger than available memory, a lot more operations (random lookups) will require slower disk access, but LMDB is still very efficient with this compared to other databases. Symas has published some benchmarks on this as well (the read benchmarks are probably the most relevant, and lmdb-js has vastly better write performance due to batching than a write-per-txn model).

One change in strategy that may be appropriate with very large databases is you may consider using asynchronous gets via the prefetch method, if a significant percentage of your reads are triggering hard page faults and going to disk.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experience with multi-terabyte datasets? #186

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Experience with multi-terabyte datasets? #186

vedantroy Jul 11, 2022

Replies: 1 comment

kriszyp Jul 25, 2022 Maintainer

vedantroy
Jul 11, 2022

kriszyp
Jul 25, 2022
Maintainer