LMDB can overflow #552

mo-fu · 2022-01-13T13:10:07Z

When training a nn_ensemble project with a large number of documents, the LMDB will throw an error message regarding size

lmdb.MapFullError: mdb_put: MDB_MAP_FULL: Environment mapsize limit reached

I propose three different solutions to get around this issue:

Make the size configurable using an environment variable.
Probably the easiest solution. Users still may run into the problem when the size was not set correctly.
When the DB is full close end the transaction and double the size of the DB.
Would need some refactoring regarding the handling of transactions.
Probably leading to more complicated code, as the transaction handling is on a different level of the call hierarchy than the actual writing.
Use the data set functionality of tensor flow:
TFRecordWriter and TFRecordDataset
This would introduce a new technology for handling input data of nn_ensemble. Required effort is probably somewhere between the first to proposals. It also seems like you want to use LMDB more in the future (See Use LMDB to store vectors in PAV backend #378 ) So this may not align with your roadmap.

What do you think about the options? I could work on this.

The text was updated successfully, but these errors were encountered:

osma · 2022-01-13T13:23:34Z

Thanks for the report. I guess this is a "640kB ought to be enough for anyone" type bug. The current, hardcoded maximum size is 1GB. I never expected that anyone would get even close to the limit; we typically train nn_ensemble models with at most tens of thousands of samples.

Make the size configurable using an environment variable.

Why not simply a parameter for the nn_ensemble backend?

As for the other options, I would slightly prefer switching to TFRecordWriter/TFRecordDataset instead of the "double the size" approach with LMDB. As I understand it this functionality is already included in TensorFlow, so it would allow dropping LMDB as a dependency, at least for now.

While we've had some thoughts about using LMDB more in the future (as in #378), this is not a goal in itself - rather, LMDB can be an elegant solution for storing large amounts of data on disk, but other options are possible too and this can be chosen on a case-by-case basis.

mo-fu · 2022-01-13T13:45:59Z

Why not simply a parameter for the nn_ensemble backend?

Mainly because it is not a parameter of the algorithm itself, only of the data handling.

mo-fu mentioned this issue Jan 18, 2022

Make LMDB map size a configurable parameter for nn_ensemble backend. #554

Merged

osma closed this as completed in #554 Jan 18, 2022

osma added the bug label Jan 18, 2022

osma added this to the 0.56 milestone Jan 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LMDB can overflow #552

LMDB can overflow #552

mo-fu commented Jan 13, 2022

osma commented Jan 13, 2022

mo-fu commented Jan 13, 2022 •

edited

Loading

LMDB can overflow #552

LMDB can overflow #552

Comments

mo-fu commented Jan 13, 2022

osma commented Jan 13, 2022

mo-fu commented Jan 13, 2022 • edited Loading

mo-fu commented Jan 13, 2022 •

edited

Loading