Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LMDB can overflow #552

Closed
mo-fu opened this issue Jan 13, 2022 · 2 comments · Fixed by #554
Closed

LMDB can overflow #552

mo-fu opened this issue Jan 13, 2022 · 2 comments · Fixed by #554
Labels
Milestone

Comments

@mo-fu
Copy link
Contributor

mo-fu commented Jan 13, 2022

When training a nn_ensemble project with a large number of documents, the LMDB will throw an error message regarding size

lmdb.MapFullError: mdb_put: MDB_MAP_FULL: Environment mapsize limit reached

I propose three different solutions to get around this issue:

  • Make the size configurable using an environment variable.
    Probably the easiest solution. Users still may run into the problem when the size was not set correctly.
  • When the DB is full close end the transaction and double the size of the DB.
    Would need some refactoring regarding the handling of transactions.
    Probably leading to more complicated code, as the transaction handling is on a different level of the call hierarchy than the actual writing.
  • Use the data set functionality of tensor flow:
    TFRecordWriter and TFRecordDataset
    This would introduce a new technology for handling input data of nn_ensemble. Required effort is probably somewhere between the first to proposals. It also seems like you want to use LMDB more in the future (See Use LMDB to store vectors in PAV backend #378 ) So this may not align with your roadmap.

What do you think about the options? I could work on this.

@osma
Copy link
Member

osma commented Jan 13, 2022

Thanks for the report. I guess this is a "640kB ought to be enough for anyone" type bug. The current, hardcoded maximum size is 1GB. I never expected that anyone would get even close to the limit; we typically train nn_ensemble models with at most tens of thousands of samples.

Make the size configurable using an environment variable.

Why not simply a parameter for the nn_ensemble backend?

As for the other options, I would slightly prefer switching to TFRecordWriter/TFRecordDataset instead of the "double the size" approach with LMDB. As I understand it this functionality is already included in TensorFlow, so it would allow dropping LMDB as a dependency, at least for now.

While we've had some thoughts about using LMDB more in the future (as in #378), this is not a goal in itself - rather, LMDB can be an elegant solution for storing large amounts of data on disk, but other options are possible too and this can be chosen on a case-by-case basis.

@mo-fu
Copy link
Contributor Author

mo-fu commented Jan 13, 2022

Why not simply a parameter for the nn_ensemble backend?

Mainly because it is not a parameter of the algorithm itself, only of the data handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants