You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I propose three different solutions to get around this issue:
Make the size configurable using an environment variable.
Probably the easiest solution. Users still may run into the problem when the size was not set correctly.
When the DB is full close end the transaction and double the size of the DB.
Would need some refactoring regarding the handling of transactions.
Probably leading to more complicated code, as the transaction handling is on a different level of the call hierarchy than the actual writing.
Use the data set functionality of tensor flow: TFRecordWriter and TFRecordDataset
This would introduce a new technology for handling input data of nn_ensemble. Required effort is probably somewhere between the first to proposals. It also seems like you want to use LMDB more in the future (See Use LMDB to store vectors in PAV backend #378 ) So this may not align with your roadmap.
What do you think about the options? I could work on this.
The text was updated successfully, but these errors were encountered:
Thanks for the report. I guess this is a "640kB ought to be enough for anyone" type bug. The current, hardcoded maximum size is 1GB. I never expected that anyone would get even close to the limit; we typically train nn_ensemble models with at most tens of thousands of samples.
Make the size configurable using an environment variable.
Why not simply a parameter for the nn_ensemble backend?
As for the other options, I would slightly prefer switching to TFRecordWriter/TFRecordDataset instead of the "double the size" approach with LMDB. As I understand it this functionality is already included in TensorFlow, so it would allow dropping LMDB as a dependency, at least for now.
While we've had some thoughts about using LMDB more in the future (as in #378), this is not a goal in itself - rather, LMDB can be an elegant solution for storing large amounts of data on disk, but other options are possible too and this can be chosen on a case-by-case basis.
When training a
nn_ensemble
project with a large number of documents, the LMDB will throw an error message regarding sizeI propose three different solutions to get around this issue:
Probably the easiest solution. Users still may run into the problem when the size was not set correctly.
Would need some refactoring regarding the handling of transactions.
Probably leading to more complicated code, as the transaction handling is on a different level of the call hierarchy than the actual writing.
TFRecordWriter and TFRecordDataset
This would introduce a new technology for handling input data of nn_ensemble. Required effort is probably somewhere between the first to proposals. It also seems like you want to use LMDB more in the future (See Use LMDB to store vectors in PAV backend #378 ) So this may not align with your roadmap.
What do you think about the options? I could work on this.
The text was updated successfully, but these errors were encountered: