-
Notifications
You must be signed in to change notification settings - Fork 6.5k
Prevent SimpleDirectoryReader from excessive memory consumption #18983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…st, and not after, to prevent excessive memory consumption
This reverts commit df5bb77.
| if limit: | ||
| c += 1 | ||
| if c > limit: | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually fix the issue? I think we still load all file-refs into memory 🤔
Maybe a better fix is something like
for _ref in os.glob(...):
if len(all_files) > limit:
break
# Manually check the ref...
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh it would be nice if we could pass the limit directly into glob 🤔 But I didn't see a limit in their docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use something as:
import os
counter = 0
for root, dirs, files in self.fs.walk():
for file in files:
counter += 1
if counter > limit:
break
refs.append(os.path.join(root, file))There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah fs.walk is a nice approach yes
| if limit and c > limit: | ||
| break | ||
| file_refs.append(os.path.join(root, file)) | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these two codepaths are the same, we can reduce code duplication by just change the args I think?
depth = None if self.recursive else 1
(I'm actually also hesitant to pass in None to begin with (we should have some upper bound like 1000 ?)
Description
With this PR, we prevent SimpleDirectoryReader from loading all the files within a list and then limiting their number (which causes memory exhaustion in resource-intense edge use cases), by implementing the limitation within the loop that add files to the
all_fileslist.Fixes this Huntr issue