Skip to content

Conversation

@AstraBert
Copy link
Member

Description

With this PR, we prevent SimpleDirectoryReader from loading all the files within a list and then limiting their number (which causes memory exhaustion in resource-intense edge use cases), by implementing the limitation within the loop that add files to the all_files list.

Fixes this Huntr issue

@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jun 5, 2025
if limit:
c += 1
if c > limit:
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually fix the issue? I think we still load all file-refs into memory 🤔

Maybe a better fix is something like


for _ref in os.glob(...):
  if len(all_files) > limit:
    break
  # Manually check the ref...
  ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh it would be nice if we could pass the limit directly into glob 🤔 But I didn't see a limit in their docs

Copy link
Member Author

@AstraBert AstraBert Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use something as:

import os 

counter = 0
for root, dirs, files in self.fs.walk():
    for file in files:
        counter += 1
        if counter > limit:
            break
        refs.append(os.path.join(root, file))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah fs.walk is a nice approach yes

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Jun 5, 2025
if limit and c > limit:
break
file_refs.append(os.path.join(root, file))
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these two codepaths are the same, we can reduce code duplication by just change the args I think?

depth = None if self.recursive else 1

(I'm actually also hesitant to pass in None to begin with (we should have some upper bound like 1000 ?)

@logan-markewich logan-markewich enabled auto-merge (squash) June 6, 2025 00:31
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jun 6, 2025
@logan-markewich logan-markewich merged commit 53614e2 into main Jun 6, 2025
10 checks passed
@logan-markewich logan-markewich deleted the clelia/simple-directory-reader-memory branch June 6, 2025 01:13
@colca colca mentioned this pull request Jun 9, 2025
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants