Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-89727: Fix os.walk to handle late editing of dirs #100703

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

jonburdo
Copy link
Contributor

@jonburdo jonburdo commented Jan 3, 2023

Allow modification of the dirs returned by an os.walk entry to affect which subdirectories are walked, even when the modification occurs after iteration over dirs has begun. This was the behavior for a long time before it was changed in #100703

@jonburdo jonburdo changed the title adopt old os.walk behavior to handle changes to dirs list after itera… gh-89727: Fix os.walk to handle late editing of dirs Jan 3, 2023
@jonburdo
Copy link
Contributor Author

jonburdo commented Jan 3, 2023

to do:

  • clean up code
  • add some tests

@jonburdo
Copy link
Contributor Author

jonburdo commented Jan 3, 2023

The BytesWalkTests.walk and BytesFwalkTests.walk wrapper functions actually prevent late modification (after iteration of dirs has started) because of the encoding/decoding logic they use which sets dirs before iterating over it. These are of course only for testing purposes, but it means that if we want to run test_walk_late_dirs_modification in them as well as in WalkTests, we'd need an alternate walk wrapper which this test uses and some logic to handle the bytes results in the Bytes... classes - maybe not worth it?

Lib/os.py Outdated
Comment on lines 360 to 364
# We may not have read permission for top, in which case we can't
# get a list of the files the directory contains.
# We suppress the exception here, rather than blow up for a
# minor reason when (say) a thousand readable directories are still
# left to visit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment belongs with the try: scandir block, which is moved from here and now exists in three separate locations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It can probably just go on the first instance of this block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually in the latest commit where this block appears in separate functions, I decided to just remove this comment. The behavior is described in the docstring and I think the try-except block is pretty clear. Happy to add it or part of it back if deemed necessary

Lib/os.py Show resolved Hide resolved
Lib/os.py Outdated
@@ -408,27 +404,61 @@ def walk(top, topdown=True, onerror=None, followlinks=False):

if walk_into:
walk_dirs.append(entry.path)
if cont:
continue

if topdown:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a high percentage of code in this function is now under either if topdown or the reverse that I'm kind of curious what it would look like to just have separate (internal) functions for topdown vs bottom-up. But this would still increase code duplication significantly, and move the duplicated code/structure further apart, so I suspect it's still better this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking about this. I just gave it a shot to see. This seems to me like one of those cases where there are enough little differences in logic that it's much cleaner to separate the two. If we didn't support dir modification for topdown or didn't care about performance it might be simpler, but when you have a lot of the same logic interspersed by a lot of little differences it gets messy. In these kinds of situations I also tend to prefer some duplication between functions over one big function with conditions inside of loops.

I also find it easier to look at the two sets of logic separately, but either way works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the split version seems fine. Will wait a bit and see if some core devs have opinions.

Lib/os.py Show resolved Hide resolved
@jonburdo
Copy link
Contributor Author

jonburdo commented Jan 6, 2023

A few more notes for context:

The first commit has more minimal changes similar to this suggestion. In that commit, because we're using an iterator, the implementation is a bit awkward and inefficient (i.e. only ever calling next(dirs) once per loop). There's two cases in which we ignore the value given by next(dirs) and move on:

  1. With topdown=True, the followlinks or not islink(top) check fails
  2. The line scandir_it = scandir(top) throws an OSError which is caught (and not re-raised by onerror)
    The subsequent changes handle these more naturally by looping through the iterator instead of calling continue to go through the outer loop again.

The subsequent changes also separate the top down and bottom up logic further (even before splitting into separate functions). Note that even before this PR, there was a fair amount of code that only applied when we have topdown=False:

  • The if isinstance(top, tuple): call at the top of the outer loop
  • walk_dirs creation and if isinstance(top, tuple): inside the scandir_it loop
  • if topdown: block at the bottom

@zmievsa
Copy link
Contributor

zmievsa commented Jan 6, 2023

Do we really want to introduce this much complexity for a completely undocumented and possibly unused feature?

@jonburdo
Copy link
Contributor Author

jonburdo commented Jan 6, 2023

Do we really want to introduce this much complexity for a completely undocumented and possibly unused feature?

Maybe not. I made this PR to see what it would look like, but am not attached to it. If I were designing os.walk from scratch, I wouldn't support this late editing of dirs - at least not in this way where the list can be edited during its iteration. I think this could be reasonably considered behavior that produces an undefined result, or simply considered a bug, although it was never documented as such.

But I strongly suspect it is used out there (maybe even by accident), and consistent behavior is valuable. If we don't support this, it'd be nice to have some sort of warning about the behavior change but I also don't see a reasonable way to do that. (Checking if dirs was modified would be complex or expensive too and kind of weird).

I'd suggest at least noting the change in behavior in release notes if we don't support the old behavior. I also do prefer separating the top down and bottom up logic as in this PR, although that could be a separate matter.

Other considerations:

  • Complexity (or at least code change) is a bit less in the first commit here if that seems any better.
  • I don't like the idea of copying this mess over to pathlib.Path.walk, and I also don't like the idea of having different behavior, especially if this is considered a bug or undefined behavior. It's also useful to keep the code similar between the two for maintenance, understanding logic, future changes, etc

@barneygale
Copy link
Contributor

I don't like the idea of copying this mess over to pathlib.Path.walk, and I also don't like the idea of having different behavior, especially if this is considered a bug or undefined behavior. It's also useful to keep the code similar between the two for maintenance, understanding logic, future changes, etc

pathlib's version of walk() was merged only because it improved the ergonomics of os.walk(), including handling symlinks to directories more consistently, which isn't possible in os.walk() for backwards compatibility.

Mutating dirnames after resuming the walk isn't included or even hinted at in the pathlib docs so I don't want to expend any effort supporting it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants