Work in progress: Foreach to iterate over files in a directory to support incremental processing #10042
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
This pull request is a rough sketch of a feature to help address #331 and is not intended to be merged as is. I have not written tests, this code breaks existing tests, and there are issues with the current implementation, but it serves as a proof of concept that does actually work at least on a toy example. I'm making this pull request to gather feedback from the DVC team about whether this feature would be taken seriously and discuss its impacts and how it should best be designed and implemented before I go to the effort of refining it.
There is a long discussion about supporting incremental changes to directories in #331, which got closed with discussion moved here. I recently made a post in the discussion thread and this PR picks up from there.
Motivation
A common scenario machine practitioners find themselves in is working on a project with incrementally changing datasets. Datasets change and grow as the result of annotation and data gathering efforts to improve model performance. However, typical use of DVC pipelines results in incremental changes to datasets requiring the entire dataset to be re-processed even when processing individual examples can occur independently of one another. For long-running or costly preprocessing jobs this can make the workflow slow enough that it's just not feasible to use DVC pipelines, forcing the team to use ad-hoc and error-prone ways of processing the data incrementally.
This PR proposes an approach to support incremental processing of data in directories by processing only those files that need to be processed.
Consider this typical way to express preprocessing stages in the
dvc.yaml:with a script (in this case,
process.py) that operates at the directory level, iterating over files and producing the output directory. The problem with this approach is that it's impossible for DVC to reason about the internals of the script and realize that processing happens independently to each example in the source directory.An alternative approach is to make
process.pyhappen at the level of an individual file and express the stage in thedvc.yamlfile as:Using this approach, the dataset can grow and previously processed files are not processed again. The main issue is that this requires keeping a list of files in the
dvc.yaml, which makes it unreadable in any situation where the list isn't trivially small. Solving the readability issue, you can put that list in aparams.yamlfile and automatically generate it with another script that scans the directory. But either way you're still left with a need to ensure that one way or another the file list is updated before you rundvc repro.Proposal
The proposal is for
foreachto support directories, with DVC iterating over the files in the directories. In this PR, you can do something like this in yourdvc.yaml:where
dep_filesspecifies a directory that DVC uses to grab a list of files and fill in the variableforeachpoints to. The files are expressed as objects with a.nameand.stemattributes. E.g. iffilenameis 'some_file.txt' thenstemis 'some_file'.This is not the most elegant way of expressing it, and it is error prone in several ways (one example: if several stages use the
${file}variable as a target then they will interfere with one another). But it was a quick hack that illustrates the core of the concept. Better might be something like:to get foreach to directly iterate over data/raw and not expose the user to a confusing intermediate variable. Or perhaps something else entirely. There are other ways it could be made more concise. But regardless of the specifics of the syntax, for many workflows it would be valuable to be able to specify directories for which processing should happen incrementally and independently to files in those directories.