Work in progress: Foreach to iterate over files in a directory to support incremental processing #10042

oadams · 2023-10-20T06:06:12Z

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

This pull request is a rough sketch of a feature to help address #331 and is not intended to be merged as is. I have not written tests, this code breaks existing tests, and there are issues with the current implementation, but it serves as a proof of concept that does actually work at least on a toy example. I'm making this pull request to gather feedback from the DVC team about whether this feature would be taken seriously and discuss its impacts and how it should best be designed and implemented before I go to the effort of refining it.

There is a long discussion about supporting incremental changes to directories in #331, which got closed with discussion moved here. I recently made a post in the discussion thread and this PR picks up from there.

Motivation

A common scenario machine practitioners find themselves in is working on a project with incrementally changing datasets. Datasets change and grow as the result of annotation and data gathering efforts to improve model performance. However, typical use of DVC pipelines results in incremental changes to datasets requiring the entire dataset to be re-processed even when processing individual examples can occur independently of one another. For long-running or costly preprocessing jobs this can make the workflow slow enough that it's just not feasible to use DVC pipelines, forcing the team to use ad-hoc and error-prone ways of processing the data incrementally.

This PR proposes an approach to support incremental processing of data in directories by processing only those files that need to be processed.

Consider this typical way to express preprocessing stages in the dvc.yaml:

stages:
  process:
      cmd: python process.py data/raw/ data/processed/
      deps:
        - data/raw/
      outs:
        - data/processed/

with a script (in this case, process.py) that operates at the directory level, iterating over files and producing the output directory. The problem with this approach is that it's impossible for DVC to reason about the internals of the script and realize that processing happens independently to each example in the source directory.

An alternative approach is to make process.py happen at the level of an individual file and express the stage in the dvc.yaml file as:

stages:
  process:
    foreach:
       - file_a_stem
       - file_b_stem
       - file_c_stem
    do:
      cmd: python process.py data/raw/${item}.txt data/processed/${item}.txt
      deps:
        - data/raw/${item}.txt
      outs:
        - data/processed/${item}.txt

Using this approach, the dataset can grow and previously processed files are not processed again. The main issue is that this requires keeping a list of files in the dvc.yaml, which makes it unreadable in any situation where the list isn't trivially small. Solving the readability issue, you can put that list in a params.yaml file and automatically generate it with another script that scans the directory. But either way you're still left with a need to ensure that one way or another the file list is updated before you run dvc repro.

Proposal

The proposal is for foreach to support directories, with DVC iterating over the files in the directories. In this PR, you can do something like this in your dvc.yaml:

stages:
  increment:
    dep_files: data/raw
    foreach: ${file}
    do:
      cmd: python process.py data/raw/${item.stem}.txt data/processed/${item.stem}.txt
      deps:
        - data/raw/${item.stem}.txt
      outs:
        - data/processed/${item.stem}.txt

where dep_files specifies a directory that DVC uses to grab a list of files and fill in the variable foreach points to. The files are expressed as objects with a .name and .stem attributes. E.g. if filename is 'some_file.txt' then stem is 'some_file'.

This is not the most elegant way of expressing it, and it is error prone in several ways (one example: if several stages use the ${file} variable as a target then they will interfere with one another). But it was a quick hack that illustrates the core of the concept. Better might be something like:

stages:
  increment:
    foreach: data/raw
    do:
      cmd: python process.py data/raw/${item.stem}.txt data/processed/${item.stem}.txt
      deps:
        - data/raw/${item.stem}.txt
      outs:
        - data/processed/${item.stem}.txt

to get foreach to directly iterate over data/raw and not expose the user to a confusing intermediate variable. Or perhaps something else entirely. There are other ways it could be made more concise. But regardless of the specifics of the syntax, for many workflows it would be valuable to be able to specify directories for which processing should happen incrementally and independently to files in those directories.

…ory from which to read files, but need to circle back to this.

…tes.

…ectory containing files that foreach iterates over.

for more information, see https://pre-commit.ci

dberenbaum · 2023-11-07T15:39:21Z

Thanks for the PR @oadams! I think the problem statement makes a lot of sense. Your last suggestion doesn't look too bad, but it feels like it would introduce lot of new logic: foreach would have to have some logic for when to iterate over a directory, plus it would require string manipulation of the file paths to get the stem. I also worry about whether using foreach-like syntax with such a large number of files and stages will make dvc.lock cumbersome.

I agree with you that #9431 is not always ideal since it means DVC logic leaks into your code, but it also has more general use outside of this scenario, and for incremental processing it gives you more flexibility to manage things like parallelizing the jobs without depending on DVC which doesn't yet have support for that.

I think it's worth exploring both directions more but would like to consider if there are any ideas for how to simplify it.

dberenbaum · 2023-12-12T13:09:30Z

Closing as we are unlikely to merge this as is, but let's keep discussion going in the linked issues.

oadams and others added 7 commits October 20, 2023 12:09

starting on the path of changing the stage schema to specify a direct…

7b1ef88

…ory from which to read files, but need to circle back to this.

Hardcoding code to search a directory for prefixes.

a674c9b

Replaced prefixes with "file" objects that have name and stem attribu…

cc32728

…tes.

Merge branch 'main' into schema_change

bc9ef61

Can specify a dep_files variable with a foreach which specifies a dir…

11b804f

…ectory containing files that foreach iterates over.

Merge remote-tracking branch 'origin/main' into foreach_file_in_dir

3d7ceef

[pre-commit.ci] auto fixes from pre-commit.com hooks

f8def4e

for more information, see https://pre-commit.ci

dberenbaum closed this Dec 12, 2023

dberenbaum mentioned this pull request Jan 4, 2024

Granular pipeline dependency status #9431

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Work in progress: Foreach to iterate over files in a directory to support incremental processing #10042

Work in progress: Foreach to iterate over files in a directory to support incremental processing #10042

Uh oh!

oadams commented Oct 20, 2023

Uh oh!

dberenbaum commented Nov 7, 2023

Uh oh!

dberenbaum commented Dec 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Work in progress: Foreach to iterate over files in a directory to support incremental processing #10042

Work in progress: Foreach to iterate over files in a directory to support incremental processing #10042

Uh oh!

Conversation

oadams commented Oct 20, 2023

Uh oh!

dberenbaum commented Nov 7, 2023

Uh oh!

dberenbaum commented Dec 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants