Skip to content

Conversation

@oadams
Copy link

@oadams oadams commented Oct 20, 2023

This pull request is a rough sketch of a feature to help address #331 and is not intended to be merged as is. I have not written tests, this code breaks existing tests, and there are issues with the current implementation, but it serves as a proof of concept that does actually work at least on a toy example. I'm making this pull request to gather feedback from the DVC team about whether this feature would be taken seriously and discuss its impacts and how it should best be designed and implemented before I go to the effort of refining it.

There is a long discussion about supporting incremental changes to directories in #331, which got closed with discussion moved here. I recently made a post in the discussion thread and this PR picks up from there.

Motivation

A common scenario machine practitioners find themselves in is working on a project with incrementally changing datasets. Datasets change and grow as the result of annotation and data gathering efforts to improve model performance. However, typical use of DVC pipelines results in incremental changes to datasets requiring the entire dataset to be re-processed even when processing individual examples can occur independently of one another. For long-running or costly preprocessing jobs this can make the workflow slow enough that it's just not feasible to use DVC pipelines, forcing the team to use ad-hoc and error-prone ways of processing the data incrementally.

This PR proposes an approach to support incremental processing of data in directories by processing only those files that need to be processed.

Consider this typical way to express preprocessing stages in the dvc.yaml:

stages:
  process:
      cmd: python process.py data/raw/ data/processed/
      deps:
        - data/raw/
      outs:
        - data/processed/

with a script (in this case, process.py) that operates at the directory level, iterating over files and producing the output directory. The problem with this approach is that it's impossible for DVC to reason about the internals of the script and realize that processing happens independently to each example in the source directory.

An alternative approach is to make process.py happen at the level of an individual file and express the stage in the dvc.yaml file as:

stages:
  process:
    foreach:
       - file_a_stem
       - file_b_stem
       - file_c_stem
    do:
      cmd: python process.py data/raw/${item}.txt data/processed/${item}.txt
      deps:
        - data/raw/${item}.txt
      outs:
        - data/processed/${item}.txt

Using this approach, the dataset can grow and previously processed files are not processed again. The main issue is that this requires keeping a list of files in the dvc.yaml, which makes it unreadable in any situation where the list isn't trivially small. Solving the readability issue, you can put that list in a params.yaml file and automatically generate it with another script that scans the directory. But either way you're still left with a need to ensure that one way or another the file list is updated before you run dvc repro.

Proposal

The proposal is for foreach to support directories, with DVC iterating over the files in the directories. In this PR, you can do something like this in your dvc.yaml:

stages:
  increment:
    dep_files: data/raw
    foreach: ${file}
    do:
      cmd: python process.py data/raw/${item.stem}.txt data/processed/${item.stem}.txt
      deps:
        - data/raw/${item.stem}.txt
      outs:
        - data/processed/${item.stem}.txt

where dep_files specifies a directory that DVC uses to grab a list of files and fill in the variable foreach points to. The files are expressed as objects with a .name and .stem attributes. E.g. if filename is 'some_file.txt' then stem is 'some_file'.

This is not the most elegant way of expressing it, and it is error prone in several ways (one example: if several stages use the ${file} variable as a target then they will interfere with one another). But it was a quick hack that illustrates the core of the concept. Better might be something like:

stages:
  increment:
    foreach: data/raw
    do:
      cmd: python process.py data/raw/${item.stem}.txt data/processed/${item.stem}.txt
      deps:
        - data/raw/${item.stem}.txt
      outs:
        - data/processed/${item.stem}.txt

to get foreach to directly iterate over data/raw and not expose the user to a confusing intermediate variable. Or perhaps something else entirely. There are other ways it could be made more concise. But regardless of the specifics of the syntax, for many workflows it would be valuable to be able to specify directories for which processing should happen incrementally and independently to files in those directories.

@dberenbaum
Copy link
Contributor

Thanks for the PR @oadams! I think the problem statement makes a lot of sense. Your last suggestion doesn't look too bad, but it feels like it would introduce lot of new logic: foreach would have to have some logic for when to iterate over a directory, plus it would require string manipulation of the file paths to get the stem. I also worry about whether using foreach-like syntax with such a large number of files and stages will make dvc.lock cumbersome.

I agree with you that #9431 is not always ideal since it means DVC logic leaks into your code, but it also has more general use outside of this scenario, and for incremental processing it gives you more flexibility to manage things like parallelizing the jobs without depending on DVC which doesn't yet have support for that.

I think it's worth exploring both directions more but would like to consider if there are any ideas for how to simplify it.

@dberenbaum
Copy link
Contributor

Closing as we are unlikely to merge this as is, but let's keep discussion going in the linked issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants