Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make DataConverter accept a list of files to convert instead of only a directory #699

Open
RasmusOrsoe opened this issue Apr 25, 2024 · 1 comment
Labels
feature New feature or request good first issue Good for newcomers

Comments

@RasmusOrsoe
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
DataConverter accepts a single argument input_dir: Union[str, List[str]] which point to one or multiple directories. These directories are searched using the GraphNeTFileReader.find_files() method to create a list of file paths for conversion.

This construction appeals to a workflow where data files of interest are copied to a separate directory and all intended for conversion.

There are examples of use cases where converting all files in a directory is unwanted behavior.

Describe the solution you'd like
Make DataConverter accept a list of user-generated file paths for conversion, instead of assuming all files in the input_dir: Union[str, List[str]] should be converted.

We rename the input_dir: Union[str, List[str]] -> input: Union[str, List[str]] : A list of files and/or directories to convert

and then slightly adjust the DataConverter from

@final
    def __call__(self, input_dir: Union[str, List[str]]) -> None:
        """Extract data from files in `input_dir` and save to disk.

        Args:
            input_dir: A directory that contains the input files.
                        The directory will be searched recursively for files
                        matching the file extension.
        """
        # Get the file reader to produce a list of input files
        # in the directory
        input_files = self._file_reader.find_files(path=input_dir)
        self._launch_jobs(input_files=input_files)

to

from path import isdir, isfile

@final
    def __call__(self, input: Union[str, List[str]]) -> None:
        """Extract data from files in `input` and save to disk.

        Args:
            input: A list of file paths and/or directories containing files selected for conversion. 
                     Directories are searched recursively, and all files in the directories will be converter.
        """
        # Get the file reader to produce a list of input files
        # in the directory
         input_files = [path for path in input if isfile(path)]
         directories_to_search = [path for path in input if isdir(path)]
   
        files_from_directories = self._file_reader.find_files(path=directories_to_search)
        input_files.extend(files_from_directories)
        self._launch_jobs(input_files=input_files)

Additional context
Multiple people have mentioned a wish for this feature

@RasmusOrsoe RasmusOrsoe added feature New feature or request good first issue Good for newcomers labels Apr 25, 2024
@kayla-leonard
Copy link
Collaborator

For the cases where input is a file instead of directory, it would be nice to allow for wildcards and/or regex in the file path. Something like glob.glob() could handle this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants