Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add inputs.txt to handle internal file dependencies #24

Open
lucamlouzada opened this issue Sep 26, 2024 · 7 comments
Open

Add inputs.txt to handle internal file dependencies #24

lucamlouzada opened this issue Sep 26, 2024 · 7 comments
Assignees

Comments

@lucamlouzada
Copy link
Collaborator

This issue is part of an effort to implement substantive improvements to the lab template, as discussed in #16.

In this issue, the goal is to work on the way the template handles input files. The approach to be adopted per the decision in plans for next steps is:

  • Each module should have a input.txt file in which users should list the files that are required as inputs
  • make.sh will read through these files and add symlinks to the input folder

I am assigning myself to work on this. I will consider whether the creation of the symlinks should be done in make.sh itself or in an additional script stored in lib/shell. I will also investigate whether there is a way to scan over all scripts and automate the identification of paths.

@lucamlouzada lucamlouzada self-assigned this Sep 26, 2024
lucamlouzada added a commit that referenced this issue Sep 26, 2024
@lucamlouzada
Copy link
Collaborator Author

lucamlouzada commented Sep 26, 2024

I have created the inputs.txt files in 276cb0c. There should also be changes in make.sh. To prevent conflicts with the changes in other branches, I will list the changes here, but will wait to push them until #18 and #20 have been closed.

The changes in make.sh should look as follows:

- # Copy and/or symlink input files to local /input/ directory
- # (Make sure this section is updated to pull in all needed input files!)
+ # Add symlink input files to local /input/ directory
+ # (Make sure inputs.txt is updated to pull in all needed input files!)
rm -rf "${MAKE_SCRIPT_DIR}/input"
mkdir -p "${MAKE_SCRIPT_DIR}/input"
- # cp my_source_files "${MAKE_SCRIPT_DIR}/input/"
+ if [[ -f "inputs.txt" ]]; then # check if inputs.txt exists
+ links_created=false
+ while IFS= read -r file_path; do
+ if [[ -n "$file_path" && "$file_path" != \#* ]]; then  # skip empty or commented out lines 
+       if [[ -f "$file_path" ]]; then  # check if the file_path is valid
+       file_name=$(basename "$file_path")
+       abs_path=$(realpath "$file_path") # get absolute path
+       ln -sf "$abs_path" "${MAKE_SCRIPT_DIR}/input/$file_name" # create symlink in the input folder
+       links_created=true
+       else
+         echo "Error: $file_path does not exist or is not a valid file path." >&2
+       false # trigger error handler
+       fi
+       fi
+       done < "inputs.txt"
+       if [[ "$links_created" == true ]]; then
+       echo -e "\nAll input links were created!"
+       else
+         echo -e "\n\033[0;34mNote:\033[0m There were no input links to create."
+       fi
+       else
+         echo -e "\nError: No inputs.txt file found in the module." | tee -a "${LOGFILE}"
+       false # trigger error handler
+       fi

Note that this change will also require updating the template documentation and the instructions on how to run the example scripts.

@lucamlouzada
Copy link
Collaborator Author

I have also done some research regarding whether it would be possible to implement automated scanning of file dependencies. One alternative is using a DAG structure with software like SCons, Snakemake, or Nextflow, but this seems like a more complex step. We could also write ad hoc scripts to do it manually, but it would require some creative combination of regex and searching for functions such as "load" or "read_csv", which does not seem very practical and would increase the complexity of the template without large returns. My sense is that we can stick with manually adding input files for now, either with the inputs.txt approach introduced in this issue, or with the previous approach in which users manually added the files in make.sh with cp my_source_files "${MAKE_SCRIPT_DIR}/input/".

@gentzkow
Copy link
Owner

@lucamlouzada Thanks for the great work here.

If we were to go this route, we'd want to abstract the code for scanning inputs.txt into a helper function rather than having it directly in make.sh. It's too clunky, plus it would be a lot of redundancy to have this repeated across directories.

More fundamentally, I'm a little worried looking at it that writing our own parser like this is going to create a potential source of errors/complexity down the line. E.g., cases where people get the syntax in inputs.txt wrong.

Here are two possible alternatives.

(1) link_inputs.sh: Instead of inputs.txt, we have a shell file that contains the bash simlink commands for the various input directories directly.

(2) input_paths.do, input_paths.r, etc.: An even simpler solution would be to have separate scripts by language that define paths that can then be called as inputs by the respective scripts. E.g., input_paths.r might look something like

raw_data <- "../../0_raw/"
clean_data <- "../../1_data/output"

These could be stored in /source/.

@lucamlouzada
Copy link
Collaborator Author

Thanks @gentzkow. Agreed on making this a separate helper script if we go down this route.

More fundamentally, I'm a little worried looking at it that writing our own parser like this is going to create a potential source of errors/complexity down the line.

What kind of other errors/complexity did you have in mind? The current approach implements error handling for the paths in the shell script, so that errors in the txt files will be shown to the users as warning messages. But these alternatives are both good options, my sense is it depends a bit on what our priorities will be.

(1) link_inputs.sh: Instead of inputs.txt, we have a shell file that contains the bash simlink commands for the various input directories directly.

By this, do you mean users would edit the shell file instead of the txt file? We could have a section in the shell script for users to edit (similarly to what is done in the current template version with the cp source_files lines) and handle errors within the same script, removing the need for a separate txt file. Is this what you had in mind? I think it would still be possible to have issues if users get the syntax wrong - and my sense is that users would be more used to editing txt files than shell files.

(2) input_paths.do, input_paths.r, etc.: An even simpler solution would be to have separate scripts by language that define paths that can then be called as inputs by the respective scripts.

This is indeed simpler and could be easy to implement. The downside is having users edit more than one file if inputs for different languages are not concentrated in the same script.

@gentzkow
Copy link
Owner

gentzkow commented Oct 5, 2024

What kind of other errors/complexity did you have in mind?

Just that it's another failure point. I know you've implemented some good error handling, but it's always possible that there's a gap in that or that a user does something unexpected.

This is indeed simpler and could be easy to implement. The downside is having users edit more than one file if inputs for different languages are not concentrated in the same script.

I'm leaning toward this approach. It would remove one big layer of complexity. I think the case where a given module has scripts in multiple languages is relatively rare, and in those cases it's not obviously a bad thing to make clear which inputs are being used by the different categories of scripts. Let's proceed w/ this approach.

@lucamlouzada
Copy link
Collaborator Author

Hi @gentzkow, looking back into this now.

input_paths.do, input_paths.r, etc.: An even simpler solution would be to have separate scripts by language that define paths that can then be called as inputs by the respective scripts.

I realized that this approach would require users to add code on top of their scripts to source the paths from the input_paths scripts. For example, users would have to add source("source/input_paths.r") on top of their wrangle_data.r script. I think this is something that we discussed when we were debating the alternative for make_externals.sh and we concluded we did not want users to always have to add something at the beginning of their scripts.

Two potential solutions (let me know if you think of others; I can also keep looking for alternatives):

  1. We have the run_xxx.sh scripts automatically source the input_paths.xx files for each language before running, if they exist.
  2. We have the input_paths.xx scripts create symlinks or copy the files instead of declaring paths as global variables.

The problem with (1) is that it makes the run_xxx scripts a little more complex (and could still lead to issues if users get the syntax wrong). The problem with (2) is that it is basically the same as the previous approach using inputs.txt, only with extra files.

I think (1) is preferable, but still don't know whether it is better thaninputs.txt- it is essentially very similar, with the difference that the complexity is in run_xx instead of in make.sh and users edit different scripts in each language instead of a single inputs.txt file.

@gentzkow
Copy link
Owner

gentzkow commented Oct 9, 2024

Thanks @lucamlouzada. I'm actually not strongly opposed to having those source commands at the top of the scripts. That's actually what I was imagining.

One downside of this approach is that there's no way to tell from within a given r script what global variables are being created. So someone could come upon a command like load(file.path(myfolder, "mydata.rdat")) and not be able to figure out where the myfolder global variable came from. That seems like a fairly small cost, though.

This approach still seems more robust to me than either (1) or (2), since it preserves the property that someone can clone the repo and then run the individual r etc. scripts out of the box.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants