-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some issues running pipeline #14
Comments
Thanks Jared! Following Snakemake best practice (see here), I would recommend partitioning the current global env into several envs—one for each rule. This would likely eliminate many dependency conflicts. |
This pipeline has some more issues to solve before I consider it reproducible for publication. Digging into some of the issues outlined above here are the following things that need to be solved before I would feel comfortable with the results from this pipeline. Missing Data
So this is a slightly more complex problem than I initially thought. The overarching problem here is that this repository does not include the files necessary to run the upstream rules (i.e process_ccs, and count_variants). Upon initial execution,
I'm not exactly sure where this behavior comes from, or how to fix it -- especially given that this is a quite old version of R environmentIt's not necessarily the most reproducible thing to say that the environment you need for half the pipeline only exists on a single institutions cluster module. We need to be able to reproduce this environment in a portable manor. Over on the |
Thanks, Jared! The readme notes
so can we configure snakemake to raise an error if the data paths are not supplied (in a config file or on the command line)? Additionally, we could add a flag (call it "raw") to run the pipeline starting from the raw sequencing files, whereas without this flag, the pipeline takes You mentioned the snakemake version is very old—should we upgrade to the latest? Re R environment, should we specify a conda env with R (https://docs.anaconda.com/working-with-conda/packages/using-r-language/)? |
Yes - I added this to the README on our topic branch.
Yea this seems reasonable to me. Happy to do so.
Sure - happy to do so! Hopefully syntax and API are backwards compatible - but I think
I've already refactored the pipeline to use environments defined here. But currently I'm just spending a lot of time debugging the R environment. e.g. The latest problem I run into is:
Note that this specific problem arises in the old Kd fitting steps. I could go ahead and hack that out of the pipeline, but then I'd have to modify all the downstream R code. I will keep trying to make the R code work with conda. Once it does -- I'll likely feel most comfortable if we containerized things like this. |
I do think we want to remove |
Yea, that will require modification of the downstream R, obviously, but I agree. I think I've finally solved the environment stuff so hopefully the rest goes fairly quickly. |
Overview
First, great pipeline, I like the way the key files and pages are automatically generated.
As I'm running the pipeline to work in the changes described in improved_Kd_fitting, I've run into a few issues that may be worth either patching, or at least documenting for future users who may run into similar issues.
conda
environmentI ran into issues that took me a while to figure out when attempting to create the environment defined in environment.yml. I'm a mamba user and using the given README command
conda env create -f environment.yml -p ./env
Gave me package conflicts and hung unexpectedly. I was indeed able to get the environment built, but that required I completely nuke mymamba
install and re-installminiconda
. This is fine, but a problem arises becausesnakemake
recommends usingmamba
... so when executing the pipeline the first thing one sees is:The simple fix is obviously adding
--conda-frontend conda
to the run_Hutch_cluster.sh, but a more robust fix would involve solving the dependency conflicts such that the environment can indeed be build directly withmamba
, and the environment has more exact pinned dependencies and is built by effect ofsnakemake run
, not the user. I'm happy to take a swing at that if we think it would be helpful.Running on the cluster using
sbatch
I'm a little confused how the cluster execution is happening here, but I have less experience farming jobs from
snakemake
--and it seems the approach has drastically changed through the versions). Anyhow, it failed at first attempt. It seems the failed jobs gave log information in "slurm-.out" files which reveled the underlying cause of the failure above - and it is related to the same environment issue described above:So it seems if we're to add
--conda-frontend conda
argument to the command in run_Hutch_cluster.sh that would solve things ... but when I add that I seem to get the same error? This could potentially be solved by choosing an earlier version ofsnakemake
that defaults toconda
frontend instead ofmamba
, or specify in the README that the user should configure their snakemake inside the environment to use that frontend instead. I have not tried either of these yet as it seems to work just running just fine without batch submission for the time being.missing input data
The README states that this repository contains all the necessary input data to run the pipeline, yet in the first rule run of the pipeline
count_variants.ipynb
, I get the errorbecause the pipeline is searching for files I don't have access to ...
It seems that this step need not be run given that the the output files exist stored in the results but nonetheless this what it's giving me. I don't think you need to store all the NGS ... but maybe specify at which intermediate step the user is able to run the pipeline without needing to download from SRA or the like?
SOLVED: it seems the pipeline (or possibly me by mistake) somehow deleted the output files from
count_variants.ipynb
: variant_counts.csv, and count_variants.md. Not sure how that happened but when I reset those files, the pipeline does indeed skip this step and seems to proceed on without error.The text was updated successfully, but these errors were encountered: