Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow running of metatdata subworkflow on multiple specimen IDs #114

Closed
BethYates opened this issue Jun 13, 2024 · 2 comments · Fixed by #132
Closed

Allow running of metatdata subworkflow on multiple specimen IDs #114

BethYates opened this issue Jun 13, 2024 · 2 comments · Fixed by #132
Assignees
Labels
enhancement Improvement of the existing features

Comments

@BethYates
Copy link
Collaborator

BethYates commented Jun 13, 2024

Description of feature

A genome note provides meta data related to the specimen used to produce the genome assembly, the specimen used to generate HiC data and the specimen used to produce RNA-Seq data. These may all be different specimens. The genome note pipeline should be able to take in each of these IDs and run the metadata subworkflow on each, recording the relevant data for use in the publication

@BethYates BethYates added the enhancement Improvement of the existing features label Jun 13, 2024
@BethYates
Copy link
Collaborator Author

The genome_metadata subworkflow will be introduced in version 2.0 of the genome note pipeline and is currently only present on the public_dev branch of the repository. To work on this issue you will need to create a feature branch from the public_dev branch rather than the dev branch. Pushing development for the 2.0 release to the public_dev branch allows us to keep the dev branch clean in case we need to push some bug fixes from there to the main release branch.

@reichan1998 reichan1998 self-assigned this Jul 3, 2024
@BethYates
Copy link
Collaborator Author

To close this issue:

  1. Rename the biosample parameter to biosample_wgs and add two additional parameters biosample_hic and biosample_rna to nextflow.config the value of these should be set to null
  2. Update test.config, test_full.config to contain values for the new parameters that you have added/changed. For the test profile biosample_hic="SAMEA7520846" and biosample_rna="SAMEA7521081" for the test_full profile biosample_hic="SAMEA7519968" and biosample_rna=null
  3. Modify genome_metadata.nf so that all of the files in ch_file_list that contains a "BIOSAMPLE_ACCESSION" are added to the file_list channel for each of the biosample parameters. In some cases (as in the test_full profile) biosample_rna will be null and should be ignored - the code needs to handle this
  4. Modify the metadata in genome_metadata.nf to include a biosample_type, the value for this should be either "WGS", "HIC", "RNA" or "" if the file is not related to a biosample.
  5. Modify run_wget.nf to include the biosample_type in the output file name where the biosample_type is not an empty string.
  6. Modify parse_metadata.nf to include the biosample_type in the output file name where the biosample_type is not an empty string.
  7. Modify parse_xml_ena_biosample.py to extract the biosample_type from the output file name passed to the script. In For the HiC and RNASeq biosample accession use this biosample_type to prefix the parameter names written to the output file (e.g. for the biosample_hic IDENTIFIER would become HIC_IDENTIFIER, for the biosample_rna SPECIMEN_ID would become RNA_SPECIMEN_ID) 9. Update docs/usage.md and nextflow_schema.json to include the new/renamed parameters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement of the existing features
Projects
Status: Done
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants