cleaning_gEVAL_data.py by dp24

In-Depth changes are found in the Wiki section.

Usage instructions

./cleaning_gEVAL_data.py {FTP} {SAVE} {FTP_TYPE [ncbi|ens]} {pep, cds, cdna} [-NAME] [-d, --debug] [-c, --clean] [-ep, --override_entryper]

-NAME is required when using ncbi data due to their naming scheme.

Ensembl Example:

./cleaning_gEVAL_data.py ftp://ftp.ensembl.org/pub/release-99/fasta/mesocricetus_auratus/cdna/Mesocricetus_auratus.MesAur1.0.cdna.all.fa.gz ./ ens cdna

NCBI Example:

./cleaning_gEVAL_data.py https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/184/155/GCF_000184155.1_FraVesHawaii_1.0/GCF_000184155.1_FraVesHawaii_1.0_protein.faa.gz ./ ncbi pep -NAME Fragaria_Vesca

What does this script do?

1, The aim of this script is to take an input FASTA file (whether cdna, cds, pep or rna) from ensembl or ncbi.

2, This is downloaded and Unzipped.

3, If Data type is cdna, seqclean is called (may have to be modified for data sets with high N count).

4, File is read and Headers are split away from the sequence and massaged into an easy read format.

5, New headers and sequence are merged and counted. Once the count reaches a set number (for each data type) a file is produced.

6, Finally folders can be cleaned and debug logs can be read if needed.

Positional Arguments:

ARGUMENT	EXPLANATION
FTP	This argument is to be used when using an ftp address for this script
SAVE	Save location for the downloaded files
{ens,ncbi}	Specify the FTP
{cds,cdna,pep,rna}	The type of DATA contained in the file

Optional Arguments:

ARGUMENT	EXPLANATION
-h, --help	show this help message and exit.
-NAME NAME, --organism_ncbi NAME	If using ncbi FTP, then the organisms name must be provided due to how they name their files
-v, --version	show program's version number and exit
-c, --clean	Specifying this argument allows the script to clean all un-necessary files after use
-d, --debug	Specifying this argument allows debug prints to workand creates a log file documenting everything the script does.
-ep, --override_entryper	Overrides to hard coded options to split various data types (defaults are cdna/5000, pep/200 and everything else 3000).
-seq, --seqclean-override	Overrides the seqclean function for cdna data, this is particularly useful for shotgun data and data with high N count which would otherwise break this module. BE AWARE this will need significant cleaning !!

Contacts

If you have any questions then contact:

[email protected]

or

[email protected]

Alternatively leave an issue on this repo.

Acknowledgements

Seqclean has not been written by my self, it was produced by the Dana-Farber Cancer Institute and is used in GRIT operations.

Name		Name	Last commit message	Last commit date
Latest commit History 302 Commits
seqclean		seqclean
README.md		README.md
cleaning_gEVAL_data.py		cleaning_gEVAL_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cleaning_gEVAL_data.py by dp24

Usage instructions

What does this script do?

Contacts

Acknowledgements

About

Releases 1

Packages

Languages

DLBPointon/gEVAL_cleaner

Folders and files

Latest commit

History

Repository files navigation

cleaning_gEVAL_data.py by dp24

Usage instructions

What does this script do?

Contacts

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages