Skip to content

GeneMark-EP and -EP+: automatic eukaryotic gene prediction supported by spliced aligned proteins

Notifications You must be signed in to change notification settings

gatech-genemark/GeneMark-EP-plus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 

Repository files navigation

GeneMark-EP+

Reference

GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins
Tomas Bruna, Alexandre Lomsadze, Mark Borodovsky
NAR Genomics and Bioinformatics, 2022 Jun NAR GB, PubMed
Georgia Institute of Technology, Atlanta, Georgia, USA

Overview

GeneMark-EP+ is a semi-supervised eukaryotic gene prediction tool which utilizes protein hints to improve unsupervised parameter estimation and predictions.

Protein hints are generated by ProtHint, a fast protein mapping pipeline which predicts and scores introns, start and stop codons in the genome of interest from any number of proteins of unknown evolutionary distance (we recommend to use all eukaryotic proteins in OrthDB).

Due to its semi-supervised nature and ability to incorporate proteins of any evolutionary distance, GeneMark-EP+ is an optimal tool to predict genes in a novel genome without the need for a curated training set or a set of closely related proteins.

Downloads

Tools

Experiments

Usage example

First run ProtHint to get protein hints (see ProtHint repository for details about usage and installation)

prothint.py genome.masked.fasta proteins.fasta --workdir ProtHintDir

Run GeneMark-EP+ with hints mapped by ProtHint.

gmes_petap.pl --EP ProtHintDir/prothint.gff --evidence ProtHintDir/evidence.gff --seq genome.masked.fasta --soft_mask 1000 --verbose

Runtime

Runtime of GeneMark-EP+ is linear with respect to genome size.

ProtHint runtime is linear with respect to both genome size (GeneMark-ES is executed to generate initial genome seeds) and to the number of genes in a genome.

Examples

Runs were executed on a 8CPU/8GB RAM machine. Genomes were masked for repeats by RepeatModeler and RepeatMasker. Proteins from species within the same taxonomical genus were excluded in these experiments.

Drosophila melanogaster (134 Mb and ~14,000 genes) with OrthoDB Arthropoda target proteins:

  • ProtHint: 3h 15m (1h 45m GeneMark-ES, 1h 30m protein mapping)
  • GeneMark-EP+: 1h 30m
  • Total: ~5 hours

Solanum lycopersicum (807 Mb and ~35,000 genes) with OrthoDB Plantae target proteins:

  • ProtHint: 9h (5h GeneMark-ES, 4h protein mapping)
  • GeneMark-EP+: 4h
  • Total: ~13 hours

About

GeneMark-EP and -EP+: automatic eukaryotic gene prediction supported by spliced aligned proteins

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published