Skip to content
This repository has been archived by the owner on Sep 13, 2020. It is now read-only.
/ pipelines Public archive

Machine learning and bioinformatic workflows for gene discovery.

License

Notifications You must be signed in to change notification settings

neocruiser/pipelines

Repository files navigation

Translational Research: Pipelines in Data Mining and Systems Biology

The following repository contains RNA-seq and microarray pipelines for transcriptome assembly, gene expression, gene annotation, genomic mutation calling, and other. Scripts dated between 2015 and 2017 are running on XSEDE supercomputers and LIred cluster in the US, including Greenfield, Bridges, and LIred servers. All others are running on H4H cluster at UHN, in Toronto Canada.

Licence: MIT Licence

Algorithms will include:

  • Feature engineering & regularization (lasso, ridge, elastic)
  • Data subsetting, extraction, reformatting & report designs
  • Subsampling, mini-batch sampling & bagging
  • Data spliting (binomial and multiclass)
  • Unsupervised learning (fuzzy, hierarchical clustering)
  • Grid search for normalization & standardization methods
  • Bayesian inferential models
  • Similarity & adjacency matrices
  • Multi-iterative module allocations for gene expressions
  • Weighted genetic networks
  • Supervised learning and grid hyper tuning
  • Bootstrapping and model alpha adjustments
  • Logging & performance metrics (ROC, AUROC, 95% CI, kappa)
  • Various descriptive and performance plotting
  • Nested cross-validation & iterative resampling structures
  • Multi-class area under the ROC curve
  • Feature importance scoring
  • Confusion matrices & multi-prediction validation
  • Redundancy and descriptive analyses
  • Machine learning optimizations
  • Random seeding optimizations
  • Over 20 machine learning models
  • Deep learning (Tensor, Torch, Mxnet, H2O, Keras)
  • Reinforcement learning (batch, temporal abstraction, deep RL)
  • QC for most next generation sequencing platforms
  • Abundance analysis (FPKM, RPKM) multi-tool cross comparison
  • Genome & targeted exome sequencing
  • Variant calling (SNVs, CNVs, Indels)
  • Germline filtering analysis
  • Gene/Variant annotations (model, non-model species) multi-DB cross matches
  • Differential gene expression (microarrays, RNA-seq) in R/Python

Systemic Analytical Pipelines

scripts can be scaled while adjusted for any sample size. Majority reproducible on other data sets

  1. hot Machine & Deep learning multi-model analyses for cancer prediction in R (Bassim 2018)
  2. hot Multi-grid search approch with data-driven gene networks in R (Bassim 2018)
  3. Gene ranking in R & multi-output wrangling in bash from transcriptional data (Bassim 2018)
  4. Polygenic & probability distributions reported by batch visualization in R (Bassim 2018)
  5. Microarray sample preprocessing & automation in bash (Bassim 2018)
  6. Microarray multi-contrast & batch logging in R (Bassim 2018)
  7. Aggregated performance metrics for transcriptional analyses in bash (Bassim 2018)
  8. Data restructuring & mining for network automation in bash (Bassim 2018)
  9. Clustering & gene expression with bootstrapped approach & metric aggregation in R (Bassim 2018)
  10. hot Targeted exome calling for variants in bash (Bassim 2018)
  11. Variant clustering for classification of tumor clones in bash and python (Bassim 2018)
  12. Reinforcement, Generative CNN & Deep learning for data embedding in python (Bassim 2017)
  13. Two species reads separation from dual RNA-seq of host & parasite in bash (Bassim 2017)
  14. Sequence abundance and expression from dual RNA-seq in R (Bassim 2017)
  15. hot Aggregating & optimizing data structure for pathway discovery in bash (Bassim 2016)
  16. Reducing transcriptome size by correcting for read abundance in bash (Bassim 2017)
  17. Transcriptome assembly of RNA-seq data optimized on high performance clusters in bash (Bassim 2016)
  18. Genome-guided assembly of RNA-seq data for new gene discovery in bash (Bassim 2016)
  19. hot RNA-seq gene expression multi-approach & metric aggregation in bash and R (Bassim 2016)
  20. Shotgun sequencing pipeline & virus classification or microbes identification in bash (Bassim 2016)
  21. Fast sequence annotation against NCBI databases with Blast & Diamond in bash (Bassim 2016)
  22. Protein sequence annotation & data aggregation with multi-database mining in bash (Bassim 2016)
  23. RNA-seq gene annotation & sequence discovery with multi-database mining in bash (Bassim 2016)
  24. Transcriptional cross-talk & gene networks in bash (Bassim 2016)
  25. Decorrelating network scores from RNA-seq dual RNA-seq data in R (Bassim 2016)
  26. Gene annotation with Hidden Markov models using HMM profiles in bash (Bassim 2015)
  27. Genomic filtering & variant calling with indel correction in bash (Bassim 2015)
  28. Genomic bam processing & variant calling QCs in bash (Bassim 2015)
  29. Regularization & Ensemble learning on top of neural nets, GLMs & bagging in R (Bassim 2014)
  30. Bayesian network inference from microarrays & timeseries data in R (Bassim 2013)
  31. Fuzzy clustering on Agilent microarray timeseries data for biomarker discovery in R (Bassim 2013)
  32. Ordination analyses of timeseries microarray data in R (Bassim 2013)
  33. Agilent timeseries data processing and pattern extraction in R (Bassim 2012)