Skip to content

Latest commit

 

History

History
141 lines (111 loc) · 5.98 KB

CHANGELOG.md

File metadata and controls

141 lines (111 loc) · 5.98 KB

Documentation of project progress

March 19th, 2020

  • forgot to update the log so I am doing this today
  • worked on writing the paper and uploaded to git

January 11th, 2020

  • created a visualization of the box plots of each gene per habitat organism collection
  • isolated organism classification namespaces into their own file
  • implemented the visualization of the box plots for organisms binned by bone classification

January 9th, 2020

  • found a way to showcase differences between grouped organisms
  • started implementing the visualizing function of the dnds/visualizer

January 8th, 2020

  • started building the distributor package grouped visualizer method
  • still have to figure out a nice way to aggregate data so that it's nice to create subplots of it

January 7th, 2020

  • added more significant organisms based on feedback
  • changed the logic of gaps vs. illegal characters in dnds/visualizer
  • ran visualizer for all the significant organisms again
  • modified dnds/loader to account for inf values (possible due to divergence)
  • changed dnds/distributor to create subplots for all new orgs

Dec 29th, 2019

  • implemented the rest of the dnds visualizer
  • fixed the dnds package (inconsistent naming of organisms between alignments and dnds was causing dnds scores to look odd)
  • implemented a distribution visualization for the dN/dS values of all the significant organisms

Dec 28th, 2019

  • implemented the similarity mapper (uses Jaccard's index for similarity)
  • implemented tree visualization for tree similarities
  • reimplemented the dnds package, turns out the dnds library does not work properly, will have to use the ape package in R (has 35 citations), turn the dnds package into a parsing and visualizing package, and construct a pipe that calls the parsing script, calls R on the resulting files, then visualizes the dnds results computed by the R ape package

Dec 27th, 2019

  • add more taxonomic details to the significant organisms phylogenetic tree
  • performed a comparison against BLAST of organisms that are reported to not have a gene
  • implemented a parser for the JSON files of the BLAST comparison

Dec 26th, 2019

  • chose and added a subset of the organisms that represent the classes of bones/cartilage for the study
  • added the phylogenetic mapper for significant organisms

Dec 3, 2019

  • added first presentation

Nov 28, 2019

  • fixed an inaccurate piece of information regarding gene functions
  • adjusted gene function chart to have a tight bounding box
  • adjusted the gene frequencies chart - sorted organisms based on gene frequency

Nov 18, 2019

  • start building the dnds package

Nov 17, 2019

  • adjusted MSA visualization to start x-axis at 0
  • adjusted grouping of organisms based on taxonomic information

Nov 4, 2019

  • fixed the problem of false negative matches when building phylogeny trees for all the genes

Nov 3, 2019

  • added a phylogeny mapper for building the trees
  • reorganized some data files
  • wrote script for phylogenetic tree visualizations

Nov 2, 2019

  • implemented a taxon mapper for generating a plot of frequency of organisms per clade
  • implemented a gene function visualizer

Nov 1, 2019

  • implemented the MSA parser
  • moved some results files around (plots moved to src/data)

Oct 31, 2019

  • started building the MSAs parser
  • noticed the alignments had a problem caused by the ClustalW format (limited number of characters for the organism name). Performed MSAs again, with Kalign on EBI, and saved in FASTA format

Oct 24, 2019

  • make taxa collector build organism information files (gene frequency and genes per organism)

Oct 13, 2019

  • used EBI's Kalign to perform multiple sequence alignments for all genes
  • added a test parser script for writing MSAs as single lines (will not use yet as they are challenging to interpret if written on single lines)
  • built the taxa collector and collected organisms' information

Oct 3, 2019

  • moved exceptions out of orthologs package
  • create curator package to parse homology information

Oct 2, 2019

  • curated gene Ensembl IDs list
  • added orthologs package
  • implemented Collector for calls to Ensembl

Sep 21, 2019

  • took the gene list form Brian and stored it in genes.txt
  • used a pipe to clean up the list from the format:
gene1, gene2, ...

to

gene1 

gene2

via

echo "gene1, gene2" | tr ',\s' '\n'
  • looked into setting up a BLAST server and found this resource
  • the objective is to find how a given set of genes (G) relate to cartilage production in multiple organisms (O)
  • O will contain:
  1. Organisms that make bone
  2. Organisms that only produce cartilage
  3. Produce neither cartilage nor bone
  4. Used to have bone but now make cartilage (note, this is different than only producing cartilage)
  • see genes.txt for G
  • explored the Ensembl Species Tree to identify what organisms to use in the project (see organisms.txt)
  • found the REST documentation of Ensembl
  • since the objective at the start of the project is to find genomes of species that host the genes of interest, all the project has to do at the start is find the Ensembl IDs of the genes of interest and use GET homology/id/:id to get orthologs (the response includes organisms, which can then be grouped into bone/cartilage/neither/used to have bone)

Potential programs to write

  • parse the list of genes to get orthologs of each one, including species (using an Ensembl API);
  • use the species information from 1 to hit some other API (taxonomy API from NCBI?) to identify vertebrates, chondrichthyes, etc and group them accordingly;
  • further refine 2 by sub-grouping to create sub-sets of organisms that are close, based on evolution trees, taxa, or something else;
  • get the dn/ds for all pairs of organisms in the sub-sets (organism 1 vs organism 2 dn/ds, 1 -> 3, 1 -> 4… etc)