Team members: Nico Chaves, Noam Weinberger and Junjie (Jason) Zhu
We began this project in Spring 2016 as a course project for Stanford CS 341 (Project in Mining Massive Datasets).
/data
: includes metadata of processed data and example datasets; full datasets are stored on the server
/preprocessing
: includes python scripts that used to process the expression data downloaded from GTExPortal
/ipython_notebook
: includes ipython notebooks used to display main results of this project in an interactive fashion
git pull
git add --all
git commit -m "MESSAGE"
git push
We downloaded the Transcript RPKM file (GTEx_Analysis_v6_RNA-seq_Flux1.6_transcript_rpkm.txt.gz
) and meta-information (GTEx_Data_V6_Annotations_SampleAttributesDS.txt
) from GTExPortal where the former includes expression values and the latter includes information about donor IDs and tissue types of each sample. Then we filtered out transcripts according to the following procedure:
- Select transcripts that are mapped to genes in the GO database (list downloaded from Ensembl Biomart)
- Select top 10,000 transcripts with the highest variance across all samples in this dataset
Downloaded Transcript RPKM | After GO Term Filter | After Variance Filter | |
---|---|---|---|
Number of Variables | 195,747 | 67,344 | 10,000 |
TODO: Write usage instructions
- Fork it!
- Create your feature branch:
git checkout -b my-new-feature
- Commit your changes:
git commit -am 'Add some feature'
- Push to the branch:
git push origin my-new-feature
- Submit a pull request :D
TODO: Write history
TODO: Write credits
TODO: Write license