Final project: Automated Cell Phenotyping via Machine Learning Approaches

Automated Cell Phenotyping via Machine Learning Approaches: An Evaluation on Single-cell Transcriptomics Data

Getting Started

Data

Download the three datasets
Majority of the intermediate files are provided in this repo. However, Github doesn't allow uploading files over 100MB

Directory name	Description
scripts	All the scripts executed in the study, containning Juptyer Notebook file, Python and bash script
data	Original data collected from open source
data_selected	Pre-processed and feature selected data
clean_data	After PCA, split the data into train and test datasets
clean_data_nopca	In order to test the impact of PCA
clean_data_pca_cv	After PCA, make the cross vaildation data for supervised learning

Steps

Feature selection
PCA
Evaluate classic ML models (Kmeans, SVM, Logistic regression and XGboost) on three different human scRNA-seq datasets.
Use Final_Project_Results file to draw the performance comparison plots.

Executing program

First run the Final_Project_fs_cv_kmeans.ipynb. Tips:
- In order to make sure the code can be exeuted normally, please keep your data folder name consistent with this repo
- For the logistic regression, simply run the run_lr_cv.sh to automatelly run batch cross vaildation (Already included these commands in the notebook)
```
bash run_lr_cv.sh baron
```
- If you want to call the logistic regression model separately, follow this example:
```
python lr_final.py ../clean_data_pca/baron/fold_1/train_features.csv ../clean_data_pca/baron/fold_1/train_labels.csv ../clean_data_pca/baron/fold_1/valid_features.csv ../clean_data_pca/baron/fold_1/valid_labels.csv metrics.txt 50 0.01
```
Second, run Final_Project_SVM_XGB_ActiveLearning.ipynb
Finally, run Final_Project_Results.ipynb to get visualization

Methodology

Evaluation

Figure1 Performance comparison of models for cell identification using different scRNA-seq datasets. Heatmap of the average F1-scores across different cell populations per model (rows) per dataset (columns).

Figure2 Test Accuracy across sampling dataset until 33% with error bar using Uncertainty Batch Sampling and random batch sampling with base learner XGBoost.

Figure3 Training and Test Errors for SVM and XGBoost across different scRNA-seq datasets.

Conclusion

We compared the cell type classification performance of classic machine learning models based on different scRNA-seq datasets.
Generally, all supervised classifiers perform better compared to the unsupervised clustering method.
Taken together, we recommend the use of the SVM model, since it has the best performance.
Our evaluation results could be used as a reference for bioinformatics or experimental researchers to choose an appropriate machine learning-based classifier models depending on the dataset setup at hand.

Authors

Yuning Zheng (yuningz3)

Zhuoqun Li (zhuoqun3)

Mercury Liu (zhishanl)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.ipynb_checkpoints		.ipynb_checkpoints
LR_results		LR_results
clean_data		clean_data
clean_data_nopca		clean_data_nopca
clean_data_pca_cv		clean_data_pca_cv
data		data
data_selected		data_selected
figs		figs
scirpts		scirpts
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Final project: Automated Cell Phenotyping via Machine Learning Approaches

Getting Started

Data

Steps

Executing program

Methodology

Evaluation

Conclusion

Authors

About

Releases

Packages

Contributors 4

Languages

igemiracle/02620_final_Automated-Cell-Phenotyping

Folders and files

Latest commit

History

Repository files navigation

Final project: Automated Cell Phenotyping via Machine Learning Approaches

Getting Started

Data

Steps

Executing program

Methodology

Evaluation

Conclusion

Authors

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages