add links to manual review instructions

globalbiodata · Mar 24, 2023 · 4824e27 · 4824e27
1 parent 2b9cb16
commit 4824e27
Show file tree

Hide file tree

Showing 2 changed files with 2 additions and 2 deletions.
diff --git a/running_pipeline.ipynb b/running_pipeline.ipynb
@@ -1 +1 @@
-{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"private_outputs":true,"authorship_tag":"ABX9TyP2URes9rnJ78RXYHVhHADH"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"},"gpuClass":"standard"},"cells":[{"cell_type":"markdown","source":["# Running Training and Prediction Pipeline\n","---\n","This notebook provides all the commands to reproduce the results of training the models, and prediction on the full corpus.\n","\n","This process does not have to be done to update the inventory, but simply to reproduce the reported results, (this is the process used to produce them in the first place).\n","\n","This pipeline has the following steps:\n","\n","*   Split the manually curated datasets\n","*   Train all models on the classificaiton and NER tasks\n","*   Select the best model for each task\n","*   Evaluate all models for each task on their test sets\n","*   Perform classification of full corpus\n","*   Run NER model on predicted biodata resource papers\n","*   Extract URLs from predicted positives\n","*   Process the predicted names\n","*   Perform automated initial deduplication\n","*   Flag the inventory for selective manual review\n","\n","### ***Warning***:\n","\n","Running the full pipeline trains many models, and their \"checkpoint\" files are quite large (~0.5GB per model, ~15GB in total). Simply running prediction requires much less resources, including storage space.\n","\n","### Other use-cases\n","\n","If you want to compare a new model to the previously compared models, you can add another row to `config/models_info.tsv`. This pipeline will train this model and compare it to the others. If the other trained model checkpoint files are still present from a previous run, they will not be re-trained during the process.\n","\n","# Setup\n","---\n","### Mount Drive\n","\n","First, mount Google Drive to have access to files necessary for the run:\n"],"metadata":{"id":"x4whPVjZZa7x"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"BmwESzXcjXTb"},"outputs":[],"source":["from google.colab import drive\n","drive.mount('/content/drive')\n","%cd /content/drive/MyDrive/GitHub/inventory_2022/"]},{"cell_type":"markdown","source":["Run the make target to install Python and R dependencies."],"metadata":{"id":"6a7pMnIVbKXE"}},{"cell_type":"code","source":["! make setup"],"metadata":{"id":"iBMUW3C0YIz4"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["# Obtaining Fine-tuned models\n","\n","All fine-tuned models have been archived. They can be optionally downloaded using the following cell. This cell also splits the training data first so that Snakemake will not automatically retrain the models (the training data is an input to the models, so if it is split after downloading, Snakemake will think the models are out of date)."],"metadata":{"id":"UXM8YkuDMOCI"}},{"cell_type":"code","source":["# Split the labeled data sets\n","! snakemake -s snakemake/train_predict.smk --configfile config/train_predict.yml -c 1 --until split_classif_data\n","! snakemake -s snakemake/train_predict.smk --configfile config/train_predict.yml -c 1 --until split_ner_data\n","\n","# Create output directory\n","! mkdir -p out/\n","\n","# Download models (may take several minutes)\n","! git lfs install\n","! git clone https://huggingface.co/globalbiodata/inventory_2022_all_models\n","\n","# Move models to proper directory and delete unused files\n","! mv inventory_2022_all_models/classification_models/ out/classif_train_out\n","! mv inventory_2022_all_models/ner_models/ out/ner_train_out\n","! rm -rf inventory_2022_all_models\n","! rm -rf out/classif_train_out/best\n","! rm -rf out/ner_train_out/best"],"metadata":{"id":"0UjJBuKpMzCZ"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["# Running the pipeline\n","---\n","Now, we are ready to run the pipeline\n","\n","## Previewing what has to be done.\n","\n","The following can be run to get a preview of what has to be done."],"metadata":{"id":"XG8imhT0bms7"}},{"cell_type":"code","source":["! make dryrun_reproduction"],"metadata":{"id":"L6sCA8z9nQWZ"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## Run it\n","\n","The following cell will run the entire pipeline described above. It takes a while, even with GPU acceleration. Without GPU it will take a very long time, if it is able to finish at all."],"metadata":{"id":"BIyIBNEGcC_u"}},{"cell_type":"code","source":["! make train_and_predict"],"metadata":{"id":"zFSmOvuUnSPE"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["# Selective Manual Review\n","\n","After running the initial pipeline, the inventory has been flagged for selective manual review.\n","\n","The file to be reviewed is located at:\n","\n","`out/original_query/for_manual_review/predictions.csv`\n","\n","Review the flagged columns according to the instruction sheet, then place the manually reviewed file in the following folder:\n","\n","`out/original_query/manually_reviewed/`\n","\n","The file must still be named `predictions.csv`\n"],"metadata":{"id":"eMe39pCwPAoH"}},{"cell_type":"markdown","source":["# Processing Manual Review\n","\n","Next, further processing is performed on the manually reviewed inventory.\n","\n","If you simply want to reproduce the original results (without manually reviewing the inventory or training models), you can copy the files that would have been generated to this point with the following commands. Otherwise, skip this code chunk."],"metadata":{"id":"Cqbody4wPyUM"}},{"cell_type":"code","source":["! mkdir -p out/original_query\n","! mkdir -p out/original_query/manually_reviewed/\n","! cp data/manually_reviewed_inventory.csv out/original_query/manually_reviewed/predictions.csv\n","! mkdir -p out/classif_train_out/combined_train_stats/\n","! cp data/classif_metrics/combined_train_stats.csv out/classif_train_out/combined_train_stats/combined_stats.csv\n","! mkdir -p out/classif_train_out/combined_test_stats/\n","! cp data/classif_metrics/combined_test_stats.csv out/classif_train_out/combined_test_stats/combined_stats.csv\n","! mkdir -p out/ner_train_out/combined_train_stats/\n","! cp data/ner_metrics/combined_train_stats.csv out/ner_train_out/combined_train_stats/combined_stats.csv\n","! mkdir -p out/ner_train_out/combined_test_stats/\n","! cp data/ner_metrics/combined_test_stats.csv out/ner_train_out/combined_test_stats/combined_stats.csv"],"metadata":{"id":"Iya2L4ecM2IG"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["To also skip processing of the manually reviewed file (such as getting metadata from EuropePMC), and just perform data analysis on the original output files, run the following code chunk:"],"metadata":{"id":"Y5ujIDs9NjW8"}},{"cell_type":"code","source":["! mkdir -p out/original_query/processed_countries/\n","! cp data/final_inventory_2022.csv out/original_query/processed_countries/predictions.csv"],"metadata":{"id":"ghybzbpKNu_-"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Final analysis of the inventory compares the resources found to those in re3data and FAIRsharing. FAIRsharing requires login credentials to use their API. Before proceeding, please create an account at FAIRsharing, and enter your email address and password in the file `config/fairsharing_login.json`."],"metadata":{"id":"DviDeEJVK-eR"}},{"cell_type":"markdown","source":["After manually reviewing the inventory or running the above code chunks to copy the previous files, final processing is performed with the below code chunk:"],"metadata":{"id":"cPk1Ym3MPMXe"}},{"cell_type":"code","source":["! make process_manually_reviewed_original"],"metadata":{"id":"xPoZb6piP7fd"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["# Results\n","---\n","Once the pipeline everythuing is complete, there are a few important output files\n","\n","\n","## Final inventory\n","\n","The final inventory, including names, URLS, and metadata is found in the file:\n","*    `out/original_query/processed_countries/predictions.csv`\n","\n","## Model training stats\n","\n","The per-epoch training statistics for all models are in the files:\n","\n","*    `out/classif_train_out/combined_train_stats/combined_stats.csv`\n","*    `out/ner_train_out/combined_train_stats/combined_stats.csv`\n","\n","## Test set evaluation\n","\n","Performance measures of the trained models on the test set are located in the files:\n","\n","*    `out/classif_train_out/combined_test_stats/combined_stats.csv`\n","*    `out/ner_train_out/combined_test_stats/combined_stats.csv`\n","\n","## Selected models\n","The name of the best models are in the files:\n","\n","*    `out/classif_train_out/best/best_checkpt.txt`\n","*    `out/ner_train_out/best/best_checkpt.txt`\n","\n","## Figures and analyses\n","\n","Figures showing the model performances on the validation sets are present:\n","\n","* `analysis/figures/class_val_set_performances.png`\n","* `analysis/figures/class_val_set_performances.svg`\n","* `analysis/figures/ner_val_set_performances.png`\n","* `analysis/figures/ner_val_set_performances.svg`\n","\n","There are tables of all models' performance on the validation and test sets:\n","\n","* `analysis/figures/combined_classification_table.docx`\n","* `analysis/figures/combined_ner_table.docx`\n","\n","Figures of location data are also output:\n","\n","* `analysis/figures/author_countries.png`\n","* `analysis/figures/ip_coordinates.png`\n","* `analysis/figures/ip_countries.png`\n","\n","Figures/table on text mining potential:\n","\n","* `analysis/figures/text_mining_potential_plot.png`\n","* `analysis/figures/text_mining_potential_plot.svg`\n","* `analysis/figures/text_mining_potential.csv`\n","\n","Comparisons to re3data and FAIRsharing:\n","\n","* `analysis/inventory_re3data_fairsharing_summary.csv`\n","* `analysis/venn_diagram_sets.csv`\n","\n","Finally, some stats on the invenotry are saved:\n","\n","* `analysis/analysed_metadata.txt`"],"metadata":{"id":"RsXV-FmxccZN"}}]}
+{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"private_outputs":true,"authorship_tag":"ABX9TyM0qmq/11ZbcbKfRqcLv4cZ"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"},"gpuClass":"standard"},"cells":[{"cell_type":"markdown","source":["# Running Training and Prediction Pipeline\n","---\n","This notebook provides all the commands to reproduce the results of training the models, and prediction on the full corpus.\n","\n","This process does not have to be done to update the inventory, but simply to reproduce the reported results, (this is the process used to produce them in the first place).\n","\n","This pipeline has the following steps:\n","\n","*   Split the manually curated datasets\n","*   Train all models on the classificaiton and NER tasks\n","*   Select the best model for each task\n","*   Evaluate all models for each task on their test sets\n","*   Perform classification of full corpus\n","*   Run NER model on predicted biodata resource papers\n","*   Extract URLs from predicted positives\n","*   Process the predicted names\n","*   Perform automated initial deduplication\n","*   Flag the inventory for selective manual review\n","\n","### ***Warning***:\n","\n","Running the full pipeline trains many models, and their \"checkpoint\" files are quite large (~0.5GB per model, ~15GB in total). Simply running prediction requires much less resources, including storage space.\n","\n","### Other use-cases\n","\n","If you want to compare a new model to the previously compared models, you can add another row to `config/models_info.tsv`. This pipeline will train this model and compare it to the others. If the other trained model checkpoint files are still present from a previous run, they will not be re-trained during the process.\n","\n","# Setup\n","---\n","### Mount Drive\n","\n","First, mount Google Drive to have access to files necessary for the run:\n"],"metadata":{"id":"x4whPVjZZa7x"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"BmwESzXcjXTb"},"outputs":[],"source":["from google.colab import drive\n","drive.mount('/content/drive')\n","%cd /content/drive/MyDrive/GitHub/inventory_2022/"]},{"cell_type":"markdown","source":["Run the make target to install Python and R dependencies."],"metadata":{"id":"6a7pMnIVbKXE"}},{"cell_type":"code","source":["! make setup"],"metadata":{"id":"iBMUW3C0YIz4"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["# Obtaining Fine-tuned models\n","\n","All fine-tuned models have been archived. They can be optionally downloaded using the following cell. This cell also splits the training data first so that Snakemake will not automatically retrain the models (the training data is an input to the models, so if it is split after downloading, Snakemake will think the models are out of date)."],"metadata":{"id":"UXM8YkuDMOCI"}},{"cell_type":"code","source":["# Split the labeled data sets\n","! snakemake -s snakemake/train_predict.smk --configfile config/train_predict.yml -c 1 --until split_classif_data\n","! snakemake -s snakemake/train_predict.smk --configfile config/train_predict.yml -c 1 --until split_ner_data\n","\n","# Create output directory\n","! mkdir -p out/\n","\n","# Download models (may take several minutes)\n","! git lfs install\n","! git clone https://huggingface.co/globalbiodata/inventory_2022_all_models\n","\n","# Move models to proper directory and delete unused files\n","! mv inventory_2022_all_models/classification_models/ out/classif_train_out\n","! mv inventory_2022_all_models/ner_models/ out/ner_train_out\n","! rm -rf inventory_2022_all_models\n","! rm -rf out/classif_train_out/best\n","! rm -rf out/ner_train_out/best"],"metadata":{"id":"0UjJBuKpMzCZ"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["# Running the pipeline\n","---\n","Now, we are ready to run the pipeline\n","\n","## Previewing what has to be done.\n","\n","The following can be run to get a preview of what has to be done."],"metadata":{"id":"XG8imhT0bms7"}},{"cell_type":"code","source":["! make dryrun_reproduction"],"metadata":{"id":"L6sCA8z9nQWZ"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## Run it\n","\n","The following cell will run the entire pipeline described above. It takes a while, even with GPU acceleration. Without GPU it will take a very long time, if it is able to finish at all."],"metadata":{"id":"BIyIBNEGcC_u"}},{"cell_type":"code","source":["! make train_and_predict"],"metadata":{"id":"zFSmOvuUnSPE"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["# Selective Manual Review\n","\n","After running the initial pipeline, the inventory has been flagged for selective manual review.\n","\n","The file to be reviewed is located at:\n","\n","`out/original_query/for_manual_review/predictions.csv`\n","\n","Review the flagged columns according to the instruction sheet ([doi: 10.5281/zenodo.7768363](https://doi.org/10.5281/zenodo.7768363)), then place the manually reviewed file in the following folder:\n","\n","`out/original_query/manually_reviewed/`\n","\n","The file must still be named `predictions.csv`\n"],"metadata":{"id":"eMe39pCwPAoH"}},{"cell_type":"markdown","source":["# Processing Manual Review\n","\n","Next, further processing is performed on the manually reviewed inventory.\n","\n","If you simply want to reproduce the original results (without manually reviewing the inventory or training models), you can copy the files that would have been generated to this point with the following commands. Otherwise, skip this code chunk."],"metadata":{"id":"Cqbody4wPyUM"}},{"cell_type":"code","source":["! mkdir -p out/original_query\n","! mkdir -p out/original_query/manually_reviewed/\n","! cp data/manually_reviewed_inventory.csv out/original_query/manually_reviewed/predictions.csv\n","! mkdir -p out/classif_train_out/combined_train_stats/\n","! cp data/classif_metrics/combined_train_stats.csv out/classif_train_out/combined_train_stats/combined_stats.csv\n","! mkdir -p out/classif_train_out/combined_test_stats/\n","! cp data/classif_metrics/combined_test_stats.csv out/classif_train_out/combined_test_stats/combined_stats.csv\n","! mkdir -p out/ner_train_out/combined_train_stats/\n","! cp data/ner_metrics/combined_train_stats.csv out/ner_train_out/combined_train_stats/combined_stats.csv\n","! mkdir -p out/ner_train_out/combined_test_stats/\n","! cp data/ner_metrics/combined_test_stats.csv out/ner_train_out/combined_test_stats/combined_stats.csv"],"metadata":{"id":"Iya2L4ecM2IG"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["To also skip processing of the manually reviewed file (such as getting metadata from EuropePMC), and just perform data analysis on the original output files, run the following code chunk:"],"metadata":{"id":"Y5ujIDs9NjW8"}},{"cell_type":"code","source":["! mkdir -p out/original_query/processed_countries/\n","! cp data/final_inventory_2022.csv out/original_query/processed_countries/predictions.csv"],"metadata":{"id":"ghybzbpKNu_-"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Final analysis of the inventory compares the resources found to those in re3data and FAIRsharing. FAIRsharing requires login credentials to use their API. Before proceeding, please create an account at FAIRsharing, and enter your email address and password in the file `config/fairsharing_login.json`."],"metadata":{"id":"DviDeEJVK-eR"}},{"cell_type":"markdown","source":["After manually reviewing the inventory or running the above code chunks to copy the previous files, final processing is performed with the below code chunk:"],"metadata":{"id":"cPk1Ym3MPMXe"}},{"cell_type":"code","source":["! make process_manually_reviewed_original"],"metadata":{"id":"xPoZb6piP7fd"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["# Results\n","---\n","Once the pipeline everythuing is complete, there are a few important output files\n","\n","\n","## Final inventory\n","\n","The final inventory, including names, URLS, and metadata is found in the file:\n","*    `out/original_query/processed_countries/predictions.csv`\n","\n","## Model training stats\n","\n","The per-epoch training statistics for all models are in the files:\n","\n","*    `out/classif_train_out/combined_train_stats/combined_stats.csv`\n","*    `out/ner_train_out/combined_train_stats/combined_stats.csv`\n","\n","## Test set evaluation\n","\n","Performance measures of the trained models on the test set are located in the files:\n","\n","*    `out/classif_train_out/combined_test_stats/combined_stats.csv`\n","*    `out/ner_train_out/combined_test_stats/combined_stats.csv`\n","\n","## Selected models\n","The name of the best models are in the files:\n","\n","*    `out/classif_train_out/best/best_checkpt.txt`\n","*    `out/ner_train_out/best/best_checkpt.txt`\n","\n","## Figures and analyses\n","\n","Figures showing the model performances on the validation sets are present:\n","\n","* `analysis/figures/class_val_set_performances.png`\n","* `analysis/figures/class_val_set_performances.svg`\n","* `analysis/figures/ner_val_set_performances.png`\n","* `analysis/figures/ner_val_set_performances.svg`\n","\n","There are tables of all models' performance on the validation and test sets:\n","\n","* `analysis/figures/combined_classification_table.docx`\n","* `analysis/figures/combined_ner_table.docx`\n","\n","Figures of location data are also output:\n","\n","* `analysis/figures/author_countries.png`\n","* `analysis/figures/ip_coordinates.png`\n","* `analysis/figures/ip_countries.png`\n","\n","Figures/table on text mining potential:\n","\n","* `analysis/figures/text_mining_potential_plot.png`\n","* `analysis/figures/text_mining_potential_plot.svg`\n","* `analysis/figures/text_mining_potential.csv`\n","\n","Comparisons to re3data and FAIRsharing:\n","\n","* `analysis/inventory_re3data_fairsharing_summary.csv`\n","* `analysis/venn_diagram_sets.csv`\n","\n","Finally, some stats on the invenotry are saved:\n","\n","* `analysis/analysed_metadata.txt`"],"metadata":{"id":"RsXV-FmxccZN"}}]}