Merge pull request #23 from globalbiodata/inventory_2022_dev

Merge dev into main
globalbiodata · Mar 24, 2023 · 38e0d5b · 38e0d5b
2 parents e69afcb + b74774d
commit 38e0d5b
Show file tree

Hide file tree

Showing 74 changed files with 47,040 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,14 @@
+env/
+r_env/
+out*/
+__pycache__/
+.snakemake/
+.vscode/
+.Rproj.user
+.Rhistory
+.Rprofile
+*.html
+data/classif_splits/
+data/ner_splits/
+config/ia_access_key.txt
+config/ia_secret_key.txt
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2022 Chan Zuckerberg Initiative Foundation and Global Biodata Coalition
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/Makefile b/Makefile
@@ -0,0 +1,49 @@
+.PHONY: dryrun, setup, test, train_and_predict, update_inventory
+
+dryrun_reproduction:
+	snakemake \
+	-s snakemake/train_predict.smk -np \
+	--configfile config/train_predict.yml
+
+setup:
+	pip install -r requirements.txt
+	echo "import nltk \nnltk.download('punkt')" | python3 /dev/stdin
+	pip install --upgrade numpy
+	Rscript -e 'install.packages("renv"), repos="http://cran.us.r-project.org"'
+	Rscript -e 'renv::restore()'
+
+setup_for_updating:
+	pip install -r requirements.txt
+	echo "import nltk \nnltk.download('punkt')" | python3 /dev/stdin
+	pip install --upgrade numpy
+
+test:
+	python3 -m pytest -v \
+	--flake8 --mypy --pylint  \
+	--pylint-rcfile=config/.pylintrc  \
+	src/inventory_utils/*.py \
+	src/*.py \
+
+train_and_predict:
+	snakemake \
+	-s snakemake/train_predict.smk \
+	--configfile config/train_predict.yml -c1
+
+process_manually_reviewed_original:
+	snakemake \
+	-s snakemake/train_predict.smk \
+	--configfile config/train_predict.yml \
+	-c 1 \
+	--until all_analysis
+
+update_inventory:
+	snakemake \
+	-s snakemake/update_inventory.smk \
+	--configfile config/update_inventory.yml -c1
+
+process_manually_reviewed_update:
+	snakemake \
+	-s snakemake/update_inventory.smk \
+	--configfile config/update_inventory.yml \
+	-c 1 \
+	--until process_countries
diff --git a/README.md b/README.md
diff --git a/analysis/README.md b/analysis/README.md
@@ -0,0 +1,93 @@
+# Data Analysis
+
+This directory contains R scripts for some analysis of the inventory conducted in 2022. They are stored here rather than [src](../src/) since their reuse is likely limited and is strictly related to analysis. However, these scripts are used in the [train and predict Snakemake pipeline](../snakemake/train_predict.smk).
+
+```sh
+.
+├── comparison.R             # Retrieve life sci resources from FAIRsharing and re3data
+├── epmc_metadata.R          # Retrieve ePMC metadata to determine OA, full text, etc.
+├── funders.R                # Analyse funder metadata by article and biodata resource
+├── funders_geo.R            # Analyse top 200 funders by country
+├── location_information.R   # Generate maps of resource location metadata
+├── metadata_analysis.R      # Perform high-level metadata analysis
+└── performance_metrics.R    # Create plots and tables of model performances
+```
+
+All R scripts are command-line executable and take output files from the inventory as inputs for analysis. Usage statements are available through the `-h|--help` flag.
+
+## `location_information.R`
+
+The final inventory file is supplied as input, output directory is specified with `-o|--out-dir`, and 3 maps are generated:
+
+* `ip_coordinates.png`: IP host coordinates dot plot
+* `ip_countries.png`: IP host countries heatmap, with country fill color scaled to country name count
+* `author_countries.png`: Author affiliation countries heatmap, with country fill color scaled to country name count
+
+## `metadata_analysis.R`
+
+The final inventory file is supplied as input, and various metadata statistics are output to stdout. To easily save the output of this, simply redirect (`>`) the output to a file. For example, running from the root of the repository:
+
+```sh
+$ Rscript analysis/metadata_analysis.R \
+    data/final_inventory_2022.csv \
+    > analysis/analysed_metadata.txt
+```
+
+In this case, no output will be seen in the terminal, but the output will be present in `analysis/analysed_metadata.txt`.
+
+Information included in this analysis:
+
+* Number of unique articles
+* Number of resources with at least 1 URL returning 2XX or 3XX
+* Number of resources with at least 1 WayBack URL
+* Number of resources with grant agency data
+
+## `performance_metrics.R`
+
+This script conducts analysis on the model performance metrics on the validation and test sets., Output directory is specified with `-o|--out-dir`. Four files are needed as input:
+
+* `-cv|--class-train`: Classification training and validation set statistics
+* `-ct|--class-test`: Classification test set statistics
+* `-nv|--ner-train`: NER training and validation set statistics
+* `-nt|--ner-test`: NER test set statistics
+
+The defaults for these arguments are the files stored in the repository, which is the results of the inventory conducted in 2022.
+
+Six files are output:
+
+* `class_val_set_performance.svg` and `class_val_set_performance.png`: Bar chart showing the performance of all article classification models on the validation set. Metrics include *F*1-score, precision, and recall. Models are in decreasing order of precision.
+* `ner_val_set_performance.svg` and `ner_val_set_performance.png`: Bar chart showing the performance of all NER models on the validation set. Metrics include *F*1-score, precision, and recall. Models are in decreasing order of *F*1-score.
+* `combined_classification_table.docx`: A Microsoft Word doc with a table showing the performance of all article classification models on the validation and test sets. Models are in decreasing order of precision on the validation set.
+* `combined_ner_table.docx`: A Microsoft Word doc with a table showing the performance of all NER models on the validation and test sets. Models are in decreasing order of *F*1-score on the validation set.
+
+## `epmc_metadata.R`
+
+The final inventory file is supplied as input and the Europe PMC API is queried to determined if the article has a CC license, is open access, has full text available, has text mined term, and has text mined accession numbers. Note that all but full text are found by querying the PMIDs found in the final inventory file; for full text, the original query was restricted to return only those as OA and having full text availability for the entire corpus and then those PMIDs were matched against the PMIDs found in the final inventory.
+
+1 file is output:
+* `text_mining_potential.csv`: A summary table of article counts (Y (Yes) or N (No)) and percentages
+
+## `comparison.R`
+
+Inputs are retrieved by querying the records available from the re3data.org API and the FAIRsharing API. Returns are filtered to life science resources and then compared resources identified in the final inventory. The resources in these two repositories are compared against one another and the inventory to get a sense of the overlap.
+
+2 files are output:
+
+* `inventory_re3data_fairsharing_summary.csv`: Number of overlapping resources in the inventory, re3data, and FAIRsharing.
+* `venn_diagram_set.csv`: Intersection set sizes between resources in the inventory, re3data, and FAIRsharing.
+
+## `funders.R`
+
+The final inventory file is supplied as input and the Europe PMC API is queried to retrieve "agency" metadata from individual articles (note that biodata resources in the inventory have concatenated "grantID" and "agency" values for resources with >1 article). This scripts retrieves "agency" for each article, when present, to analyze the supporting funding organizations identified.
+
+1 file is output:
+* `inventory_funders.csv`: Deduplicated funder names with total unique article count, total unique biodata resource count, associated article PMIDs (list) and associated biodata resources (list).
+
+## `funders_geo.R`
+
+The output file from funders.R (inventory_funders_2023-01-20.csv) was manually curated to determine countries for funders mentioned >2 times and mapped to ISO.3166-1.alpha-3 country codes. The resulting file, funders_geo_200.csv, is used as the input for this script which groups by unique country to get summary statistics. Note that for agency names, there is some ambiguity via either unclear parent-child relationships (e.g. NIH vs. NIGMS) or inconsistent naming (e.g. National Key Research and Development Program vs. National Key Research Program of China).
+
+2 files are output:
+* `funders_geo_counts.csv`: By country summary with count unique agency names, count unique biodata resources, agency names (list) and biodata resource names (list).
+* `funder_countries.png`: A (heat)map showing the number of biodata resources that were funded by
+at least one agency from a given country.