Skip to content

Commit

Permalink
Merge pull request #23 from globalbiodata/inventory_2022_dev
Browse files Browse the repository at this point in the history
Merge dev into main
  • Loading branch information
schackartk authored Mar 24, 2023
2 parents e69afcb + b74774d commit 38e0d5b
Show file tree
Hide file tree
Showing 74 changed files with 47,040 additions and 2 deletions.
14 changes: 14 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
env/
r_env/
out*/
__pycache__/
.snakemake/
.vscode/
.Rproj.user
.Rhistory
.Rprofile
*.html
data/classif_splits/
data/ner_splits/
config/ia_access_key.txt
config/ia_secret_key.txt
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2022 Chan Zuckerberg Initiative Foundation and Global Biodata Coalition

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
49 changes: 49 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
.PHONY: dryrun, setup, test, train_and_predict, update_inventory

dryrun_reproduction:
snakemake \
-s snakemake/train_predict.smk -np \
--configfile config/train_predict.yml

setup:
pip install -r requirements.txt
echo "import nltk \nnltk.download('punkt')" | python3 /dev/stdin
pip install --upgrade numpy
Rscript -e 'install.packages("renv"), repos="http://cran.us.r-project.org"'
Rscript -e 'renv::restore()'

setup_for_updating:
pip install -r requirements.txt
echo "import nltk \nnltk.download('punkt')" | python3 /dev/stdin
pip install --upgrade numpy

test:
python3 -m pytest -v \
--flake8 --mypy --pylint \
--pylint-rcfile=config/.pylintrc \
src/inventory_utils/*.py \
src/*.py \

train_and_predict:
snakemake \
-s snakemake/train_predict.smk \
--configfile config/train_predict.yml -c1

process_manually_reviewed_original:
snakemake \
-s snakemake/train_predict.smk \
--configfile config/train_predict.yml \
-c 1 \
--until all_analysis

update_inventory:
snakemake \
-s snakemake/update_inventory.smk \
--configfile config/update_inventory.yml -c1

process_manually_reviewed_update:
snakemake \
-s snakemake/update_inventory.smk \
--configfile config/update_inventory.yml \
-c 1 \
--until process_countries
361 changes: 359 additions & 2 deletions README.md

Large diffs are not rendered by default.

93 changes: 93 additions & 0 deletions analysis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Data Analysis

This directory contains R scripts for some analysis of the inventory conducted in 2022. They are stored here rather than [src](../src/) since their reuse is likely limited and is strictly related to analysis. However, these scripts are used in the [train and predict Snakemake pipeline](../snakemake/train_predict.smk).

```sh
.
├── comparison.R # Retrieve life sci resources from FAIRsharing and re3data
├── epmc_metadata.R # Retrieve ePMC metadata to determine OA, full text, etc.
├── funders.R # Analyse funder metadata by article and biodata resource
├── funders_geo.R # Analyse top 200 funders by country
├── location_information.R # Generate maps of resource location metadata
├── metadata_analysis.R # Perform high-level metadata analysis
└── performance_metrics.R # Create plots and tables of model performances
```

All R scripts are command-line executable and take output files from the inventory as inputs for analysis. Usage statements are available through the `-h|--help` flag.

## `location_information.R`

The final inventory file is supplied as input, output directory is specified with `-o|--out-dir`, and 3 maps are generated:

* `ip_coordinates.png`: IP host coordinates dot plot
* `ip_countries.png`: IP host countries heatmap, with country fill color scaled to country name count
* `author_countries.png`: Author affiliation countries heatmap, with country fill color scaled to country name count

## `metadata_analysis.R`

The final inventory file is supplied as input, and various metadata statistics are output to stdout. To easily save the output of this, simply redirect (`>`) the output to a file. For example, running from the root of the repository:

```sh
$ Rscript analysis/metadata_analysis.R \
data/final_inventory_2022.csv \
> analysis/analysed_metadata.txt
```

In this case, no output will be seen in the terminal, but the output will be present in `analysis/analysed_metadata.txt`.

Information included in this analysis:

* Number of unique articles
* Number of resources with at least 1 URL returning 2XX or 3XX
* Number of resources with at least 1 WayBack URL
* Number of resources with grant agency data

## `performance_metrics.R`

This script conducts analysis on the model performance metrics on the validation and test sets., Output directory is specified with `-o|--out-dir`. Four files are needed as input:

* `-cv|--class-train`: Classification training and validation set statistics
* `-ct|--class-test`: Classification test set statistics
* `-nv|--ner-train`: NER training and validation set statistics
* `-nt|--ner-test`: NER test set statistics

The defaults for these arguments are the files stored in the repository, which is the results of the inventory conducted in 2022.

Six files are output:

* `class_val_set_performance.svg` and `class_val_set_performance.png`: Bar chart showing the performance of all article classification models on the validation set. Metrics include *F*1-score, precision, and recall. Models are in decreasing order of precision.
* `ner_val_set_performance.svg` and `ner_val_set_performance.png`: Bar chart showing the performance of all NER models on the validation set. Metrics include *F*1-score, precision, and recall. Models are in decreasing order of *F*1-score.
* `combined_classification_table.docx`: A Microsoft Word doc with a table showing the performance of all article classification models on the validation and test sets. Models are in decreasing order of precision on the validation set.
* `combined_ner_table.docx`: A Microsoft Word doc with a table showing the performance of all NER models on the validation and test sets. Models are in decreasing order of *F*1-score on the validation set.

## `epmc_metadata.R`

The final inventory file is supplied as input and the Europe PMC API is queried to determined if the article has a CC license, is open access, has full text available, has text mined term, and has text mined accession numbers. Note that all but full text are found by querying the PMIDs found in the final inventory file; for full text, the original query was restricted to return only those as OA and having full text availability for the entire corpus and then those PMIDs were matched against the PMIDs found in the final inventory.

1 file is output:
* `text_mining_potential.csv`: A summary table of article counts (Y (Yes) or N (No)) and percentages

## `comparison.R`

Inputs are retrieved by querying the records available from the re3data.org API and the FAIRsharing API. Returns are filtered to life science resources and then compared resources identified in the final inventory. The resources in these two repositories are compared against one another and the inventory to get a sense of the overlap.

2 files are output:

* `inventory_re3data_fairsharing_summary.csv`: Number of overlapping resources in the inventory, re3data, and FAIRsharing.
* `venn_diagram_set.csv`: Intersection set sizes between resources in the inventory, re3data, and FAIRsharing.

## `funders.R`

The final inventory file is supplied as input and the Europe PMC API is queried to retrieve "agency" metadata from individual articles (note that biodata resources in the inventory have concatenated "grantID" and "agency" values for resources with >1 article). This scripts retrieves "agency" for each article, when present, to analyze the supporting funding organizations identified.

1 file is output:
* `inventory_funders.csv`: Deduplicated funder names with total unique article count, total unique biodata resource count, associated article PMIDs (list) and associated biodata resources (list).

## `funders_geo.R`

The output file from funders.R (inventory_funders_2023-01-20.csv) was manually curated to determine countries for funders mentioned >2 times and mapped to ISO.3166-1.alpha-3 country codes. The resulting file, funders_geo_200.csv, is used as the input for this script which groups by unique country to get summary statistics. Note that for agency names, there is some ambiguity via either unclear parent-child relationships (e.g. NIH vs. NIGMS) or inconsistent naming (e.g. National Key Research and Development Program vs. National Key Research Program of China).

2 files are output:
* `funders_geo_counts.csv`: By country summary with count unique agency names, count unique biodata resources, agency names (list) and biodata resource names (list).
* `funder_countries.png`: A (heat)map showing the number of biodata resources that were funded by
at least one agency from a given country.
Loading

0 comments on commit 38e0d5b

Please sign in to comment.