Merge pull request #22 from schackartk/inventory_2022_dev

Inventory 2022 dev
globalbiodata · Mar 24, 2023 · b74774d · b74774d
2 parents 3c34dae + 4824e27
commit b74774d
Show file tree

Hide file tree

Showing 16 changed files with 87 additions and 53 deletions.
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2022 anaistrate
+Copyright (c) 2022 Chan Zuckerberg Initiative Foundation and Global Biodata Coalition
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -2,7 +2,11 @@
 
 This code repository represents work done as part of a collaborative effort between the Chan Zuckerberg Initiative (CZI) and Global Biodata Coalition (GBC) to create an inventory of biodata resources found in scientific articles. CZI Research Scientist Ana-Maria Istrate designed the machine learning framework for the project and wrote the code to implement and evaluate the NLP models used to classify articles and extract individual resources. Ana’s code was used by GBC consultant Ken Schackart as the starting point for a pipeline to create an ML-predicted preliminary inventory, which is then further refined with code that includes steps for deduplication, processing for selective manual review, and augmentation with additional attributes to create the final inventory of biodata resources.
 
-## Overview of Methods
+## Motivation
+
+GBC initiated this project with the objective of gaining an understanding of the global infrastructure of biological data resources. While registries of data resources exist (such as [re3data](https://www.re3data.org/) and [FAIRsharing](https://fairsharing.org/)), their scopes are different from that intended by GBC. So this project was initiated to create an inventory of the blobal biodata resource infrastructure using methodologies that are reproducible so the inventory could be periodically updated.
+
+## Overview of methods
 
 EuropePMC is queried to obtain titles and abstracts of scientific articles. A BERT model is used to classify those articles as describing or not describing a biodata resource. A BERT model is also used to perform named entity recoginition to extract the resource name for those articles that are predicted to describe a biodata resource. Resource URLs are extracted using a regular expression.
 
@@ -14,6 +18,18 @@ The final inventory gives a list of biodata resources, the PMIDs of the articles
 
 Snakemake is used as a workflow manager to automate these processes.
 
+## Inteded uses
+
+The code and pipelines here have been designed with a few intended use cases:
+
+- **Reproduction**: We intend that the results of this study are directly reproducible using the pipelines presented here. That includes fine-tuning the models on the manually curated datasets, selecting the best model, using the models for prediction, and all downstream processes.
+
+- **Updating**: It should be possible to get an updated inventory with minimal changes to the code. A pipeline was developed that allows the user to provide a new publication date range, and then the fine-tuned models are used to process the new data to yield an updated list of resources and their associated metadata.
+
+- **Generalization**: With some extra work, it should be possible for a future user to manually curate new training data, and use the existing pipelines to finetune the models and perform all downstream analysis. We note that while much of the existing code would be useful, some changes to the code are likely in this case.
+
+To help with the usability of this code, it has been tested on Google Colab. If a user would like to run the code on Colab, [this protocol](https://dx.doi.org/10.17504/protocols.io.5jyl89o36v2w/v3) provides instructions on how to set up colab and clone this project there. Note that Google and GitHub accounts are required to follow those instructions.
+
 # Workflow overview
 
 ## Data curation
@@ -103,7 +119,7 @@ graph TD
 
 The process up to this point is run without human intervention. As a quality control measure, the inventory must be manually reviewed for articles that are potentially duplicate descriptions of a common resource, or potential false positives based on a low name probability score.
 
-During manual review, the inventory is annotated to determine which potential duplicates should be merged, and which low-probability articles should be removed.
+During manual review, the inventory is annotated to determine which potential duplicates should be merged, and which low-probability articles should be removed. Instructions for this process are available on Zenodo ([doi: 10.5281/zenodo.7768363](https://doi.org/10.5281/zenodo.7768363))
 
 ## Final Processing
 
@@ -170,9 +186,13 @@ affiliation_countries | list(string) | Country codes of countries mentioned in a
 └── updating_inventory.ipynb
 ```
 
-# Installation
+# Systems
+
+The code for this project was developed using [WSL2](https://learn.microsoft.com/en-us/windows/wsl/install) connected to an [Ubuntu 20.04](https://releases.ubuntu.com/focal/) kernel. It has also been run on [Google Colaboratory](https://colab.research.google.com/). Compatibility to other systems may vary. In particular, certain functionality (like GNU Make) may not work on Windows.
 
-There are several ways to install the dependencies for this workflow. The workflow has been developed and tested on Linux (Ubuntu 20.04 via Windows System for Linux) and Google Colaboratory.
+If you would like to run the code on a Windows machine, we recommend using WSL2. [This protocol](https://www.protocols.io/view/install-wsl-and-vscode-on-windows-10-q26g78e1klwz/v1) may be helpful for getting that set up.
+
+# Installation
 
 ## Pip
 
@@ -236,6 +256,14 @@ $ python3
 >>> nltk.download('punkt')
 ```
 
+# Allow for execution
+
+To avoid file permission problems, run the following (on Linux) to allow for exeuction of the scripts:
+
+```sh
+$ chmod +x src/*.py analysis/*.py analysis/*.R
+```
+
 # Running Tests
 
 A full test suite is included to help ensure that everything is running as expected. To run the full test suite, run:
@@ -269,8 +297,6 @@ $ snakemake -s snakemake/train_predict.smk --configfile config/train_predict.yml
 
 The above commands run the Snakemake pipeline. If you wish to run the steps manually, see [src/README.md](src/README.md#training-and-prediction).
 
-
-
 ## Updating the inventory
 
 Before running the automated pipelines, first update the configuration file [config/update_inventory.yml](config/update_inventory.yml):
@@ -321,12 +347,13 @@ Configurations regarding model training parameters are stored in [config/models_
 
 The EuropePMC query string is stored in [config/query.txt](config/query.txt).
 
-# How to Cite
-
+# Associated publications
 
+The primary article of the biodata resource inventory can be found at https://doi.org/10.5281/zenodo.7768416.
+A case study describing the efforts taken to make this project reproducible and to uphold code and data standards can be found at https://doi.org/10.5281/zenodo.7767794.
 
 # Authorship
 
 * [Dr. Heidi Imker]([email protected]), Global Biodata Coalition
-* [Kenneth Schackart]([email protected]), Global Biodata Coalition
+* [Dr. Kenneth Schackart]([email protected]), Global Biodata Coalition
 * [Ana-Maria Istrate]([email protected]), Chan Zuckerberg Initiative
diff --git a/requirements.txt b/requirements.txt
@@ -1,23 +1,23 @@
 datasets == 1.18.3
 kaleido == 0.2.1
 nltk == 3.6.1
-numpy == 1.22
+numpy == 1.19
 pandas == 1.3.5
 plotly == 5.1.0
 pyyaml
 scikit-learn == 0.24.1
 seqeval == 1.2.2
-snakemake
+snakemake == 7.1.1
 torch == 1.9.0
 transformers == 4.16.2
 tqdm == 4.63.0
 pycountry == 22.3.5
-pytest
-flake8
-pylint
-mypy
-pytest-flake8
-pytest-pylint
-pytest-mypy
-requests
-urllib3
+pytest == 6.2.4
+flake8 == 3.9.2
+pylint == 2.8.2
+mypy == 0.812
+pytest-flake8 == 1.0.7
+pytest-pylint == 0.18.0
+pytest-mypy == 0.8.1
+requests == 2.27.1
+urllib3 == 1.26.8
diff --git a/running_pipeline.ipynb b/running_pipeline.ipynb
diff --git a/snakemake/train_predict.smk b/snakemake/train_predict.smk
@@ -46,8 +46,8 @@ rule all_analysis:
 rule query_epmc:
     output:
         query_results=config["query_out_dir"] + "/query_results.csv",
-        date_file1=config["query_out_dir"] + "/last_query_date.txt",
-        date_file2=config["last_date_dir"] + "/last_query_date.txt",
+        date_file1=config["query_out_dir"] + "/last_query_dates.txt",
+        date_file2=config["last_date_dir"] + "/last_query_dates.txt",
     params:
         out_dir=config["query_out_dir"],
         begin_date=config["initial_query_start"],

diff --git a/snakemake/update_inventory.smk b/snakemake/update_inventory.smk
@@ -10,8 +10,8 @@ rule all:
 rule query_epmc:
     output:
         query_results=config["query_out_dir"] + "/query_results.csv",
-        date_file1=config["query_out_dir"] + "/last_query_date.txt",
-        date_file2=config["last_date_dir"] + "/last_query_date.txt",
+        date_file1=config["query_out_dir"] + "/last_query_dates.txt",
+        date_file2=config["last_date_dir"] + "/last_query_dates.txt",
     params:
         out_dir=config["query_out_dir"],
         query=config["query_string"],

diff --git a/src/README.md b/src/README.md
@@ -63,7 +63,7 @@ If the query has no placeholders, the `--from-date` and `--to-date` arguments ar
 
 Once the query is completed two files are created in `--out-dir`:
 
-* `last_query_date.txt`: File with the `--to-date`, defaulting to today's date
+* `last_query_dates.txt`: File with the date range used in the query for later reference (formatted as `from_date`-`to_date`)
 * `new_query_results.csv`: Containing IDs, titles, abstracts, and first publication dates from query
 
 # Data Generation
@@ -192,7 +192,7 @@ Articles that have the same URL are marked in the `duplicate_urls` column. The v
 
 ## Processing Manually Reviewed Inventory
 
-Once the flagged inventory has been manually reviewed, the determinations made during review are executed (*e.g.* removing certain rows, merging duplicates) by `process_manual_review.py`.
+Once the flagged inventory has been manually reviewed according to the instructions on Zenodo ([doi: 10.5281/zenodo.7768363](https://doi.org/10.5281/zenodo.7768363)), the determinations made during review are executed (*e.g.* removing certain rows, merging duplicates) by `process_manual_review.py`.
 
 There are quite a few validations to ensure that the manual review process was conducted in a way that it can be properly processed. If there any errors are discovered during this evaluation, an error message with the ID values of bad rows will be given, as well as a description of the problem(s).
 

diff --git a/src/check_urls.py b/src/check_urls.py
@@ -572,12 +572,14 @@ def test_check_url(testing_session: requests.Session) -> None:
     # Bad URLs
     url_status = check_url('http://google.com', testing_session)
     assert url_status.url == 'http://google.com'
-    assert url_status.status == 301
+    assert isinstance(url_status.status, int)
+    assert url_status.status >= 300
 
     url_status = check_url('https://www.amazon.com/afbadffbaefbnaegn',
                            testing_session)
     assert url_status.url == 'https://www.amazon.com/afbadffbaefbnaegn'
-    assert url_status.status == 404
+    assert isinstance(url_status.status, int)
+    assert url_status.status >= 400
     assert url_status.country == ''
 
     # Runtime exception

diff --git a/src/class_predict.py b/src/class_predict.py
@@ -174,7 +174,7 @@ def main() -> None:
 
     # Predict labels
     df = pd.read_csv(open(args.infile.name, encoding='ISO-8859-1'), dtype=str)
-    df.fillna('', inplace=True)
+    df = df.fillna('')
     df = df[~df.duplicated('id')]
     df = df[df['id'] != '']
     predicted_labels = predict(model, dataloader, class_labels, device)

diff --git a/src/class_train.py b/src/class_train.py
@@ -208,11 +208,6 @@ def train(settings: Settings,
             best_train = train_metrics
             best_model = copy.deepcopy(model)
 
-        # Stop training once validation F1 goes down
-        # Overfitting has begun
-        # if val_metrics.f1 < best_val.f1 and epoch > 0:
-        #     break
-
         epoch_row = pd.DataFrame(
             {
                 'epoch': epoch,

diff --git a/src/inventory_utils/metrics.py b/src/inventory_utils/metrics.py
@@ -10,6 +10,7 @@
 import sys
 from typing import Any, List, Optional, cast
 
+import numpy as np
 import torch
 from datasets import load_metric
 from torch.functional import Tensor
@@ -76,6 +77,7 @@ def get_ner_metrics(model: Any, dataloader: DataLoader,
     Return:
     A `Metrics` NamedTuple
     """
+    # pylint: disable=too-many-locals
     calc_seq_metrics = load_metric('seqeval')
     total_loss = 0.
     num_seen_datapoints = 0
@@ -84,13 +86,16 @@ def get_ner_metrics(model: Any, dataloader: DataLoader,
         with torch.no_grad():
             outputs = model(**batch)
         num_seen_datapoints += len(batch['input_ids'])
-        predictions = torch.argmax(outputs.logits, dim=-1)  # Diff from class
-        predictions = predictions.detach().cpu().clone().numpy()
+        predictions = torch.argmax(outputs.logits, dim=-1)
+        predictions_array = predictions.detach().cpu().clone().numpy()
+        predictions_array = cast(np.ndarray, predictions)
 
         labels = cast(Tensor, batch['labels'])
-        labels = labels.detach().cpu().clone().numpy()
+        labels_array = labels.detach().cpu().clone().numpy()
+        labels_array = cast(np.ndarray, labels)
 
-        pred_labels, true_labels = convert_to_tags(predictions, labels)
+        pred_labels, true_labels = convert_to_tags(predictions_array,
+                                                   labels_array)
 
         calc_seq_metrics.add_batch(predictions=pred_labels,
                                    references=true_labels)

diff --git a/src/inventory_utils/wrangling.py b/src/inventory_utils/wrangling.py
@@ -245,7 +245,7 @@ def preprocess_data(file: TextIO) -> pd.DataFrame:
         sys.exit(f'Data file {file.name} must contain columns '
                  'labeled "title" and "abstract".')
 
-    df.fillna('', inplace=True)
+    df = df.fillna('')
     df = df[~df.duplicated('id')]
     df = df[df['id'] != '']
 
@@ -283,8 +283,9 @@ def test_preprocess_data() -> None:
 
 
 # ---------------------------------------------------------------------------
-def convert_to_tags(batch_predictions: array,
-                    batch_labels: array) -> Tuple[TaggedBatch, TaggedBatch]:
+def convert_to_tags(
+        batch_predictions: np.ndarray,
+        batch_labels: np.ndarray) -> Tuple[TaggedBatch, TaggedBatch]:
     """
     Convert numeric labels to string tags
 

diff --git a/src/ner_final_eval.py b/src/ner_final_eval.py
@@ -54,6 +54,9 @@ def get_args():
 
     args = parser.parse_args()
 
+    if ".pkl" not in args.test_file:
+        parser.error(f'Invalid input file "{args.test_file}". Must be .pkl')
+
     return Args(args.test_file, args.checkpoint, args.out_dir)
 
 

diff --git a/src/ner_train.py b/src/ner_train.py
@@ -132,6 +132,10 @@ def get_args() -> Args:
 
     args = parser.parse_args()
 
+    for infile in [args.train_file, args.val_file]:
+        if ".pkl" not in infile:
+            parser.error(f'Invalid input file "{infile}". Must be .pkl')
+
     return Args(args.train_file, args.val_file, args.out_dir, args.metric,
                 args.model_name, args.learning_rate, args.weight_decay,
                 args.num_training, args.num_epochs, args.batch_size,
@@ -242,11 +246,6 @@ def train(settings: Settings,
             best_train = train_metrics
             best_model = copy.deepcopy(model)
 
-        # Stop training once validation F1 goes down
-        # Overfitting has begun
-        # if val_metrics.f1 < best_val.f1 and epoch > 0:
-        #     break
-
         epoch_row = pd.DataFrame(
             {
                 'epoch': epoch,

diff --git a/src/query_epmc.py b/src/query_epmc.py
@@ -31,7 +31,7 @@ def get_args() -> Args:
 
     parser = argparse.ArgumentParser(
         description=('Query EuropePMC to retrieve articles. '
-                     'Saves csv of results and file of today\'s date'),
+                     'Saves csv of results and file of query dates'),
         formatter_class=CustomHelpFormatter)
 
     parser.add_argument('query',
@@ -94,7 +94,7 @@ def make_filenames(outdir: str) -> Tuple[str, str]:
     '''
 
     csv_out = os.path.join(outdir, 'query_results.csv')
-    txt_out = os.path.join(outdir, 'last_query_date.txt')
+    txt_out = os.path.join(outdir, 'last_query_dates.txt')
 
     return csv_out, txt_out
 
@@ -196,10 +196,12 @@ def main() -> None:
     else:
         to_date = args.to_date
 
-    results = run_query(args.query, args.from_date, to_date)
+    from_date = args.from_date
+
+    results = run_query(args.query, from_date, to_date)
 
     results.to_csv(out_df, index=False)
-    print(to_date, file=open(date_out, 'wt'))
+    print(f"{from_date}-{to_date}", file=open(date_out, 'wt'))
 
     print(f'Done. Wrote 2 files to {out_dir}.')