Skip to content

Commit

Permalink
Merge pull request #22 from schackartk/inventory_2022_dev
Browse files Browse the repository at this point in the history
Inventory 2022 dev
  • Loading branch information
schackartk authored Mar 24, 2023
2 parents 3c34dae + 4824e27 commit b74774d
Show file tree
Hide file tree
Showing 16 changed files with 87 additions and 53 deletions.
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2022 anaistrate
Copyright (c) 2022 Chan Zuckerberg Initiative Foundation and Global Biodata Coalition

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
45 changes: 36 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@

This code repository represents work done as part of a collaborative effort between the Chan Zuckerberg Initiative (CZI) and Global Biodata Coalition (GBC) to create an inventory of biodata resources found in scientific articles. CZI Research Scientist Ana-Maria Istrate designed the machine learning framework for the project and wrote the code to implement and evaluate the NLP models used to classify articles and extract individual resources. Ana’s code was used by GBC consultant Ken Schackart as the starting point for a pipeline to create an ML-predicted preliminary inventory, which is then further refined with code that includes steps for deduplication, processing for selective manual review, and augmentation with additional attributes to create the final inventory of biodata resources.

## Overview of Methods
## Motivation

GBC initiated this project with the objective of gaining an understanding of the global infrastructure of biological data resources. While registries of data resources exist (such as [re3data](https://www.re3data.org/) and [FAIRsharing](https://fairsharing.org/)), their scopes are different from that intended by GBC. So this project was initiated to create an inventory of the blobal biodata resource infrastructure using methodologies that are reproducible so the inventory could be periodically updated.

## Overview of methods

EuropePMC is queried to obtain titles and abstracts of scientific articles. A BERT model is used to classify those articles as describing or not describing a biodata resource. A BERT model is also used to perform named entity recoginition to extract the resource name for those articles that are predicted to describe a biodata resource. Resource URLs are extracted using a regular expression.

Expand All @@ -14,6 +18,18 @@ The final inventory gives a list of biodata resources, the PMIDs of the articles

Snakemake is used as a workflow manager to automate these processes.

## Inteded uses

The code and pipelines here have been designed with a few intended use cases:

- **Reproduction**: We intend that the results of this study are directly reproducible using the pipelines presented here. That includes fine-tuning the models on the manually curated datasets, selecting the best model, using the models for prediction, and all downstream processes.

- **Updating**: It should be possible to get an updated inventory with minimal changes to the code. A pipeline was developed that allows the user to provide a new publication date range, and then the fine-tuned models are used to process the new data to yield an updated list of resources and their associated metadata.

- **Generalization**: With some extra work, it should be possible for a future user to manually curate new training data, and use the existing pipelines to finetune the models and perform all downstream analysis. We note that while much of the existing code would be useful, some changes to the code are likely in this case.

To help with the usability of this code, it has been tested on Google Colab. If a user would like to run the code on Colab, [this protocol](https://dx.doi.org/10.17504/protocols.io.5jyl89o36v2w/v3) provides instructions on how to set up colab and clone this project there. Note that Google and GitHub accounts are required to follow those instructions.

# Workflow overview

## Data curation
Expand Down Expand Up @@ -103,7 +119,7 @@ graph TD

The process up to this point is run without human intervention. As a quality control measure, the inventory must be manually reviewed for articles that are potentially duplicate descriptions of a common resource, or potential false positives based on a low name probability score.

During manual review, the inventory is annotated to determine which potential duplicates should be merged, and which low-probability articles should be removed.
During manual review, the inventory is annotated to determine which potential duplicates should be merged, and which low-probability articles should be removed. Instructions for this process are available on Zenodo ([doi: 10.5281/zenodo.7768363](https://doi.org/10.5281/zenodo.7768363))

## Final Processing

Expand Down Expand Up @@ -170,9 +186,13 @@ affiliation_countries | list(string) | Country codes of countries mentioned in a
└── updating_inventory.ipynb
```

# Installation
# Systems

The code for this project was developed using [WSL2](https://learn.microsoft.com/en-us/windows/wsl/install) connected to an [Ubuntu 20.04](https://releases.ubuntu.com/focal/) kernel. It has also been run on [Google Colaboratory](https://colab.research.google.com/). Compatibility to other systems may vary. In particular, certain functionality (like GNU Make) may not work on Windows.

There are several ways to install the dependencies for this workflow. The workflow has been developed and tested on Linux (Ubuntu 20.04 via Windows System for Linux) and Google Colaboratory.
If you would like to run the code on a Windows machine, we recommend using WSL2. [This protocol](https://www.protocols.io/view/install-wsl-and-vscode-on-windows-10-q26g78e1klwz/v1) may be helpful for getting that set up.

# Installation

## Pip

Expand Down Expand Up @@ -236,6 +256,14 @@ $ python3
>>> nltk.download('punkt')
```

# Allow for execution

To avoid file permission problems, run the following (on Linux) to allow for exeuction of the scripts:

```sh
$ chmod +x src/*.py analysis/*.py analysis/*.R
```

# Running Tests

A full test suite is included to help ensure that everything is running as expected. To run the full test suite, run:
Expand Down Expand Up @@ -269,8 +297,6 @@ $ snakemake -s snakemake/train_predict.smk --configfile config/train_predict.yml

The above commands run the Snakemake pipeline. If you wish to run the steps manually, see [src/README.md](src/README.md#training-and-prediction).



## Updating the inventory

Before running the automated pipelines, first update the configuration file [config/update_inventory.yml](config/update_inventory.yml):
Expand Down Expand Up @@ -321,12 +347,13 @@ Configurations regarding model training parameters are stored in [config/models_

The EuropePMC query string is stored in [config/query.txt](config/query.txt).

# How to Cite

# Associated publications

The primary article of the biodata resource inventory can be found at https://doi.org/10.5281/zenodo.7768416.
A case study describing the efforts taken to make this project reproducible and to uphold code and data standards can be found at https://doi.org/10.5281/zenodo.7767794.

# Authorship

* [Dr. Heidi Imker]([email protected]), Global Biodata Coalition
* [Kenneth Schackart]([email protected]), Global Biodata Coalition
* [Dr. Kenneth Schackart]([email protected]), Global Biodata Coalition
* [Ana-Maria Istrate]([email protected]), Chan Zuckerberg Initiative
22 changes: 11 additions & 11 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,23 +1,23 @@
datasets == 1.18.3
kaleido == 0.2.1
nltk == 3.6.1
numpy == 1.22
numpy == 1.19
pandas == 1.3.5
plotly == 5.1.0
pyyaml
scikit-learn == 0.24.1
seqeval == 1.2.2
snakemake
snakemake == 7.1.1
torch == 1.9.0
transformers == 4.16.2
tqdm == 4.63.0
pycountry == 22.3.5
pytest
flake8
pylint
mypy
pytest-flake8
pytest-pylint
pytest-mypy
requests
urllib3
pytest == 6.2.4
flake8 == 3.9.2
pylint == 2.8.2
mypy == 0.812
pytest-flake8 == 1.0.7
pytest-pylint == 0.18.0
pytest-mypy == 0.8.1
requests == 2.27.1
urllib3 == 1.26.8
2 changes: 1 addition & 1 deletion running_pipeline.ipynb

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions snakemake/train_predict.smk
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ rule all_analysis:
rule query_epmc:
output:
query_results=config["query_out_dir"] + "/query_results.csv",
date_file1=config["query_out_dir"] + "/last_query_date.txt",
date_file2=config["last_date_dir"] + "/last_query_date.txt",
date_file1=config["query_out_dir"] + "/last_query_dates.txt",
date_file2=config["last_date_dir"] + "/last_query_dates.txt",
params:
out_dir=config["query_out_dir"],
begin_date=config["initial_query_start"],
Expand Down
4 changes: 2 additions & 2 deletions snakemake/update_inventory.smk
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ rule all:
rule query_epmc:
output:
query_results=config["query_out_dir"] + "/query_results.csv",
date_file1=config["query_out_dir"] + "/last_query_date.txt",
date_file2=config["last_date_dir"] + "/last_query_date.txt",
date_file1=config["query_out_dir"] + "/last_query_dates.txt",
date_file2=config["last_date_dir"] + "/last_query_dates.txt",
params:
out_dir=config["query_out_dir"],
query=config["query_string"],
Expand Down
4 changes: 2 additions & 2 deletions src/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ If the query has no placeholders, the `--from-date` and `--to-date` arguments ar

Once the query is completed two files are created in `--out-dir`:

* `last_query_date.txt`: File with the `--to-date`, defaulting to today's date
* `last_query_dates.txt`: File with the date range used in the query for later reference (formatted as `from_date`-`to_date`)
* `new_query_results.csv`: Containing IDs, titles, abstracts, and first publication dates from query

# Data Generation
Expand Down Expand Up @@ -192,7 +192,7 @@ Articles that have the same URL are marked in the `duplicate_urls` column. The v

## Processing Manually Reviewed Inventory

Once the flagged inventory has been manually reviewed, the determinations made during review are executed (*e.g.* removing certain rows, merging duplicates) by `process_manual_review.py`.
Once the flagged inventory has been manually reviewed according to the instructions on Zenodo ([doi: 10.5281/zenodo.7768363](https://doi.org/10.5281/zenodo.7768363)), the determinations made during review are executed (*e.g.* removing certain rows, merging duplicates) by `process_manual_review.py`.

There are quite a few validations to ensure that the manual review process was conducted in a way that it can be properly processed. If there any errors are discovered during this evaluation, an error message with the ID values of bad rows will be given, as well as a description of the problem(s).

Expand Down
6 changes: 4 additions & 2 deletions src/check_urls.py
Original file line number Diff line number Diff line change
Expand Up @@ -572,12 +572,14 @@ def test_check_url(testing_session: requests.Session) -> None:
# Bad URLs
url_status = check_url('http://google.com', testing_session)
assert url_status.url == 'http://google.com'
assert url_status.status == 301
assert isinstance(url_status.status, int)
assert url_status.status >= 300

url_status = check_url('https://www.amazon.com/afbadffbaefbnaegn',
testing_session)
assert url_status.url == 'https://www.amazon.com/afbadffbaefbnaegn'
assert url_status.status == 404
assert isinstance(url_status.status, int)
assert url_status.status >= 400
assert url_status.country == ''

# Runtime exception
Expand Down
2 changes: 1 addition & 1 deletion src/class_predict.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ def main() -> None:

# Predict labels
df = pd.read_csv(open(args.infile.name, encoding='ISO-8859-1'), dtype=str)
df.fillna('', inplace=True)
df = df.fillna('')
df = df[~df.duplicated('id')]
df = df[df['id'] != '']
predicted_labels = predict(model, dataloader, class_labels, device)
Expand Down
5 changes: 0 additions & 5 deletions src/class_train.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,11 +208,6 @@ def train(settings: Settings,
best_train = train_metrics
best_model = copy.deepcopy(model)

# Stop training once validation F1 goes down
# Overfitting has begun
# if val_metrics.f1 < best_val.f1 and epoch > 0:
# break

epoch_row = pd.DataFrame(
{
'epoch': epoch,
Expand Down
13 changes: 9 additions & 4 deletions src/inventory_utils/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import sys
from typing import Any, List, Optional, cast

import numpy as np
import torch
from datasets import load_metric
from torch.functional import Tensor
Expand Down Expand Up @@ -76,6 +77,7 @@ def get_ner_metrics(model: Any, dataloader: DataLoader,
Return:
A `Metrics` NamedTuple
"""
# pylint: disable=too-many-locals
calc_seq_metrics = load_metric('seqeval')
total_loss = 0.
num_seen_datapoints = 0
Expand All @@ -84,13 +86,16 @@ def get_ner_metrics(model: Any, dataloader: DataLoader,
with torch.no_grad():
outputs = model(**batch)
num_seen_datapoints += len(batch['input_ids'])
predictions = torch.argmax(outputs.logits, dim=-1) # Diff from class
predictions = predictions.detach().cpu().clone().numpy()
predictions = torch.argmax(outputs.logits, dim=-1)
predictions_array = predictions.detach().cpu().clone().numpy()
predictions_array = cast(np.ndarray, predictions)

labels = cast(Tensor, batch['labels'])
labels = labels.detach().cpu().clone().numpy()
labels_array = labels.detach().cpu().clone().numpy()
labels_array = cast(np.ndarray, labels)

pred_labels, true_labels = convert_to_tags(predictions, labels)
pred_labels, true_labels = convert_to_tags(predictions_array,
labels_array)

calc_seq_metrics.add_batch(predictions=pred_labels,
references=true_labels)
Expand Down
7 changes: 4 additions & 3 deletions src/inventory_utils/wrangling.py
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,7 @@ def preprocess_data(file: TextIO) -> pd.DataFrame:
sys.exit(f'Data file {file.name} must contain columns '
'labeled "title" and "abstract".')

df.fillna('', inplace=True)
df = df.fillna('')
df = df[~df.duplicated('id')]
df = df[df['id'] != '']

Expand Down Expand Up @@ -283,8 +283,9 @@ def test_preprocess_data() -> None:


# ---------------------------------------------------------------------------
def convert_to_tags(batch_predictions: array,
batch_labels: array) -> Tuple[TaggedBatch, TaggedBatch]:
def convert_to_tags(
batch_predictions: np.ndarray,
batch_labels: np.ndarray) -> Tuple[TaggedBatch, TaggedBatch]:
"""
Convert numeric labels to string tags
Expand Down
3 changes: 3 additions & 0 deletions src/ner_final_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,9 @@ def get_args():

args = parser.parse_args()

if ".pkl" not in args.test_file:
parser.error(f'Invalid input file "{args.test_file}". Must be .pkl')

return Args(args.test_file, args.checkpoint, args.out_dir)


Expand Down
9 changes: 4 additions & 5 deletions src/ner_train.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,10 @@ def get_args() -> Args:

args = parser.parse_args()

for infile in [args.train_file, args.val_file]:
if ".pkl" not in infile:
parser.error(f'Invalid input file "{infile}". Must be .pkl')

return Args(args.train_file, args.val_file, args.out_dir, args.metric,
args.model_name, args.learning_rate, args.weight_decay,
args.num_training, args.num_epochs, args.batch_size,
Expand Down Expand Up @@ -242,11 +246,6 @@ def train(settings: Settings,
best_train = train_metrics
best_model = copy.deepcopy(model)

# Stop training once validation F1 goes down
# Overfitting has begun
# if val_metrics.f1 < best_val.f1 and epoch > 0:
# break

epoch_row = pd.DataFrame(
{
'epoch': epoch,
Expand Down
10 changes: 6 additions & 4 deletions src/query_epmc.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ def get_args() -> Args:

parser = argparse.ArgumentParser(
description=('Query EuropePMC to retrieve articles. '
'Saves csv of results and file of today\'s date'),
'Saves csv of results and file of query dates'),
formatter_class=CustomHelpFormatter)

parser.add_argument('query',
Expand Down Expand Up @@ -94,7 +94,7 @@ def make_filenames(outdir: str) -> Tuple[str, str]:
'''

csv_out = os.path.join(outdir, 'query_results.csv')
txt_out = os.path.join(outdir, 'last_query_date.txt')
txt_out = os.path.join(outdir, 'last_query_dates.txt')

return csv_out, txt_out

Expand Down Expand Up @@ -196,10 +196,12 @@ def main() -> None:
else:
to_date = args.to_date

results = run_query(args.query, args.from_date, to_date)
from_date = args.from_date

results = run_query(args.query, from_date, to_date)

results.to_csv(out_df, index=False)
print(to_date, file=open(date_out, 'wt'))
print(f"{from_date}-{to_date}", file=open(date_out, 'wt'))

print(f'Done. Wrote 2 files to {out_dir}.')

Expand Down
Loading

0 comments on commit b74774d

Please sign in to comment.