Skip to content

Micro-API to query a COVID-19 preprint's publication status

License

Notifications You must be signed in to change notification settings

lanbufan/Upload-or-Publish

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 

Repository files navigation

Upload-or-Publish

Micro-API to query a COVID-19 preprint's publication status

UoP API

The micro-API service hosted on PythonAnywhere offers a free, programmatic way for a client to make a query on a COVID-19 preprint's (or a batch of COVID-91 preprints) publication status. The API return a JSON dictionary with CORD-19's enhanced metadata. If the preprint's publication status is positive, then, the API return a JSON containing metadata pertaining to the published article. The UoP API, at least in its current implementation, require no keys/authentication. But please be mindful of the fact that this is an unfunded initiative run by a single person.

This tool is currently in its beta-version. It is in active development.

Preprint Coverage

The current version (vbeta) covers three preprint servers: arXiv, bioRxiv, and medRxiv. I use Covid-19 Open Research Dataset (CORD-19) to match preprints from those three repositories with their final published counterpart. As we know, there are dozens upon dozens of online preperint repositories. Therefore, one of my first goal is to extend UoP's repo coverage. NIH iSearch COVID-19 publication database is a great first step in that direction since it includes three preprint servers not coverated by the CORD-19 dataset, namely, ChemRxiv, SSRN, and ResearchSquare.

Institutional User

Adding your preprint server to UoP

If you are the manager/admin/developer in charge of a preprint repository and you would like to see 'your' COVID-19 preprint manuscripts' metadata added to the UoP API please email me. In a nutshell, what I would need is a CSV containing all your COVID-19 preprints's metadata (title, authors, doi, etc..) using CORD-19 metadata formatting. I am aware that some preprint repositories have API capabilities, but at this point, I am NOT planning to extend the UoP coverage by scraping websites or building API pipelines. I do not have the ressources nor the time to accomplish that kind of data architecture.

Adding functionalities to UoP

The current querying functions of the API are rather limited (see documentation below). If you have suggestions about a function you would like to use, please feel free to contact me.

Methodology

Context

As the COVID-19 pandemic persists around the world, the scientific community continues to produce and circulate knowledge on the deadly disease at an unprecedented rate. During the early stage of the pandemic, preprints represented nearly 40% of all English-language COVID-19 scientific corpus (6, 000+ preprints | 16, 000+ articles). As of mid-August 2020, that proportion dropped to around 28% (13, 000+ preprints | 49, 000+ articles). Nevertheless, preprint servers remain a key engine in the efficient dissemination of scientific work on this infectious disease. But, giving the ‘uncertified’ nature of the scientific manuscripts curated on preprint repositories, their integration to the global ecosystem of scientific communication is not without creating serious tensions. This is especially the case for biomedical knowledge since the dissemination of bad science can have widespread societal consequences.

Data Etiquette

In the spirit of open science, and especially in the context of the COVID-19 pandemic, I develop a free API. I am running this out of my own pocket. My current plan with Python Everything allows for 100, 000 API queries per day. I strongly encourage intelligent and mindful users. Don't be stupid. Don't query the same data-point over and over. Don't use over-kill parallel processing that will overload the server. If you notice that your requests are not working anymore, just stop your program, ok! Finally, please use a user-agent header that identify you as a user, including your email. I reserve the right to restrict or block clients that are wont follow this etiquette.

I relied on "rxvist.org/docs" to write this section. Please note that "rxvist" draw from the Crossref API documentation to craft their own.

How to cite UoP API

If you use UoP data in your research, please cite:

Upload-or-Publish: Lachapelle, F. (2020). COVID-19 Preprints and Their Publishing Rate: An Improved Method. medRxiv. 1-34. doi: https://doi.org/10.1101/2020.09.04.20188771.

CORD-19 Project: Wang, L. L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., ... & Mooney, P. (2020). CORD-19:The Covid-19 Open Research Dataset. ArXiv.

How to use the UoP API {beta}

The current beta-version only allows one route "http://heibufan.pythonanywhere.com/json/pp_meta/doi"

For example, if a client want to determine the publication status of a specific COVID-19 preprint, using the doi, the url should be: http://heibufan.pythonanywhere.com/json/pp_meta/10.1101/2020.03.19.998179

The returned JSON will look like this:

The most important returned metadata is match_status: True=preprint has a peer-reviewed published counterpart; False=preprint doesnt have one (see documentation below).

{"result": {"indx_pp": 11811,
            "indx_pr": 26,
            "ti_pp": "molecular characterization of sars-cov-2 in the first covid-19 cluster in france reveals an amino-acid deletion in nsp2 (asp268del)",
            "ti_pr": "molecular characterization of sars-cov-2 in the first covid-19 cluster in france reveals an amino acid deletion in nsp2 (asp268del)",
            "fuzz_score": 99,
            "no_fuzz_test": 1,
            "no_fuzz_test_above": 1,
            "prop_au_match": 1.0,
            "z_fuzzy_test_history": [],
            "au_pp": "Bal, Antonin; Destras, Gru00e9gory; Gaymard, Alexandre; Bouscambert-Duchamp, Maude; Valette, Martine; Escuret, Vanessa; Frobert, Emilie; Billaud, Geneviu00e8ve; Trouillet-Assant, Sophie; Cheynet, Valu00e9rie; Brengel-Pesce, Karen; Morfin, Florence; Lina, Bruno; Josset, Laurence",
            "au_pr": "Bal, A.; Destras, G.; Gaymard, A.; Bouscambert-Duchamp, M.; Valette, M.; Escuret, V.; Frobert, E.; Billaud, G.; Trouillet-Assant, S.; Cheynet, V.; Brengel-Pesce, K.; Morfin, F.; Lina, B.; Josset, L.",
            "source_x_pp": "biorxiv",
            "source_x_pr": "pmc",
            "journal_pp": "bioRxiv",
            "journal_pr": "Clin Microbiol Infect",
            "pub_time_pp": "3/21/2020",
            "pub_time_pr": "3/28/2020",
            "cord_uid_pp": "wnh6h9f0",
            "cord_uid_pr": "4c0zwhdh",
            "sha_pp": NaN,
            "sha_pr": NaN,
            "pmcid_pp": NaN,
            "pmcid_pr": "PMC7142683",
            "pubmedid_pp": NaN,
            "pubmedid_pr": 32234449.0,
            "doi_pp": "10.1101/2020.03.19.998179",
            "doi_pr": "10.1016/j.cmi.2020.03.020",
            "diff_day": 7,
            "internal_method": "fuzzy",
            "match_status": true,
            "cord_19_version": "2020_08_12",
            "fuzzy_matching_date": "2020_08_12"}}

How to use the UoP API {beta} - Simple Code Example in Python

import json
import requests
from bs4 import BeautifulSoup

UoP_url_base = 'http://heibufan.pythonanywhere.com/json/pp_meta/'

l_pp_to_query = ['10.1101/2020.03.19.998179', doi2, doi3, etc]

for pp_doi in l_pp_to_query:

    url_query = f'{UoP_url_base}{pp_doi}'
    raw_data = requests.get(url_query)
    if raw_data.status_code!=200:
        raise Exception("HTTP code " + str(raw_data.status_code))
    json_data = json.loads(raw_data.text)

Documentation

The most important returned metadata is match_status: True=preprint has a peer-reviewed published counterpart; False=preprint doesnt have one.

'pp' stands for preprint
'pr' stands for peer-review
'(c)' indicates that the metadate comes from the CORD-19 dataset

"indx_pp": internal working id for admin
"indx_pr": internal working id for admin
"ti_pp": title of preprint (c)
"ti_pr": title of published/peer-review counterpart (c)
"fuzz_score": fuzzy logic score yields from comparing both titles
"no_fuzz_test": total number (raw) of fuzzy mathing score produced (see methods)
"no_fuzz_test_above": total number of fuzzy matching produced that were above the cut-off point of 0.60
"prop_au_match": proportion of preprint's authors' last names that was found in the list of authors of the pr article
"z_fuzzy_test_history": list/array of results of all fuzzy matching tests performed if >1
"au_pp": list of authors (preprint) (c)
"au_pr": list of authors (published version) (c)
"source_x_pp": bibliometric source where CORD-19 got metadata from (c)
"source_x_pr": ibid.
"journal_pp": journal venue (c)
"journal_pr": ibid.
"pub_time_pp": date of preprint upload (c); note: a preprint can have multiple uploaded versions. Still need to validate that CORD-19 always use v1 date
"pub_time_pr": date of peer-reviewed article's publication (c)
"cord_uid_pp": preprint's id (c) - note: not a unique id
"cord_uid_pr": peer-review's id (c) - note: not a unique id
"sha_pp": id and name of pdf json of preprint (c)
"sha_pr": id and name of pdf json of article (c)
"pmcid_pp": pub med central id (c)
"pmcid_pr": ibid.
"pubmedid_pp": pub med id (c)
"pubmedid_pr": pub med id (c)
"doi_pp": digital unique identifier note: arXiv doesnt automatically generate doi for the preprint manuscripts its curated. (see methods)
"doi_pr": ibid.
"diff_day": difference in day between preprint upload and final publication
"internal_method": (see methods)
"match_status": True: pp has a pr, False: pp has no pr
"cord_19_version": version of CORD-19 dataset used for matching algo.
"fuzzy_matching_date": date when the fuzzy matching code was performed

Contributing

Please feel free to email me if you have any questions or if you are interested in contributing.

Authors

  • Francois Lachapelle - subFIELD.lab | PhD Cand. University of British Columbia, Vancouver, Canada

License

This project is licensed under the Apache License 2.0 - see the LICENSE.md file for details

Acknowledgments

At its foundation, this project is a Record Linkage initiative. Therefore, it would not be possible without the great work of researchers at:

CORD-19 Project, Wang, L. L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., ... & Mooney, P. (2020). CORD-19:The Covid-19 Open Research Dataset. ArXiv.

For advisory support:

  • Adam Howe - Statistics Canada | UBC

For inspirational support (as in, ah ok, building an API is cool and doable)

  • Elian Carsenat - NAMSOR
  • Abdill RJ, Blekhman R. - Rxivist

For moral support:

  • Heather Thom - Simon Fraser University
  • Philippe Lachapelle - Centre Hospitalier Universitaire de Quebec

About

Micro-API to query a COVID-19 preprint's publication status

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published