Skip to content

Latest commit

 

History

History

Scicrunch data

Data processing script

This directory contains human-curated data relating software to organizations from SciCrunch, plus a script to populate its most likely organizations and RORs. To run the script:

$ python3 enrich_sci_crunch_csv.py --input=scicrunch_working_file_software.csv --format=minimal
# This will output a file called "scicrunch_working_file_minimal.csv" that can be fed into the consolidate_links.py
# script.  To produce an enriched version of the original csv, use --format=full and it outputs "scicrunch_working_file_enriched.csv"

Ground truth

RRID to ROR Software Mapping Data, Please cite as Bandrowski 2023 Zenodo DOI:10.5281/zenodo.10048228

scicrunch_working_file_software.csv is a file that has been extracted from the SciCrunch Registry, accessible on the web: https://scicrunch.org/resources

The data is human curated and filtered by "Software" additional resource type. These data are openly available and any resource can be freely accessed by using the scicrunch or another resolving service without an API key (be kind with the number of times you access this or be banned). Example: https://scicrunch.org/resolver/[RRID] for bots add .json at the end

The data contains the columns described below:

Field name description field type
scr_id scicrunch registry identifier can be used to pull metadata via n2t.net/RRID:SCR_$### text
original_id original identifier text
type organization or resource text
parent_organization_id typically only parents come from ROR entities (e.g., University not a program) text
Resource_Name unique across the registry for all curated items; resource names that are the same follow rules specifying how to augment the name, usually the university name or vendor name goes first (e.g., Graphpad Prism) longtext
Defining_Citation a manuscript written about the resource longtext
Supercategory Resource for all of these data longtext
Species not usually used for software, but the main species that is covered by the resource longtext
Related_Disease not usually used for software, but the main disease that is covered by the resource longtext
Additional_Resource_Types type hirearchy: https://bioportal.bioontology.org/ontologies/NIFSTD?p=classes&conceptid=http%3A%2F%2Furi.neuinfo.org%2Fnif%2Fnifstd%2Fnlx_res_20090101 longtext
Synonyms comma delimited list longtext
Abbreviation comma delimited list longtext
Keywords comma delimited list longtext
Resource_URL current URL longtext
Availability license information if explicit license not available, also may not be in service longtext
Related_Application typically for biological applications longtext
Funding_Information ignore, there probably are not enough of these in software to worry about longtext
Publication_Link link to defining citation field longtext
Twitter_Handle twitter handle without the at sign longtext
Alternate_URLs other URLs that may be documentation, or other instances of the tool longtext
Terms_Of_Use_URLs URL to the terms of use longtext
Old_URLs URLs that are no longer used longtext
Alternate_IDs Any Identifiers that are not RRIDs, can be resolved by RRID resolver, comma delimeted longtext
Comment curation comment longtext
Social_URLs social media URLs longtext
Supporting_Agency the funding agency that supports the resource longtext
Editorial_Note ignore longtext
Canonical_ID the RRID in curie syntax longtext
License Explicit license longtext
relationship_strings bar delimited list of relationship labels with resources_scr_ids mediumtext
resources_scr_ids bar delimited list of other resources that are related to this RRID via the relationship listed in the relationship_strings field text