-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOE OSTI DOIs for input4MIPs #177
Comments
FYI to self, 210 DOIs were issued by the CMIP6-era citation service for input4MIPs, so maybe we need to bump up the ~100 number we've discussed - see https://www.wdc-climate.de/ui/statistics?type=cmip6_doi_registration. Also relevant is the CMIP6 Data Citation and Long-Term Archival wiki - https://redmine.dkrz.de/projects/cmip6-lta-and-data-citation/wiki |
Hi @jitendra-kumar. Just circling on this task, is there any progress to report? We have a project meeting tomorrow, so I was keen to update the data providers about the status and timings |
Here's a summary of fields we need information for to register with OSTI. Many (but not all) of these information exist within the JSONs in this repo and we can pull the information together from the existing JSONs, and create a new JSON with all the information needed to register DOI for each dataset. Product Description:
Dataset Location:
|
@jitendra-kumar, that's great. What is the best/easiest format for this info to be collated, considering this first pass is going to be manual copy-and-paste — text files or another format? |
We should put the information together in a JSON, and that would allow us to automate the process at the later date. And even for the short term I can extract everything needed from that for manual entry. |
We will need to create .html landing pages. The .json could be used to render those. We would need then to put together a template. Then push those pages to gh-pages. This could be done with Github Actions. |
Do you have any ideas for the schema @jitendra-kumar ? E.g. do certain fields need to be strings/boolean/lists etc.? I think that is the key. Once we have the schema, writing data to match it is relatively trivial. Even just something like the below Schema proposalfrom attrs import define
@define
class Author:
first_name: str
last_name: str
orcid: str # would we also validate this, probably a good idea if easy
affiliation: str
affiliation_ror: str | None # optional for anyone whose institute isn't registered
@define
class Product:
dataset_title: str
authors: list[Author]
related_dois: list[str] # should validate that these are DOIs
originating_research_organisation: str # I find this field a bit weird, given most things have multiple authors therefore source organisations and the authors have affiliations anyway
publication_date: str # YYYY-MM-DD I guess?
sponsoring_organisation: str # as above re needing multiple and info already being in author info. Also unclear to me what the difference from the other orgs is so I would suggest making this optional if we can
keywords: list[str]
geolocation: tuple[float, float] # what do we put here? Lat/lon co-ords? Would suggest making this optional or dropping if we can
description: str
dataset_location: tuple[float, float] # what do we put here? Lat/lon co-ords? Would suggest making this optional or dropping if we can. Or do I misunderstand this field?
@define
class Dataset:
url: str # validate this is a URL
extension: str
size: float # in bytes I guess? |
@znichollscr working on this schema to be consistent with what OSTI wants. Will have something to share soon. |
@jitendra-kumar just FYI, our first CMIP7 final dataset has just been published with no DOI (see here).. So if we have a very quick solution to implement now is the time! |
@durack1 I looked at the three published datasets and can get the published metadata from the ESGF catalog. I am also able to extract most of the fields necessary to register the DOI from the input4MIPs CVs (in this repository). However, the published datasets are not reflect in the main branch of this raster, can the appropriate branches/forks with the CMIP7 mip_era be merged in the main. Two outstanding attributes that's needed --
Or add affiliation and ORCID to the current format. ORCID can be an optional field:
Need something that I can parse.. So if the CVs can be updated for title and author fields, I will be able to get us a DOI very quick. Also, as a quick fix for these three datasets published so far, you can send me Title and Authors information directly via email and I can register DOIs for them and add them as a new |
We also need dataset description, or a way to assemble them from CVs. This was the description in the past input4MIPs. @durack1 can you provide an updated description to use. Ex. from https://www.wdc-climate.de/ui/cmip6?input=input4MIPs.CMIP6.RFMIP.UColorado.UColorado-RFMIP-0-4
|
Done now. To set your expectations for future: in general, the lag is 1-2 days. In this case it was a week because I was on leave. We're on the fence about whether this should be a fully automated process or not (see e.g. prototype here, which we decided not to pursue right now #115). Before continuing, a bit of background on the rationale for this, which will help explain the recommendations below. Background/justification for the belowSome datasets are already minting their own DOIs (CEDS, GHGs, aerosol simple plumes). Hence, we can't create a workflow that only works if the DOI comes from this process (which is why the proposed quick fix isn't the way to go, we can't put a DOI in for each source ID as some providers mint DOIs at a higher level of resolution than source IDs). This is a fallback, so I would keep it simple. Let's mint DOIs at the mip-era - source ID level. The combination of MIP era and source IDs should be unique, so this is a totally fine way that will let us keep track of things without having to mint heaps of DOIs. This then makes the title easy to create and easy for a human to read. I would also keep the description short. We have lots of information already. Let's not duplicate it (which would then force us to think about how to update it). Rather, let's link back to existing information sources that we already have as much as possible. Re the authors, I had wondered when our unstructured contacts would bite us. Looks like that day has come. I'll fix that now. Recommendations
Dataset description (I would start with this, and update once we know what information users who hit the DOI are actually interested in rather than spending lots of time trying to anticipate this in advance because I think we really don't know what context someone who lands on this DOI page will already have):
|
@znichollscr : where should I look for the |
Ah that's a good point: it's not a 1:1 mapping. You can either scrape it from the ESGF data, then just put a list of values. Honestly, I would probably just drop that sentence from the description. |
Also, is/will there a CMIP table/database, equivalent to CMIP6Plus URL in your example above, for CMIP7. |
Good pick up. I'll add one tomorrow. |
@jitendra-kumar just circling on this, we already have one CMIP7 dataset published into ESGF, SOLARIS-HEPPA-CMIP-4-6, and so we've missed being able to assign a DOI within the published files. We have volcanic, land use, greenhouse gas, anthropogenic emissions, biomass burning, and several other data that are being prepared right now, so what're the chances we can get ~10 DOIs issued now so these can be written to the files pre-publication? |
|
@durack1 : I need few additional information added/updated to the CVs in my last comment above. Else, send me the list of I have a DOI minted for |
@jitendra-kumar thinking about this and seeing how it is evolving. Can you tell us what you need? (I.e. give me a schema to fill out, and I'll auto-generate it. Having you parse the CVs isn't going to be efficient I don't think). Including everything in the CVs is going to be a pain for us and likely not helpful. It's going to be much faster for me to just compile what you need than for us to pollute our CVs in the way we're going.
Thanks, good pick up.
Given we can update the metadata after the DOI is minted, can we just mint the DOIs and we'll update the metadata afterwards? |
You can auto-fill the schema for input4MIPs but we won't be able to do that with other projects. We need to be able to do this for projects other than input4MIPs and will be using CVs as our source of information. CVs already has most of what we need. We prefer to not have to hard code For minting DOIs, yes we can update the metadata later, but need some minimal information (dataset name, authors -- which can be edited later) to populate the draft. As long as the dataset is in |
Let me make a demo. I think we're on the same page. My point is just more that, we don't need to have this all in the source ID CVs, we can grab it from a few places. Anyway, I'll do a demo then we can see.
Grab the following:
They're obviously old, but the information will be mostly correct and applicable for the next version. As long as we can update metadata, then they're the perfect starting point |
@jitendra-kumar one other question. Can you make this for SOLARIS-HEPPA-CMIP-4-7 ? We don't republish data as far as I understand (that can create complete confusion so we either publish something new or don't change the data). |
@sashakames @climate-dude may be able to comment on best way to retract/republish. But yes I can change SOLARIS-HEPPA-CMIP-4-6 to SOLARIS-HEPPA-CMIP-4-7 in the DOI metadata which is still in draft form. |
Thanks. We're comfortable with the process and work with Sasha a lot already. We just don't do it to avoid confusion. For input4MIPs, it's either: new version or nothing to avoid there being multiple versions of the same thing out there. |
Great to know, thanks |
So, thinking about it more, this is the process I imagine we'll end up with:
|
Is this new data or a replica of existing data? If the files do not change we don't want to cut a new source_id, then you would need to start over, or will that happen at LLNL? |
Sorry @sashakames, shouldn't have pulled you in with no context. Anyway, now you're here: we would only cut a new source ID if we create new files. The question is whether we do that or not, and on that we haven't decided yet (is it worth creating new files, just to get a DOI attribute, we'll probably just ask the data provider, once we've worked out the various issues here). |
thanks, if there are new files, need to decide to do the original publish at ORNL vs LLNL. Also we may want to populate a citation_url with he landing pageand an xlink record with .json so Metagrid points at something. |
Sounds good, but I don't fully follow. Is this something that we do on the ESGF side or that you're suggesting we put in the file? |
Trying to keep track of things, the demo is in #200 |
@jitendra-kumar sorry just catching up, the SOLARIS-HEPPA-CMIP-4-6 is our first mip_era=CMIP7 or "final" dataset that will be used in the production CMIP7 simulations, see input4MIPs_CVs/CVs/input4MIPs_source_id.json Lines 486 to 504 in 2a1439f
In the ESGF SOLR index, and in the file: $ ncdump -h input4MIPs/CMIP7/CMIP/SOLARIS-HEPPA/SOLARIS-HEPPA-CMIP-4-6/atmos/mon/multiple/gn/v20250219/multiple_input4MIPs_solar_CMIP_SOLARIS-HEPPA-CMIP-4-6_gn_185001-202312.nc | grep mip_era
:mip_era = "CMIP7" ; Previous versions were prototype mip_era=CMIP6Plus data, e.g., input4MIPs_CVs/CVs/input4MIPs_source_id.json Lines 467 to 485 in 2a1439f
EDIT: ah, ok, so there was a metadata issue in our CVs, https://github.com/PCMDI/input4MIPs_CVs/pull/198/files |
Here are DOI's for these three datasets. We would need a better way to track these. Create a separate markdown table? I am having to add to CVs but these are only placeholder old source Ids.
|
We'll write them in the files ideally, then they'll be captured and appear in various places e.g. https://input4mips-controlled-vocabularies-cvs.readthedocs.io/en/stable/database-views/input4MIPs_source-id_CMIP7.html and I can also pull them out so they appear in other auto-generated places e.g. https://input4mips-controlled-vocabularies-cvs.readthedocs.io/en/stable/dataset-overviews/anthropogenic-slcf-co2-emissions/ They don't need to go in the CVs: we will capture them in other ways. |
Sure, that fine. Let me know as and when you have more datasets ready to mint DOIs, and when there are updated information for these three. |
Cool I fixed this up in #205. 10.25981/ESGF.input4MIPs.CMIP7/2522675 should be updated to "SOLARIS-HEPPA-CMIP-4-6":{
"activity_id": "input4MIPs",
"authors": [
{
"name": "Bernd Funke",
"email": "[email protected]",
"affiliations": [
"Instituto de Astrofísica de Andalucía, CSIC, Granada, Spain"
],
"orcid": "0000-0003-0462-4702"
}
],
"contact":"[email protected]",
"dataset_category":["solar"],
"further_info_url":"http://solarisheppa.geomar.de/cmip7",
"institution_id":"SOLARIS-HEPPA",
"license_id":"CC BY 4.0",
"mip_era":"CMIP7",
"target_mip":"CMIP",
"source_version":"4.6"
}, 10.25981/ESGF.input4MIPs.CMIP7/2522673 should be updated to "UOEXETER-CMIP-2-0-0":{
"activity_id": "input4MIPs",
"authors": [
{
"name": "Thomas Aubry",
"email": "[email protected]",
"affiliations": [
"Faculty of Environment, Science and Economy, University of Exeter, Exeter, EX4 4QF, UK"
],
"orcid": "0000-0002-9275-4293"
}
],
"contact":"[email protected]",
"dataset_category":["aerosolProperties", "emissions"],
"further_info_url":"https://input4mips-controlled-vocabularies-cvs.readthedocs.io/en/latest/dataset-overviews/stratospheric-volcanic-so2-emissions-aod/",
"institution_id":"uoexeter",
"license_id":"CC BY 4.0",
"mip_era":"CMIP7",
"source_version":"2.0.0"
}, Thanks! |
From #197, this will be the entry for 10.25981/ESGF.input4MIPs.CMIP7/2524040 "DRES-CMIP-BB4CMIP7-2-0":{
"activity_id": "input4MIPs",
"authors": [
{
"name": "Margreet van Marle",
"email": "[email protected]",
"affiliations": [
"Deltares, Delft, the Netherlands"
],
"orcid": "0000-0001-7473-5550"
},
{
"name": "Guido van der Werf",
"email": "[email protected]",
"affiliations": [
"Wageningen University and Research, Meteorology and Air Quality, Wageningen, Netherlands"
],
"orcid": "0000-0001-9042-8630"
}
],
"contact":"[email protected], [email protected]",
"dataset_category":["emissions"],
"further_info_url":"https://www.globalfiredata.org",
"institution_id":"DRES",
"license_id":"CC BY 4.0",
"mip_era":"CMIP7",
"source_version":"2.0"
}, |
Just looping back from an email thread, we currently have the status below (I will update this table to reflect the current status) - this HTML view is the latest here @jitendra-kumar @znichollscr ping
|
@durack1 @znichollscr @climate-dude I had not realized that the For minting DOIs for list below..
|
Yep, the easiest would be watch this repo for "Releases". We're trying to behave ourselves to make sure that any data that is published into ESGF is followed promptly with a webpage/repo update, and release |
@jitendra-kumar that's our fault, sorry. We've been so focussed on getting data sets out that minting DOIs has fallen off our plate.
@durack1 as a note this isn't ideal, because by the time this happens it's too late to mint a DOI that can actually go in the file. Summarising what we need next:
Other notes:
Correct, it will not be reflected until #213 is merged (here is the preview https://input4mips-controlled-vocabularies-cvs--213.org.readthedocs.build/en/213/database-views/input4MIPs_source-id_CMIP7.html). For what it's worth, https://doi.org/10.25981/ESGF.input4MIPs.CMIP7/2521499 does not resolve for me so I think there is something wrong @jitendra-kumar can you please also address the requests made here: #177 (comment) |
Regarding #177 comment:
For now we will stick to link to Metagrid and within ESGF we are discussing potential ways to offer better landing pages. These two DOIs are finalized and will be active shortly. UofMD-landState-3-1: https://doi.org/10.25981/ESGF.input4MIPs.CMIP7/2521499 is still in draft stage and is not finalized yet since this dataset is not in the CVs yet. I can do that once that appears, I assume with merge of #213 . |
Then back to something we were discussing, autogenerate the landing page from the CV info and host in GH pgs. Autogenerate the MG links as part of the page generation using the url syntax via the "Copy Search" feature. |
read the docs/via mkdocs, but yes, I agree
Great, thanks |
@jitendra-kumar can you try pointing the DOI entries' landing page to a URL of the form https://input4mips-cvs.readthedocs.io/en/stable/source-id-landing-pages/{source_id}/? e.g. https://input4mips-cvs.readthedocs.io/en/stable/source-id-landing-pages/CR-CMIP-1-0-0/. That should provide us with a landing page that has more metadata than the raw ESGF page. |
@jitendra-kumar as an FYI, UofMD-landState-3-1-1 (#229) used the same DOI as UofMD-landState-3-1 so no need for a new DOI (reusing a DOI isn't ideal, but in this case it's ok because the difference is so small) |
Just adding a placeholder issue, so we can centralize information about what the DOI OSTI service requires from authors to get a DOI issued.
We can then update the source_id and institution_id registration info, with the additional fields
ping @jitendra-kumar @sashakames
The text was updated successfully, but these errors were encountered: