Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

software identifiers #69

Closed
mr-c opened this issue Jan 16, 2017 · 15 comments
Closed

software identifiers #69

mr-c opened this issue Jan 16, 2017 · 15 comments

Comments

@mr-c
Copy link

mr-c commented Jan 16, 2017

Hello,

@matuskalas and I have been talking about the best way to represent identifiers for software such as DOIs, CPEs or RRIDs

CPE identifiers are used to track security vulnerabilities

We think the DOI field should be replaced with a list of identifiers, each entry consisting of the name of the identifier and the identifier itself.

identifiers:
 - type: DOI
   value: 10.5281/zenodo.208182
 - type: RRID
   value: SCR_001156
 - type: CPE
   value: cpe:2.3:a:basespace_ruby_sdk_project:basespace_ruby_sdk:*:*:*:*:*:ruby:*:*

Thoughts?

@joncison
Copy link
Member

joncison commented Jan 16, 2017

+1

Like this a lot, however, it would be a breaking change and I promised folks we'd restrict those to once / year (i.e. it could go in 3.x.x / Dec 2017 release)

@joncison joncison self-assigned this Nov 2, 2017
@joncison
Copy link
Member

joncison commented Nov 2, 2017

Heads-up @mr-c, @matuskalas and @ekry

I will (in biotools_dev.xsd) create summary->identifier (0...many) with identifier having the (mandatory) elements of value (with appropriate regexes) and type (an enum). I propose the following:

  1. Refactor existing summary->doi element (breaking change)
  2. Refactor existing summary->toolID (the bio.tools toolID) (breaking change) : @ekry @hans are there any implications here (beyond it being a breaking change generally) ?
  3. Add RRID and CPE as suggested : can someone please tell me appropriate XSD-compatible regexes ??
  4. Any other identifiers I should add ?

A valid value will have to be specified (otherwise bio.tools will choke) - so help on the regexes for any identifiers would be much appreciated.

@joncison
Copy link
Member

joncison commented Nov 2, 2017

PS. currently there must be 1 only of toolID but in the new proposed model, summary->identifier would be optional, i.e. 0...many. So I need a steer here @ekry and @hansioan and how / whether bio.tools would cope with that.

@joncison
Copy link
Member

joncison commented Nov 2, 2017

PPS: as for regexes I have:

  • RRID:.+ (it needs the prefix, right? as SciCrunch aren't the only provider, at least in principle, then it can't be RRID:SCR_ ... otherwise please advise )
  • cpe:.+
    i.e. basically the prefix: than any old crap : can I be more restrictive ?

@joncison
Copy link
Member

joncison commented Nov 2, 2017

So the proposed model is:
capture

you'll notice I've moved version there, i.e. the version label can now be associated with a specific identifier. I think this could be useful for various integration scenarios.

joncison pushed a commit that referenced this issue Nov 2, 2017
implements #69, #73  and partially implements #82
@mr-c
Copy link
Author

mr-c commented Nov 3, 2017

(Tagging in @stain to correct anything I get wrong)

In CWL we model software identifiers as URIs (really, a list of strings) which has the following advantages:

  • We don't have to update our model to add another provider or registry
  • Anyone can determine that we are referring to a particular identifier by pattern matching on the full URI without having to know anything about our file format

@joncison
Copy link
Member

joncison commented Nov 3, 2017

Thanks Michael, that sounds sensible, and is pretty much what we have above, except value is just an xs:token, and can be explicitly typed; I just ponder whether it is / will always be the case that tool identifiers can be disambiguated via regex?? Anyhow, I could make type optional above (making it a bit easier to use). cc @ekry @hansioan : what do you think?

@joncison
Copy link
Member

joncison commented Nov 4, 2017

type now optional, thus:
capture

@matuskalas
Copy link
Member

matuskalas commented Nov 10, 2017

Reopening, as this needs at least a little bit of fixing.

A couple of thoughts:

  • One nice example of these various IDs working together can be seen in Debian Med (thx @tillea and @smoe), visible e.g. at https://blends.debian.org/med/tasks/bio (see the colourful links in the bottom left corner of a tool record) and recorded e.g. in https://anonscm.debian.org/cgit/debian-med/bowtie2.git/tree/debian/upstream/metadata.

  • @mr-c @stain Could you please share 1 or more representative examples of tools IDs in CWL?

  • All these ID types should be included into EDAM under Identifier, with regeces, and a new "URI scheme" attribute. Todo: open an issue there.

  • bio.tools ID can be modelled in biotoolsSchema as either URI or a plain ID value. Both make sense and no strong opinions about either option. It should be defined as an xs:simpleType with Linked-Data- and therefore xs:NCName-compatible regex (see Update facets of IDs #79 (comment) and Update facets of IDs #79 (comment)). This needs fixing, ideally now, and also in the bio.tools data.

  • For consideration is whether biotoolsSchema should model biotoolsIDs separately from "external" IDs, or not. Both options have pros and cons.

  • Other, "external" IDs: included either open (either as URIs (software identifiers #69 (comment)), or idType+idValue (works for URI-less IDs such as RRID), or best an xs:choice of the two), or defined as separate xs:simpleTypes with regeces and xs:appinfos (appinfos incl. URI schemes, definitions, ...) - but even then keeping open for other IDs would be good.

  • DOIs correspond to releases, not tools as a whole|brand. N.B. that they are never (not typically) assigned by SW developers or service providers (https://github.com/bio-tools/biotoolsSchema/blob/68248eb7a55b41e246f6710d05bd77bccc8b5753/biotools_dev.xsd line 179). Can be fixed now.

  • Question to @smoe @mr-c @stain @tillea: Are RRIDs or CPEs specific to a release i.e. a version of a tool? DOIs are, but Zenodo assigns also an "overall" i.e. version-unspecific DOI which however resolves to the most recent tool version and its DOI. OMICtools IDs aren't version-specific, and neither are SEQwiki pages (which will hopefully be up again soon). This point implies that

@matuskalas matuskalas reopened this Nov 10, 2017
@joncison
Copy link
Member

  • For modelling of bio.tools ID (and other ID types) see recent changes and comments in Update facets of IDs #79 (comment). I've gone down the route of insisting on CURIEs for all identifiers i.e. syntax of "prefix:value" (e.g. "biotools:signal") which then are (or at least should be) trivially map-able to both something resolvable (e.g. https://bio.tools/signalp) or to the vanilla and xs:NCName-compatible identifier (e.g. "signalp").

  • For now bio.tools IDs are lumped with the others. Indeed pros and cons, I'm not too fussed.

  • Comment on identifier changed to "A unique identifier of the software, typically assigned by an ID-assignment authority."

  • Generally, IDs may or may not be version-specific; it's possible to specify version in model above (but see Consider associating version with publication #76)

So, I close this again, for now, until we get any specific additional changes for biotoolsSchema.

@mr-c
Copy link
Author

mr-c commented Nov 10, 2017

To answer @matuskalas's questions:

CWL example of a software identifier:

hints:
  SoftwareRequirement:
    packages:
      interproscan:
        specs: [ "https://identifiers.org/rrid/RRID:SCR_005829" ]
        version: [ "5.21-60" ]

https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl/blob/master/tools/InterProScan5.21-60.cwl#L25

Software RRIDs are not assigned to specific releases, but to the idea of a particular piece of software

@joncison
Copy link
Member

UPDATE
biotoolsID and version information (for the bio.tools entry itself) are in their own elements - various discussion concluded this would lead to less confusion / problems. Thus:
capture

@matuskalas
Copy link
Member

Element type is redundant, as these are in CURIE form.

<xs:element name="type" minOccurs="0">

I suggest its removal, unless there is a use case of keeping it.

I'd say either value+type or curie. If resovable HTTP(S) URI have been ruled out, that is.

@joncison
Copy link
Member

@hans, @ekry & I discussed this yesterday. We could only safely remove it we assume all future otherIDs are indeed CURIEable. We weren't 100% sure so we left it in (and also because it's useful to have an explicit enum of the types). I think we could support HTTP(s) URIs in the existing model, if needed, because they have a prefix with a colon, i.e. "https:" or "http:".

So I think we leave it as-is for now. If specified in a payload to PUT or POST the otherID->type value will simply be ignored.

@matuskalas
Copy link
Member

Ok, @joncison. By leaving the type in, we allow for further backwards-compatible additions of un-CURIE-able and un-URI-able IDs :-)

However, as a note, allowing any other types of IDs and|or HTTP(s) URIs will also need a (backwards-compatible) change of the current schema, as it only allows the 3 patterns (doi, rrid, cpe).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants