Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OHE attributes with existing GSC validation #858

Open
lschriml opened this issue Sep 10, 2024 · 3 comments
Open

OHE attributes with existing GSC validation #858

lschriml opened this issue Sep 10, 2024 · 3 comments

Comments

@lschriml
Copy link
Member

Requested in Google group [gensc-cig], Ruth Time (GSC Board member)

Issue raised via email:
https://groups.google.com/g/gensc-cig/c/POQWoEXEP2c/m/ToqrFMLJAAAJ?utm_medium=email&utm_source=footer&pli=1

Hi NCBI BioSample,

I'm including Lynn Schriml here, for awareness on this question regarding existing GSC validation on the following attributes, also included in the One Health Enteric BioSample package.

indoor_surf [cabinet|ceiling|counter top|door|shelving|vent cover|window|wall]
surf_material [adobe|carpet|cinder blocks|concrete|hay bales|glass|metal|paint|plastic|stainless steel|stone|stucco|tile|vinyl|wood]

We'd like to start curating our own picklists for these attributes, specific for our use cases. While our use cases overlap significantly with what's included here, we need to expand beyond this term list. We'd also like to perform our own validation.

Can we remove NCBI validation on these attributes when they are included in OHE package submissions?

How shall we proceed here? This question is triggered by a recent validation error we received on SUB14608130. We discussed this a few years back (see below), but I never followed up on solving it.

==========================================================================

From: Timme, Ruth <[email protected]>
Sent: Tuesday, May 10, 2022 4:18 PM
To: Pennerman, Kayla * <[email protected]>; Anderson, John B (NIH) <[email protected]>
Cc: Barrett, Tanya (NIH) <[email protected]>
Subject: Re: [EXTERNAL] OHE examples

Hi John, replying with more specific answers to your questions:

NEW ISSUE: Sorry we didn’t notice this previously, but the One Health Enteric package includes the following attributes that are already in our system and already use picklists.

[1] These attributes were originally provided by the GSC. We now see that some of the terms in your picklists differ from the GSC picklists, which are:
building_setting [GSC] [urban|suburban|exurban|rural]
indoor_surf [GSC] [cabinet|ceiling|counter top|door|shelving|vent cover|window|wall]
surf_material [GSC] [adobe|carpet|cinder blocks|concrete|hay bales|glass|metal|paint|plastic|stainless steel|stone|stucco|tile|vinyl|wood]

We’ll default to the GSC picklists for now on these three attributes and I’ll follow-up with GSC + Lynn Schriml to get picklists for these terms expanded.

[2] And host_gender currently has a picklist defined by the INSDC:

host_gender [INSDC] [male|female|pooled male and female|neuter|hermaphrodite|intersex]

We could possibly edit this list to at least match GSC host_gender picklist, which I believe is:
host_gender [GSC] [female|hermaphrodite|non-binary|male|neuter|transgender|transgender (female to male)|transgender (male to female)|undeclared]

We’ll revert back to using host_sex, which has a more applicable working picklist. and I’ll work with the GSC to harmonize going forward.

Do you think you might be able to work with the GSC to harmonize your lists? As it is, any values you supply that don’t match existing picklists will fail submission validation.

One way to stop this delaying submissions in the short term would be to omit these fields from uploads, and update at a later date once picklists have been harmonized.

Thanks,
John

===================================================================

From: "Pennerman, Kayla *" <[email protected]>
Date: Tuesday, May 10, 2022 at 4:11 PM
To: "Anderson, John B (NIH)" <[email protected]>
Cc: "Timme, Ruth" <[email protected]>, "Barrett, Tanya (NIH)" <[email protected]>
Subject: RE: [EXTERNAL] OHE examples

Hi John,

We removed the ontological accessions from all picklists and excess terms from the specified picklists. Please let us know if there are still issues to address.

Thank you,
Kayla

========================================================

From: Anderson, John (NIH/NLM/NCBI) [E] <[email protected]>
Sent: Thursday, May 5, 2022 8:36 AM
To: Pennerman, Kayla * <[email protected]>
Cc: Timme, Ruth <[email protected]>; Barrett, Tanya (NIH) <[email protected]>
Subject: RE: [EXTERNAL] OHE examples

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Hi Kayla,

Thanks for sending us your latest reference guide, template and vocabulary picklists.

We hadn’t considered stripping them from all the picklists. We are modeling our package off the SARS-CoV-2 clinical package

To pass validation this side, the ontology information will only have to be stripped from fields where NCBI supports picklists. The values provided for these fields have to be an exact match with values in our picklists, regardless of the package they are submitted under (including SARS-CoV-2). Stripping ontology information is not necessary for free text fields. The full list of attributes recognized by our system is provided at https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/ (this page is not yet updated with new OneHealthEnteric attributes). The exact picklist values we support are listed in the ‘Format’ field (note, no ontology information is included within those picklist values). Note that attributes are package-agnostic at NCBI - they follow the same rules regardless of which package they participate in.

NEW ISSUE: Sorry we didn’t notice this previously, but the One Health Enteric package includes the following attributes that are already in our system and already use picklists.

[1] These attributes were originally provided by the GSC. We now see that some of the terms in your picklists differ from the GSC picklists, which are:
building_setting [GSC] [urban|suburban|exurban|rural]
indoor_surf [GSC] [cabinet|ceiling|counter top|door|shelving|vent cover|window|wall]
surf_material [GSC] [adobe|carpet|cinder blocks|concrete|hay bales|glass|metal|paint|plastic|stainless steel|stone|stucco|tile|vinyl|wood]

[2] And host_gender currently has a picklist defined by the INSDC:

host_gender [INSDC] [male|female|pooled male and female|neuter|hermaphrodite|intersex]

We could possibly edit this list to at least match GSC host_gender picklist, which I believe is:
host_gender [GSC] [female|hermaphrodite|non-binary|male|neuter|transgender|transgender (female to male)|transgender (male to female)|undeclared]

Do you think you might be able to work with the GSC to harmonize your lists? As it is, any values you supply that don’t match existing picklists will fail submission validation.

One way to stop this delaying submissions in the short term would be to omit these fields from uploads, and update at a later date once picklists have been harmonized.

Thanks,
John

=================================================

Hi Kayla,

Thanks for the file. It’s great. I test it and found a few things we need to clear up:

  1. We need the finalized picklist for "indoor surface". Did you settle what the exact term for the pooled sample?

We also need finalized picklists for "building setting", "indoor surface", "surface material" "host_gender".

  1. The new attribute 'sequenced by' is optional, right?

  2. In your test input, you still have the Ontology IDs appended to some picklist values, eg, human as food consumer [FOODON:03510026]. Ruth said it wouldn't be a problem for you to strip these out before submitting, so all future submissions will not have those, correct?

  3. We noticed that your template omits 3 optional attributes. Should these attributes still be included in our template?

indoor_surf_subpart

serovar

surface_orientation

  1. Regarding your green and orange test items, here’s what I saw:

food_processing_method = "food (ground) [FOODON:00002713];food (frozen) [FOODON:03302148]" passed validation. We're not validating that attribute at all, so it’s OK to have multiple entries.

host_age 13 passed. We don't require units.

latitude and longitude 50 N 20 N failed
latitude and longitude USA:WI failed
latitude and longitude 2 3 passed because it was automatically converted to 2 N 3 E

also one you didn't have in orange:
latitude and longitude 120 S 90 W failed because latitude 120 is impossible

cult_isol_date 13/27/2022 passed because we're not validating that attribute

collection date March failed
collection date 2022-03-04 passed - not sure why you had this in orange. That's a valid format.

reference_material Star*Reads passed because we're not validating that attribute

The following organisms were flagged for curator review:

bacteria this is always flagged. We require a valid tax name. In rare cases “bacterium” wouldbe allowed.
Listeria sp. 'sp.' names without an appended strain name are always flagged
Escherichia coli serovar O157 this is in the Taxonomy database as “Escherichia coli O157”
Salmonella enterica subsp. enterica servoar Dublin "serovar" is misspelled

Thanks,
John

=======================================================

@lschriml
Copy link
Member Author

Email this week from NCBI:
Hi Ruth and Lynn,

John Anderson passed along your request Ruth to remove the validation for the following fields in the OHE BioSample package.

indoor_surf [cabinet|ceiling|counter top|door|shelving|vent cover|window|wall]
surf_material [adobe|carpet|cinder blocks|concrete|hay bales|glass|metal|paint|plastic|stainless steel|stone|stucco|tile|vinyl|wood]

We validate based on a set list of terms as indicated in the list above. We would be happy to make the changes you requested, but we cannot do so without the approval of the GSC as well. These terms and validations appear in their packages as well as the OHE package and therefore any changes to the validation would need to be approved by all stakeholders. We cannot simply remove the validation for one package (OHE) as that is not how our system works. Lynn we would need the GSC to approve removing the validations for these packages and making them free text. My understanding from Ruth's request is that OHE plans to add a new picklist with specific terms and some of those terms are not in the current validation list so the new values will fail the existing validation. Alternatively, if the GSC and OHE could provide us with an updated list of agreed upon terms that should be used for validation, we could make that change as well.

Please let us know what decision OHE and the GSC can come to and then we can begin the work to implement the needed changes.

Best Regards,

Linda

Linda Frisse, PhD
Genomes Team Lead and BioSample SME
NIH/NLM/NCBI/IEB

@lschriml
Copy link
Member Author

Hello Linda,
It would be preferable to have the two groups coordinate and to provide an updated list of terms.
Ruth, we would be happy to coordinate and update the list.
I have cc'd Chris Hunter, as he is leading the working group to make these changes.

Cheers,
Lynn

@turbomam
Copy link
Member

thanks for sharing this @lschriml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants