Tracking OBIS Submission for WBTS Calanus Data #102

Dylan-Pugh · 2022-03-22T19:49:00Z

I'm creating this issue to track the OBIS submission process for the WBTS Calanus dataset. I've opened a PR which contains the conversion script I used, as well as the three output files: #101.

Tagging @albenson-usgs here for help/guidance on using the IPT!

Please let me know if you have any questions, or see any issues.

MathewBiddle · 2022-03-29T18:59:22Z

@albenson-usgs, you can find the final Darwin Core files at https://github.com/ioos/bio_data_guide/tree/main/datasets/WBTS_MBON/data/processed

A description of the process is at https://github.com/ioos/bio_data_guide/tree/main/datasets/WBTS_MBON

albenson-usgs · 2022-03-29T21:35:34Z

Dylan-Pugh · 2022-03-30T13:45:57Z

Dylan-Pugh · 2022-03-31T16:15:27Z

I've made a number of updates here, and opened a new PR #103 with the updated script & output files.

A quick question about the measurementRemarks field:

The data in measurementRemarks seems like some of it might be better in the event or occurrence file. For instance "Comments: 74 total pteropods in light box" seems like it would be better in occurrenceRemarks and "Comments: GOMCES station bt0401" seems like it would be better for eventRemarks.

Currently the data in this field comes from the COMMENTS field in the dataset. These comments can be anything, so I don't see a way to programmatically parse them into either occurrenceRemarks or eventRemarks based on the content of a given comment. I'm thinking the three options would be:

Put all comments into the occurrenceRemarks field
Put all comments into the eventRemarks field
Omit this data

Do you have a sense of which of those is preferable?

Metadata

We can use "WBTS_CFIN_2004_2017" as the short name for this dataset.

I've also attached the cruise report for the dataset here, let me know if this format is workable, or if you need something else!

GoM_WBTS_CruiseReport_WBTS_DMAC version_8SEP21.docx

albenson-usgs · 2022-03-31T16:53:18Z

Great! Let's put it all in occurrenceRemarks.

For the metadata:

Should I have Jeff Runge as the only Resource Contact and put all Co-PIs and Mesozooplankton collection and enumeration as Resource Creators? Should I put you as the Metadata Creator? I will add you as an associated party as the processor so folks know who aligned the data to Darwin Core.
Should we add the sampling protocol into the event file? I know we have "Mesh net cast" in there right now and I think it would be good to keep that but a specific protocol is referenced "Atlantic Zone Monitoring Program (AZMP) established by Fisheries and Oceans Canada (Mitchell et al. 2002)" https://publications.gc.ca/collections/collection_2007/dfo-mpo/Fs97-18-223-2002E.pdf which might be helpful to include. Wish there was a DOI. I tried looking for it in OBPS but doesn't look like it's in there yet. Maybe amend samplingProtocol = "Mesh net cast; Mitchell et al. 2002 https://publications.gc.ca/collections/collection_2007/dfo-mpo/Fs97-18-223-2002E.pdf"?
What is the license for the data? CC-0?
Is there a preferred citation? It's ok if not, I can autogenerate one.

Dylan-Pugh · 2022-03-31T19:14:31Z

Thanks Abby! I reached out to Jeff Runge for some clarity on the contact info, license, and preferred citation. I also moved all the comments into the occurrenceRemarks field, and updated the samplingProtocol value.

While reviewing I noticed that there were some duplicate occurrenceIDs due to the way the script was generating them. I've corrected the issue so each occurrence now has a unique ID.

Changes can be seen in this PR #104.

I'll update here once I hear back from Jeff!

albenson-usgs · 2022-03-31T22:26:52Z

After reviewing the newest files-

This eventID "GC120604WBWB-72" has no associated occurrences. Just want to make sure that's correct.
There seems to be one occurrence that doesn't have any measurements "2e4c7daa-abe4-4ba8-92fa-20b48371dbd6"
The nice thing about the emof is that you can have measurements that link only to the events and so you don't have to repeat information for each occurrence. For instance sampling equipment http://vocab.nerc.ac.uk/collection/L05/current/22/ .75DRing you don't need to repeat that information for each occurrence but only for the events so it would be in the emof file 178 times (number of events) and you would only have eventID and no occurrenceID. Happy to hop on a call if it would work better over the phone to discuss.
measurementType needs to be unique for each measurement type. I see repeats of sampling protocol, weight, sampling equipment but with different measurementTypeIDs. I created a file in Notepad++ to show what I think these should be:
emof_measurementTypes_and_IDs.txt

Dylan-Pugh · 2022-04-01T17:31:55Z

Thanks Abby, sorry for my misunderstanding about the measurementType field! Looking at your file I've updated the mappings, let me know if this looks correct to you:

Origin Term	measurementID	measurementType
Net_Type	http://vocab.nerc.ac.uk/collection/L05/current/22/	net type
Mesh_Size	http://vocab.nerc.ac.uk/collection/Q01/current/Q0100015/	mesh size
Plankton_Net_Area	http://vocab.nerc.ac.uk/collection/Q01/current/Q0100017/	plankton net area
Volume_Filtered	http://vocab.nerc.ac.uk/collection/P25/current/VOL/	volume filtered
Dilution_Factor		dilution factor
Sample_Split		sample split
TOTAL_DILFACTOR_CFIN		total dilution factor CFIN
NET_DEPTH	http://vocab.nerc.ac.uk/collection/P01/current/DXPHPRST/	net depth
Sample_Dry_Weight	http://vocab.nerc.ac.uk/collection/P01/current/ODRYBM01/'	sample dry weight
DW_G_M_2	http://vocab.nerc.ac.uk/collection/P01/current/ODRYBM01/	biomass per area

I checked the source data, and the eventID "GC120604WBWB-72" has no occurrences, so that's correct.

However, there's something strange going on with the occurrenceID: 2e4c7daa-abe4-4ba8-92fa-20b48371dbd6 as you mentioned. It seems to be out of order, so I'm going to investigate further.

Thanks for the clarification on the emof file - that concept makes sense (little by little!), and I'm just trying to think of how to achieve that separation programmatically. I'll open a new PR once I implement the changes and check back here.

Dylan-Pugh · 2022-04-04T15:08:34Z

Circling back here - I'm wondering how best to handle the event/occurrence split for the MoF file.

I think the thing that's tripping me up is that in the original dataset each row corresponds to a sampling event, and several (between 0 and 8) occurrences. During processing I'm expanding each original row into 8 new rows, each representing an occurrence.

Because of this, all of the data contained in the MoF file is identical for each expanded row - the only thing that changes is data related to an actual organism (sex, lifestage, individualCount, occurrenceStatus). So it seems to me that everything in the MoF should actually be tied to the eventID, because in this case none of the MoF fields are unique to an occurrenceID.

Does that make sense at all? I'd also be available for a quick call tomorrow to discuss!

Here's a diagram of how each input row becomes the 8 output rows:

albenson-usgs · 2022-04-04T15:33:23Z

Ok I think I understand this better now and I think you are right. In this case all the eMoFs are event level eMoFs and data in the columns for the different life stages (N, CI, CII, CIII, CIV, CV) and sexes (F, M) is best in individualCount if it's truly a count of the number of individuals- if it's some other type of quantity then we should use organismQuantity and organismQuantityType. My only concern is about having "absences" for different life stages and sexes. Maybe it would be better to only have absences for when Calanus finmarchicus is never found (e.g. that one event GC120604WBWB-72)?

Dylan-Pugh · 2022-04-05T15:44:21Z

Great - I think switching to organismQuantity and organismQuantityType makes sense. Looking back at the cruise report, the counts for a given stage are defined as:

Number of stage CI per m2

So my thought would be to have organismQuantity be whatever the count is, while organismQuantityType is "individuals per m2" - does that seem reasonable?

As far as recording absences - there are some records which are blank (or NaN) and others that actually record "0" in the Calanus columns. I'm going to reach out to the PI for clarification, because my understanding is that those would actually be treated differently:

0 -> looked for the organism, but it wasn't there
blank -> some kind of error/anomaly in the reporting?

albenson-usgs · 2022-04-05T16:14:34Z

So my thought would be to have organismQuantity be whatever the count is, while organismQuantityType is "individuals per m2" - does that seem reasonable?

Yes that makes sense to me.

For the absences- yes good to get clarification from the PI for sure about that difference. However, I'm still wondering if it makes sense to have absences for males vs. females or different life stages. Note that the definition of occurrenceStatus is "A statement about the presence or absence of a Taxon at a Location." Given this definition I'm just not sure if makes sense to have:

eventID	occurrenceID	scientificName	sex	lifeStage	occurrenceStatus
event1	event1_occ1	Calanus finmarchicus	M		present
event1	event1_occ2	Calanus finmarchicus	F		absent
event1	event1_occ3	Calanus finmarchicus		nauplii	absent
event1	event1_occ3	Calanus finmarchicus		copepodite	present
event1	event1_occ3	Calanus finmarchicus		adult	absent

In the example I have above Calanus finmarchicus (taxon) is present at the location (event1) it's just that not all sexes or life stages are present. I would not include the sex / life stages "absences" so it would look like this:

eventID	occurrenceID	scientificName	sex	lifeStage	occurrenceStatus
event1	event1_occ1	Calanus finmarchicus	M		present
event1	event1_occ3	Calanus finmarchicus		copepodite	present

The time when I would include a row for absent is when absolutely no Calanus finmarchicus are found. But I'm curious what others think about this so I will put this question over into the Slack for more discussion.

Dylan-Pugh · 2022-04-05T20:38:44Z

That makes a lot of sense, and I think you're right about the absences. The fact that the definition specifically mentions Taxon definitely helps clarify that in my mind.

For the moment I've gone ahead and updated the script to ignore "missing" sex and life stage records, and the output now looks like your example above. I'll keep an eye on the discussion in Slack and can amend this if need be.

I also verified that blank records mean that an organism was not counted. From the cruise report:

Table cell showing NaN indicates not counted.

so those records are now ignored.

I've opened a new PR with those changes and some additional corrections: #105

Metadata

The PI also responded to my earlier questions about metadata:

Resource Contacts: Jeffrey Runge, Lee Karp Boss
Metadata Creator: Dylan Pugh
Citation: no preference, feel free to generate

He was unsure about the license question, so I'm following up with a few other people on that. Is this in reference to the source data's existing license, or the license which will be applied to the DwC files?

albenson-usgs · 2022-04-05T20:47:31Z

Does this page help with the licenses question? It's the license that will be applied to the DwC files. Note that they must select one of three licenses or the data cannot be published to OBIS and GBIF: CC-0, CC-BY, CC-BY-NC.

Dylan-Pugh · 2022-05-06T16:23:53Z

Hi @albenson-usgs - just wanted to circle back here! I heard back from the PI, and we'd like to use CC-BY for the license.

albenson-usgs · 2022-05-06T16:33:34Z

Does that mean we're a go for publishing? Should I load what's here into the OBIS-USA IPT and publish to OBIS and GBIF?

Dylan-Pugh · 2022-05-06T16:36:59Z

Sure thing - that sounds good to me!

albenson-usgs · 2022-05-06T19:41:38Z

Dylan I'm just doing a quick recheck before I publish. Previously there was only one event with no occurrences but now there are 88 events with no occurrences- is that accurate?

Also the negative sign is missing from 13 of the longitude values.

Dylan-Pugh · 2022-05-06T20:27:27Z

Thanks Abby - I'm correcting the longitude values in the source data now, and will also verify the missing occurrences.

Dylan-Pugh · 2022-05-19T17:56:21Z

Hi @albenson-usgs - sorry for the super long delay in getting back to you. I've corrected the issue with the missing negative signs, and confirmed that there are 88 events with no occurrences - so the DwC files should be correct.

I've opened a PR with the changes here: #108

Hopefully we'll be all set once it's merged in!

MathewBiddle · 2022-05-19T17:58:03Z

I went ahead and merged it in.

albenson-usgs · 2022-05-19T18:53:38Z

@Dylan-Pugh is this a one off and will never be updated or will this be updated with more observations in the future? I'm trying to decide if I should include the dates in the title of the resource (Wilkinson Basin Time Series Station (WBTS): MESOZOOPLANKTON 2004-2017) or not (Wilkinson Basin Time Series Station (WBTS): MESOZOOPLANKTON)

Dylan-Pugh · 2022-05-19T19:39:05Z

This should be a one off! I don't think any new data will be added in the future.

albenson-usgs · 2022-05-19T19:48:37Z

Published! https://www1.usgs.gov/obis-usa/ipt/resource?r=gom_wbts_mesozooplankton

Dylan-Pugh · 2022-05-19T19:58:35Z

Thanks Abby & Matt!

albenson-usgs · 2022-05-19T19:59:40Z

Thank you! Teamwork!

albenson-usgs · 2022-05-19T21:09:54Z

Here's the dataset in OBIS https://obis.org/dataset/5ef55cd8-05a1-4569-8e17-ceb224e40f59 :-)

MathewBiddle · 2022-05-20T12:28:10Z

And GBIF - https://www.gbif.org/dataset/29651377-23c8-4f45-b439-693a1a23cee1!

Dylan-Pugh mentioned this issue Mar 31, 2022

WBTS DwC Conversion Revisions #103

Merged

Dylan-Pugh mentioned this issue Mar 31, 2022

WBTS DwC Conversion Revisions Round 2 #104

Merged

albenson-usgs closed this as completed May 19, 2022

albenson-usgs mentioned this issue Nov 21, 2022

Create dataset review issue template #134

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking OBIS Submission for WBTS Calanus Data #102

Tracking OBIS Submission for WBTS Calanus Data #102

Dylan-Pugh commented Mar 22, 2022

MathewBiddle commented Mar 29, 2022 •

edited

Loading

albenson-usgs commented Mar 29, 2022 •

edited

Loading

Dylan-Pugh commented Mar 30, 2022 •

edited

Loading

Dylan-Pugh commented Mar 31, 2022 •

edited

Loading

albenson-usgs commented Mar 31, 2022

Dylan-Pugh commented Mar 31, 2022

albenson-usgs commented Mar 31, 2022

Dylan-Pugh commented Apr 1, 2022

Dylan-Pugh commented Apr 4, 2022 •

edited

Loading

albenson-usgs commented Apr 4, 2022

Dylan-Pugh commented Apr 5, 2022

albenson-usgs commented Apr 5, 2022

Dylan-Pugh commented Apr 5, 2022

albenson-usgs commented Apr 5, 2022

Dylan-Pugh commented May 6, 2022

albenson-usgs commented May 6, 2022

Dylan-Pugh commented May 6, 2022

albenson-usgs commented May 6, 2022 •

edited

Loading

Dylan-Pugh commented May 6, 2022

Dylan-Pugh commented May 19, 2022

MathewBiddle commented May 19, 2022

albenson-usgs commented May 19, 2022 •

edited

Loading

Dylan-Pugh commented May 19, 2022

albenson-usgs commented May 19, 2022

Dylan-Pugh commented May 19, 2022 •

edited

Loading

albenson-usgs commented May 19, 2022

albenson-usgs commented May 19, 2022

MathewBiddle commented May 20, 2022

Tracking OBIS Submission for WBTS Calanus Data #102

Tracking OBIS Submission for WBTS Calanus Data #102

Comments

Dylan-Pugh commented Mar 22, 2022

MathewBiddle commented Mar 29, 2022 • edited Loading

albenson-usgs commented Mar 29, 2022 • edited Loading

Dylan-Pugh commented Mar 30, 2022 • edited Loading

Dylan-Pugh commented Mar 31, 2022 • edited Loading

Metadata

albenson-usgs commented Mar 31, 2022

Dylan-Pugh commented Mar 31, 2022

albenson-usgs commented Mar 31, 2022

Dylan-Pugh commented Apr 1, 2022

Dylan-Pugh commented Apr 4, 2022 • edited Loading

albenson-usgs commented Apr 4, 2022

Dylan-Pugh commented Apr 5, 2022

albenson-usgs commented Apr 5, 2022

Dylan-Pugh commented Apr 5, 2022

Metadata

albenson-usgs commented Apr 5, 2022

Dylan-Pugh commented May 6, 2022

albenson-usgs commented May 6, 2022

Dylan-Pugh commented May 6, 2022

albenson-usgs commented May 6, 2022 • edited Loading

Dylan-Pugh commented May 6, 2022

Dylan-Pugh commented May 19, 2022

MathewBiddle commented May 19, 2022

albenson-usgs commented May 19, 2022 • edited Loading

Dylan-Pugh commented May 19, 2022

albenson-usgs commented May 19, 2022

Dylan-Pugh commented May 19, 2022 • edited Loading

albenson-usgs commented May 19, 2022

albenson-usgs commented May 19, 2022

MathewBiddle commented May 20, 2022

MathewBiddle commented Mar 29, 2022 •

edited

Loading

albenson-usgs commented Mar 29, 2022 •

edited

Loading

Dylan-Pugh commented Mar 30, 2022 •

edited

Loading

Dylan-Pugh commented Mar 31, 2022 •

edited

Loading

Dylan-Pugh commented Apr 4, 2022 •

edited

Loading

albenson-usgs commented May 6, 2022 •

edited

Loading

albenson-usgs commented May 19, 2022 •

edited

Loading

Dylan-Pugh commented May 19, 2022 •

edited

Loading