taxonRank and aggregators #1338

dustymc · 2017-11-30T17:03:57Z

Without taxonRank, iDigBio "fixes" various taxonomy terms to random values which are sometimes completely unrelated to the original ID.

taxonRank is not a required field in DWC.

Arctos does not require taxa to be ranked (which is an accurate representation of taxonomy itself). Some identifications do not use taxa at all, others use multiple taxa, all of it may be ranked or not.

iDigBio's suggestion is to "add[] taxonRank as a required field in Arctos" which isn't possible or practical for many reasons.

When ranked taxonomy is available we fill out the appropriate "columns" in DWC - "Family," "Order" etc. From that we could find the most specific term which is ranked, but not "The taxonomic rank of the most specific name in the scientificName" (as specified in the Standard). The lowest ranked term also does not necessarily appear in the scientificName at all.

I can't quite see what we could do before exporting the DWC that wouldn't just be wrong in some instances. As always, I'm open to suggestions.

tucotuco · 2017-11-30T18:05:13Z

I recommend that, if the identification is to a single name at a single rank (majority of cases), provide it, otherwise leave it blank.

atrox10 · 2017-11-30T18:08:14Z

I would fill in something, even just animalia or plantar, if there’s no genus, otherwise I think both GBiF and iDigBio will put something crazy in for identification.

On Thu, Nov 30, 2017 at 10:05 AM John Wieczorek ***@***.***> wrote: I recommend that, if the identification is to a single name at a single rank (majority of cases), provide it, otherwise leave it blank. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1338 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AESS8SKqexHsyrJQ54OckvKNxAKT3SpGks5s7u5agaJpZM4Qw4oN> .

-- Sent from Gmail Mobile

dustymc · 2017-11-30T18:19:43Z

single name at a single rank

Seems possible, if perhaps somewhat expensive. I'll explore.

fill in something

I am REALLY hesitant to ignore published Standards, and even if we do our data will not always resolve to a singular "something." There may be a useful default, but I'm going to need explicit instructions for finding it.

iDigBio will put something crazy in for identification

This is obviously a bug in iDigBio.

atrox10 · 2017-11-30T19:50:54Z

I agree about not wanting to ignore published standards, but then why are iDigBio and GBIF filling something in for this? That's what Joanna told me, if there is nothing in that field, then iDigBio fills in something and GBIF requires it too. John or Dusty - can you find out from someone at GBIF if they are doing the same thing as iDigBio (requiring something in this field and if it's not there, filling in identifications with random stuff)?

…

On Thu, Nov 30, 2017 at 10:19 AM, dustymc ***@***.***> wrote: single name at a single rank Seems possible, if perhaps somewhat expensive. I'll explore. fill in something I am REALLY hesitant to ignore published Standards, and even if we do our data will not always resolve to a singular "something." There may be a useful default, but I'm going to need explicit instructions for finding it. iDigBio will put something crazy in for identification This is obviously a bug in iDigBio. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1338 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AESS8TxOFFEyOb6LYWadb12d55xxWrsxks5s7vHAgaJpZM4Qw4oN> .

-- Carol L. Spencer, Ph.D. Staff Curator of Herpetology & Researcher Museum of Vertebrate Zoology 3101 Valley Life Sciences Building University of California, Berkeley, CA, USA 94720-3160 [email protected] or [email protected] 510-643-5778 http://mvz.berkeley.edu/

dustymc · 2017-11-30T20:20:16Z

why are iDigBio and GBIF filling something in for this?

That is the question!

GBIF if they are doing the same thing as iDigBio

See https://www.idigbio.org/portal/records/24cc3e24-0cac-4a54-9877-5f458a191e18 (Decapoda: Animalia > Arthropoda > Insecta > Orthoptera > Tettigoniidae) vs https://www.gbif.org/occurrence/1145113729 (Decapoda: Animalia Arthropoda Malacostraca). GBIF is behaving predictably here. (But see #1291 (comment). GBIF has no idea how to handle ISO8601 dates for some reason. Manipulation by "portals" seems to always cause problems which users likely interpret as "Arctos is broken.")

I wrote a simple script to return rank for "single name at a single rank." It's running against accepted IDs in Arctos now and it should have done something in a few days. (I think it would be usably-fast in prod, I've got it throttled heavily to prevent any unanticipated problems for now). I'll post whatever falls out when it's done; perhaps there will be some solution to the more complicated situations evident from those data.

tucotuco · 2017-11-30T20:21:43Z

GBIF's suggestions (most of their "requirements" are not actually required in practice) can be found at http://www-old.gbif.org/publishing-data/quality. The taxonRank field is only "strongly recommended". iDigBio uses the GBIF taxonomic backbone as a data source against which to validate taxa, but they do not use the same process to determine the valid classification. You can see this in the GBIF record of the same specimen: GBIF: https://www.gbif.org/occurrence/1145113729 iDigBio: https://www.idigbio.org/portal/records/24cc3e24-0cac-4a54-9877-5f458a191e18

…

On Thu, Nov 30, 2017 at 4:51 PM, Carol ***@***.***> wrote: I agree about not wanting to ignore published standards, but then why are iDigBio and GBIF filling something in for this? That's what Joanna told me, if there is nothing in that field, then iDigBio fills in something and GBIF requires it too. John or Dusty - can you find out from someone at GBIF if they are doing the same thing as iDigBio (requiring something in this field and if it's not there, filling in identifications with random stuff)? On Thu, Nov 30, 2017 at 10:19 AM, dustymc ***@***.***> wrote: > single name at a single rank > > Seems possible, if perhaps somewhat expensive. I'll explore. > > fill in something > > I am REALLY hesitant to ignore published Standards, and even if we do our > data will not always resolve to a singular "something." There may be a > useful default, but I'm going to need explicit instructions for finding it. > > iDigBio will put something crazy in for identification > > This is obviously a bug in iDigBio. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#1338 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ AESS8TxOFFEyOb6LYWadb12d55xxWrsxks5s7vHAgaJpZM4Qw4oN> > . > -- Carol L. Spencer, Ph.D. Staff Curator of Herpetology & Researcher Museum of Vertebrate Zoology 3101 Valley Life Sciences Building University of California, Berkeley, CA, USA 94720-3160 ***@***.*** or ***@***.*** 510-643-5778 <(510)%20643-5778> http://mvz.berkeley.edu/ — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1338 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAcP6y7nBNzNsoKL1anLCpVwhjc0TnVqks5s7wcggaJpZM4Qw4oN> .

dustymc · 2017-12-04T18:46:56Z

There's a first-pass attempt at getting taxonRank at https://github.com/ArctosDB/DDL/blob/master/functions/getTaxonRank.sql.

Here's the result:

UAM@ARCTOS> select taxon_rank || ' @ ' || count(*) from temp_test_taxon_rank group by taxon_rank order by taxon_rank;

TAXON_RANK||'@'||COUNT(*)
------------------------------------------------------------------------------------------------------------------------
author_text @ 1
canonical name @ 40
canonical_name @ 1
class @ 24240
error!: ORA-01403: no data found @ 645026
error!: ORA-01422: exact fetch returns more than requested number of rows @ 13692
family @ 156720
forma @ 11
genus @ 53231
hyporder @ 574
infraclass @ 4
infraorder @ 1
kingdom @ 3777
order @ 93419
phylum @ 17896
species @ 1911536
subclass @ 536
subdivision @ 1
subfamily @ 5862
suborder @ 3212
subphylum @ 20
subpspecies @ 55
subspecies @ 577788
superfamily @ 2576
superorder @ 82
tribe @ 552
variety @ 5133

27 rows selected.

Adjusting the script to use only classification terms from https://arctos.database.museum/info/ctDocumentation.cfm?table=CTTAXON_TERM or something would clean up a few things, but would also slow down the script. Perhaps we should clean up our taxonomy instead?

I don't see a pathway to

the ranks used have to be (major) Linnean ranks: kingdom, phylum, class, order, family, genus, species.

@ http://www-old.gbif.org/publishing-data/quality#dcTaxonRank2 - will violating that break something else?

"Error" data attached. OK, they're not because it's too big and #1345. I'll email it by request.

How should I proceed, if I should proceed?

tucotuco · 2017-12-05T13:25:14Z

"The taxonomic rank of the most specific name in the scientificName. Recommended best practice is to use a controlled vocabulary." Anything else is a myth. :-) The majority of records use ranks that iDigBio will understand. The likelihood of a mis-classification such as the one that started this will be reduced immensely. Given that it does not happen often anyway, we can probably call it vanishingly small (which in turn just means that it'll take an extra week to find one).

…

On Mon, Dec 4, 2017 at 3:46 PM, dustymc ***@***.***> wrote: There's a first-pass attempt at getting taxonRank at https://github.com/ArctosDB/DDL/blob/master/functions/getTaxonRank.sql. Here's the result: ***@***.***> select taxon_rank || ' @ ' || count(*) from temp_test_taxon_rank group by taxon_rank order by taxon_rank; TAXON_RANK||'@'||COUNT(*) ------------------------------------------------------------------------------------------------------------------------ author_text @ 1 canonical name @ 40 canonical_name @ 1 class @ 24240 error!: ORA-01403: no data found @ 645026 error!: ORA-01422: exact fetch returns more than requested number of rows @ 13692 family @ 156720 forma @ 11 genus @ 53231 hyporder @ 574 infraclass @ 4 infraorder @ 1 kingdom @ 3777 order @ 93419 phylum @ 17896 species @ 1911536 subclass @ 536 subdivision @ 1 subfamily @ 5862 suborder @ 3212 subphylum @ 20 subpspecies @ 55 subspecies @ 577788 superfamily @ 2576 superorder @ 82 tribe @ 552 variety @ 5133 27 rows selected. Adjusting the script to use only classification terms from https://arctos.database.museum/info/ctDocumentation.cfm?table=CTTAXON_TERM or something would clean up a few things, but would also slow down the script. Perhaps we should clean up our taxonomy instead? I don't see a pathway to the ranks used have to be (major) Linnean ranks: kingdom, phylum, class, order, family, genus, species. @ http://www-old.gbif.org/publishing-data/quality#dcTaxonRank2 - will violating that break something else? "Error" data attached. OK, they're not because it's too big and #1345 <#1345>. I'll email it by request. How should I proceed, if I should proceed? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1338 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAcP6xC247OehyM4GDpWuyqpdkPeQDwDks5s9D4igaJpZM4Qw4oN> .

dustymc · 2017-12-05T17:41:26Z

an extra week to find one

That's part of my concern - we do something random, {whoever} turns it into some sort of "users think Arctos is broken" garbage, we don't notice because it's ONLY a few tens of thousands of records....

A way of knowing what's going on in portals would be really great. "We don't like something somewhere in this record for reasons we're not going to share" is difficult to work with.

dustymc · 2017-12-05T18:16:23Z

From AWG meeting:

Fill this in when not 'error....'
Check data in iDigBio in a couple months????

?????????

@atrox10

ekrimmel · 2017-12-05T20:08:40Z

FWIW, links to TDWG's ideas for standardizing data quality tests across aggregators. I don't know how far along in practice any of this is.

tucotuco · 2017-12-05T21:06:42Z

There will be a meeting 16-19 January in Gainesville to finalize these tests and assertions and build out pseudo-code and create test data sets for these. The activity is urgent, as ALA, iDigBio, VertNet, Kurator, and GBIF all seek to implement the same algorithms, providing the same results on given input data.

…

On Tue, Dec 5, 2017 at 5:08 PM, Erica Krimmel ***@***.***> wrote: FWIW, links to TDWG's ideas <https://github.com/tdwg/bdq/blob/master/tg2/README.md> for standardizing data quality tests <https://docs.google.com/spreadsheets/d/1td7zJ9GH3WWhu0Pa1X-1fkaWk71U8qqr54-kkbfwbfE/edit#gid=339716286> across aggregators. I don't know how far along in practice any of this is. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1338 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAcP63bauKTlKAmxeQsx6PDbTXIA8iYPks5s9aLKgaJpZM4Qw4oN> .

Jegelewicz · 2018-07-31T19:43:52Z

Re-invigorating this thread as I will be talking about it at SPNHC.

Would it not be possible to have our names table include a field that was "taxon rank"? So:

scientific name, taxon rank
Aves, class
Bufo americanus, species

and so on. This way whatever identification is with the specimen would also tell iDigBio, GBIF that the ID is referring to a specific rank, regardless of what the Arctos classification has in it.

See also #1607

dustymc · 2018-07-31T20:51:33Z

names table

Names are not consistently ranked. Diptera is a genus and order, for example. We already (optionally) have ranks in classifications.

I'm not sure how #1607 is related.

Jegelewicz · 2018-07-31T22:40:28Z

This is what is causing the Aves/Avus problem. As we are not passing a rank, they are assuming that Aves is a genus and that we are misspelling Avus.

Names are not consistently ranked. Diptera is a genus and order, for example. We already (optionally) have ranks in classifications.

And that is a problem for anyone who uses "Diptera" by itself (even with sp.) as an identification, because that is showing up at iDigBio and they don't care what we put in the classification. If we don't tell them via taxonrank that we are talking about the order, they will just assume it is the genus. Maybe names should be a pair. Name + taxon rank.

So we would have:
Diptera, order
Diptera, genus

and that would allow each to have it's own, proper classification.

dustymc · 2018-07-31T22:54:10Z

they are assuming

I can't really do anything about that.

anyone who uses "Diptera" by itself (even with sp.) as an identification,

That's why we deal in data objects rather than strings.

http://arctos.database.museum/name/Diptera#Arctos
http://arctos.database.museum/name/Diptera#ArctosPlants

talking about the order

We provide that information. It's not always simple enough to pick a rank.

UAM@ARCTOS> select distinct phylclass from flat where scientific_name='Aves';

PHYLCLASS
------------------------------------------------------------------------------------------------------------------------
Aves

1 row selected.

Maybe names should be a pair. Name + taxon rank.

"Echidna, genus" is an eel and a snake and a mammal and some other stuff - that does not clarify anything in a great number of cases. It would also require ranks, which is not a useful taxonomy model.

Jegelewicz · 2018-07-31T23:13:05Z

they are assuming

I can't really do anything about that.

According to them, we can. All we have to do is provide the taxon_rank. We should figure out a way to do so.

dustymc · 2018-07-31T23:55:13Z

I'm certainly up for ideas! Mine's above - I'm not sure what could be done with the ~half-million misses.

Jegelewicz · 2018-08-01T19:20:51Z

How about adding a field like ID_Name_Rank with a controlled vocabulary that includes the ranks that iDigBio is looking for? We could populate the field for stuff already in Arctos from the lowest rank on the taxonomic classification, if there is one, attached to the taxon name. This information wouldn't have to be presented to the public. Like encumbered information, perhaps it only needs to be visible to operators.

For non-biological collections, this might be useful in other ways, but for now, they could use a term such as "not applicable" or "not provided".

campmlc · 2018-08-01T19:41:25Z

I support Teresa's suggestion. Since iDigBio can't seem to fix it on their end without input from us, we need to fix it. I don't want our data to look bad from the aggregators' portals, since for so many people that is all they look at to search for records.

…

On Wed, Aug 1, 2018 at 1:24 PM, Teresa Mayfield ***@***.***> wrote: How about adding a field like *ID_Name_Rank* with a controlled vocabulary that includes the ranks that iDigBio is looking for? We could populate the field for stuff already in Arctos from the lowest rank on the taxonomic classification, if there is one, attached to the taxon name. This information wouldn't have to be presented to the public. Like encumbered information, perhaps it only needs to be visible to operators. For non-biological collections, this might be useful in other ways, but for now, they could use a term such as "not applicable" or "not provided". — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1338 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOH0hGW565TSPS5fYKlU0Cl4t4Obm1W7ks5uMgBdgaJpZM4Qw4oN> .

dustymc · 2018-08-01T20:04:47Z

ID_Name_Rank

I'm not following - where would you store this?

ranks that iDigBio is looking for

That's another problem.

Names should be scientific (latin) names at major Linnean ranks, like “Animalia” (kingdom) or “Rosaceae” (family). Not: common names (“animals”), abbreviations (“Rosac.”), intermediate rank levels (“Tetrapoda” (superclass)), or polyphyletic or non-taxonomic groupings (“algae”, “herbivora”).

Our "lowest ranks" do include things like superclass.

Jegelewicz · 2018-08-01T20:45:25Z

ID_Name_Rank would go with Identification fields:

TAXON_NAME	ID_NAME_RANK	ID_MADE_BY_AGENT	MADE_DATE	NATURE_OF_ID	IDENTIFICATION_REMARKS

Jegelewicz · 2018-08-01T20:48:19Z

ID_NAME_RANK the level of classification to which the TAXON_NAME belongs

Values would include, but not be limited to:

Kingdom
Phylum
Class
Order
Family
Genus
Species
Subspecies

dustymc · 2018-08-01T21:04:02Z

to which the TAXON_NAME

I think I understand, but that's not quite the right verbiage. (And I can get that from the classification data when things are that simple.) IDs are two levels away from taxon names - they implicitly include classification data, and may be comprised of zero or many taxa. I think I'd suggest being more straightforward, unless there's some use case I'm not seeing.

Term: taxonRank
Definition: value to provide for DWC:taxonRank

Subspecies

I'd guess that falls in "intermediate rank levels" which idigbio seems to not recognize, but there's no documentation that I can find so I don't really know.

Jegelewicz · 2018-08-01T21:30:08Z

So does that seem like a workable solution?

dustymc · 2018-08-15T22:34:06Z

Thanks!

https://repository.si.edu/bitstream/handle/10088/23079/SMC_7_Meek_1864_8_ii-40.pdf?sequence=1&isAllowed=y (and others) strongly suggests that Heteroceras angulatum (Meek & Hayden) is an ammonite. OK to delete the Arthropoda (phylum) classification?

anna-chinn · 2018-08-15T22:35:31Z

OK to delete the Arthropoda (phylum) classification?

Yes!

dustymc · 2018-08-15T22:36:32Z

thx/done

dustymc · 2018-08-16T01:10:30Z

broken

https://www.idigbio.org/portal/records/18805a8a-84d0-4ecd-8e56-013ce3a0b21f

"dwc:phylum": "Chordata",
"dwc:class": "Aves",
"dwc:scientificName": "Aves"

( "dwc:preparations": "long bone",)

?

Jegelewicz · 2018-08-16T15:15:57Z

iDigBio receives many identifications with incomplete or non-existent classification data. We are as guilty as anyone of sending them such data (for an example, see http://arctos.database.museum/name/Childonias%20niger ). In an attempt to clarify and make these identifications searchable via means of higher taxa, iDigBio has created scripts to fill in or create the missing pieces. As with any script (we should know) there are things that will not turn out as expected. One of the scripts says that an identification that is monomial will be assumed to be a genus UNLESS Taxon_Rank says otherwise. In the case of the Aves, we are not giving iDigBio a Taxon_Rank, so the script first assumes we have left out part of the classification (order and family) and because there is no genus Aves, it then assumes we have misspelled Avus and assigns the classification you see as "broken" above.

While we can debate the merits of iDigBio logic all day, we could just fix this issue by providing Taxon_Rank.. It isn't really all that difficult for us to do and it will make our data look good.

Can we just do this please?

Jegelewicz · 2018-08-16T15:19:07Z

All of these records are being altered in some way when they reach iDigBio. This is only from colletions to which I have access....

dustymc · 2018-08-16T15:27:32Z

I need details of WHAT and HOW, and I think we need input from at least @atrox10 if the "how" involves manually setting a value for each ID.

You are proposing only this:

add identification.taxon_rank to the model and UI
push any values that appear there to DWC

Correct?

Jegelewicz · 2018-08-16T17:14:52Z

add identification.taxon_rank to the model and UI

Yes

push any values that appear there to DWC

Yes to dwc:taxonrank

details of WHAT and HOW

I think this should either be a bulkloader field to be completed along with the identification or we should make this a term associated with taxon name, required when creating a new name (will not solve the homonym issue).
For everything already in Arctos, we could populate with "species" if a bi or trinomial and use the lowest classification possible for monomials with an attached classification.
NULL should be a valid entry, at least for now since that is what we are sending for everything at this point.
The rest will need to be figured out by humans and any of those we auto-populate may end up doing dumb stuff until someone figures them out, something the Taxonomy Committee could help with.

dustymc · 2018-08-16T17:36:21Z

be a bulkloader field

That's possible, but slightly less that trivial.

term associated with taxon name

Here's where I get lost. If there's a taxon name (and classification, via collection preferences) used in a straightforward manner and that classification contains some ranks, I can get taxon rank from it. If there's a term used in a less-straightforward manner, I can get the rank of the term, but it's not necessarily strictly the same "level" as the ID (a "Sorex cinereus ?" ID would return "species" - the assertion is more like "probably species." Better than "IDK, maybe a bug?"!) I'm not sure what you're hoping to accomplish with this addition - it might be useful in fringe cases (I'm not likely to find a defensible rank for "Mus musculus domesticus and Siphonaptera"), but this seems like a lot of work for that, and I don't think you can explicitly assert a defensible taxon rank to that either.

required when creating a new name

That would be a major change to the model and definitely require discussion.

solve the homonym issue

I'm still not real clear on what this is either.

For everything already in Arctos, we could populate with "species" if a bi or trinomial and use the lowest classification possible for monomials with an attached classification.

I can do that on the fly with no data additions.

NULL should be a valid entry, at least for now since that is what we are sending for everything at this point.

My scripts would return NULL for "Mus musculus domesticus and Siphonaptera" - and I think that's the only defensible value for something like that.

The rest will need to be figured out by humans

See above - I doubt that's always going to be possible.

any of those we auto-populate may end up doing dumb stuff

Aside from the whole '"Sorex ?" is taxonRank genus' thing, the only way my scripts are going to find dumb stuff is if there's dumb stuff in the classification data. (In which case GIGO applies.)

Taxonomy Committee

Strictly speaking, this is an identification problem. (Or it's a bit of both, maybe.)

I'm still unclear regarding what an explicit assertion can do than we can't pull from the data.

Jegelewicz · 2018-08-16T21:57:24Z

I am fine with just pulling from the data. Anything else can be part of a larger taxonomy discussion.

dustymc · 2018-08-31T17:25:42Z

A first-pass is completed in production. Please let me know if you see anything the scripts should have figured out and I can update them. I'll start pushing these data to FLAT, where they'll be available to DWC - hopefully that'll be done and maintaining itself by early next week.

Attached are IDs for which I could not determine taxon rank. Most are A {string} (and not having a taxon is the point of that formula) but a fair number are just bad data in Arctos.

http://arctos.database.museum/name/Trivia%20nivea has no classification data at all

http://arctos.database.museum/name/Urosaurus%20bicarinatus%20anonymorphus has minimal data

http://arctos.database.museum/name/Myriapoda has a term ranked two ways

http://arctos.database.museum/name/Aves#Arctos (what started this?!) claims to be a genus....

Ditto http://arctos.database.museum/name/Phocidae

And there's a lot of standardization to be had in A {string} names - looks like there are about a dozen ways of saying "flake." Time to figure out formal non-Linnean taxonomies? @sjshirar

temp_notaxrank.csv.zip

Jegelewicz · 2018-09-03T15:11:21Z

http://arctos.database.museum/name/Aves#Arctos (what started this?!) claims to be a genus.... Ditto http://arctos.database.museum/name/Phocidae

This is what I complained about before. I am not able to delete that genus assertion no matter how hard I try. When I delete it it looks like it is gone, but when I leave and come back, it is there again.

Jegelewicz · 2018-09-03T15:33:02Z

http://arctos.database.museum/name/Trivia%20nivea has no classification data at all

Also has no specimens - delete?

Jegelewicz · 2018-09-03T15:38:24Z

http://arctos.database.museum/name/Urosaurus%20bicarinatus%20anonymorphus has minimal data

Filled in basic higher taxa.

Jegelewicz · 2018-09-03T15:42:24Z

http://arctos.database.museum/name/Myriapoda has a term ranked two ways

Removed the class term because Myriapoda is a subphylum. https://en.wikipedia.org/wiki/Myriapoda

But check it, because I get the genus flaky auto-suggest that I again, cannot seem to delete....

dustymc · 2018-09-03T16:28:49Z

delete that genus assertion

It's a suggestion, not an assertion (until someone clicks save anyway!) - #1188 (comment)

no specimens - delete

http://arctos.database.museum/SpecimenResults.cfm?anyTaxId=10942232 is a specimen using it (if your login allows that). The ONLY reason to delete a taxon is you're sure it's not validly published (and maybe not then if it's useful - http://handbook.arctosdb.org/documentation/taxonomy.html)

campmlc · 2018-09-03T17:16:29Z

I have been confused by that as well. Maybe we shouldn't make inaccurate suggestions that require an action to get rid of it on the part of the user , e.g. "delete this row? If there is no genus, could we have a pop up suggestion box that says something like "You are choosing submitting a classification without a genus name. Click here to add a genus or click here to accept the current classification and taxon ranks."

…

On Mon, Sep 3, 2018 at 10:28 AM, dustymc ***@***.***> wrote: delete that genus assertion It's a suggestion, not an assertion (until someone clicks save anyway!) - #1188 (comment) <#1188 (comment)> [image: screen shot 2018-09-03 at 9 23 09 am] <https://user-images.githubusercontent.com/5720791/44996176-4dfb4e00-af5b-11e8-9417-70e5ad100ce6.png> [image: screen shot 2018-09-03 at 9 23 19 am] <https://user-images.githubusercontent.com/5720791/44996185-52c00200-af5b-11e8-95cf-2aab1107c932.png> [image: screen shot 2018-09-03 at 9 23 30 am] <https://user-images.githubusercontent.com/5720791/44996187-55225c00-af5b-11e8-9a52-beaf7f8e1aca.png> [image: screen shot 2018-09-03 at 9 23 50 am] <https://user-images.githubusercontent.com/5720791/44996192-5a7fa680-af5b-11e8-911a-8d74f81f107e.png> no specimens - delete http://arctos.database.museum/SpecimenResults.cfm?anyTaxId=10942232 is a specimen using it (if your login allows that). The ONLY reason to delete a taxon is you're sure it's not validly published (and maybe not then if it's useful - http://handbook.arctosdb.org/documentation/taxonomy.html) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1338 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOH0hEBX2D6yTLaMosGwUba05opk278Cks5uXVjCgaJpZM4Qw4oN> .

dustymc · 2018-09-03T18:28:02Z

#1188 (comment).

The current behavior seems generally less-evil than any alternatives I've found, and it arose from data problems. My increasingly-strong preference remains to deprecate the single-record edit option altogether and pipe everything through the bulkloader (which hopefully would mostly be an extension of the hierarchical editor, but that approach does not constrain us to local hierarchical data).

@ejbrock has been making lots of bird taxonomy edits via the hierarchical tool, and it's likely that this has been making things things harder to find than they need to be. It's absolutely not possible to be consistent when editing 2.5 million records one by one, and it's absolutely not possible to be inconsistent within a hierarchical structure. The single-record edits which introduce inconsistency fracture hierarchies, which both hides specimens from users and makes future large-scale edits difficult. (What should be a single hierarchy becomes one - often a very short one - for each inconsistency.)

Jegelewicz · 2018-09-03T20:11:55Z

It's a suggestion, not an assertion (until someone clicks save anyway!) - #1188 (comment)

So, when someone goes to add the author name, this "suggestion" will get saved if they don't really know to delete it every time. That seems a little evil to me.

Jegelewicz · 2018-09-03T20:13:17Z

I am still really unclear on how this bulkloader works, especially if I just need to add one taxon. This is something that anyone with taxonomy access needs to be trained with.

dustymc · 2018-09-04T15:57:42Z

a little evil

Absolutely, but less evil than the alternative. (Most monomials are genera - I'm just playing the odds, and always up for better ideas.)

bulkloader ... add one taxon

For that use case, probably sort of a pain. I think we get to pick our poison - is requiring a lot of work for a "simple" task less-evil than providing a path which consistently introduces inconsistent data? I'm leaning that way; the tiny bits of garbage finds ways to propagate out into huge messes.

Jegelewicz · 2018-09-05T21:15:35Z

Would it be possible to add TAXON_RANK in the Non-Classification table that you could then use to modify your suggestions? If I had "Class" in the TAXON_RANK field, then you would know not to suggest anything below Class. Suggested controlled vocabulary for this field:

Kingdom
Phylum
Class
Order
Family
Genus
Species

Like this

campmlc · 2018-09-05T21:34:16Z

I agree with Teresa. Most monomials in parasite collections include genera,phyla, classes, and families. There is no safe assumption there.

…

On Wed, Sep 5, 2018 at 3:15 PM, Teresa Mayfield ***@***.***> wrote: Would it be possible to add TAXON_RANK in the Non-Classification table that you could then use to modify your suggestions? If I had "Class" in the TAXON_RANK field, then you would know not to suggest anything below Class. Suggested controlled vocabulary for this field: Kingdom Phylum Class Order Family Genus Species Like this [image: capture] <https://user-images.githubusercontent.com/5725767/45121465-77180c00-b11e-11e8-9140-f1495b198f08.JPG> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1338 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOH0hMaDS1KGKL31A780ZDNsNtRfEe1gks5uYD74gaJpZM4Qw4oN> .

dustymc · 2018-09-05T22:01:54Z

That's possible, but...

I'm not sure I see the point - would it ever be anything other than the lowest term in the hierarchy?
That isn't really useful for the complex/problem IDs, and IDs are what we share via DWC.

assumptions

There are 159,746 monomials in Arctos. 150,646 of them are ranked 'genus' in a local classification. ~94% of the time, the "monomials are genera" assumption is correct (I assume that's the reference??).

dustymc · 2018-09-05T22:54:44Z

This is now implemented in production, and the data are available as taxonRank in the IPT view.

This also turned out to be the straw that broke the ~~camel's~~ DBMS_COMPARISON's back; we had to significantly change the way in which public data are processed, and that should now happen more or less in real-time.

Here are the data from FLAT:

UAM@ARCTOS> select taxon_rank || ' @ ' || count(*) from flat group by taxon_rank order by count(*);

TAXON_RANK||'@'||COUNT(*)
------------------------------------------------------------------------------------------------------------------------
unranked clade @ 1
subdivision @ 1
infraorder @ 3
epifamily @ 4
infraclass @ 4
subpspecies @ 4
forma @ 13
subphylum @ 22
superorder @ 93
subclass @ 552
hyporder @ 574
tribe @ 633
superclass @ 1936
superfamily @ 2665
suborder @ 3301
variety @ 5729
subfamily @ 7321
kingdom @ 10831
phylum @ 16500
class @ 23381
order @ 94951
genus @ 128756
family @ 164410
 @ 561031
subspecies @ 564074
species @ 1931137

26 rows selected.

And a bit of funky data that made it in.

select taxon_rank,scientific_name,guid from flat where taxon_rank not in (select TAXON_TERM from CTTAXON_TERM where IS_CLASSIFICATION=1) order by taxon_rank,scientific_name;

TAXON_RANK
------------------------------------------------------------------------------------------------------------------------
SCIENTIFIC_NAME
------------------------------------------------------------------------------------------------------------------------
GUID
------------------------------------------------------------------------------------------------------------------------
subpspecies
Eclogavena quadrimaculata thielei
DMNS:Inv:17947

subpspecies
Eclogavena quadrimaculata thielei
DMNS:Inv:26718

subpspecies
Eclogavena quadrimaculata thielei
DMNS:Inv:26806

subpspecies
Ficus papyratia lindae
DMNS:Inv:16814

unranked clade
Merriamosauria
UAM:ES:2437

@sharpphyl @KatherineLAnderson

sharpphyl · 2019-04-15T21:16:22Z

subpspecies
Eclogavena quadrimaculata thielei
DMNS:Inv:17947

subpspecies
Eclogavena quadrimaculata thielei
DMNS:Inv:26718

subpspecies
Eclogavena quadrimaculata thielei
DMNS:Inv:26806

subpspecies
Ficus papyratia lindae
DMNS:Inv:16814

It looks like switching from Arctos to WoRMS (via Arctos) has solved the above issue as both taxa have complete classifications now from WoRMS. Let me know if I'm missing anything.

KatherineLAnderson · 2019-04-15T22:49:52Z

unranked clade
Merriamosauria
UAM:ES:2437
@sharpphyl @KatherineLAnderson

Merriamosauria is a valid but unranked clade. Its classification in Arctos taxonomy is correct.

dustymc added Function-ExternalLinks Function-Taxonomy/Identification Help wanted I have a question on how to use Arctos labels Nov 30, 2017

dustymc added this to the Needs Discussion milestone Nov 30, 2017

dustymc closed this as completed Sep 5, 2018

dustymc mentioned this issue May 14, 2019

ID formula A sp. #1304

Closed

taxonRank and aggregators #1338

taxonRank and aggregators #1338

Comments

dustymc commented Nov 30, 2017

tucotuco commented Nov 30, 2017

atrox10 commented Nov 30, 2017 via email

dustymc commented Nov 30, 2017

atrox10 commented Nov 30, 2017 via email

dustymc commented Nov 30, 2017

tucotuco commented Nov 30, 2017 via email

dustymc commented Dec 4, 2017

tucotuco commented Dec 5, 2017 via email

dustymc commented Dec 5, 2017

dustymc commented Dec 5, 2017

ekrimmel commented Dec 5, 2017

tucotuco commented Dec 5, 2017 via email

Jegelewicz commented Jul 31, 2018 • edited Loading

dustymc commented Jul 31, 2018

Jegelewicz commented Jul 31, 2018 • edited Loading

dustymc commented Jul 31, 2018

Jegelewicz commented Jul 31, 2018

dustymc commented Jul 31, 2018

Jegelewicz commented Aug 1, 2018

campmlc commented Aug 1, 2018 via email

dustymc commented Aug 1, 2018

Jegelewicz commented Aug 1, 2018

Jegelewicz commented Aug 1, 2018 • edited Loading

dustymc commented Aug 1, 2018

Jegelewicz commented Aug 1, 2018

dustymc commented Aug 15, 2018

anna-chinn commented Aug 15, 2018 • edited Loading

dustymc commented Aug 15, 2018

dustymc commented Aug 16, 2018

Jegelewicz commented Aug 16, 2018

Jegelewicz commented Aug 16, 2018

dustymc commented Aug 16, 2018

Jegelewicz commented Aug 16, 2018

dustymc commented Aug 16, 2018

Jegelewicz commented Aug 16, 2018

dustymc commented Aug 31, 2018

Jegelewicz commented Sep 3, 2018 • edited Loading

Jegelewicz commented Sep 3, 2018 • edited Loading

Jegelewicz commented Sep 3, 2018

Jegelewicz commented Sep 3, 2018

dustymc commented Sep 3, 2018

campmlc commented Sep 3, 2018 via email

dustymc commented Sep 3, 2018

Jegelewicz commented Sep 3, 2018

Jegelewicz commented Sep 3, 2018

dustymc commented Sep 4, 2018

Jegelewicz commented Sep 5, 2018

campmlc commented Sep 5, 2018 via email

dustymc commented Sep 5, 2018

dustymc commented Sep 5, 2018

sharpphyl commented Apr 15, 2019

KatherineLAnderson commented Apr 15, 2019

Jegelewicz commented Jul 31, 2018 •

edited

Loading

Jegelewicz commented Jul 31, 2018 •

edited

Loading

Jegelewicz commented Aug 1, 2018 •

edited

Loading

anna-chinn commented Aug 15, 2018 •

edited

Loading

Jegelewicz commented Sep 3, 2018 •

edited

Loading

Jegelewicz commented Sep 3, 2018 •

edited

Loading