Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

taxonRank and aggregators #1338

Closed
dustymc opened this issue Nov 30, 2017 · 75 comments
Closed

taxonRank and aggregators #1338

dustymc opened this issue Nov 30, 2017 · 75 comments
Labels
Function-Taxonomy/Identification Help wanted I have a question on how to use Arctos

Comments

@dustymc
Copy link
Contributor

dustymc commented Nov 30, 2017

Without taxonRank, iDigBio "fixes" various taxonomy terms to random values which are sometimes completely unrelated to the original ID.

taxonRank is not a required field in DWC.

Arctos does not require taxa to be ranked (which is an accurate representation of taxonomy itself). Some identifications do not use taxa at all, others use multiple taxa, all of it may be ranked or not.

iDigBio's suggestion is to "add[] taxonRank as a required field in Arctos" which isn't possible or practical for many reasons.

When ranked taxonomy is available we fill out the appropriate "columns" in DWC - "Family," "Order" etc. From that we could find the most specific term which is ranked, but not "The taxonomic rank of the most specific name in the scientificName" (as specified in the Standard). The lowest ranked term also does not necessarily appear in the scientificName at all.

I can't quite see what we could do before exporting the DWC that wouldn't just be wrong in some instances. As always, I'm open to suggestions.

@tucotuco
Copy link

I recommend that, if the identification is to a single name at a single rank (majority of cases), provide it, otherwise leave it blank.

@atrox10
Copy link

atrox10 commented Nov 30, 2017 via email

@dustymc
Copy link
Contributor Author

dustymc commented Nov 30, 2017

single name at a single rank

Seems possible, if perhaps somewhat expensive. I'll explore.

fill in something

I am REALLY hesitant to ignore published Standards, and even if we do our data will not always resolve to a singular "something." There may be a useful default, but I'm going to need explicit instructions for finding it.

iDigBio will put something crazy in for identification

This is obviously a bug in iDigBio.

@atrox10
Copy link

atrox10 commented Nov 30, 2017 via email

@dustymc
Copy link
Contributor Author

dustymc commented Nov 30, 2017

why are iDigBio and GBIF filling something in for this?

That is the question!

GBIF if they are doing the same thing as iDigBio

See https://www.idigbio.org/portal/records/24cc3e24-0cac-4a54-9877-5f458a191e18 (Decapoda: Animalia > Arthropoda > Insecta > Orthoptera > Tettigoniidae) vs https://www.gbif.org/occurrence/1145113729 (Decapoda: Animalia Arthropoda Malacostraca). GBIF is behaving predictably here. (But see #1291 (comment). GBIF has no idea how to handle ISO8601 dates for some reason. Manipulation by "portals" seems to always cause problems which users likely interpret as "Arctos is broken.")

I wrote a simple script to return rank for "single name at a single rank." It's running against accepted IDs in Arctos now and it should have done something in a few days. (I think it would be usably-fast in prod, I've got it throttled heavily to prevent any unanticipated problems for now). I'll post whatever falls out when it's done; perhaps there will be some solution to the more complicated situations evident from those data.

@tucotuco
Copy link

tucotuco commented Nov 30, 2017 via email

@dustymc
Copy link
Contributor Author

dustymc commented Dec 4, 2017

There's a first-pass attempt at getting taxonRank at https://github.com/ArctosDB/DDL/blob/master/functions/getTaxonRank.sql.

Here's the result:

UAM@ARCTOS> select taxon_rank || ' @ ' || count(*) from temp_test_taxon_rank group by taxon_rank order by taxon_rank;

TAXON_RANK||'@'||COUNT(*)
------------------------------------------------------------------------------------------------------------------------
author_text @ 1
canonical name @ 40
canonical_name @ 1
class @ 24240
error!: ORA-01403: no data found @ 645026
error!: ORA-01422: exact fetch returns more than requested number of rows @ 13692
family @ 156720
forma @ 11
genus @ 53231
hyporder @ 574
infraclass @ 4
infraorder @ 1
kingdom @ 3777
order @ 93419
phylum @ 17896
species @ 1911536
subclass @ 536
subdivision @ 1
subfamily @ 5862
suborder @ 3212
subphylum @ 20
subpspecies @ 55
subspecies @ 577788
superfamily @ 2576
superorder @ 82
tribe @ 552
variety @ 5133

27 rows selected.

Adjusting the script to use only classification terms from https://arctos.database.museum/info/ctDocumentation.cfm?table=CTTAXON_TERM or something would clean up a few things, but would also slow down the script. Perhaps we should clean up our taxonomy instead?

I don't see a pathway to

the ranks used have to be (major) Linnean ranks: kingdom, phylum, class, order, family, genus, species.

@ http://www-old.gbif.org/publishing-data/quality#dcTaxonRank2 - will violating that break something else?

"Error" data attached. OK, they're not because it's too big and #1345. I'll email it by request.

How should I proceed, if I should proceed?

@tucotuco
Copy link

tucotuco commented Dec 5, 2017 via email

@dustymc
Copy link
Contributor Author

dustymc commented Dec 5, 2017

an extra week to find one

That's part of my concern - we do something random, {whoever} turns it into some sort of "users think Arctos is broken" garbage, we don't notice because it's ONLY a few tens of thousands of records....

A way of knowing what's going on in portals would be really great. "We don't like something somewhere in this record for reasons we're not going to share" is difficult to work with.

@dustymc
Copy link
Contributor Author

dustymc commented Dec 5, 2017

From AWG meeting:

  • Fill this in when not 'error....'
  • Check data in iDigBio in a couple months????

?????????

@atrox10

@ekrimmel
Copy link

ekrimmel commented Dec 5, 2017

FWIW, links to TDWG's ideas for standardizing data quality tests across aggregators. I don't know how far along in practice any of this is.

@tucotuco
Copy link

tucotuco commented Dec 5, 2017 via email

@Jegelewicz
Copy link
Member

Jegelewicz commented Jul 31, 2018

Re-invigorating this thread as I will be talking about it at SPNHC.

Would it not be possible to have our names table include a field that was "taxon rank"? So:

scientific name, taxon rank
Aves, class
Bufo americanus, species

and so on. This way whatever identification is with the specimen would also tell iDigBio, GBIF that the ID is referring to a specific rank, regardless of what the Arctos classification has in it.

See also #1607

@dustymc
Copy link
Contributor Author

dustymc commented Jul 31, 2018

names table

Names are not consistently ranked. Diptera is a genus and order, for example. We already (optionally) have ranks in classifications.

I'm not sure how #1607 is related.

@Jegelewicz
Copy link
Member

Jegelewicz commented Jul 31, 2018

This is what is causing the Aves/Avus problem. As we are not passing a rank, they are assuming that Aves is a genus and that we are misspelling Avus.

Names are not consistently ranked. Diptera is a genus and order, for example. We already (optionally) have ranks in classifications.

And that is a problem for anyone who uses "Diptera" by itself (even with sp.) as an identification, because that is showing up at iDigBio and they don't care what we put in the classification. If we don't tell them via taxonrank that we are talking about the order, they will just assume it is the genus. Maybe names should be a pair. Name + taxon rank.

So we would have:
Diptera, order
Diptera, genus

and that would allow each to have it's own, proper classification.

@dustymc
Copy link
Contributor Author

dustymc commented Jul 31, 2018

they are assuming

I can't really do anything about that.

anyone who uses "Diptera" by itself (even with sp.) as an identification,

That's why we deal in data objects rather than strings.

http://arctos.database.museum/name/Diptera#Arctos
http://arctos.database.museum/name/Diptera#ArctosPlants

talking about the order

We provide that information. It's not always simple enough to pick a rank.

UAM@ARCTOS> select distinct phylclass from flat where scientific_name='Aves';

PHYLCLASS
------------------------------------------------------------------------------------------------------------------------
Aves

1 row selected.

Maybe names should be a pair. Name + taxon rank.

"Echidna, genus" is an eel and a snake and a mammal and some other stuff - that does not clarify anything in a great number of cases. It would also require ranks, which is not a useful taxonomy model.

@Jegelewicz
Copy link
Member

they are assuming

I can't really do anything about that.

According to them, we can. All we have to do is provide the taxon_rank. We should figure out a way to do so.

@dustymc
Copy link
Contributor Author

dustymc commented Jul 31, 2018

I'm certainly up for ideas! Mine's above - I'm not sure what could be done with the ~half-million misses.

@Jegelewicz
Copy link
Member

How about adding a field like ID_Name_Rank with a controlled vocabulary that includes the ranks that iDigBio is looking for? We could populate the field for stuff already in Arctos from the lowest rank on the taxonomic classification, if there is one, attached to the taxon name. This information wouldn't have to be presented to the public. Like encumbered information, perhaps it only needs to be visible to operators.

For non-biological collections, this might be useful in other ways, but for now, they could use a term such as "not applicable" or "not provided".

@campmlc
Copy link

campmlc commented Aug 1, 2018 via email

@dustymc
Copy link
Contributor Author

dustymc commented Aug 1, 2018

ID_Name_Rank

I'm not following - where would you store this?

ranks that iDigBio is looking for

That's another problem.

Names should be scientific (latin) names at major Linnean ranks, like “Animalia” (kingdom) or “Rosaceae” (family). Not: common names (“animals”), abbreviations (“Rosac.”), intermediate rank levels (“Tetrapoda” (superclass)), or polyphyletic or non-taxonomic groupings (“algae”, “herbivora”).

Our "lowest ranks" do include things like superclass.

@Jegelewicz
Copy link
Member

ID_Name_Rank would go with Identification fields:

TAXON_NAME ID_NAME_RANK ID_MADE_BY_AGENT MADE_DATE NATURE_OF_ID IDENTIFICATION_REMARKS

@Jegelewicz
Copy link
Member

Jegelewicz commented Aug 1, 2018

ID_NAME_RANK the level of classification to which the TAXON_NAME belongs

Values would include, but not be limited to:

Kingdom
Phylum
Class
Order
Family
Genus
Species
Subspecies

@dustymc
Copy link
Contributor Author

dustymc commented Aug 1, 2018

to which the TAXON_NAME

I think I understand, but that's not quite the right verbiage. (And I can get that from the classification data when things are that simple.) IDs are two levels away from taxon names - they implicitly include classification data, and may be comprised of zero or many taxa. I think I'd suggest being more straightforward, unless there's some use case I'm not seeing.

Term: taxonRank
Definition: value to provide for DWC:taxonRank

Subspecies

I'd guess that falls in "intermediate rank levels" which idigbio seems to not recognize, but there's no documentation that I can find so I don't really know.

@Jegelewicz
Copy link
Member

So does that seem like a workable solution?

@dustymc
Copy link
Contributor Author

dustymc commented Aug 15, 2018

Thanks!

https://repository.si.edu/bitstream/handle/10088/23079/SMC_7_Meek_1864_8_ii-40.pdf?sequence=1&isAllowed=y (and others) strongly suggests that Heteroceras angulatum (Meek & Hayden) is an ammonite. OK to delete the Arthropoda (phylum) classification?

@anna-chinn
Copy link

anna-chinn commented Aug 15, 2018

OK to delete the Arthropoda (phylum) classification?

Yes!

@dustymc
Copy link
Contributor Author

dustymc commented Aug 15, 2018

thx/done

@dustymc
Copy link
Contributor Author

dustymc commented Aug 16, 2018

broken

https://www.idigbio.org/portal/records/18805a8a-84d0-4ecd-8e56-013ce3a0b21f

"dwc:phylum": "Chordata",
"dwc:class": "Aves",
"dwc:scientificName": "Aves"

( "dwc:preparations": "long bone",)

screen shot 2018-08-15 at 5 21 45 pm

?

@Jegelewicz
Copy link
Member

iDigBio receives many identifications with incomplete or non-existent classification data. We are as guilty as anyone of sending them such data (for an example, see http://arctos.database.museum/name/Childonias%20niger ). In an attempt to clarify and make these identifications searchable via means of higher taxa, iDigBio has created scripts to fill in or create the missing pieces. As with any script (we should know) there are things that will not turn out as expected. One of the scripts says that an identification that is monomial will be assumed to be a genus UNLESS Taxon_Rank says otherwise. In the case of the Aves, we are not giving iDigBio a Taxon_Rank, so the script first assumes we have left out part of the classification (order and family) and because there is no genus Aves, it then assumes we have misspelled Avus and assigns the classification you see as "broken" above.

While we can debate the merits of iDigBio logic all day, we could just fix this issue by providing Taxon_Rank.. It isn't really all that difficult for us to do and it will make our data look good.

Can we just do this please?

@Jegelewicz
Copy link
Member

image

All of these records are being altered in some way when they reach iDigBio. This is only from colletions to which I have access....

@dustymc
Copy link
Contributor Author

dustymc commented Aug 16, 2018

I need details of WHAT and HOW, and I think we need input from at least @atrox10 if the "how" involves manually setting a value for each ID.

You are proposing only this:

  • add identification.taxon_rank to the model and UI
  • push any values that appear there to DWC

Correct?

@Jegelewicz
Copy link
Member

add identification.taxon_rank to the model and UI

Yes

push any values that appear there to DWC

Yes to dwc:taxonrank

details of WHAT and HOW

  • I think this should either be a bulkloader field to be completed along with the identification or we should make this a term associated with taxon name, required when creating a new name (will not solve the homonym issue).
  • For everything already in Arctos, we could populate with "species" if a bi or trinomial and use the lowest classification possible for monomials with an attached classification.
  • NULL should be a valid entry, at least for now since that is what we are sending for everything at this point.
  • The rest will need to be figured out by humans and any of those we auto-populate may end up doing dumb stuff until someone figures them out, something the Taxonomy Committee could help with.

@dustymc
Copy link
Contributor Author

dustymc commented Aug 16, 2018

be a bulkloader field

That's possible, but slightly less that trivial.

term associated with taxon name

Here's where I get lost. If there's a taxon name (and classification, via collection preferences) used in a straightforward manner and that classification contains some ranks, I can get taxon rank from it. If there's a term used in a less-straightforward manner, I can get the rank of the term, but it's not necessarily strictly the same "level" as the ID (a "Sorex cinereus ?" ID would return "species" - the assertion is more like "probably species." Better than "IDK, maybe a bug?"!) I'm not sure what you're hoping to accomplish with this addition - it might be useful in fringe cases (I'm not likely to find a defensible rank for "Mus musculus domesticus and Siphonaptera"), but this seems like a lot of work for that, and I don't think you can explicitly assert a defensible taxon rank to that either.

required when creating a new name

That would be a major change to the model and definitely require discussion.

solve the homonym issue

I'm still not real clear on what this is either.

For everything already in Arctos, we could populate with "species" if a bi or trinomial and use the lowest classification possible for monomials with an attached classification.

I can do that on the fly with no data additions.

NULL should be a valid entry, at least for now since that is what we are sending for everything at this point.

My scripts would return NULL for "Mus musculus domesticus and Siphonaptera" - and I think that's the only defensible value for something like that.

The rest will need to be figured out by humans

See above - I doubt that's always going to be possible.

any of those we auto-populate may end up doing dumb stuff

Aside from the whole '"Sorex ?" is taxonRank genus' thing, the only way my scripts are going to find dumb stuff is if there's dumb stuff in the classification data. (In which case GIGO applies.)

Taxonomy Committee

Strictly speaking, this is an identification problem. (Or it's a bit of both, maybe.)

I'm still unclear regarding what an explicit assertion can do than we can't pull from the data.

@Jegelewicz
Copy link
Member

I am fine with just pulling from the data. Anything else can be part of a larger taxonomy discussion.

@dustymc
Copy link
Contributor Author

dustymc commented Aug 31, 2018

A first-pass is completed in production. Please let me know if you see anything the scripts should have figured out and I can update them. I'll start pushing these data to FLAT, where they'll be available to DWC - hopefully that'll be done and maintaining itself by early next week.

Attached are IDs for which I could not determine taxon rank. Most are A {string} (and not having a taxon is the point of that formula) but a fair number are just bad data in Arctos.

http://arctos.database.museum/name/Trivia%20nivea has no classification data at all

http://arctos.database.museum/name/Urosaurus%20bicarinatus%20anonymorphus has minimal data

http://arctos.database.museum/name/Myriapoda has a term ranked two ways

http://arctos.database.museum/name/Aves#Arctos (what started this?!) claims to be a genus....

Ditto http://arctos.database.museum/name/Phocidae

And there's a lot of standardization to be had in A {string} names - looks like there are about a dozen ways of saying "flake." Time to figure out formal non-Linnean taxonomies? @sjshirar

temp_notaxrank.csv.zip

@Jegelewicz
Copy link
Member

Jegelewicz commented Sep 3, 2018

http://arctos.database.museum/name/Aves#Arctos (what started this?!) claims to be a genus.... Ditto http://arctos.database.museum/name/Phocidae

This is what I complained about before. I am not able to delete that genus assertion no matter how hard I try. When I delete it it looks like it is gone, but when I leave and come back, it is there again.

@Jegelewicz
Copy link
Member

Jegelewicz commented Sep 3, 2018

http://arctos.database.museum/name/Trivia%20nivea has no classification data at all

Also has no specimens - delete?

@Jegelewicz
Copy link
Member

http://arctos.database.museum/name/Urosaurus%20bicarinatus%20anonymorphus has minimal data

Filled in basic higher taxa.

@Jegelewicz
Copy link
Member

http://arctos.database.museum/name/Myriapoda has a term ranked two ways

Removed the class term because Myriapoda is a subphylum. https://en.wikipedia.org/wiki/Myriapoda

But check it, because I get the genus flaky auto-suggest that I again, cannot seem to delete....

@dustymc
Copy link
Contributor Author

dustymc commented Sep 3, 2018

delete that genus assertion

It's a suggestion, not an assertion (until someone clicks save anyway!) - #1188 (comment)

screen shot 2018-09-03 at 9 23 09 am

screen shot 2018-09-03 at 9 23 19 am

screen shot 2018-09-03 at 9 23 30 am

screen shot 2018-09-03 at 9 23 50 am

no specimens - delete

http://arctos.database.museum/SpecimenResults.cfm?anyTaxId=10942232 is a specimen using it (if your login allows that). The ONLY reason to delete a taxon is you're sure it's not validly published (and maybe not then if it's useful - http://handbook.arctosdb.org/documentation/taxonomy.html)

@campmlc
Copy link

campmlc commented Sep 3, 2018 via email

@dustymc
Copy link
Contributor Author

dustymc commented Sep 3, 2018

#1188 (comment).

The current behavior seems generally less-evil than any alternatives I've found, and it arose from data problems. My increasingly-strong preference remains to deprecate the single-record edit option altogether and pipe everything through the bulkloader (which hopefully would mostly be an extension of the hierarchical editor, but that approach does not constrain us to local hierarchical data).

@ejbrock has been making lots of bird taxonomy edits via the hierarchical tool, and it's likely that this has been making things things harder to find than they need to be. It's absolutely not possible to be consistent when editing 2.5 million records one by one, and it's absolutely not possible to be inconsistent within a hierarchical structure. The single-record edits which introduce inconsistency fracture hierarchies, which both hides specimens from users and makes future large-scale edits difficult. (What should be a single hierarchy becomes one - often a very short one - for each inconsistency.)

@Jegelewicz
Copy link
Member

It's a suggestion, not an assertion (until someone clicks save anyway!) - #1188 (comment)

So, when someone goes to add the author name, this "suggestion" will get saved if they don't really know to delete it every time. That seems a little evil to me.

@Jegelewicz
Copy link
Member

I am still really unclear on how this bulkloader works, especially if I just need to add one taxon. This is something that anyone with taxonomy access needs to be trained with.

@dustymc
Copy link
Contributor Author

dustymc commented Sep 4, 2018

a little evil

Absolutely, but less evil than the alternative. (Most monomials are genera - I'm just playing the odds, and always up for better ideas.)

bulkloader ... add one taxon

For that use case, probably sort of a pain. I think we get to pick our poison - is requiring a lot of work for a "simple" task less-evil than providing a path which consistently introduces inconsistent data? I'm leaning that way; the tiny bits of garbage finds ways to propagate out into huge messes.

@Jegelewicz
Copy link
Member

Would it be possible to add TAXON_RANK in the Non-Classification table that you could then use to modify your suggestions? If I had "Class" in the TAXON_RANK field, then you would know not to suggest anything below Class. Suggested controlled vocabulary for this field:

Kingdom
Phylum
Class
Order
Family
Genus
Species

Like this
capture

@campmlc
Copy link

campmlc commented Sep 5, 2018 via email

@dustymc
Copy link
Contributor Author

dustymc commented Sep 5, 2018

That's possible, but...

  1. I'm not sure I see the point - would it ever be anything other than the lowest term in the hierarchy?
  2. That isn't really useful for the complex/problem IDs, and IDs are what we share via DWC.

assumptions

There are 159,746 monomials in Arctos. 150,646 of them are ranked 'genus' in a local classification. ~94% of the time, the "monomials are genera" assumption is correct (I assume that's the reference??).

@dustymc
Copy link
Contributor Author

dustymc commented Sep 5, 2018

This is now implemented in production, and the data are available as taxonRank in the IPT view.

This also turned out to be the straw that broke the camel's DBMS_COMPARISON's back; we had to significantly change the way in which public data are processed, and that should now happen more or less in real-time.

Here are the data from FLAT:

UAM@ARCTOS> select taxon_rank || ' @ ' || count(*) from flat group by taxon_rank order by count(*);

TAXON_RANK||'@'||COUNT(*)
------------------------------------------------------------------------------------------------------------------------
unranked clade @ 1
subdivision @ 1
infraorder @ 3
epifamily @ 4
infraclass @ 4
subpspecies @ 4
forma @ 13
subphylum @ 22
superorder @ 93
subclass @ 552
hyporder @ 574
tribe @ 633
superclass @ 1936
superfamily @ 2665
suborder @ 3301
variety @ 5729
subfamily @ 7321
kingdom @ 10831
phylum @ 16500
class @ 23381
order @ 94951
genus @ 128756
family @ 164410
 @ 561031
subspecies @ 564074
species @ 1931137

26 rows selected.

And a bit of funky data that made it in.

select taxon_rank,scientific_name,guid from flat where taxon_rank not in (select TAXON_TERM from CTTAXON_TERM where IS_CLASSIFICATION=1) order by taxon_rank,scientific_name;

TAXON_RANK
------------------------------------------------------------------------------------------------------------------------
SCIENTIFIC_NAME
------------------------------------------------------------------------------------------------------------------------
GUID
------------------------------------------------------------------------------------------------------------------------
subpspecies
Eclogavena quadrimaculata thielei
DMNS:Inv:17947

subpspecies
Eclogavena quadrimaculata thielei
DMNS:Inv:26718

subpspecies
Eclogavena quadrimaculata thielei
DMNS:Inv:26806

subpspecies
Ficus papyratia lindae
DMNS:Inv:16814

unranked clade
Merriamosauria
UAM:ES:2437

@sharpphyl @KatherineLAnderson

@dustymc dustymc closed this as completed Sep 5, 2018
@sharpphyl
Copy link

subpspecies
Eclogavena quadrimaculata thielei
DMNS:Inv:17947

subpspecies
Eclogavena quadrimaculata thielei
DMNS:Inv:26718

subpspecies
Eclogavena quadrimaculata thielei
DMNS:Inv:26806

subpspecies
Ficus papyratia lindae
DMNS:Inv:16814

It looks like switching from Arctos to WoRMS (via Arctos) has solved the above issue as both taxa have complete classifications now from WoRMS. Let me know if I'm missing anything.

@KatherineLAnderson
Copy link

unranked clade
Merriamosauria
UAM:ES:2437


@sharpphyl @KatherineLAnderson

Merriamosauria is a valid but unranked clade. Its classification in Arctos taxonomy is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Function-Taxonomy/Identification Help wanted I have a question on how to use Arctos
Projects
None yet
Development

No branches or pull requests

9 participants