Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classification Cloning #1641

Closed
dperriguey opened this issue Aug 7, 2018 · 61 comments
Closed

Classification Cloning #1641

dperriguey opened this issue Aug 7, 2018 · 61 comments
Assignees
Labels
Function-Taxonomy/Identification NeedsDocumentation When the issue is resolved in Arctos repository, this should be moved to the Documentation-wiki repo Priority-Low (Wish list) I don't want to forget this, but it doesn't need to be done immediately

Comments

@dperriguey
Copy link

dperriguey commented Aug 7, 2018

I've been normally using the Paleobiology Database Classification trees for editing my specimen classifications when available through Arctos. Most of the time there is an author associated with this
image

After cloning, however, this does not automatically fill in the "edit classification" screen
image
image

Is there a way to fix this @dustymc

@dustymc
Copy link
Contributor

dustymc commented Aug 9, 2018

Terms carried over are limited to https://arctos.database.museum/info/ctDocumentation.cfm?table=CTTAXON_TERM (and you should be getting a big red warning to that effect).

I might be able to guess that "whatever they use" means "our equivalent" - there's no standardization of terms from GN - but in this case I don't think we have a local equivalent, and I don't think I can reliably extract what we need from strings like that, so I suspect this would lead to the introduction of malformed data.

I would still prefer to deprecate that form and push all updates through the hierarchical editor, which produces consistent data.

@campmlc
Copy link

campmlc commented Aug 9, 2018 via email

@dustymc
Copy link
Contributor

dustymc commented Aug 9, 2018

documentation

I don't think we have a comparison.

taxonomy meeting/training

Yes!

difference

In short,

  • Taxonomy (as a body of literature - the only useful scope for something like Arctos) is not hierarchical. The core structure reflects that - it supports multiple classifications (family=Muridae + family=Cricetidae), unranked terms ("phylocode-like"), 2 "family" terms in a single classification (probably never a good idea, but it's still there), and I think everything else that anyone's ever thought was "taxonomy."

  • The single-record editor writes to that. Neotoma albigua->family=Cridetidate and Neotoma micropus->family=Muridae works just fine. A user searching "Muridae" under those data won't find the albigua.

  • Hierarchical data are inherently normalized. A term occurs once, and each term has exactly zero-or-one parents.

  • pick one

    • Neotoma
      • Neotoma albigua
        • Neotoma micropus

is the only possible organization. That won't work for all collections and users (some folks NEED the non-hierarchical data - or at least they said they did so we built the structure to support it!) - but when it does work it's easy to be consistent (or impossible to be inconsistent), which leads to users finding what they're looking for.

It's also easier to manage. When "pick one" becomes "pick another" you just update the parent of Neotoma and all children (subfamily, species, subspecies, etc. - ALL of them!) automagically follow along.

The hierarchical tool is NOT so easy to set up. Inconsistencies in the data - like those inevitably introduced by the single-record tool - make import difficult, it's (very purposefully) not possible to edit single records, etc. If all local edits went through the hierarchical tool, those would mostly be one-time issues. As long as we have the single-record tool, those inconsistencies will continue to reappear.

@dperriguey
Copy link
Author

I guess my only question was if we could automate the non-classification terms. In my example the name string contains "(Morton 1842)". If that cannot be pulled, I would have to manually enter the author_text, correct?

@dustymc
Copy link
Contributor

dustymc commented Sep 7, 2018

automate the non-classification terms.

I see no local equivalents to "name string," so I'm not sure what you're asking.

@campmlc
Copy link

campmlc commented Sep 7, 2018 via email

@dustymc
Copy link
Contributor

dustymc commented Sep 7, 2018

Thanks, AWG discussion seems very useful. Perhaps we should extend an invitation to anyone who wants it?

More at #1609

Here's my take:

  • classifications aren't necessary to catalog specimens - nothing we do here can interfere with data entry.
  • editing records singly (so eg, it's possible to have a infinite number of pathways between the terms "Animalia" and "Sorex") provides a mechanism to hide specimens, and to hide the terms from tools which can make them consistent.
  • editing records singly introduces funky data which breaks aggregators and makes Arctos look bad (Arctos and iDigBio - my SPNHC presentation will make Arctos look bad, please help #1607)

I see a lot of negatives and few real positives to keeping the single-record editor around. Being able to quickly/easily add data of limited utility doesn't seem terribly useful to me. I'm not looking at the world from the perspective of a collection manager either.

There are currently 2,450,310 names in Arctos. 11,252 of them have no local classification data whatsoever, 183,429 have no term ranked "kingdom," 170,968 have no class term, and 88,856 have no order term. I have no idea what percentage of classifications are inconsistent, but it's significant. I don't think I've ever tried to use the hierarchical tool without finding a few outliers.

The data from GlobalNames provides a path to most (??) of the specimens which would otherwise be obscured by our messy local data, which I think makes this much less critical than it would be otherwise, but that also only works from one search field ("Any taxon") - it won't help a user who searches for eg, a genus (which will probably work) and a class (eg, to disambiguate homonyms) - there a decent chance at least a couple species in the genus won't have consistent class data and so would be excluded from that search.

@Jegelewicz
Copy link
Member

I have a (crazy?) suggestion. You can create taxa downloads from PBDB. Dusty is correct that classifications are not necessary to identify specimens, perhaps an occasional list of names without classifications from UNM ES could be generated, then the classifications downloaded from PBDB and uploaded to Arctos. It is pretty easy to find the author text, so that could be uploaded along with the classification.
PBDB_test.txt
Actually, I don't think it would be too hard to arrange this download so that it would fit into the taxon bulkload tool....

You can include the author text. One sample from the attached list is:
"orig_no","taxon_no","record_type","flags","taxon_rank","taxon_name","taxon_attr","common_name","difference","accepted_no","accepted_rank","accepted_name","parent_no","ref_author","ref_pubyr","reference_no","is_extant","n_occs","phylum","class","order","family","genus","type_taxon"

"15403","15403","txn","B","genus","Placenticeras","Meek 1876","","","15403","genus","Placenticeras","61786","Sepkoski","2002","6930","extinct","206","Mollusca","Cephalopoda","Ammonitida","Placenticeratidae","Placenticeras",""

@dustymc
Copy link
Contributor

dustymc commented Dec 13, 2018

Doesn't seem so crazy to me. I've always been a fan of scooping up anything that looks like taxonomy for Arctos - that makes things easier for the folks who'll need those names, helps keep them consistent with similar or related names, gets them involved in our updates (#1761), gives us a chance to find the garbage (and tune our garbage-filters), lets us add relationships that help users find specimens, etc.

Why not just grab everything? Arctos has a name-loader too.

@Jegelewicz
Copy link
Member

I say go for it, but I will let others weigh in...

@DerekSikes
Copy link

DerekSikes commented Dec 14, 2018 via email

@Jegelewicz
Copy link
Member

Grabbing everything from Paelobiology Database to create taxonomy in Arctos.

@DerekSikes
Copy link

DerekSikes commented Dec 14, 2018 via email

@Jegelewicz
Copy link
Member

@dustymc what do we need to do to make this happen?

@Nicole-Ridgwell-NMMNHS
Copy link

I like the idea of grabbing all the taxonomy from the PBDB and pulling it into Arctos. It is a very well vetted resource.

@sharpphyl
Copy link

I did a test run and it looks very helpful for our collection.

@dustymc
Copy link
Contributor

dustymc commented Mar 18, 2019

Anyone can do this; Arctos provides tools. I can smash buttons if nobody else wants to, but this would be better done by someone who intends to use the data.

@Jegelewicz
Copy link
Member

Are there instructions somewhere?

@dustymc dustymc added the NeedsDocumentation When the issue is resolved in Arctos repository, this should be moved to the Documentation-wiki repo label Mar 18, 2019
@dustymc
Copy link
Contributor

dustymc commented Mar 18, 2019

Are there instructions somewhere?

Not really, but there should be.

  • get it into the classification loader format
  • pull distinct scientific names
  • check what exists, somehow deal with them
  • create names
  • load classifications

@Jegelewicz
Copy link
Member

Jegelewicz commented Mar 18, 2019

HMMMM, I know how to do all that, I need to figure out how to get it out of PBDB....without it being one HUGE file.

@dustymc
Copy link
Contributor

dustymc commented Mar 29, 2019

Actually nevermind, let it cook as long as you need to, I'll just delete my mess and pull it again if necessary.

I added 290861 'valid' names so GN can start chewing on them, and they should be available for cataloging now.

FYI - the counts aren't perfect because there are some duplicates but fairly close.

UAM@ARCTOS> select name_status, count(*) from temp_pdbd group by name_status;

NAME_STATUS
------------------------------------------------------------------------------------------------------------------------
  COUNT(*)
----------
Double spaces detected
	 2

valid
    293117

already_got_one
     65817

Invalid characters.
     24726

"sp" is not a valid name-part
	 1


5 rows selected.

@reminders reminders bot removed the reminder label Apr 5, 2019
@reminders
Copy link

reminders bot commented Apr 5, 2019

👋 @Jegelewicz, check back on this by

@Jegelewicz
Copy link
Member

/remind @Jegelewicz discuss with paleo people 04/16/2019

@reminders reminders bot added the reminder label Apr 7, 2019
@reminders
Copy link

reminders bot commented Apr 7, 2019

@Jegelewicz set a reminder for Apr 16th 2019

@Jegelewicz
Copy link
Member

I made that name and pulled from GN - http://arctos.database.museum/name/Microtus%20cautus#ThePaleobiologyDatabase. Looks like the PDB data will all magic in if I just create the names.

I suppose my inclination is to just ignore everything that doesn't fit nicely in a local classification, and create Arctos classifications when there isn't one for the names in PDB. I'm not very enthusiastic about that, and those of you working on making consistent data are probably less enthusiastic about cleaning up ~300K new messes.

OK, so I have experienced this first hand as I added a bunch of classifications for OWU this week. The PBDB stuff isn't compatible with the usual Kingdom, Phylum... order of the taxonomic hierarchy and you are correct that it will make inconsistent classifications and probably hide specimens. Having the names in Arctos is a start - one less thing that needs to be done, but I don't know about those classifications. I'll put this back on the Taxonomy Committee agenda for discussion.

Also, @dperriguey @Nicole-Ridgwell-NMMNHS @KatherineLAnderson @mbprondzinski may have comments or suggestions.

@reminders reminders bot removed the reminder label Apr 16, 2019
@reminders
Copy link

reminders bot commented Apr 16, 2019

👋 @Jegelewicz, discuss with paleo people

@dustymc dustymc added Priority-Low (Wish list) I don't want to forget this, but it doesn't need to be done immediately and removed Priority-Critical (Arctos is broken) Critical because it is breaking functionality. labels Apr 18, 2019
@Jegelewicz Jegelewicz reopened this Jun 17, 2019
@campmlc
Copy link

campmlc commented Aug 22, 2019

What about the idea of "add all of that stuff to our existing data"? This could be done for vertebrates, anyway, at least for some of the taxonomic ranks. Not a solution, but maybe make things a bit more discoverable across time scales and collection types.

@dustymc
Copy link
Contributor

dustymc commented Aug 22, 2019

@campmlc I think it would be useful, but compare eg

species = Microtus cautus
genus = Microtus
tribe = Arvicolini
family = Arvicolinae
family = Cricetidae
superfamily = Muroidea
infraorder = Myodonta
phylorder = Rodentia
phylorder = Glires
unranked clade = Euarchontoglires
unranked clade = Placentalia
subclass = Eutheria
infraclass = Tribosphenida
class = Mammalia
unranked clade = Mammaliaformes
unranked clade = Mammaliamorpha
unranked clade = Probainognathia
infraorder = Eucynodontia
unranked clade = Epicynodontia
family = Cynodontia
superorder = Therapsida
subclass = Synapsida
unranked clade = Amniota
suborder = Cotylosauria
subclass = Batrachosauria
unranked clade = Anthracosauria
unranked clade = Reptiliomorpha
subphylum = Tetrapoda
unranked clade = Tetrapodomorpha
subclass = Dipnotetrapodomorpha
unranked clade = Sarcopterygii
class = Osteichthyes
superclass = Gnathostomata
subphylum = Vertebrata
phylum = Chordata
unranked clade = Deuterostomia
unranked clade = Nephrozoa
unranked clade = Triploblastica
unranked clade = Animalia
unranked clade = Opisthokonta
kingdom = Eukaryota
unranked clade = Life

and https://arctos.database.museum/name/Microtus#Arctos

I think a person would have to go through the existing ~3 million records, make sure they're not adding Arvicolinae to Microtus-the-treefern, etc.

And there's the whole "not sure I'm ready to admit that I'm a fish" thing - do we really want searches for "Osteichthyes" turning up mice? Maybe we should, but it would still overwhelm some users, make it difficult to discover what most think of as fish, etc.

@Jegelewicz
Copy link
Member

We are all fish...

@Jegelewicz
Copy link
Member

I say we close this - there doesn't seem to be an easy way to bring PBDB classifications into Arctos.

@dustymc dustymc mentioned this issue Apr 10, 2020
@Nicole-Ridgwell-NMMNHS
Copy link

Can we revisit this? Now that we can have multiple sources can we add PBDB as a new, externally maintained, automatically updated source? I think this would really help the taxonomy for our collection.

@dustymc
Copy link
Contributor

dustymc commented Dec 16, 2020

revisit

Sure - there's no longer any reason to try to make it consistent with anything else.

automatically updated

So is this just a request to allow Sources from GlobalNames to be preferred by collections?

@Nicole-Ridgwell-NMMNHS
Copy link

So is this just a request to allow Sources from GlobalNames to be preferred by collections?

Yes, I suppose so.

@Jegelewicz
Copy link
Member

So is this just a request to allow Sources from GlobalNames to be preferred by collections?

This would be a good demo of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Function-Taxonomy/Identification NeedsDocumentation When the issue is resolved in Arctos repository, this should be moved to the Documentation-wiki repo Priority-Low (Wish list) I don't want to forget this, but it doesn't need to be done immediately
Projects
None yet
Development

No branches or pull requests

8 participants