Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local taxonomy editor #1000

Closed
dustymc opened this issue Dec 14, 2016 · 26 comments
Closed

local taxonomy editor #1000

dustymc opened this issue Dec 14, 2016 · 26 comments
Assignees
Labels
Enhancement I think this would make Arctos even awesomer! Function-Taxonomy/Identification Priority-High (Needed for work) High because this is causing a delay in important collection work..

Comments

@dustymc
Copy link
Contributor

dustymc commented Dec 14, 2016

Since the GNITE project seems to be dead, there is a very basic prototype/proof of concept of a potential local hierarchical taxonomy editor running at TEST/fix/taxonomyTree.cfm.

The data model is purely hierarchical; there is a unique index on names, all names have one-or-zero parents, etc. This is NOT acceptable data modeling at the scope of Arctos (eg, we must deal with homonyms), but I think (especially after seeing the data rejected by #891!) that it's a useful approach to limited-scope classifications - data used to catalog specimens in one collection (or several related collections etc. - it might work for "[in]vertebrates"). There are lots of examples where a hierarchical model is insufficient even at the collection level, but I think SOMEHOW dealing with the rare homonym etc. is less evil than trying to manage data in our "do everything" core model. (I don't foresee changing the core model - homonyms, alternate classifications, etc., could still be dealt with there. This tool is intended only as a way to manage subsets of "Arctos taxonomy," which is already a merger from many sources.)

The tool is extremely fragile:

  • there is no error handling
  • there are no status indicators
  • it'll happily add copies of things that have already been added to the tree
  • it uses computationally expensive and limited (1000 terms) SQL, which would need rewritten for a production environment

If you try something and nothing happens, just reload and try something else. Please comment on this Issue if something weird and not already documented happens.

  • It should load with root-level terms (currently two superkingdoms).
  • Doubleclick any node to expand. Nodes currently all look the same, and if there are no children nothing obvious will happen.
  • Search will replace the entire tree with the search results, even when that result set is empty. (Searching the local tree could be enabled by buying an upgrade to the library - a few hundred dollars.)
  • Drag anything to anything else to rearrange. You'll just get an alert of what would happen if that was functional.

There is no provision for non-classification term-level metadata (authors etc.) - that would need developed if we pursue this.

The unique index on term makes this "single user" - we'd have to add a half-key to that index in order to allow eg, [hemi]homonyms (plant and insect collections, etc.) to be managed simultaneously.

The usage mode would be, more or less:

  • Get some data from somewhere. The test data (~15K random terms) came from the Arctos classification. (Data which cannot be made hierarchical will not fit - http://arctos-test.tacc.utexas.edu/name/Falculifer%20rostratus#Arctos uses "Acari" twice so isn't included, for example.)
  • Edit. That would all need developed, lots of possibilities for how it could work. (Check a checkbox to see what would likely become the edit form.)
  • Add/Delete - not yet handled, we'd need to discuss how to do that without breaking anything else.
  • Push back to the "real" Arctos taxonomy model - also likely needs discussion (eg, conflict prevention). That could work continuously in the background, or after some flag is set, or WHATEVER.

Note that there's no hard connection between what's edited in this tool and where those data originate or end up - you could pull and edit "NCBI classification where Family=Somefamily" and push back to the "Arctos Plants" classification, updating only the records you've edited, for example.

All of this - including the idea that we need a simpler, limited-scope editor - is entirely open to discussion. I've built only enough to believe that this approach is viable with current technology.

If this doesn't seem like a crazy idea, and if it actually works in some limited scope, I'd like to see it (somehow, eventually) replace all other classification editing tools (which rely on a lot of client-side 'suggestions' and still demonstrably fail to produce the quality of data we want) in Arctos.

@dustymc dustymc added Enhancement I think this would make Arctos even awesomer! Function-Taxonomy/Identification labels Dec 14, 2016
@dustymc dustymc added this to the Needs Discussion milestone Dec 14, 2016
@dustymc dustymc mentioned this issue Mar 14, 2017
@dustymc dustymc added the Priority-High (Needed for work) High because this is causing a delay in important collection work.. label Mar 30, 2017
@dustymc
Copy link
Contributor Author

dustymc commented Mar 30, 2017

carla says "go"! - changing status.

Goal: make the save button as close to live as possible.

@dustymc
Copy link
Contributor Author

dustymc commented Mar 30, 2017

The import (from Arctos to the hierarchy) scripts ignore all but one hierarchy.

For example, http://arctos-test.tacc.utexas.edu/name/Veronica%20pallida has family Asteraceae. All other Veronica have family Scrophulariaceae. The import works top-down by name, and so any conflicts are ignored. The final data as imported include....

screen shot 2017-03-30 at 2 48 00 pm

the conflicting family assertion, now childless, and...

screen shot 2017-03-30 at 2 48 49 pm

Veronica pallida placed under Scrophulariaceae, because it's under Veronica which was already under Scrophulariaceae (from some earlier import) and the processing is by name.

Note that this is not a consensus - the first time a term is encountered the term-->parent_term relationship is established, and all children of "term" will simply follow.

I can't find a more-transparent functional solution which doesn't involve simply rejecting the entire dataset. Please let me know as soon as possible if this is not acceptable behavior.

@dustymc
Copy link
Contributor Author

dustymc commented Apr 5, 2017

I need instructions for handling duplicate (within a "dataset" - 'birds' 'insects' or whatever convenient group of taxa might be managed in the hierarchical editor) terms which do not share rank. The decisions made here may require extensive code rewriting, so I'm temporarily elevating this Issue as well. @campmlc @ccicero @DerekSikes please help!

I've set the hierarchy up to disallow duplicate terms within a dataset, which is necessary for consistency. Polyphaga is given as an order of http://arctos.database.museum/name/Leiopsammodius (and ~40 more names) and a suborder of http://arctos.database.museum/name/Abraeomorphus (and thousands more) and, if I correctly understand the data, that inconsistency just ensures users cannot successfully find what they're looking for. The situation is very common (in test) - I don't think I've found any group of more than a few hundred records without similar data.

However, there are several instances where two terms within a single taxon's classification seem to be "valid," most commonly in subgenus. (The rest may or may not be similar to Polyphaga - a taxonomist will have to make that determination individually.) Handling for this situation cannot co-exist with enforcing consistency.

There are a few possible solutions:

  1. Drop the unique key on (name, dataset), add a unique key on (name, rank, dataset). This allows Polyphaga to sometimes be an Order and sometimes a Suborder, guarantees inconsistent results for anyone searching by the term, and generally evades the most compelling reasons to work with hierarchical data in the first place (guaranteed consistency). To me, this is the Most Evil Option.

  2. Drop subgenus when it's the same string as genus. I'm not sure how acceptable this might be from the standpoint of a taxonomist, but it is my preferred approach (since it maintains all the power of a hierarchical data structure).

  3. Do something to make subgenus unique - perhaps wrap the NAME (and classification term) in parens - "(Subgenus)" instead of "Subgenus." This would require relaxing some rules regarding the structure of taxonomy, and there would be no way to force that usage only for subgenus - I could not prevent someone creating "(Animalia)." (Names exist before classifications can, Names may be used by many classifications, and weirder things are common!)

Help!

@dustymc dustymc self-assigned this Apr 5, 2017
@dustymc dustymc added Blocked: Needs Discussion Priority-Critical (Arctos is broken) Critical because it is breaking functionality. and removed Priority-High (Needed for work) High because this is causing a delay in important collection work.. labels Apr 5, 2017
@campmlc
Copy link

campmlc commented Apr 5, 2017 via email

@dustymc
Copy link
Contributor Author

dustymc commented Apr 5, 2017

I'm not sure I understand the question, so please clarify if I miss your point.

variable higher taxonomic categories

These cannot exist in a simple hierarchy. They're no problem in the "core" model (and handling them is not small part of why we use the data model we do). If having some species of genus Bla in family Blah and other species of genus Bla in family Blagh is important in your classification, you probably cannot use the hierarchical editor and you should probably not be sharing a classification with anyone who does use the hierarchical editor (because they'll eventually suck in "your" data and consolidate Blah/Blagh).

If dealing with that type of data (or genus and subgenus sharing a string) is critical in management classifications, we should probably schedule a call and discuss whether a hierarchical editor is worth further pursuit at all. (I don't think there's a clear-cut answer to that, but building a complex model so curatorial users don't have to use the OTHER complex model deserves some serious discussion!)

For scale on what I think may be the core of the problem, there are currently 2,431,313 taxa managed by Arctos+Arctos Plants, 6,090 of them have a "subgenus" term, and for 5,832 of those genus and subgenus match.

@dustymc dustymc added Priority-High (Needed for work) High because this is causing a delay in important collection work.. and removed Blocked: Needs Discussion Priority-Critical (Arctos is broken) Critical because it is breaking functionality. labels Apr 7, 2017
@dustymc
Copy link
Contributor Author

dustymc commented Apr 7, 2017

Dropping priority given the scope of the subgenus problem.

This is ready for further testing. There are 4 datasets which can be used to test, or create your own. 3 are small (<50K names), "bugs" contains ~800K terms (and seems to work fine, although import took ~16h).

  • Search is still extremely limited; working on that now.
  • Repatriation is still fully manual; probably will not work on that until non-test data have been
    processed. (Repatriation will use the classification bulkloader, which is proven - the process will work, HOW the process works - how much we can automate, what needs manual review, etc. - is the open question.)

@dustymc
Copy link
Contributor Author

dustymc commented Apr 7, 2017

Search is now fully functional

@dustymc
Copy link
Contributor Author

dustymc commented Apr 11, 2017

search for something that doesn't exist does not behave properly

@dustymc
Copy link
Contributor Author

dustymc commented Apr 11, 2017

Test data make useful testing very difficult, so moving development to production - please test at PROD/tools/taxonomyTree.cfm (linked under Enter/Batch, for some reason...).

Birds are imported and "real Arctos" taxonomy cleaned (minus a few, need Carla feedback). The import will not work on very messy data and I have not discovered any workarounds - the edit classification form (which autosuggests many of the fixes) seems most efficient, although it is not very efficient.

The semi-silent ignoring of "alternate hierarchies" is too cryptic - adding a 'more info' tool (or ??)

Added tool and JS-injected summary to edit form

@dustymc
Copy link
Contributor Author

dustymc commented Apr 11, 2017

There are a many names (about 800 used by bird collections) which have no classification data, or no useful classification data. (Presumably there are many more which SHOULD be in Aves, but I see absolutely no hope of finding them.) If there are terms which share term-parts (eg, something that looks like a species with something that looks like the genus-part in another name), and if they're processed in a convenient order, I can sometimes guess at them. (For birds, that worked for about 700 of the 800 terms). If I can't guess at them, I'm just inserting them as a top-level term, equivalent to kingdom. That's a little messy and will involve a lot of drag-dropping to clean, but it should lead to more consistent data and I have no better ideas. Suggestions greatly appreciated.

"Used by bird collections" also brings along some strangeness - http://arctos.database.museum/guid/MSB:Bird:24548 is a coyote cataloged in a bird collection, for example. It's not yet clear how we can catch the very low-quality taxa (behind which specimens are effectively hidden) and avoid making partial updates (eg, introducing inconsistent data) to mostly-unrelated collections when those data are repatriated.

@dustymc
Copy link
Contributor Author

dustymc commented Apr 13, 2017

Finding missed seed data with webforms is not working very well; there are just too many variations of "birds not in class Aves" (and probably very close to all similar queries). Added some SQL code (and will continue adding) and a recommendation to request DBA help to the forms.

@dustymc
Copy link
Contributor Author

dustymc commented May 1, 2017

Notes/comments/ramblings from the first real-data run:

  1. How do we intend to use this thing? For this use, we pulled in a large (~50K) dataset, cleaned up a few (~300) records, and pushed only the 300 back to Real Arctos. Do we need a "load this and it's children" option to support that, or can we "seed" only what we intend to re-load, or ????

  2. The classification bulkloader and cttaxon_term had drifted apart. Need a way to keep that synchronized IF we're going to use the classification bulkloader to repatriate hierarchical data.

  3. Multiple non-classification terms (eg, two remarks) now get concatenated. I don't think this is a problem.

  4. We need a more-integrated way of dealing with things that ultimately cannot load (eg, missing nomenclatural_code, which is "required" by forms).

  5. An answer to Can we auto-generate display name? (was: Auto-create display names in taxonomy) #1086 (can we auto-generate display name?) would simplify some data checks.

  6. Can the hierarchical editor replace the classification bulkloader and/or the single-record edit forms?

@DerekSikes
Copy link

DerekSikes commented May 1, 2017 via email

@dustymc
Copy link
Contributor Author

dustymc commented May 1, 2017

Can you tell me what problem it solves? For example, if I worry that some of my taxonomy data have become inconsistent

^^ that. It's a hierarchy - there's a unique key on term (so "Animalia" can exist only once) and every term has exactly one parent. "Genus species1" and "Genus species2" are structural children of "Genus" - the only way to introduce inconsistency it to misspell something.

Structural consistency opens the door to mass-edit tools - the things that have plagued us in other models cannot exist in a hierarchy, so you can eg drag a genus to a new family (term to another term - neither ranks nor direction matter) which automagically updates the 9000 children (species and subspecies) of the genus as well, for example.

check & see if these problems exist

Sorta. See #1000 (comment) - those problems CAN'T exist in the hierarchy, so the import currently just makes everything follow the first pattern it finds. The only safe alternative I could find was to do nothing (eg, reject everything) and it's pretty simple to put things where they belong (drag/drop/done).

@sharpphyl
Copy link

What is the status of developing a hierarchical taxonomy? I'm faced with hundreds and more likely thousands of changes (mostly reassignment of genera to new families) that currently I can only do manually one-by-one.

If I continue that way (rather than being able to move the entire genus to the new family), I'm leaving behind totally inconsistent taxa with a portion of the genus being in one family (the new correct family on the species that I have in my collection) and the rest in another family (the previous no-longer-valid family). See Teinostoma - The eight species in our collection have been updated to Tornidae. The rest are still in Vitrinellidae which is no longer the valid family (see http://www.marinespecies.org/aphia.php?p=taxdetails&id=153704).

If I can experiment with making this change for all species in this genus with your test tool, can you provide a better link to it? Thx.

@sharpphyl
Copy link

I moved the above comment to Issue #891. Not sure which is the best place for it, but the chain of comments on #891 seems more aligned with the issue.

@dustymc
Copy link
Contributor Author

dustymc commented Aug 26, 2017

AFAIK the hierarchical editor is fully-functional and being used. @ccicero can this issue be closed and documentation written to the current iteration of the editor?

#1000 (comment)

screen shot 2017-08-26 at 7 18 39 am

I went as far as I can in #891; the hierarchical editor with a taxonomist driving can easily get whatever I missed.

My inclination is to deprecate the single-record editors in favor of the hierarchical, to avoid re-creating that problem (and a bunch of others) if collections are willing to be locked into hierarchical structures.

The editor will certainly work to move a species to a new family - that's a drag+drop, should literally take a second or so - but it's also sort of a lot of set-up for that. I think you'll find it more productive to work at larger scales - maybe family or above.

I seeded 33K Mollusca - it'll take a while to fully process, but you can play with it at https://arctos.database.museum/tools/taxonomyTree.cfm?action=manageDataset&dataset_name=clams. Feel free to delete that if you're don't use it.

https://arctos.database.museum/taxonomy.cfm?taxon_name=&taxon_term=%3DM&term_type=&source=&common_name= jumped out pretty quickly - they both have a taxon term "M".

@sharpphyl
Copy link

I'll give it a try. Thanks for giving me these links.

I cleaned up the two terms with "M" as a taxon term. Both were typos that weren't caught.

@sharpphyl
Copy link

I LOVE IT!!!! THANKS! THANKS!

@sharpphyl
Copy link

sharpphyl commented Aug 29, 2017 via email

@dustymc
Copy link
Contributor Author

dustymc commented Aug 29, 2017

There will be soon, I hope... In the meantime:

Find the parent node of what you want to repatriate.
Edit.
Seed Export
wait for the email
download
review
upload with the bulkload classification tool

@sharpphyl
Copy link

Dusty,

After using the hierarchical tool for several weeks, I noticed a bug today. Ok, we know it could be operator error, but here's what's happening. I edited and went to repatriate Heterodonta and found that none of the ORDERs were on the spreadsheet. To work with a smaller set of data, I exported Archiheterodonta which looks like this in the tool.

screen shot 2017-10-02 at 4 29 15 pm

The ORDER Carditida doesn't appear anywhere in the CSV download. I assume it should appear under PHYLORDER. There is only one ORDER in Archiheterodonta so all the taxa should fall under Carditida.

screen shot 2017-10-02 at 4 47 49 pm

Let me know if I've missed a step somewhere, but it appears this occurred on several other large sets that I repatriated recently.

@dustymc
Copy link
Contributor Author

dustymc commented Oct 3, 2017

@sharpphyl that should be fixed and patched in - thanks!

@sharpphyl
Copy link

sharpphyl commented Oct 3, 2017 via email

@dustymc dustymc closed this as completed Mar 15, 2018
@Jegelewicz
Copy link
Member

Is there any documentation yet?

@dustymc
Copy link
Contributor Author

dustymc commented Jul 17, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement I think this would make Arctos even awesomer! Function-Taxonomy/Identification Priority-High (Needed for work) High because this is causing a delay in important collection work..
Projects
None yet
Development

No branches or pull requests

5 participants