-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
local taxonomy editor #1000
Comments
carla says "go"! - changing status. Goal: make the save button as close to live as possible. |
The import (from Arctos to the hierarchy) scripts ignore all but one hierarchy. For example, http://arctos-test.tacc.utexas.edu/name/Veronica%20pallida has family Asteraceae. All other Veronica have family Scrophulariaceae. The import works top-down by name, and so any conflicts are ignored. The final data as imported include.... the conflicting family assertion, now childless, and... Veronica pallida placed under Scrophulariaceae, because it's under Veronica which was already under Scrophulariaceae (from some earlier import) and the processing is by name. Note that this is not a consensus - the first time a term is encountered the term-->parent_term relationship is established, and all children of "term" will simply follow. I can't find a more-transparent functional solution which doesn't involve simply rejecting the entire dataset. Please let me know as soon as possible if this is not acceptable behavior. |
I need instructions for handling duplicate (within a "dataset" - 'birds' 'insects' or whatever convenient group of taxa might be managed in the hierarchical editor) terms which do not share rank. The decisions made here may require extensive code rewriting, so I'm temporarily elevating this Issue as well. @campmlc @ccicero @DerekSikes please help! I've set the hierarchy up to disallow duplicate terms within a dataset, which is necessary for consistency. Polyphaga is given as an order of http://arctos.database.museum/name/Leiopsammodius (and ~40 more names) and a suborder of http://arctos.database.museum/name/Abraeomorphus (and thousands more) and, if I correctly understand the data, that inconsistency just ensures users cannot successfully find what they're looking for. The situation is very common (in test) - I don't think I've found any group of more than a few hundred records without similar data. However, there are several instances where two terms within a single taxon's classification seem to be "valid," most commonly in subgenus. (The rest may or may not be similar to Polyphaga - a taxonomist will have to make that determination individually.) Handling for this situation cannot co-exist with enforcing consistency. There are a few possible solutions:
Help! |
The following solution may invalidate constraints that make classifications
hierarchical, but they would remove constraints making data discoverable,
correct? This is the only solution that would solve problems with variable
higher taxonomic categories. So this would solve an issue that plagues
collections that deal with taxa that are not phylogenetically well defined,
e.g. parasites, arthropods, birds, etc?
Drop the unique key on (name, dataset), add a unique key on (name, rank,
dataset). This allows Polyphaga to sometimes be an Order and sometimes a
Suborder, guarantees inconsistent results for anyone searching by the term,
and generally evades the most compelling reasons to work with hierarchical
data in the first place (guaranteed consistency). To me, this is the Most
Evil Option.
…On Wed, Apr 5, 2017 at 2:06 PM, dustymc ***@***.***> wrote:
I need instructions for handling duplicate (within a "dataset" - 'birds'
'insects' or whatever convenient group of taxa might be managed in the
hierarchical editor) terms which do not share rank. The decisions made here
may require extensive code rewriting, so I'm temporarily elevating this
Issue as well. @campmlc <https://github.com/campmlc> @ccicero
<https://github.com/ccicero> @DerekSikes <https://github.com/DerekSikes>
please help!
I've set the hierarchy up to disallow duplicate terms within a dataset,
which is necessary for consistency. Polyphaga is given as an order of
http://arctos.database.museum/name/Leiopsammodius (and ~40 more names)
and a suborder of http://arctos.database.museum/name/Abraeomorphus (and
thousands more) and, if I correctly understand the data, that inconsistency
just ensures users cannot successfully find what they're looking for. The
situation is very common (in test) - I don't think I've found any group of
more than a few hundred records without similar data.
However, there are several instances where two terms within a single
taxon's classification seem to be "valid," most commonly in subgenus. (The
rest may or may not be similar to Polyphaga - a taxonomist will have to
make that determination individually.) Handling for this situation cannot
co-exist with enforcing consistency.
There are a few possible solutions:
1.
Drop the unique key on (name, dataset), add a unique key on (name,
rank, dataset). This allows Polyphaga to sometimes be an Order and
sometimes a Suborder, guarantees inconsistent results for anyone searching
by the term, and generally evades the most compelling reasons to work with
hierarchical data in the first place (guaranteed consistency). To me, this
is the Most Evil Option.
2.
Drop subgenus when it's the same string as genus. I'm not sure how
acceptable this might be from the standpoint of a taxonomist, but it is my
preferred approach (since it maintains all the power of a hierarchical data
structure).
3.
Do something to make subgenus unique - perhaps wrap the NAME (and
classification term) in parens - "(Subgenus)" instead of "Subgenus." This
would require relaxing some rules regarding the structure of taxonomy, and
there would be no way to force that usage only for subgenus - I could not
prevent someone creating "(Animalia)." (Names exist before classifications
can, Names may be used by many classifications, and weirder things are
common!)
Help!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1000 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AOH0hDG-i3uHdfpVwgt7JlwSonLaq3Ucks5rs_Q1gaJpZM4LNUui>
.
|
I'm not sure I understand the question, so please clarify if I miss your point.
These cannot exist in a simple hierarchy. They're no problem in the "core" model (and handling them is not small part of why we use the data model we do). If having some species of genus Bla in family Blah and other species of genus Bla in family Blagh is important in your classification, you probably cannot use the hierarchical editor and you should probably not be sharing a classification with anyone who does use the hierarchical editor (because they'll eventually suck in "your" data and consolidate Blah/Blagh). If dealing with that type of data (or genus and subgenus sharing a string) is critical in management classifications, we should probably schedule a call and discuss whether a hierarchical editor is worth further pursuit at all. (I don't think there's a clear-cut answer to that, but building a complex model so curatorial users don't have to use the OTHER complex model deserves some serious discussion!) For scale on what I think may be the core of the problem, there are currently 2,431,313 taxa managed by Arctos+Arctos Plants, 6,090 of them have a "subgenus" term, and for 5,832 of those genus and subgenus match. |
Dropping priority given the scope of the subgenus problem. This is ready for further testing. There are 4 datasets which can be used to test, or create your own. 3 are small (<50K names), "bugs" contains ~800K terms (and seems to work fine, although import took ~16h).
|
Search is now fully functional |
|
Test data make useful testing very difficult, so moving development to production - please test at PROD/tools/taxonomyTree.cfm (linked under Enter/Batch, for some reason...). Birds are imported and "real Arctos" taxonomy cleaned (minus a few, need Carla feedback). The import will not work on very messy data and I have not discovered any workarounds - the edit classification form (which autosuggests many of the fixes) seems most efficient, although it is not very efficient.
Added tool and JS-injected summary to edit form |
There are a many names (about 800 used by bird collections) which have no classification data, or no useful classification data. (Presumably there are many more which SHOULD be in Aves, but I see absolutely no hope of finding them.) If there are terms which share term-parts (eg, something that looks like a species with something that looks like the genus-part in another name), and if they're processed in a convenient order, I can sometimes guess at them. (For birds, that worked for about 700 of the 800 terms). If I can't guess at them, I'm just inserting them as a top-level term, equivalent to kingdom. That's a little messy and will involve a lot of drag-dropping to clean, but it should lead to more consistent data and I have no better ideas. Suggestions greatly appreciated. "Used by bird collections" also brings along some strangeness - http://arctos.database.museum/guid/MSB:Bird:24548 is a coyote cataloged in a bird collection, for example. It's not yet clear how we can catch the very low-quality taxa (behind which specimens are effectively hidden) and avoid making partial updates (eg, introducing inconsistent data) to mostly-unrelated collections when those data are repatriated. |
Finding missed seed data with webforms is not working very well; there are just too many variations of "birds not in class Aves" (and probably very close to all similar queries). Added some SQL code (and will continue adding) and a recommendation to request DBA help to the forms. |
Notes/comments/ramblings from the first real-data run:
|
D,
fyi I've been waiting to try to use this but haven't found a reason to yet.
Can you tell me what problem it solves? For example, if I worry that some
of my taxonomy data have become inconsistent (eg some species have ranks
indicates that other species in the same genus don't have)... can I use
this tool to check & see if these problems exist & then fix them?
I'm getting tantalizing hints that this might be able to do wonderful
things and I'll love using it... but just not information / circumstances
have accumulated for me yet.
…-D
On Mon, May 1, 2017 at 3:13 PM, dustymc ***@***.***> wrote:
Notes/comments/ramblings from the first real-data run:
1.
How do we intend to use this thing? For this use, we pulled in a large
(~50K) dataset, cleaned up a few (~300) records, and pushed only the 300
back to Real Arctos. Do we need a "load this and it's children" option to
support that, or can we "seed" only what we intend to re-load, or ????
2.
The classification bulkloader and cttaxon_term had drifted apart. Need
a way to keep that synchronized IF we're going to use the classification
bulkloader to repatriate hierarchical data.
3.
Multiple non-classification terms (eg, two remarks) now get
concatenated. I don't think this is a problem.
4.
We need a more-integrated way of dealing with things that ultimately
cannot load (eg, missing nomenclatural_code, which is "required" by forms).
5.
An answer to #1086 <#1086>
(can we auto-generate display name?) would simplify some data checks.
6.
Can the hierarchical editor replace the classification bulkloader
and/or the single-record edit forms?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1000 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIraMy3aH7mnev1dm4RCOzvDqneb7hNYks5r1mc2gaJpZM4LNUui>
.
--
+++++++++++++++++++++++++++++++++++
Derek S. Sikes, Chief Curator, Curator of Insects
Associate Professor of Entomology
University of Alaska Museum
907 Yukon Drive
Fairbanks, AK 99775-6960
[email protected]
phone: 907-474-6278
FAX: 907-474-5469
University of Alaska Museum - search 347,746 digitized arthropod records
http://arctos.database.museum/uam_ento_all
<http://www.uaf.edu/museum/collections/ento/>
+++++++++++++++++++++++++++++++++++
Interested in Alaskan Entomology? Join the Alaska Entomological
Society and / or sign up for the email listserv "Alaska Entomological
Network" at
http://www.akentsoc.org/contact_us <http://www.akentsoc.org/contact.php>
|
^^ that. It's a hierarchy - there's a unique key on term (so "Animalia" can exist only once) and every term has exactly one parent. "Genus species1" and "Genus species2" are structural children of "Genus" - the only way to introduce inconsistency it to misspell something. Structural consistency opens the door to mass-edit tools - the things that have plagued us in other models cannot exist in a hierarchy, so you can eg drag a genus to a new family (term to another term - neither ranks nor direction matter) which automagically updates the 9000 children (species and subspecies) of the genus as well, for example.
Sorta. See #1000 (comment) - those problems CAN'T exist in the hierarchy, so the import currently just makes everything follow the first pattern it finds. The only safe alternative I could find was to do nothing (eg, reject everything) and it's pretty simple to put things where they belong (drag/drop/done). |
What is the status of developing a hierarchical taxonomy? I'm faced with hundreds and more likely thousands of changes (mostly reassignment of genera to new families) that currently I can only do manually one-by-one. If I continue that way (rather than being able to move the entire genus to the new family), I'm leaving behind totally inconsistent taxa with a portion of the genus being in one family (the new correct family on the species that I have in my collection) and the rest in another family (the previous no-longer-valid family). See Teinostoma - The eight species in our collection have been updated to Tornidae. The rest are still in Vitrinellidae which is no longer the valid family (see http://www.marinespecies.org/aphia.php?p=taxdetails&id=153704). If I can experiment with making this change for all species in this genus with your test tool, can you provide a better link to it? Thx. |
AFAIK the hierarchical editor is fully-functional and being used. @ccicero can this issue be closed and documentation written to the current iteration of the editor? I went as far as I can in #891; the hierarchical editor with a taxonomist driving can easily get whatever I missed. My inclination is to deprecate the single-record editors in favor of the hierarchical, to avoid re-creating that problem (and a bunch of others) if collections are willing to be locked into hierarchical structures. The editor will certainly work to move a species to a new family - that's a drag+drop, should literally take a second or so - but it's also sort of a lot of set-up for that. I think you'll find it more productive to work at larger scales - maybe family or above. I seeded 33K Mollusca - it'll take a while to fully process, but you can play with it at https://arctos.database.museum/tools/taxonomyTree.cfm?action=manageDataset&dataset_name=clams. Feel free to delete that if you're don't use it. https://arctos.database.museum/taxonomy.cfm?taxon_name=&taxon_term=%3DM&term_type=&source=&common_name= jumped out pretty quickly - they both have a taxon term "M". |
I'll give it a try. Thanks for giving me these links. I cleaned up the two terms with "M" as a taxon term. Both were typos that weren't caught. |
I LOVE IT!!!! THANKS! THANKS! |
Are there instructions anywhere for how to repatriate taxa that we've
updated in the "tree for clams"?
…On Sat, Aug 26, 2017 at 8:44 AM, dustymc ***@***.***> wrote:
AFAIK the hierarchical editor is fully-functional and being used. @ccicero
<https://github.com/ccicero> can this issue be closed and documentation
written to the current iteration of the editor?
#1000 (comment)
<#1000 (comment)>
[image: screen shot 2017-08-26 at 7 18 39 am]
<https://user-images.githubusercontent.com/5720791/29742161-d9a91a7a-8a2e-11e7-85ec-380b6d6642f7.png>
I went as far as I can in #891
<#891>; the hierarchical editor
with a taxonomist driving can easily get whatever I missed.
My inclination is to deprecate the single-record editors in favor of the
hierarchical, to avoid re-creating that problem (and a bunch of others) if
collections are willing to be locked into hierarchical structures.
The editor will certainly work to move a species to a new family - that's
a drag+drop, should literally take a second or so - but it's also sort of a
lot of set-up for that. I think you'll find it more productive to work at
larger scales - maybe family or above.
I seeded 33K Mollusca - it'll take a while to fully process, but you can
play with it at https://arctos.database.museum/tools/taxonomyTree.cfm?
action=manageDataset&dataset_name=clams. Feel free to delete that if
you're don't use it.
https://arctos.database.museum/taxonomy.cfm?taxon_
name=&taxon_term=%3DM&term_type=&source=&common_name= jumped out pretty
quickly - they both have a taxon term "M".
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1000 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AOqAra-3VfNQ6li2AJqIFeJnvYwNxzS6ks5scC9TgaJpZM4LNUui>
.
|
There will be soon, I hope... In the meantime: Find the parent node of what you want to repatriate. |
Dusty, After using the hierarchical tool for several weeks, I noticed a bug today. Ok, we know it could be operator error, but here's what's happening. I edited and went to repatriate Heterodonta and found that none of the ORDERs were on the spreadsheet. To work with a smaller set of data, I exported Archiheterodonta which looks like this in the tool. The ORDER Carditida doesn't appear anywhere in the CSV download. I assume it should appear under PHYLORDER. There is only one ORDER in Archiheterodonta so all the taxa should fall under Carditida. Let me know if I've missed a step somewhere, but it appears this occurred on several other large sets that I repatriated recently. |
@sharpphyl that should be fixed and patched in - thanks! |
Thanks, Dusty.
…On Tue, Oct 3, 2017 at 10:27 AM, dustymc ***@***.***> wrote:
@sharpphyl <https://github.com/sharpphyl> that should be fixed and
patched in - thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1000 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AOqArWpJdzGr8dnbjATLw85kbuqymNfqks5somB_gaJpZM4LNUui>
.
|
Is there any documentation yet? |
Since the GNITE project seems to be dead, there is a very basic prototype/proof of concept of a potential local hierarchical taxonomy editor running at TEST/fix/taxonomyTree.cfm.
The data model is purely hierarchical; there is a unique index on names, all names have one-or-zero parents, etc. This is NOT acceptable data modeling at the scope of Arctos (eg, we must deal with homonyms), but I think (especially after seeing the data rejected by #891!) that it's a useful approach to limited-scope classifications - data used to catalog specimens in one collection (or several related collections etc. - it might work for "[in]vertebrates"). There are lots of examples where a hierarchical model is insufficient even at the collection level, but I think SOMEHOW dealing with the rare homonym etc. is less evil than trying to manage data in our "do everything" core model. (I don't foresee changing the core model - homonyms, alternate classifications, etc., could still be dealt with there. This tool is intended only as a way to manage subsets of "Arctos taxonomy," which is already a merger from many sources.)
The tool is extremely fragile:
If you try something and nothing happens, just reload and try something else. Please comment on this Issue if something weird and not already documented happens.
There is no provision for non-classification term-level metadata (authors etc.) - that would need developed if we pursue this.
The unique index on term makes this "single user" - we'd have to add a half-key to that index in order to allow eg, [hemi]homonyms (plant and insect collections, etc.) to be managed simultaneously.
The usage mode would be, more or less:
Note that there's no hard connection between what's edited in this tool and where those data originate or end up - you could pull and edit "NCBI classification where Family=Somefamily" and push back to the "Arctos Plants" classification, updating only the records you've edited, for example.
All of this - including the idea that we need a simpler, limited-scope editor - is entirely open to discussion. I've built only enough to believe that this approach is viable with current technology.
If this doesn't seem like a crazy idea, and if it actually works in some limited scope, I'd like to see it (somehow, eventually) replace all other classification editing tools (which rely on a lot of client-side 'suggestions' and still demonstrably fail to produce the quality of data we want) in Arctos.
The text was updated successfully, but these errors were encountered: