Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force refresh of WoRMS (via Arctos) #3512

Closed
sharpphyl opened this issue Mar 11, 2021 · 31 comments
Closed

Force refresh of WoRMS (via Arctos) #3512

sharpphyl opened this issue Mar 11, 2021 · 31 comments
Labels
Enhancement I think this would make Arctos even awesomer!

Comments

@sharpphyl
Copy link

@dustymc
Are you able to force a refresh of a genus or family in WoRMS (via Arctos) at my request? The genus Alycaeus
is in the family Alycaeidae. But an unknown portion of the WoRMS (via Arctos) classifications of this genus (and Dicharax, etc.) are still in Cyclophoridae. The only way I've been able to get them all in Alycaeidae is to manually refresh the ones I'm using which leaves the WoRMS (via Arctos) inconsistent. I didn't know it until I printed labels and they showed up in two different families.

Here's an example.

Alycaeus conformis - WoRMS

Alycaeus conformis WoRMS (via Arctos)

I don't think the hierarchical tool works for externally managed sources. Can I force a refresh of a family or genus with some type of taxon bulkload without having to manually enter all the aphiaIDs and names?

This relates to #3311 whether to use WoRMS (via ARctos) - slow to update - or WoRMS (via GlobalNames) - only updated every 60 days and cannot be updated directly from WoRMS with aphiaID. But in the meantime, it would be helpful to not have to manually refresh (and add) WoRMS names and classifications.

@sharpphyl sharpphyl added the Enhancement I think this would make Arctos even awesomer! label Mar 11, 2021
@dustymc
Copy link
Contributor

dustymc commented Mar 11, 2021

It should be understood that #3311 and talking directly to WoRMS' API are not incompatible. We can do both.

#3311 could have two impacts on this.

  1. If we can find a way to get WoRMS to give, and GN to accept, whatever you need then this - and maybe the other few hundred sources that go through GN - is trivial to handle in the future. Going through GN rather than talking directly to WoRMS seems a lot more sustainable to me, and so I think we should if we can, but if we can't make GN work then we can still go straight to the source (but it does require additional time, CPU, code, etc.).

  2. Some of the complexity is "translating." We have to figure out how to skip that if Allow non-local taxonomy sources to be preferred by collections #3311 is going to work, and whatever we do there should also work for any other "non-local" source, wherever it comes from. Even if we keep talking directly to WoRMS, 3311 is likely to simplify doing so by eliminating the need to interpret.

The below should be set to refresh, that should happen in the next 800 minutes or so if nothing else pops up.

 select 
      ap.term,
      f.term,
      scientific_name
    from
      taxon_name
      inner join taxon_term ap on taxon_name.taxon_name_id=ap.taxon_name_id and ap.source='WoRMS (via Arctos)' and ap.term_type='aphiaid'
      inner join taxon_term g on taxon_name.taxon_name_id=g.taxon_name_id and g.source='WoRMS (via Arctos)' and g.term_type='genus'
      inner join taxon_term f on taxon_name.taxon_name_id=f.taxon_name_id and f.source='WoRMS (via Arctos)' and f.term_type='family'
    where
    g.term='Alycaeus'

@sharpphyl
Copy link
Author

I'll check it later today. Thanks for the SQL. @Jegelewicz can we add this to our cheat sheet?

@dustymc I'm not sure I understand what "translating" is so I leave that to your magic.

@dustymc
Copy link
Contributor

dustymc commented Mar 11, 2021

not sure I understand what "translating" is

I have code that tries to make WoRMS data align with https://arctos.database.museum/info/ctDocumentation.cfm?table=cttaxon_term. There are lots of gaps in various places in that process, and it's all hard-coded - when we add of change something the API call needs rebuilt, which often gets skipped. #3311 would (somehow) only require "our" terms for "locally-managed" taxonomy - it would stop you from typing "speeceez," but it would still accept that from "remote" sources, however we pull their data in.

#3498 is a first-pass attempt at making those less-predictable data available from the catalog record.

@Jegelewicz
Copy link
Member

Added to cheat sheet as

Update WoRMS (via Arctos) for a single genus

@sharpphyl
Copy link
Author

sharpphyl commented Mar 11, 2021

@Jegelewicz Thanks for adding to the cheat sheet. It might be better understood as "refresh" as that's the (current) term.

@Jegelewicz
Copy link
Member

Changed!

@dustymc
Copy link
Contributor

dustymc commented Mar 11, 2021

For clarity, that's just a select. Writing to cf_worms_refreshed calls for a refresh. These seem to have caught up.

@Jegelewicz
Copy link
Member

For clarity, that's just a select. Writing to cf_worms_refreshed calls for a refresh.

Sooooo - mere mortals can't really DO anything with it?

@sharpphyl
Copy link
Author

sharpphyl commented Mar 12, 2021

For clarity, that's just a select. Writing to cf_worms_refreshed calls for a refresh. These seem to have caught up.

So I just tried the SQL with Dicharax and it gave me a nice list of 73 records. I'll give them the same 800 minutes. But are you saying that I still have to manually refresh them or will this nudge them?

@dustymc
Copy link
Contributor

dustymc commented Mar 12, 2021

Yup, no mortals. That's still just a select - nudging requires more access. That could probably be made more accessible, but it's just a symptom of some other problem so not sure how I feel about that....

Check back in 140 minutes.

@dustymc
Copy link
Contributor

dustymc commented Mar 17, 2021

I think we're done here?

@dustymc dustymc closed this as completed Mar 17, 2021
@sharpphyl
Copy link
Author

No. We may be done with this specific issue since there doesn't seem to be a way to refresh more than one taxon name at a time. The bigger issue remains - WoRMS (via Arctos) is not what is in WoRMS (marinespecies.org) and what is in WoRMS (via Global Names) is not what is in WoRMS.

Evidently we don't have the processing power to actually have the WoRMS database available within a reasonable time frame (1 week? 1 month?) in WoRMS (via Arctos) and using Global Names doesn't catch us up. I have the time to manage the taxonomy and it's not nearly as much work as before we added WoRMS (via Arctos), but we shouldn't promote Arctos as having the WoRMS taxonomy "built-in."

Right now, for the genus Chamalycaeus, we have 47 names in the family Cyclophoridae (former), 59 in Alycaeidae (correct) and total of 106. WoRMS has 225 because they include those with a subgenus and we don't get any of them. Global names hasn't caught up with the change in Dicharax from Cyclophoridae to Alycaeidae. The last change was made in November 2020.

If there's not much we can do about any of this, then, yes, we can close this issue.

Screen Shot 2021-03-17 at 11 10 07 AM


Screen Shot 2021-03-17 at 12 22 56 PM

@dustymc
Copy link
Contributor

dustymc commented Mar 17, 2021

Thanks, reopening. I think much of this needs dedicated Issues - it's scattered around, we hit on symptoms here and there, but I don't think we really have a place to get at the core of the issue.

don't have the processing power

It's more than CPU - it needs time, probably ongoing.

Global Names

A central question is if we can spend whatever we need on or through GN (which Arctos can easily talk to, and which would address related Issues in the future), or if we're just going to have to figure out how to maintain the connection ourselves, or ??? I don't think it's so much if but how we can best do this.

Chamalycaeus

Set to refresh. (That could probably be an app, but again I'd really like to get at the core of the problem instead of just treating symptoms.)

subgenus

There's an old issue that could be revived somewhere, from here this looks like a "research grade" problem - doing crazy "traditional" things to names and providing "research grade" data do not seem compatible to me. That doesn't mean we can't do something, but I'm not (yet, I hope) sure what that might be. "Get GlobalNames to figure it out" would be pretty cool....

@Jegelewicz
Copy link
Member

nudging requires more access. That could probably be made more accessible, but it's just a symptom of some other problem so not sure how I feel about that....

@dustymc would it be possible to build a "little" WoRMS (via Arctos) widget so that people with manage_taxonomy could request a refresh in bulk, maybe at family or genus level?

So Phyllis could just select:

Please refresh all classifications in taxonomy source = WoRMS (via Arctos) with taxon_term = family and taxon_term_value = Taxon

@dustymc
Copy link
Contributor

dustymc commented Sep 15, 2021

Yep that's "(That could probably be an app, but again I'd really like to get at the core of the problem instead of just treating symptoms.)"

I think "core of the problem" is this:

Best: Can we get WoRMS and GlobalNames to play nice? I know this is probably a daunting task, depending on what exactly "play nice" means to be useful, but it would make WoRMS data available to everyone (but again, maybe we are "everyone" as far as GN is concerned), make WoRMS work like everything else in Arctos, bla bla bla - good stuff.

Not-so-best: If we're going with the "sustainable service" approach, I'd like to turn off the automation, or reduce the webservice to refreshing names from a list (which could be built via that new app). I suspect this is the most realistic, does that all sound right if so?

I'm assuming the "get more resources" option is right out, but I'd really like to be wrong about that so if it's not please let me know!

@Jegelewicz
Copy link
Member

It isn't that they can't "play nice", it is that GN does not have the resources to do what we want.

So

turn off the automation, or reduce the webservice to refreshing names from a list (which could be built via that new app). I suspect this is the most realistic, does that all sound right if so?

Might be a good choice. I will let @sharpphyl weigh in.

@sharpphyl
Copy link
Author

turn off the automation, or reduce the webservice to refreshing names from a list

Our collection would be well served if we only refreshed the phylum Mollusca - 148,813 names in WoRMS out of 576,574 names. We can manually update the few taxa we have in other phyla.

Or update one class of the 8 in Mollusca each night on a rolling basis.

That may not work for other collections using WoRMS (via Arctos) as their primary source. You know which collections would be impacted.

I'll close #2808 if we can have the auto refresh function in days or weeks instead of many months.

@dustymc
Copy link
Contributor

dustymc commented Sep 16, 2021

one class of the 8

That's probably less sustainable than what we're doing now!

each night

I don't want to set up any unrealistic expectations.

I'm proposing a form which accepts two inputs

  1. rank (optional), and
  2. term (required)

That will flag all records

  • in WoRMS (via Arctos)
  • with aphiaid
  • with {term} [={rank}]

to refresh.

If you try with something too big, it'll time out. I don't know what "too big" means, and whatever it is now it won't be the same later. (I suspect it'll generally be happy with a few tens of thousands.)

The refresh won't work within any particular timeframe or in any particular order, and whatever that turns out to be initially, it'll probably change. (Realistically, this is probably at least tens of thousands per day but I won't know until I can get it unwrapped from the other stuff.)

You will almost certainly be able to feed that faster than it can process, in which case it might just refresh the same small group of names over and over. (But probably not, just trying to get all the limitations spelled out.)

I don't think any of that will be in any way limiting, I just don't want this to seem like maybe it's going to be something it's not actually going to be.

days or weeks

The core of this is just UI to #3512 (comment) - send me data. (And I think the dev will be pretty quick, but I've also got a long list of things that I have to prioritize when they become available, so I don't want to promise anything I can't deliver on.)

@sharpphyl
Copy link
Author

Sounds positive but leaves some questions. Here are some candidates for data to test.

There is a new Family (rank) Chauvetiidae (term, aphiaID 1522424). I'm hoping refresh will update all the species within that family. It's tiny - only one genus Chauvetia and 54 species in this family.

A bigger refresh would be the superfamily level (Buccinoidea) which contains Chauvetiidae and four other new families and lots of moves of genera from one family to another.

The test should answer my question as to whether it can only refresh what's already in WoRMS (via Arctos) with an aphia ID or can it also add the new families and any other taxon names in that superfamily.

For example, Retimohniidae (aphiaID 1522369) is another new family that I haven't yet added to WoRMS (via Arctos). I refreshed the genus Fusipagoda so it shows this as the family, but that family doesn't yet exist in WoRMS (via Arctos) so the higher classification is incomplete.

Basically, how much manual maintenance will still need to be done or can the system find and update these terms within the term/rank we refresh?

Let me know if you need more data to test.

@dustymc
Copy link
Contributor

dustymc commented Sep 18, 2021

What's I've proposed is by-record, entirely within Arctos.

select 
      f.term,
      scientific_name
    from
      taxon_name
      inner join taxon_term f on taxon_name.taxon_name_id=f.taxon_name_id and f.source='WoRMS (via Arctos)' and f.term_type='family'
    where
    f.term='Chauvetiidae'


 
--------------+---------------------
 Chauvetiidae | Lachesis helenae
 Chauvetiidae | Donovaniella
 Chauvetiidae | Chauvetia retifera
 Chauvetiidae | Chauvetia
 Chauvetiidae | Nesaea
 Chauvetiidae | Lachesis
 Chauvetiidae | Chauvetiella
 Chauvetiidae | Syntagma
 Chauvetiidae | Folineaea
 Chauvetiidae | Chauvetiidae
 Chauvetiidae | Donovania
 Chauvetiidae | Chauvetia mamillata
 Chauvetiidae | Chauvetia decorata
 Chauvetiidae | Chauvetia lamyi


That's the entirety of the names where "family=Chauvetiidae according to WoRMS (via Arctos)." If some other names should be included, then we need a different query to find them, or something we can use to find them would need bulkloaded, or SOME source of additional information would need created and/or identified.

Basically, how much manual maintenance

  1. Create the names
  2. Provide enough classification information to
    • Find the records (remark=plz_refresh or family=Chauvetiidae - anything that can be used to identify term-at-rank within source - will suffice), and
    • an aphiaid, the link to WoRMs

And someone should add "Lachesis" to the Official Nonexistent List Of Homonyms...

@sharpphyl
Copy link
Author

These are all the Chauvetia (with aphiaID) already in Arctos that need to be refreshed - 57 names less the four that I already refreshed. WoRMS shows 54 direct children of Chauvetia plus the subspecies, so I think we have them all already in Arctos.

A few genera have been refreshed - maybe with the nightly upload. The remainder are probably still in Buccinidae. Do you want to try your refresh code with Buccinidae? It sounds like we would have to use the "old" family rather than the new family in the code since you're working "entirely within Arctos."

Chauvetia | 137702
Chauvetia affinis | 138879
Chauvetia austera | 475241
Chauvetia balgimae | 488191
Chauvetia bartolomeoi | 456795
Chauvetia borgesi | 475242
Chauvetia brunnea | 138880
Chauvetia canarica | 490890
Chauvetia candidissima | 138881
Chauvetia candidissima canarica | 751466
Chauvetia crassior | 138882
Chauvetia dentifera | 488189
Chauvetia distans | 475243
Chauvetia edentula | 475244
Chauvetia elongata | 490891
Chauvetia errata | 475245
Chauvetia gigantea | 458003
Chauvetia gigantissima | 475249
Chauvetia giunchiorum | 138884
Chauvetia hernandezi | 475246
Chauvetia inopinata | 853972
Chauvetia javieri | 458004
Chauvetia joani | 458005
Chauvetia lefebvrei | 138885
Chauvetia lefebvrii | 478532
Chauvetia lineolata | 138886
Chauvetia luciacuestae | 458006
Chauvetia maroccana | 488188
Chauvetia mauritania | 1304115
Chauvetia megastoma | 475247
Chauvetia merianae | 1304116
Chauvetia minima | 751349
Chauvetia minima fasciata | 751472
Chauvetia multilirata | 458007
Chauvetia obliqua | 138887
Chauvetia pardacuta | 458008
Chauvetia pardofasciata | 458009
Chauvetia peculiaris | 475248
Chauvetia pellisphocae | 490893
Chauvetia pelorcei | 458010
Chauvetia poseidonae | 1304117
Chauvetia procerula | 138888
Chauvetia recondita | 138889
Chauvetia robustalba | 458011
Chauvetia soni | 490894
Chauvetia submamillata | 138891
Chauvetia taeniata | 488190
Chauvetia tenebrosa | 458012
Chauvetia tenuisculpta | 490895
Chauvetia turritellata | 138892
Chauvetia ventrosa | 138893
Chauvetia vulpecula | 751264
Chauvetia vulpecula attenuata | 751521

Chauvetia decorata | 138883 - OK
Chauvetia retifera | 138890 - OK
Chauvetia mamillata | 478535 - OK
Chauvetia lamyi | 490892 - OK

The Lachesis in WoRMS is aphiaid: 512176, author: Risso, 1826 and it's invalid now per WoRMS: "invalid: junior homonym of Lachesis Daudin, 1803 [Reptilia]; Donovania and Syntagma are replacement names." I added the author to the Arctos metadata for clarity. If you limit a taxon search to WoRMS (via Arctos) source, then you don't get the reptile ones and same if you search for Current Reptile Classification source. Is there a better way to alert users of the (now invalid) homonym status?

@dustymc
Copy link
Contributor

dustymc commented Sep 19, 2021

='family'='Buccinidae' is set to run, but it's still mixed in with the existing code so who knows what will happen or when.

I really need a decision here.

Lachesis

Meh, call it "for amusement purposes." The system is broken, there are thousands of these things, but somehow I'm still surprised when I find a strange critter wearing a familiar name....

added the author to the Arctos metadata for clarity

And I probably just un-did that by calling for a refresh, which starts by deleting everything. "Automatically maintained" needs clarified. (Or maybe you meant the "Arctos" classification? That's potentially useful, but it's a BIG step away for many users, given that tens of thousands of "alternate classifications" refer to wildly different things rather than alternate views.)

Relationships are taxon-level and survive anything we'll ever automate - that's the best/only way to add clarity.

alert users

If anyone's lost here they don't need to be told they're lost, they need a path out (and relationships do that). This is a great example - Google thinks Donovania is an STD-causing bacteria and Syntagma is a linguistic unit. If Arctos doesn't help, that might be a hard dead end for the casual user.

@sharpphyl
Copy link
Author

added the author to the Arctos metadata for clarity

Yes, I added the author to the Arctos (valid) classification since it's being used by a reptile collection.

The WoRMS (via Arctos) is purely historical and invalid. Is there some way to leave it in for information (e.g., the remark "invalid: junior homonym of Lachesis Daudin, 1803 [Reptilia]; Donovania and Syntagma are replacement names") but mark it as "unusable." Or is it better to delete the WoRMS (via Arctos) classification totally and add that comment to the Arctos classification?

This doesn't solve the homonym issue if multiple collections want to use the same term with different classifications but in this case the homonymy has been resolved in favor of the reptiles.

If someone wants to use the invalid marine invertebrate version, they could use a string.

Is there something the Taxonomy Committee can do about it or is it an Arctos structural issue?

@dustymc
Copy link
Contributor

dustymc commented Sep 20, 2021

Is there something the Taxonomy Committee can do about it or is it an Arctos structural issue?

I don't think it's either, it's a "taxonomy is broken, it's hard not to notice at scale" issue.

in this case the homonymy has been resolved

"Some committee says pretend those publications don't exist" won't ever seem entirely synonymous with "resolved" to me...

If someone wants to use the invalid marine invertebrate version, they could use a string.

I suspect whoever holds the type material wouldn't agree with that, but either way works fine in Arctos.

@dustymc
Copy link
Contributor

dustymc commented Oct 1, 2021

Next release has limited API (refresh on demand only), and

Screen Shot 2021-09-30 at 5 02 08 PM

for the demanding.

@Jegelewicz
Copy link
Member

Are you saying @sharpphyl is demanding? :-)

@sharpphyl
Copy link
Author

Who me? Yep. @dustymc THANK YOU!!!

I made my first refresh request. Teinostoma has moved from Tornidae to Teinostomidae. Does it take several hours or days or minutes to do the refresh? Any size limit you recommend? I'll check back in the morning. Thanks again.

@sharpphyl
Copy link
Author

WooHoo! It worked. Thanks. I think I'm the only one in the demanding category, so I'll close this issue. "Nevertheless, she demanded" t-shirts will be available soon.

@Jegelewicz
Copy link
Member

I'll buy one!

@dustymc
Copy link
Contributor

dustymc commented Oct 5, 2021

several hours or days or minutes

Yes, maybe! Lots of variability in there and the server is being very polite and does not need a t-shirt at the moment, but I suspect it's still faster than you can click - at least over a period of weeks. (Scripts don't sleep much....)

size limit

Nope - if it doesn't eat your browser it should be fine, if it does just break it into chunks (or ping me). Bigger batches will take longer, but still shouldn't plug the toobs.

@sharpphyl
Copy link
Author

sharpphyl commented Oct 5, 2021

Most of the time we need just small chunks so I doubt I'll challenge my browser, but I'll let you know if I set my computer on fire. And, yes, it's definitely much faster than I can click.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement I think this would make Arctos even awesomer!
Projects
None yet
Development

No branches or pull requests

3 participants