Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repeatability of results: online (gnames apis) vs standalone tools (gndiff & gnparser) #87

Open
abubelinha opened this issue Feb 17, 2022 · 10 comments

Comments

@abubelinha
Copy link

abubelinha commented Feb 17, 2022

During the last weeks I opened many issues and questions about gnverifier / gnames / resolver / gnparser / gndiff , trying to tune them for my use cases ... which I hope are similar to those of many other users.

As long as those issues are solved, the results returned by apis are improving.
But improving also implies "changing". And I think this is a major issue for some use cases as well.

When it comes to publishing scientific results (thesis, reports, papers, whatever), repeatability is a must.
If I need to publish a curated list of scientific names, I can describe my protocol (i.e. #85), and I can also provide my data sources as attached files ... but there is no way I can provide the software I used to process those data following my protocol, because it was an api running on a remote server.
And there is no way to change this, since old apis and servers need to be removed. And their names datasources need to be updated, so even the same api version might return different results because of those names updates.

On the other hand, as far as I can tell, if I do my work using a particular release version of gnparser and gndiff, my results will be 100% repeatable in the future as they work completely offline, am I right?

I am currently using online resolver/verifier for several use cases.
For many of them a changing and up to date online api is the best option (i.e. daily checking names of new specimens entering in a collection).
But for published works, a protocol where I download a given version of a datasource and process it offline is a much better option.

In this sense, I see gndiff+gnparser as the most important gnames' tools for scientific publications.
I open this issue not only for encouraging you to further develop them, but also to raise the question about what to do (as of today) if I want to publish some work and describe a protocol which was based on results returned by a current or past gnames api version.

Is there any way of citing "I used this version of gnverifier api" and also provide some kind of link (github? edit: seen a couple of Zenodo links cited here) which exactly reflects its code at that time ... so whoever wants to repeat my results in the future do it can download the exact version from github and install everything needed to repeat my work? i.e., an exact replica of gnames services at a given moment in time (of course, given that I also downloaded the current gnames database dump at that time, and stored it in a permanent repository somewhere, and provided a link in my publication).

I know that would imply a lot of work and nobody would take the time to do that in practice. But in theory, would it be possible?
As of today, can we state that a work which used gnverifier api results is "in theory" repeatable in 10 years from now, or is this not possible?

And if it is, I would suggest not only to document the how-to, but also that apis could somehow return the how-to info if we request it (some sort of citation parameters, providing necessary links to github, db dumps, etc).

@abubelinha abubelinha changed the title repeatability of matching results: online tools (gnames apis) vs standalone tools (gndiff & gnparser) repeatability of results: online (gnames apis) vs standalone tools (gndiff & gnparser) Feb 17, 2022
@dimus
Copy link
Member

dimus commented Feb 17, 2022

It is indeed a problem. And it is not only code, because database evolves as well, although, it mostly stays backward compatible sofar. However, nothing prevents a situation where an important feature would break that backward compatibility. So I guess a solution would be

  1. Figure out how to monitor database versioning (database actually is defined by this internal package https://github.com/gnames/gnidump), which is an equivalent of walking around the house alone in pajamas (no docs, bad architecture, no versions). So it would need to be improved. It would need to get to v1, and every time there is a breaking change in the database, increae major version number to v2, v3 etc.
  2. Add version number to sql dump file at http://opendata.globalnames.org/dumps/
  3. gnames version should return its own version + version of gnmatcher
  4. Every major version of database dumps has one latest file (something like dump-v1.3.6, dump-v2.0.2)

That gives a theoretical possibility to put together verification system. Using particular version of gnames + gnmatcher + database.

It does not solve a problem of data changing all the time, but I think that in most cases for most data-sources data change is cumulative, so result should be close, albeit not identical sometimes.

@abubelinha
Copy link
Author

Quite a lot of work.

So to be realistic, I think we are much closer to a day where I can create a replicable protocol using this combination:

  • my own draft list of problematic names
  • my own set of checklist datasources (i.e., dwc dump of my preferred sources, and extract the needed columns from them to feed gndiff/gnparser)
  • gndiff+gnparser CLI

All these are versionable, downloadable, easily citable and standalone executable.
I will closely follow gndiff evolution ;)

@dimus
Copy link
Member

dimus commented Feb 17, 2022

Usually I use formula: work/users_num

I think it is something that everybody who publishes their results would need, so I think it is not so much work in the end. I'll keep it open and close when the system is in place

@abubelinha
Copy link
Author

abubelinha commented Feb 18, 2022

I think it is something that everybody who publishes their results would need, so I think it is not so much work in the end. I'll keep it open and close when the system is in place

Great. Not sure if you are now meaning gnverifier / gndiff option, or both. But any advances would be good as for "theorical" repeteability.

As for really practical, I think the gndiff approach is the only good one (it would be easy to replicate something as long as you use the same offline tools; but anybody would accomplish the task of replicating the whole gnames services as they were at some time in the past, just for reviewing goodness of a small experiment or checklist).

@dimus
Copy link
Member

dimus commented Feb 18, 2022

for gndiff it should be easy, it has no remote dependencies, so just its version defines the result

@abubelinha
Copy link
Author

abubelinha commented Feb 18, 2022

Yes I agree.
Version plus a given combination of request parameters, since it would be best to give users the option to define as much as possible the matching behaviour (of course with default values for everything, to avoid undesired CLI complexity).

Either that or using an editable default config file, so users can see default values and modifiy as needed.

@abubelinha
Copy link
Author

Somehow related, but a bit off-topic.
I have seen some Zenodo links related to your work (i.e. https://doi.org/10.5281/zenodo.5111543). A couple of questions:

  • As that links back to github, I understand you prefer the Zenodo link to be cited. Correct?
  • Does Zenodo contain a full backup of the github project files by that time?
    I wonder if or you needed to upload them all to Zenodo (perhaps there is some "auto-zenoding" tool for github projects that you can tell me?)
  • When still not in Zenodo, which would you say is the best way to citate a github project?
    I am a bit lost because the above Zenodo url links back to https://github.com/gnames/gnverifier/tree/v0.3.3 , but I am not sure what "tree" and "v0.3.3" means in this context. What's the difference between tree v0.3.3 and realease v0.3.3? https://github.com/gnames/gnverifier/releases/tag/v0.3.3

Just looking for advice so I might decide to use github and/or zenodo for versioning a checklist in the future.

Thanks a lot in advance

@dimus
Copy link
Member

dimus commented Feb 26, 2022

Someone wanted to cite gnames, so I created Zenodo link for that purpose. Being lazy, I prefer to avoid unnecessary work, so I decided not to update these links, until someone requests a change again :)

@abubelinha
Copy link
Author

abubelinha commented Feb 26, 2022

OK. I thought you used some kind of auto-backup from github and zenodo.

As for the difference between github tree v.xxx and release v.xxx, do you have any opinion?

@dimus
Copy link
Member

dimus commented Feb 26, 2022

I these tree/vx.x.x and vx.x.x mean the same. In case of github links I usually use something like
https://github.com/gnames/gnverifier/releases/tag/v0.8.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants