Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add function to query citations #22

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

Conversation

arw36
Copy link

@arw36 arw36 commented Jan 4, 2020

Currently, references are provided as a url link for each occurrence. This makes it difficult to synthesize the primary literature for each interaction and likely leads to helminthR users only citing LMNH and helminthR, rather than the data publishers (similar to other data aggregation platforms such as GBIF, see Escribano et al. 2018). This function allows a user to input a previous occurrence query and output the relevant primary literature.

An outstanding issue is that the LMNH website batches references to 30 articles per page. Currently, this will only synthesize the first 30 references per occurrence. I'm hoping this was an issue for the other helminthR functions, and you might have a solution already?

I think there could be several improvements to this, for instance linking primary literature back to specific interactions rather than full search queries. For now, I think this is a good first step.

Escribano N, Galicia D, Ariño AH. The tragedy of the biodiversity data commons: a data impediment creeping nigher?. Database. 2018 Apr 9;2018:bay033.

@taddallas
Copy link
Member

This looks fantastic! Thanks for your work on this. I don't think it's quite ready to merge into the package now, but I think it'll be a really nice addition. A couple of things that need to be worked out:

  • remove library calls and edit the namespace to call in needed functions.
  • change the dplyr call (which I think is just for the filter function?) to a which statement so we can remove the dplyr added dependency.
  • Process the citation text a bit to make it more readable (lots of \r and \n symbols that could be removed (e.g., gsub("\r|\n", "", cites[3,]))
  • I'm curious about the output format. It might be better if each unique host-helminth interaction had it's own references, so using something like a list instead of a data.frame? I could also see the argument for a tibble here, but I don't know if I want to add one more dependency and a bit of a headache for some users.

Let me know what you think about this, and let me know how I can help. I can look into the 30 citation limit, but this may not a large problem if each host-helminth interaction is queried separately (as associations tend to be based on 1-5 citations).

Thanks again for your work on this. :)

@arw36
Copy link
Author

arw36 commented Jan 6, 2020

Thanks for the feedback. I'll work on those edits this week.

One solution for interactions with > 30 references could simply be to include a warning that some references are cut off and you can go to url to manually get. This would only be for those uncommon cases (e.g. foxes, pig).

@taddallas
Copy link
Member

I also just noticed a workaround for the 30 citations bit. The structure of the call can match the existing find functions, with some minor modifications.

    url <- "http://www.nhm.ac.uk/research-curation/scientific-resources/taxonomy-systematics/host-parasites/database/references.jsp;"
    args <- list(dbfnsRowsPerPage = "500000", x = "13", y = "5", 
        paragroup = group, fmsubgroup = "Contains", subgroup = subgroup, 
        fmparagenus = "Contains", paragenus = genus, fmparaspecies = "Contains", 
        paraspecies = species, fmhostgenus = NULL, hostgenus = NULL, 
        fmhostspecies = NULL, hostspecies = NULL, location = location, 
        hstate = hostState, pstatus = NULL, showparasites = "on", 
        showhosts = "on", showrefs = "on", groupby = "parasite", 
        search = "Search")
    hp <- GET(url, query = args)

I haven't checked, but I think the above code should pull the information for all associated citations for a given query (host/parasite info). The above example is pulled from the findParasite function,but just changing the base URL.

If possible, can we also get around the new imports (e.g., tidyr and reshape2)? The package requires a bunch of dependencies already, I think due to rvest requiring a bunch of tidyverse-esque stuff, but I'm not certain.

Thanks again for your work on this. Sorry I didn't notice the similar call for references earlier. Hopefully this helps, though I don't know if it's best to have findCitations take the same arguments as the other find functions or the set of interactions (as your code currently does).

@arw36
Copy link
Author

arw36 commented Jan 7, 2020

I removed the reshape2 and stringr dependency. I'm not sure of base equivalents of tidyr long to wide conversion? These tidyr functions were added to filter by the reference comments which hold some important annotations like if a reference is a non-original source.

I'll have to play around with querying the url directly. I'm preferential to the references being linked to a previous query's interactions as it links outputs more directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants