Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple author institutions lost from works #50

Open
zilch42 opened this issue Nov 22, 2022 · 8 comments
Open

Multiple author institutions lost from works #50

zilch42 opened this issue Nov 22, 2022 · 8 comments

Comments

@zilch42
Copy link

zilch42 commented Nov 22, 2022

Hi there, oa2df() appears to be dropping subsequent institutional affiliations from authors when returning works.

See this example:
https://explore.openalex.org/works/W2898962279
image

The 3rd Author Tim McVicar is affiliated with both the Australian Research Council and CSIRO Land and Water.

The raw JSON from oa_request() includes both affiliations

library(openalexR)
library(dplyr)
oa_query(identifier = "W2898962279") %>% oa_request()

(output below is just the relevant subset because it's long)

$authorships[[3]]
$authorships[[3]]$author_position
[1] "middle"

$authorships[[3]]$author
$authorships[[3]]$author$id
[1] "https://openalex.org/A2013114412"

$authorships[[3]]$author$display_name
[1] "Tim R. McVicar"

$authorships[[3]]$author$orcid
[1] "https://orcid.org/0000-0002-0877-8285"


$authorships[[3]]$institutions
$authorships[[3]]$institutions[[1]]
$authorships[[3]]$institutions[[1]]$id
[1] "https://openalex.org/I1337719021"

$authorships[[3]]$institutions[[1]]$display_name
[1] "Australian Research Council"

$authorships[[3]]$institutions[[1]]$ror
[1] "https://ror.org/05mmh0f86"

$authorships[[3]]$institutions[[1]]$country_code
[1] "AU"

$authorships[[3]]$institutions[[1]]$type
[1] "government"


$authorships[[3]]$institutions[[2]]
$authorships[[3]]$institutions[[2]]$id
[1] "https://openalex.org/I4210161554"

$authorships[[3]]$institutions[[2]]$display_name
[1] "CSIRO Land and Water"

$authorships[[3]]$institutions[[2]]$ror
[1] "https://ror.org/057xz1h85"

$authorships[[3]]$institutions[[2]]$country_code
[1] "AU"

$authorships[[3]]$institutions[[2]]$type
[1] "facility"



$authorships[[3]]$raw_affiliation_string
[1] "Australian Research Council Centre of Excellence for Climate System Science, Sydney, Australia"

But when using oa_fetch() the flattening process appears to lose CSIRO Land and Water.

oa_fetch("W2898962279")$author[[1]]$institution_display_name
[1] "Princeton University"        "ETH Zurich"                  "Australian Research Council" "Princeton University"        "Princeton University"        "Princeton University"   

author table in Rstudio
image

@trangdata
Copy link
Collaborator

Hi @zilch42 — thanks for raising this issue. 🌈

You're right: by default, oa_fetch returns a "flattened" dataframe/tibble as output (of potentially many entities/rows). And in the flattening process, we have decided to only keep one institution to simplify the author column of the tibble.

If you'd like the original nested list without any simplification, you could do oa_query |> oa_request as you mentioned, or oa_fetch(output = "list") as below. I will try to make this argument clearer in the documentation.

library(openalexR)
dat <- oa_fetch("W2898962279", output = "list")
do.call(rbind.data.frame, dat$author[[3]]$institutions)
#>                                 id                display_name
#> 1 https://openalex.org/I1337719021 Australian Research Council
#> 2 https://openalex.org/I4210161554        CSIRO Land and Water
#>                         ror country_code       type
#> 1 https://ror.org/05mmh0f86           AU government
#> 2 https://ror.org/057xz1h85           AU   facility

Created on 2022-11-21 with reprex v2.0.2

TLDR; we were intentional in keeping only one institution for each author in the author column of the tibble output. But we would love to hear other ideas on this simplification. 🌱

@zilch42
Copy link
Author

zilch42 commented Nov 22, 2022

Thanks @trangdata , glad to know there is a method for getting at the data.

It would be intuitive in my mind to flatten to the lowest possible level and therefore include duplication rather than drop data. So in the case of the author table I would have expected to see 2 rows for Tim McVicar, each with a different institution, but I can appreciate how that may confuse other folks as you would then have to deduplicate on au_id if authors were what you were interested in. In my personal opinion though, that would still be easier than needing to use a list approach for one case and a tibble approach for another (me being not very familiar with lists 😄).

It would be good to have some more detailed documentation, particularly on this page
https://massimoaria.github.io/openalexR/articles/About-the-tibble-output.html

Correct me if I'm wrong, but Example 2 on that page, which is about finding the institutions associated with a group of works, isn't actually correct because it is using the tibble output rather than list and therefore only taking the first because institution for each author? That page only mentions using output="list" as a matter of comfort and familiarity rather than potentially having an impact on the completeness of the data returned.

It would be great to have a list of any other tables one needs to be careful with, where records may be dropped using oa_fetch(output="tibble"), necessitating list output instead if they are the main point of interest.

trangdata added a commit that referenced this issue Nov 24, 2022
@trangdata
Copy link
Collaborator

Thank you for this explanation @zilch42. 🌻

The purpose of the "About the tibble output" vignette was to show a few different ways for the user to extract the data. But you're right. I have added more information and clarified the assumption we made in 77b37b8.

A table of specific simplifications makes sense. I came to this project later and these simplifications were already there, but we should revisit and write up more clearly what is being done in oa2df. 👍🏽

@zilch42
Copy link
Author

zilch42 commented Nov 24, 2022

Thanks @trangdata. The updated documentation is definitely clearer

@zilch42
Copy link
Author

zilch42 commented Nov 25, 2022

Hi @trangdata, I have modified works2df to allow all institutions to be returned using output="tibble" for my own use at least. Are you open to a pull request on this?

@trangdata
Copy link
Collaborator

trangdata commented Nov 27, 2022

@zilch42 I'm happy to look at what you'd like to change, but I'm not sure if 2 rows for one author in the author column is intuitive. I was hoping to keep only one unique OpenAlex ID for each row. @massimoaria what do you think?

@massimoaria
Copy link
Collaborator

@trangdata I fully agree with you.
The format used in the oa2df function follows a standard in bibliographic datasets used with input files of major science mapping software.
Creating multiple rows for each item, following a "long" logic, risks creating huge files that are not easy to handle, not intuitive, and with non-unique ids.
The best way to get at the original data is to use the list format and manipulate it at will.

@mariusbommert
Copy link

I would be interested in an option for getting all institution data in works2df like it is mentioned above. I updated works2df in https://github.com/mariusbommert/openalexR/blob/main/R/oa2df.R with 2 additional parameters use_first_institution and use_first_affiliation_string for allowing to get multiple affiliations. If both parameters are TRUE (default) you get the same result as in the original version of works2d. If one or both parameters are FALSE multiple institutions are considered and the corresponding information is available as tibble. There is still only one row per author and only the institution and/or raw_affiliation are changed. Is there any chance that such a feature will be added/merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants