Multiple author institutions lost from works #50

zilch42 · 2022-11-22T01:50:05Z

Hi there, oa2df() appears to be dropping subsequent institutional affiliations from authors when returning works.

See this example:
https://explore.openalex.org/works/W2898962279

The 3rd Author Tim McVicar is affiliated with both the Australian Research Council and CSIRO Land and Water.

The raw JSON from oa_request() includes both affiliations

library(openalexR)
library(dplyr)
oa_query(identifier = "W2898962279") %>% oa_request()

(output below is just the relevant subset because it's long)

$authorships[[3]]
$authorships[[3]]$author_position
[1] "middle"

$authorships[[3]]$author
$authorships[[3]]$author$id
[1] "https://openalex.org/A2013114412"

$authorships[[3]]$author$display_name
[1] "Tim R. McVicar"

$authorships[[3]]$author$orcid
[1] "https://orcid.org/0000-0002-0877-8285"


$authorships[[3]]$institutions
$authorships[[3]]$institutions[[1]]
$authorships[[3]]$institutions[[1]]$id
[1] "https://openalex.org/I1337719021"

$authorships[[3]]$institutions[[1]]$display_name
[1] "Australian Research Council"

$authorships[[3]]$institutions[[1]]$ror
[1] "https://ror.org/05mmh0f86"

$authorships[[3]]$institutions[[1]]$country_code
[1] "AU"

$authorships[[3]]$institutions[[1]]$type
[1] "government"


$authorships[[3]]$institutions[[2]]
$authorships[[3]]$institutions[[2]]$id
[1] "https://openalex.org/I4210161554"

$authorships[[3]]$institutions[[2]]$display_name
[1] "CSIRO Land and Water"

$authorships[[3]]$institutions[[2]]$ror
[1] "https://ror.org/057xz1h85"

$authorships[[3]]$institutions[[2]]$country_code
[1] "AU"

$authorships[[3]]$institutions[[2]]$type
[1] "facility"



$authorships[[3]]$raw_affiliation_string
[1] "Australian Research Council Centre of Excellence for Climate System Science, Sydney, Australia"

But when using oa_fetch() the flattening process appears to lose CSIRO Land and Water.

oa_fetch("W2898962279")$author[[1]]$institution_display_name

[1] "Princeton University"        "ETH Zurich"                  "Australian Research Council" "Princeton University"        "Princeton University"        "Princeton University"

author table in Rstudio

The text was updated successfully, but these errors were encountered:

trangdata · 2022-11-22T02:38:14Z

Hi @zilch42 — thanks for raising this issue. 🌈

You're right: by default, oa_fetch returns a "flattened" dataframe/tibble as output (of potentially many entities/rows). And in the flattening process, we have decided to only keep one institution to simplify the author column of the tibble.

If you'd like the original nested list without any simplification, you could do oa_query |> oa_request as you mentioned, or oa_fetch(output = "list") as below. I will try to make this argument clearer in the documentation.

library(openalexR)
dat <- oa_fetch("W2898962279", output = "list")
do.call(rbind.data.frame, dat$author[[3]]$institutions)
#>                                 id                display_name
#> 1 https://openalex.org/I1337719021 Australian Research Council
#> 2 https://openalex.org/I4210161554        CSIRO Land and Water
#>                         ror country_code       type
#> 1 https://ror.org/05mmh0f86           AU government
#> 2 https://ror.org/057xz1h85           AU   facility

^{Created on 2022-11-21 with reprex v2.0.2}

TLDR; we were intentional in keeping only one institution for each author in the author column of the tibble output. But we would love to hear other ideas on this simplification. 🌱

zilch42 · 2022-11-22T03:46:44Z

Thanks @trangdata , glad to know there is a method for getting at the data.

It would be intuitive in my mind to flatten to the lowest possible level and therefore include duplication rather than drop data. So in the case of the author table I would have expected to see 2 rows for Tim McVicar, each with a different institution, but I can appreciate how that may confuse other folks as you would then have to deduplicate on au_id if authors were what you were interested in. In my personal opinion though, that would still be easier than needing to use a list approach for one case and a tibble approach for another (me being not very familiar with lists 😄).

It would be good to have some more detailed documentation, particularly on this page
https://massimoaria.github.io/openalexR/articles/About-the-tibble-output.html

Correct me if I'm wrong, but Example 2 on that page, which is about finding the institutions associated with a group of works, isn't actually correct because it is using the tibble output rather than list and therefore only taking the first because institution for each author? That page only mentions using output="list" as a matter of comfort and familiarity rather than potentially having an impact on the completeness of the data returned.

It would be great to have a list of any other tables one needs to be careful with, where records may be dropped using oa_fetch(output="tibble"), necessitating list output instead if they are the main point of interest.

trangdata · 2022-11-24T01:40:02Z

Thank you for this explanation @zilch42. 🌻

The purpose of the "About the tibble output" vignette was to show a few different ways for the user to extract the data. But you're right. I have added more information and clarified the assumption we made in 77b37b8.

A table of specific simplifications makes sense. I came to this project later and these simplifications were already there, but we should revisit and write up more clearly what is being done in oa2df. 👍🏽

zilch42 · 2022-11-24T05:09:47Z

Thanks @trangdata. The updated documentation is definitely clearer

zilch42 · 2022-11-25T06:19:30Z

Hi @trangdata, I have modified works2df to allow all institutions to be returned using output="tibble" for my own use at least. Are you open to a pull request on this?

trangdata · 2022-11-27T01:21:42Z

@zilch42 I'm happy to look at what you'd like to change, but I'm not sure if 2 rows for one author in the author column is intuitive. I was hoping to keep only one unique OpenAlex ID for each row. @massimoaria what do you think?

massimoaria · 2022-11-27T10:57:06Z

@trangdata I fully agree with you.
The format used in the oa2df function follows a standard in bibliographic datasets used with input files of major science mapping software.
Creating multiple rows for each item, following a "long" logic, risks creating huge files that are not easy to handle, not intuitive, and with non-unique ids.
The best way to get at the original data is to use the list format and manipulate it at will.

mariusbommert · 2024-04-29T11:11:13Z

I would be interested in an option for getting all institution data in works2df like it is mentioned above. I updated works2df in https://github.com/mariusbommert/openalexR/blob/main/R/oa2df.R with 2 additional parameters use_first_institution and use_first_affiliation_string for allowing to get multiple affiliations. If both parameters are TRUE (default) you get the same result as in the original version of works2d. If one or both parameters are FALSE multiple institutions are considered and the corresponding information is available as tibble. There is still only one row per author and only the institution and/or raw_affiliation are changed. Is there any chance that such a feature will be added/merged?

trangdata added a commit that referenced this issue Nov 24, 2022

addresses #50

77b37b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple author institutions lost from works #50

Multiple author institutions lost from works #50

zilch42 commented Nov 22, 2022

trangdata commented Nov 22, 2022

zilch42 commented Nov 22, 2022

trangdata commented Nov 24, 2022

zilch42 commented Nov 24, 2022

zilch42 commented Nov 25, 2022

trangdata commented Nov 27, 2022 •

edited

Loading

massimoaria commented Nov 27, 2022

mariusbommert commented Apr 29, 2024

Multiple author institutions lost from works #50

Multiple author institutions lost from works #50

Comments

zilch42 commented Nov 22, 2022

trangdata commented Nov 22, 2022

zilch42 commented Nov 22, 2022

trangdata commented Nov 24, 2022

zilch42 commented Nov 24, 2022

zilch42 commented Nov 25, 2022

trangdata commented Nov 27, 2022 • edited Loading

massimoaria commented Nov 27, 2022

mariusbommert commented Apr 29, 2024

trangdata commented Nov 27, 2022 •

edited

Loading