Add function to flag similar strings #75

allaway · 2022-10-21T21:06:36Z

We currently do not standardize PI or institution names. It would be helpful to do this on a semi-regular basis.

It would be great if we could have a function that flags similar strings in the Studies table, and add it as, say, a weekly or quarterly job. It would probably require manual intervention to actually fix the data.

allaway · 2022-10-21T21:26:47Z

Quick and dirty example:

library(stringdist)
library(dplyr)
library(tibble)
library(synapser)
synLogin()

foo <- synTableQuery('select distinct unnest(studyLeads) as pi from syn16787123')$asDataFrame()

dist <- stringdist::stringdistmatrix(foo$pi, method = "jw") %>% 
  as.matrix() %>% 
  as_tibble()

pheatmap::pheatmap(dist)

colnames(dist) <- foo$pi
dist["pi_1"] <- foo$pi

tidy_names <- tidyr::gather(dist, !contains("pi_1"), key = "pi_2", value = "dist")%>% 
  filter(dist != 0) %>% 
  arrange(dist)

Which yields:

Interestingly, one of the more prevalent issues appears to be trailing/leading whitespace, probably from older manual copy-pasting...

Anything above 0.2 j-w seems to be truly distinct, whereas <0.2 seems to deserve closer inspection.

allaway · 2022-10-21T21:32:09Z

Similar for institutions:

foo <- synTableQuery('select distinct unnest(institutions) as inst from syn16787123')$asDataFrame()

dist <- stringdist::stringdistmatrix(foo$inst, method = "jw") %>% 
  as.matrix() %>% 
  as_tibble()

pheatmap::pheatmap(dist)

colnames(dist) <- foo$inst
dist["inst_1"] <- foo$inst

tidy_names <- tidyr::gather(dist, !contains("inst_1"), key = "inst_2", value = "dist")%>% 
  filter(dist != 0) %>% 
  arrange(dist)

yields:

However, this isn't as easy to scan manually because of all of the high-similarity University of ... matches that really hide some of the true matches/values that need correction - can you spot them here? ;)

allaway · 2022-10-31T22:39:38Z

PI names in screenshot above have been standardized. I picked whichever one was more recent as the "standard."

anngvu added this to NF-OSI Sprints Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add function to flag similar strings #75

Add function to flag similar strings #75

allaway commented Oct 21, 2022

allaway commented Oct 21, 2022

allaway commented Oct 21, 2022

allaway commented Oct 31, 2022

Add function to flag similar strings #75

Add function to flag similar strings #75

Comments

allaway commented Oct 21, 2022

allaway commented Oct 21, 2022

allaway commented Oct 21, 2022

allaway commented Oct 31, 2022