-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
r-polars #48
Comments
Mieux : on utilise map :
On divise par 10 la lecture du rsa partie fixe : 0.257 sec elapsed pour 140 000 RSA. Et c'est transposable aux actes/diags/rsa_um. library(polars)
parse_pmsi_fwf <- function(df, champ1 = "mco", table1 = "rsa", an1 = "20") {
f_rsa <- pl$DataFrame(pmeasyr::formats)$filter((pl$col('champ') == champ1) &
(pl$col('an') == an1) &
(pl$col('table') == table1))$to_data_frame()
column_names <- f_rsa$nom
df <- df$with_columns(purrr::map(seq_len(nrow(f_rsa)),
function(k){
slice_tuple <- c(f_rsa$position[k]-1,ifelse(is.na(f_rsa$longueur[k]), 1e6, f_rsa$longueur[k]))
pl$col('column_1')$str$slice(slice_tuple[1],
slice_tuple[2])$alias(column_names[k])
})
)
df
}
tictoc::tic()
u <- pl$read_csv('~/Documents/data/mco/290000017.2022.12.rsa', has_header=FALSE, skip_rows=0)$lazy()
u <- parse_pmsi_fwf(u, "mco", "rsa", "22")
u$drop(c('column_1', 'ZA', 'FILLER6'))$collect()
tictoc::toc()
ça devient intéressant en comparaison à l'actuel pmeasyr et à vroom/readr. |
test sur 2,1 M de RSA : 3,9 secondes pour lire la partie fixe. tictoc::tic()
u <- pl$read_csv('~/Documents/data/mco/290000018.2022.12.rsa', has_header=FALSE, skip_rows=0)$lazy()
u <- parse_pmsi_fwf(u, "mco", "rsa", "22")
u$drop(c('column_1', 'ZA', 'FILLER6'))$collect()
|
en revanche la conversion ensuite en data.frame / tibble est couteux. v <- u$to_data_frame() |> tibble::as_tibble() prend le même temps que le découpage.. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
test :
2,6 secondes
vs
5,8 secondes
The text was updated successfully, but these errors were encountered: