Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r-polars #48

Open
GuillaumePressiat opened this issue Apr 30, 2023 · 3 comments
Open

r-polars #48

GuillaumePressiat opened this issue Apr 30, 2023 · 3 comments

Comments

@GuillaumePressiat
Copy link
Owner

test :

library(polars)

tictoc::tic()
u <- pl$read_csv('~/Documents/data/mco/290000017.2020.12.rsa', 
            has_header=FALSE,
            skip_rows=0)$lazy()

class(u)
formats <- pmeasyr::formats
library(dplyr)
f_rsa <- formats %>% filter(champ == "mco", an == "20", table == "rsa")
column_names <- f_rsa$nom
slice_tuples <- list()
for (i in 1:nrow(f_rsa)){
  slice_tuples[[i]] <- c(f_rsa$position[i]-1,ifelse(is.na(f_rsa$longueur[i]), 1e6, f_rsa$longueur[i]))
}
j <- 1:nrow(f_rsa)

# u$with_columns(pl$col('column_1')$str$slice(slice_tuple[1], slice_tuple[2])$alias(column_names[k]),
#                     pl$col('column_1')$str$slice(slice_tuple[1], slice_tuple[2])$alias(column_names[k]))$collect()

for (k in j){
  slice_tuple <- slice_tuples[[k]]
  u <- u$with_columns(pl$col('column_1')$str$slice(slice_tuple[1], slice_tuple[2])$alias(column_names[k]))
}
u$drop(c('column_1', 'ZA', 'FILLER6'))$collect()
tictoc::toc()

2,6 secondes
vs

tictoc::tic()
pmeasyr::irsa(290000017, 2020, 12, '~/Documents/data/mco', typi = 1)$rsa
tictoc::toc()

5,8 secondes

@GuillaumePressiat
Copy link
Owner Author

GuillaumePressiat commented Dec 2, 2023

Mieux :

on utilise map :

  • pour générer les formats (les tuples début fin pour les slices)
  • on créé une liste avec toutes les colonnes à construire dans le même with_columns, devenu possible avec une maj de r-polars

On divise par 10 la lecture du rsa partie fixe : 0.257 sec elapsed pour 140 000 RSA.

Et c'est transposable aux actes/diags/rsa_um.

library(polars)

parse_pmsi_fwf <- function(df, champ1 = "mco", table1 = "rsa", an1 = "20") {
  f_rsa <- pl$DataFrame(pmeasyr::formats)$filter((pl$col('champ') == champ1) &
                                                   (pl$col('an') == an1) &
                                                   (pl$col('table') == table1))$to_data_frame()  
  column_names <- f_rsa$nom

  df <- df$with_columns(purrr::map(seq_len(nrow(f_rsa)), 
     function(k){
       slice_tuple <- c(f_rsa$position[k]-1,ifelse(is.na(f_rsa$longueur[k]), 1e6, f_rsa$longueur[k]))
       
       pl$col('column_1')$str$slice(slice_tuple[1], 
                                    slice_tuple[2])$alias(column_names[k])
     })
  )
  
  df
}

tictoc::tic()
u <- pl$read_csv('~/Documents/data/mco/290000017.2022.12.rsa', has_header=FALSE, skip_rows=0)$lazy()
u <- parse_pmsi_fwf(u, "mco", "rsa", "22")
u$drop(c('column_1', 'ZA', 'FILLER6'))$collect()
tictoc::toc()
shape: (141_901, 88)
┌───────────┬────────┬────────────┬────────┬───┬────────┬────────┬──────┬───────┐
│ NOFINESS  ┆ NOVRSA ┆ CLE_RSA    ┆ NOVRSS ┆ … ┆ DP     ┆ DR     ┆ NDAS ┆ NA    │
│ ---       ┆ ---    ┆ ---        ┆ ---    ┆   ┆ ---    ┆ ---    ┆ ---  ┆ ---   │
│ str       ┆ str    ┆ str        ┆ str    ┆   ┆ str    ┆ str    ┆ str  ┆ str   │
╞═══════════╪════════╪════════════╪════════╪═══╪════════╪════════╪══════╪═══════╡
│ 290000017 ┆ 227    ┆ 000xxxxxxx ┆ 121    ┆ … ┆ Z515   ┆ D761   ┆ 0010 ┆ 00002 │
│ 290000017 ┆ 227    ┆ 000xxxxxxx ┆ 121    ┆ … ┆ D352   ┆        ┆ 0113 ┆ 00090 │
│ 290000017 ┆ 227    ┆ 000xxxxxxx ┆ 121    ┆ … ┆ P073   ┆        ┆ 0073 ┆ 00754 │
│ 290000017 ┆ 227    ┆ 000xxxxxxx ┆ 121    ┆ … ┆ G621   ┆        ┆ 0015 ┆ 00020 │
│ …         ┆ …      ┆ …          ┆ …      ┆ … ┆ …      ┆ …      ┆ …    ┆ …     │
│ 290000017 ┆ 227    ┆ 000xxxxxxx ┆ 121    ┆ … ┆ C12    ┆        ┆ 0008 ┆ 00028 │
│ 290000017 ┆ 227    ┆ 000xxxxxxx ┆ 121    ┆ … ┆ Z4588  ┆        ┆ 0000 ┆ 00002 │
│ 290000017 ┆ 227    ┆ 000xxxxxxx ┆ 121    ┆ … ┆ Z511   ┆ C349   ┆ 0002 ┆ 00000 │
│ 290000017 ┆ 227    ┆ 000xxxxxxx ┆ 121    ┆ … ┆ C100   ┆        ┆ 0027 ┆ 00025 │
└───────────┴────────┴────────────┴────────┴───┴────────┴────────┴──────┴───────┘
> tictoc::toc()
0.257 sec elapsed

ça devient intéressant en comparaison à l'actuel pmeasyr et à vroom/readr.

@GuillaumePressiat
Copy link
Owner Author

GuillaumePressiat commented Dec 2, 2023

test sur 2,1 M de RSA : 3,9 secondes pour lire la partie fixe.

tictoc::tic()
u <- pl$read_csv('~/Documents/data/mco/290000018.2022.12.rsa', has_header=FALSE, skip_rows=0)$lazy()
u <- parse_pmsi_fwf(u, "mco", "rsa", "22")
u$drop(c('column_1', 'ZA', 'FILLER6'))$collect()
shape: (2_128_515, 88)
┌───────────┬────────┬────────────┬────────┬───┬────────┬────────┬──────┬───────┐
│ NOFINESS  ┆ NOVRSA ┆ CLE_RSA    ┆ NOVRSS ┆ … ┆ DP     ┆ DR     ┆ NDAS ┆ NA    │
│ ---       ┆ ---    ┆ ---        ┆ ---    ┆   ┆ ---    ┆ ---    ┆ ---  ┆ ---   │
│ str       ┆ str    ┆ str        ┆ str    ┆   ┆ str    ┆ str    ┆ str  ┆ str   │
╞═══════════╪════════╪════════════╪════════╪═══╪════════╪════════╪══════╪═══════╡
│ 290000017 ┆ 227    ┆ yyyyyyyyyy ┆ 121    ┆ … ┆ Z515   ┆ D761   ┆ 0010 ┆ 00002 │
│ 290000017 ┆ 227    ┆ yyyyyyyyyy ┆ 121    ┆ … ┆ D352   ┆        ┆ 0113 ┆ 00090 │
│ 290000017 ┆ 227    ┆ yyyyyyyyyy ┆ 121    ┆ … ┆ P073   ┆        ┆ 0073 ┆ 00754 │
│ 290000017 ┆ 227    ┆ yyyyyyyyyy ┆ 121    ┆ … ┆ G621   ┆        ┆ 0015 ┆ 00020 │
│ …         ┆ …      ┆ …          ┆ …      ┆ … ┆ …      ┆ …      ┆ …    ┆ …     │
│ 290000017 ┆ 227    ┆ yyyyyyyyyy ┆ 121    ┆ … ┆ C12    ┆        ┆ 0008 ┆ 00028 │
│ 290000017 ┆ 227    ┆ yyyyyyyyyy ┆ 121    ┆ … ┆ Z4588  ┆        ┆ 0000 ┆ 00002 │
│ 290000017 ┆ 227    ┆ yyyyyyyyyy ┆ 121    ┆ … ┆ Z511   ┆ C349   ┆ 0002 ┆ 00000 │
│ 290000017 ┆ 227    ┆ yyyyyyyyyy ┆ 121    ┆ … ┆ C100   ┆        ┆ 0027 ┆ 00025 │
└───────────┴────────┴────────────┴────────┴───┴────────┴────────┴──────┴───────┘
> tictoc::toc()
3.946 sec elapsed

@GuillaumePressiat
Copy link
Owner Author

en revanche la conversion ensuite en data.frame / tibble est couteux.

v <- u$to_data_frame() |> tibble::as_tibble()

prend le même temps que le découpage..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant