Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add serializer and parser for excel #973

Closed
r2evans opened this issue Jan 23, 2025 · 6 comments · Fixed by #975
Closed

add serializer and parser for excel #973

r2evans opened this issue Jan 23, 2025 · 6 comments · Fixed by #975
Assignees

Comments

@r2evans
Copy link
Contributor

r2evans commented Jan 23, 2025

See #959 (comment)

@r2evans
Copy link
Contributor Author

r2evans commented Jan 23, 2025

@schloerke looking for simple guidance/preferences, there are two questions below.

The default behavior of readxl::read_xlsx is to read only the first worksheet, and the documented way to read in multiple is to manually loop around the indices or names. I suggest it is easy enough to detect sheet=NA (normally not accepted by readxl::read_xlsx) and internally loop over the discovered names.

I suggest we can support the following annotations:

### all of the following result in the same thing
#* @parser xlsx
#* @parser xlsx list(sheet=NULL)
#* @parser xlsx list(sheet=1)

### extension
#* @parser xlsx list(sheet=NA)

The latter call is intercepted by the parser. It'll wrap internally, returning a named list of frames (names are the worksheet names).

As it stands, internally I'm planning

if (anyNA(sheet)) sheet <- readxl::excel_names(tmpfile)
lapply(sheet, function(sht) readxl::read_xlsx(path = tmpfile, sheet = sht, ...))

This is simple enough and it works.

However, without guardrails, the user can use

#* @parser xlsx list(sheet=c(1,2,5))

which will still iterate and fail if there are fewer than 5 sheets. Clearly I can guard against this failure as well. While I have no personal use-case where this would be useful, I also don't feel it is necessary to force against this utility, as it just adds code, and the error messages tend to be clear/unambiguous.

Question 1: do you care enough about this that we should actively prohibit this behavior, or do we allow the user to put themselves in this position? Said differently, do you feel strongly enough that we need to enforce sheet= be length-1? (I think we should not enforce it.)

Now the simple point of what to return ... options:

  1. always return a list of frames (optionally named), even if list(sheet=NULL) or similar;
  2. return a simple frame if sheet= is length-1 and not NA, otherwise a list of frames

Question 2: Do you prefer "always list of frames"? I personally feel that while option 2 does present a polymorphic parser (of sorts) -- discouraged by some people -- the return of anything other than a simple frame requires explicit action by the user, so it should be unambiguous.

@schloerke
Copy link
Collaborator

If a file has a single sheet... should it return a data.frame() or a list() of a single data.frame()? This fuzzy area makes me want to have two parsers... excel and excel_sheet? excel_sheets and excel? excel_list and excel_df? One for a list of data frames (one per sheet) and the second for a single data frame (from the first sheet). It is much easier to document and maintain consistency with the two methods.

This being the case, I'm up for a single method that always returns a list of data frames as an excel file always has sheets that contain data. (excel?)

(Is it possible to use readxl::read_excel()? I'm hoping the file name is being posted for the temp file. Then we're not tied to a single file type of .xls vs .xlsx)


How should we safeguard the sheet= parameter?

Is it possible to wrap the reading of the sheet in a tryCatch and throw an error with context? Something like:

# untested

ret <- lapply(sheet, function(sht) {
  tryCatch({
    readxl::read_excel(path = tmpfile, sheet = sht, ...)
  }, error = function(e) {
    res$status <- 422L
    rlang::abort(
      paste0("Error reading sheet: ", sht, "),
      parent = e
    )
  }
})

This way plumber isn't making any assumptions on what sheet value can be. Only that we can iterate of them.


Related:

Sometimes I've seen an attempt to gracefully handle and item's error processing by setting the item to NULL if it fails to parse.

I don't think we should set NULL for a sheet's value if it is failing to parse. Instead, the user should receive a parsing error with more context if any of the sheets fail to parse. (Probably with a proper parsing error code of 422? **Added in example above)

@thomasp85
Copy link
Collaborator

I think it is paramount to have type stability in the parsers so I support the idea of always returning a list even in the case of the xlsx file only containing a single sheet

@r2evans
Copy link
Contributor Author

r2evans commented Jan 23, 2025

I prefer the notion of "always a list of frames" (agree with @thomasp85 on "type-stability", that was the term I needed instead of polymorph). I prefer a single parser overloading the sheet= argument as demonstrated above.

Having said that, to address your points @schloerke :

  • I'm good with read_excel instead of read_xlsx, I must have been thinking of times when reading binary .xls files was less stable.

  • I'm good with @parser excel instead of @parser xlsx.

  • If you're not comfortable with intercepting list(sheet=NA), then I suggest excel_sheet/excel_sheets:

    • excel_sheet will enforce length-1 sheet= if passed, and will always return a single frame
    • excel_sheets will allow `length(sheet) > 0, and will always return a list of one or more frames
  • The current error on reading a sheet seems fairly reasonable. If I have a 2-worksheet file, then lapply(1:3, read_excel, path=tmpfile) errors with:

    lapply(1:3, function(sht) readxl::read_excel(tmpfile, sheet=sht, ...))
    # Error: Can't retrieve sheet in position 3, only 2 sheet(s) found.
    

    If you prefer, I can wrap it and append something, but I don't know what else we can add that adds context other than perhaps the value of sheet= as passed to the parser. (I don't think that's worth the squeeze, but can add it if it's a blocker.)

Are you good with @serializer excel instead of adding the _sheet(s)? For clarity, writexl::write_excel is much saner here: if you pass it a single frame, the file has one worksheet; if you pass it a list of n-frames, it returns a file with n-worksheets, so that behavior is straight-forward (even if inconsistent between readxl/writexl).

@r2evans
Copy link
Contributor Author

r2evans commented Jan 23, 2025

To add some "inconsistency", however ... it's writexl::write_xlsx, and it only supports .xlsx not .xls. For consistency, though, we'll still us @serializer excel despite the lack of multi-format writer.

@r2evans r2evans changed the title add serializer and parser for xlsx add serializer and parser for excel Jan 23, 2025
r2evans pushed a commit to r2evans/plumber that referenced this issue Jan 23, 2025
@r2evans
Copy link
Contributor Author

r2evans commented Jan 23, 2025

btw, the tests in files/source_values.R look for a variable named count, which does technically exist when dplyr is loaded in the search path; I suggest that test (at least) have its name changed to something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants