add serializer and parser for excel #973

r2evans · 2025-01-23T17:53:38Z

r2evans · 2025-01-23T18:42:12Z

@schloerke looking for simple guidance/preferences, there are two questions below.

The default behavior of readxl::read_xlsx is to read only the first worksheet, and the documented way to read in multiple is to manually loop around the indices or names. I suggest it is easy enough to detect sheet=NA (normally not accepted by readxl::read_xlsx) and internally loop over the discovered names.

I suggest we can support the following annotations:

### all of the following result in the same thing
#* @parser xlsx
#* @parser xlsx list(sheet=NULL)
#* @parser xlsx list(sheet=1)

### extension
#* @parser xlsx list(sheet=NA)

The latter call is intercepted by the parser. It'll wrap internally, returning a named list of frames (names are the worksheet names).

As it stands, internally I'm planning

if (anyNA(sheet)) sheet <- readxl::excel_names(tmpfile)
lapply(sheet, function(sht) readxl::read_xlsx(path = tmpfile, sheet = sht, ...))

This is simple enough and it works.

However, without guardrails, the user can use

#* @parser xlsx list(sheet=c(1,2,5))

which will still iterate and fail if there are fewer than 5 sheets. Clearly I can guard against this failure as well. While I have no personal use-case where this would be useful, I also don't feel it is necessary to force against this utility, as it just adds code, and the error messages tend to be clear/unambiguous.

Question 1: do you care enough about this that we should actively prohibit this behavior, or do we allow the user to put themselves in this position? Said differently, do you feel strongly enough that we need to enforce sheet= be length-1? (I think we should not enforce it.)

Now the simple point of what to return ... options:

always return a list of frames (optionally named), even if list(sheet=NULL) or similar;
return a simple frame if sheet= is length-1 and not NA, otherwise a list of frames

Question 2: Do you prefer "always list of frames"? I personally feel that while option 2 does present a polymorphic parser (of sorts) -- discouraged by some people -- the return of anything other than a simple frame requires explicit action by the user, so it should be unambiguous.

schloerke · 2025-01-23T19:18:08Z

If a file has a single sheet... should it return a data.frame() or a list() of a single data.frame()? This fuzzy area makes me want to have two parsers... excel and excel_sheet? excel_sheets and excel? excel_list and excel_df? One for a list of data frames (one per sheet) and the second for a single data frame (from the first sheet). It is much easier to document and maintain consistency with the two methods.

This being the case, I'm up for a single method that always returns a list of data frames as an excel file always has sheets that contain data. (excel?)

(Is it possible to use readxl::read_excel()? I'm hoping the file name is being posted for the temp file. Then we're not tied to a single file type of .xls vs .xlsx)

How should we safeguard the sheet= parameter?

Is it possible to wrap the reading of the sheet in a tryCatch and throw an error with context? Something like:

# untested

ret <- lapply(sheet, function(sht) {
  tryCatch({
    readxl::read_excel(path = tmpfile, sheet = sht, ...)
  }, error = function(e) {
    res$status <- 422L
    rlang::abort(
      paste0("Error reading sheet: ", sht, "),
      parent = e
    )
  }
})

This way plumber isn't making any assumptions on what sheet value can be. Only that we can iterate of them.

I don't think we should set NULL for a sheet's value if it is failing to parse. Instead, the user should receive a parsing error with more context if any of the sheets fail to parse. (Probably with a proper parsing error code of 422? **Added in example above)

thomasp85 · 2025-01-23T19:25:14Z

I think it is paramount to have type stability in the parsers so I support the idea of always returning a list even in the case of the xlsx file only containing a single sheet

r2evans · 2025-01-23T20:00:51Z

I prefer the notion of "always a list of frames" (agree with @thomasp85 on "type-stability", that was the term I needed instead of polymorph). I prefer a single parser overloading the sheet= argument as demonstrated above.

Having said that, to address your points @schloerke :

I'm good with read_excel instead of read_xlsx, I must have been thinking of times when reading binary .xls files was less stable.
I'm good with @parser excel instead of @parser xlsx.
If you're not comfortable with intercepting list(sheet=NA), then I suggest excel_sheet/excel_sheets:
- excel_sheet will enforce length-1 sheet= if passed, and will always return a single frame
- excel_sheets will allow `length(sheet) > 0, and will always return a list of one or more frames
The current error on reading a sheet seems fairly reasonable. If I have a 2-worksheet file, then lapply(1:3, read_excel, path=tmpfile) errors with:
```
lapply(1:3, function(sht) readxl::read_excel(tmpfile, sheet=sht, ...))
# Error: Can't retrieve sheet in position 3, only 2 sheet(s) found.
```
If you prefer, I can wrap it and append something, but I don't know what else we can add that adds context other than perhaps the value of sheet= as passed to the parser. (I don't think that's worth the squeeze, but can add it if it's a blocker.)

Are you good with @serializer excel instead of adding the _sheet(s)? For clarity, writexl::write_excel is much saner here: if you pass it a single frame, the file has one worksheet; if you pass it a list of n-frames, it returns a file with n-worksheets, so that behavior is straight-forward (even if inconsistent between readxl/writexl).

r2evans · 2025-01-23T20:09:59Z

To add some "inconsistency", however ... it's writexl::write_xlsx, and it only supports .xlsx not .xls. For consistency, though, we'll still us @serializer excel despite the lack of multi-format writer.

r2evans · 2025-01-23T20:42:44Z

btw, the tests in files/source_values.R look for a variable named count, which does technically exist when dplyr is loaded in the search path; I suggest that test (at least) have its name changed to something else.

schloerke assigned r2evans Jan 23, 2025

r2evans changed the title ~~add serializer and parser for xlsx~~ add serializer and parser for excel Jan 23, 2025

r2evans pushed a commit to r2evans/plumber that referenced this issue Jan 23, 2025

close rstudio#973, add excel serializer/parser

538cb6f

schloerke mentioned this issue Jan 27, 2025

add excel serializer/parser #975

Merged

3 tasks

schloerke closed this as completed in #975 Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add serializer and parser for excel #973

add serializer and parser for excel #973

r2evans commented Jan 23, 2025

r2evans commented Jan 23, 2025

schloerke commented Jan 23, 2025

thomasp85 commented Jan 23, 2025

r2evans commented Jan 23, 2025

r2evans commented Jan 23, 2025

r2evans commented Jan 23, 2025

add serializer and parser for excel #973

add serializer and parser for excel #973

Comments

r2evans commented Jan 23, 2025

r2evans commented Jan 23, 2025

schloerke commented Jan 23, 2025

thomasp85 commented Jan 23, 2025

r2evans commented Jan 23, 2025

r2evans commented Jan 23, 2025

r2evans commented Jan 23, 2025