-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parse_data: Fix compatibility with R 4.4 #588
Conversation
I don't know the internals well enough to comment further than: looks promising!!! thank you 🙌 |
Your investigation makes sense thanks, do you think you could add a test for the |
get_parse_data extracts lines from the input srcfile object and feeds them to split_on_line_directives, which expects the lines to be a concatenation of all the package R files, separated by #line directives. With how get_parse_data is currently called, that expectation is met. get_parse_data is called only if utils::getParseData returns NULL, and getParseData doesn't return NULL for any of the cases where the input does _not_ have line directives (i.e. entry points other than package_coverage). An upcoming commit is going to move the get_parse_data call in front of the getParseData call, so update split_on_line_directives to detect the "no directives" case. Without this guard, the mapply call in split_on_line_directives would error under an R version before 4.2; with R 4.2 or later, split_on_line_directives returns empty.
split_on_line_directives breaks the input at #line directives and returns a named list of lines for each file. For a package with a single file under R/, there is one directive. The bounds calculation is still correct for that case. However, the return value is incorrectly a matrix rather than a list because the mapply call simplifies the result. At this point, this bug is mostly [*] unexposed because this code path is only triggered if utils::getParseData returns NULL, and it should always return a non-NULL result for the single-file package case. The next commit will reorder things, exposing the bug. Tell mapply to not simplify the result. [*] The simplification to a matrix could also happen for multi-file packages in the unlikely event that all files have the same number of lines.
utils::getParseData has a longstanding bug: for an installed package, parse data is available only for the last file [1]. To work around that, the get_tokens helper first calls getParseData and then falls back to custom logic that extracts the concatenated source lines, splits them on #line directives, and calls getParseData on each file's lines. The getParseData bug was fixed in R 4.4.0 (r84538). Unfortunately that change causes at least two issues (for some subset of packages): a substantial performance regression [2] and an error when applying exclusions [3]. Under R 4.4, getParseData always returns non-NULL as a result of that change when calculating package coverage (in other words, the get_parse_data fallback is _not_ triggered). The slowdown is partially due to the parse data no longer being cached across get_tokens calls. Another relevant aspect, for both the slowdown and the error applying exclusions, is likely that the new getParseData returns data for the entire package rather than the per-file parse data the downstream covr code expects. One solution would be to adapt covr's caching and handling of the getParseData when running under R 4.4.0 or later. Instead go with a simpler and more minimal fix. Reorder the calls so that the get_parse_data call, which we know has been the primary code path for package coverage before R 4.4.0, is the first call tried. Leave getParseData as the fallback to handle the non-package coverage cases. [1] r-lib#154 https://bugs.r-project.org/show_bug.cgi?id=16756 [2] As an extreme case, calling package_coverage on R.utils goes from under 15 minutes to over 6 hours. [3] nanotime (v0.3.10) and diffobj (v0.3.5) are two examples of packages that hit into this error. Closes r-lib#576 Closes r-lib#579 Re: r-lib#567
@jimhester thanks for taking a look!
Sure, I've pushed an update. (The issues are triggered by existing tests, but I agree that a direct/explicit test makes sense.) |
I've tried the patch and it brings the time for running coverage on Luminescence back to 7 minutes, which is roughly what we get on 4.3.3 (as opposed to 80m with the version on CRAN with 4.4.1). 🎉 |
Thanks again for the work and thank you @mcol for reporting on the improved runtime. |
This series resolves two issues with calculating package coverage under R 4.4:
for some packages, there is a substantial performance regression compared to R 4.3 (related to segfault recently in codecov() #567)
some packages hit into an error when applying exclusions (Error in exclude() 'from must be a finite number' for seq() usage #576, Applying exclusions raise error due to missing srcref lines #579)
In addition to the links above, there are cases demonstrating both these issues at https://github.com/kyleam/covr-r44-issues.
Here's a summary of timings (seconds, wall time) to give a sense of how severe the slowdown can be:
("patch" refers to a run with a covr patched with the equivalent of the changes proposed here.)
Both the slowdown and error bisect to R's 990fe3d8a1 (getParseData() - for first-file objects ..., 2023-06-14). And the associated bug report was actually connected to a covr PR (gh-154).
As discussed in the third commit's message, I think covr is getting tripped up by the new getParseData returning parse data for the entire package.
example
Under R 4.4.1,
getParseData
returns the same result fornanaduration
andnanoival
, which are in separate files. The parse data appears to be for all files of the package.I haven't worked out the details of how that leads to the
NA
values that trigger exclusions error.My proposed solution is to reorder the
utils::getParseData
andcovr:::get_parse_data
calls so thatget_parse_data
is used for R 4.4 or greater, just as it is for R 4.3 or below. As a result, theget_tokens
call returns the expected per-file parse data.The
utils::getParseData
call is retained as the fallback. That's needed when not calculating package coverage (e.g., when callingfile_coverage
).Putting
get_parse_data
in front of thegetParseData
call exposes two issues insplit_on_line_directives
. Those are resolved in the first two commits.Closes #576
Closes #579
cc: @dgkf @jimhester @krlmlr @mcol @MichaelChirico