-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-42143: [R] Sanitize R metadata #41969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-42143: [R] Sanitize R metadata #41969
Conversation
jonkeane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!
r/R/metadata.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not strictly necessary, but would asserting that this is ARROW be a bit more obvious?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually about how base::serialize() works, signifying that it is ASCII:
The format consists of a single line followed by the data: the first line contains a single character: X for binary serialization and A for ASCII serialization, followed by a new line.
https://stat.ethz.ch/R-manual/R-devel/library/base/html/serialize.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AAAAAH Maybe a comment X for binary serialization and A for ASCII serialization there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the comment on the line above not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe it is. Though I did read it when reviewing and thought we were testing that the string started with ARROW so it wasn't when I was reading it last night. Not a huge deal either way, I think if someone needs to know this, they would poke at it more
r/R/metadata.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as.raw(c(31, 139)) is magic, indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gzip's magic number is 1f 8b, that's it in integers. However, I removed this check while debugging the test failure in the backwards-compat tests. In any case, I'd expect memDecompress() to error if it's not valid gzip, so it's probably not needed.
r/R/metadata.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| stop("Serialized data contains a promise object") | |
| stop("Invalid serialized data: Serialized data contains a promise object") |
Up for other suggestions, but it would be good to make it clear that Serialized data containing a promise is problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it doesn't matter because this error gets swallowed in https://github.com/apache/arrow/pull/41969/files#diff-659e9fa6b66e5a72b4e3f9ac79ffddf08f92d9ea3d7aa45bd8c73b9a022fa2e5R52 and in the end the user sees an opaque "Invalid metadata$r" warning. This is a holdover from how we're currently doing the deserialization, any errors are just trapped if it fails to deserialize and we return NULL with that warning. Happy to revisit that though if there's interest.
r/R/metadata.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on_save is to make if safe_r_metadata is saving metadata or loading metadata, yeah? It would be nice to either have doc strings (even just as a comment) explaining that, or maybe saving = FALSE is slightly more transparent to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, can do. The meaning is described in a comment lower in the code but I can clarify up top too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 1057b78
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this!
r/R/metadata.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # By capturing the data in a list, we can inspect it for promises without | |
| # triggering their evaluation. | |
| # By capturing the data in a list, we can minimize the possibility | |
| # that R internals will evaluate any promises present before it | |
| # can be inspected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is more accurate--I can obj <- deserialize(charToRaw()) the data in https://github.com/apache/arrow/pull/41969/files#diff-0386351ec2a20934987de3d32d4aee6fc609fbfbe3af3bf287a66941e8d563a7R121-R141 and the promise doesn't evaluate; it only evaluates if I touch obj. (This is on R 4.3.)
|
Are there any additional changes we need to make here before merging? |
|
Needs a NEWS bullet, but other than that I'm not aware of anything. |
…is safe in older R
1057b78 to
e64b85f
Compare
|
@github-actions crossbow submit test-r-arrow-backwards-compatibility |
|
Revision: e64b85f Submitted crossbow builds: ursacomputing/crossbow @ actions-e588e9e142
|
|
|
|
After merging your PR, Conbench analyzed the 8 benchmarking runs that have been run so far on merge-commit 801de2f. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
arrowuses Rserialize()/unserialize()to store additional metadata in the Arrow schema. This PR adds some extra checking and sanitizing in order to make the reading of this metadata robust to data of unknown provenance.What changes are included in this PR?
options(arrow.unsafe_metadata = TRUE), the full metadata including disallowed types is returned, also with a warning. This option is an escape hatch in case we are too strict with dropping types when reading files produced by older versions of the package that did not filter them out.unserialize()is called in a way that prevents promises contained in the data from being automatically invoked. This technique works on all versions of R: it is not dependent on the patch for RDS reading that was included in 4.4.Are these changes tested?
Yes
Are there any user-facing changes?
For most, no. But:
This PR contains a "Critical Fix".
Without this patch, it is possible to construct an Arrow or Parquet file that would contain code that would execute when the R metadata is applied when converting to a data.frame. If you are using an older version of the package and are reading data from a source you do not trust, you can read into a
Tableand use its internal$to_data_frame()method, likeread_parquet(..., as_data_frame = FALSE)$to_data_frame(). This should skip the reading of the R metadata.