GH-42143: [R] Sanitize R metadata #41969

nealrichardson · 2024-06-05T00:09:24Z

Rationale for this change

arrow uses R serialize()/unserialize() to store additional metadata in the Arrow schema. This PR adds some extra checking and sanitizing in order to make the reading of this metadata robust to data of unknown provenance.

What changes are included in this PR?

When writing metadata, we strip out all but simple types: strings, numbers, boolean, lists, etc. Objects of other types, such as environments, external pointers, and other language types, are removed.
When reading metadata, the same filter is applied. If there are types that are not in the allowlist, one of two things happen. By default, they are removed with a warning. If you set options(arrow.unsafe_metadata = TRUE), the full metadata including disallowed types is returned, also with a warning. This option is an escape hatch in case we are too strict with dropping types when reading files produced by older versions of the package that did not filter them out.
unserialize() is called in a way that prevents promises contained in the data from being automatically invoked. This technique works on all versions of R: it is not dependent on the patch for RDS reading that was included in 4.4.
Other sanity checking to be stricter about only reading back in something of the form we wrote out: assert that the data is ASCII-serialized, and if it is compressed, it is gzip, the same way we do on serialization. It's not clear that it's necessary, but it's not bad to be extra strict here.

Are these changes tested?

Yes

Are there any user-facing changes?

For most, no. But:

This PR contains a "Critical Fix".

Without this patch, it is possible to construct an Arrow or Parquet file that would contain code that would execute when the R metadata is applied when converting to a data.frame. If you are using an older version of the package and are reading data from a source you do not trust, you can read into a Table and use its internal $to_data_frame() method, like read_parquet(..., as_data_frame = FALSE)$to_data_frame(). This should skip the reading of the R metadata.

GitHub Issue: [R] Sanitize R metadata #42143

jonkeane

Thanks for this!

jonkeane · 2024-06-05T02:55:13Z

r/R/metadata.R

I know it's not strictly necessary, but would asserting that this is ARROW be a bit more obvious?

This is actually about how base::serialize() works, signifying that it is ASCII:

The format consists of a single line followed by the data: the first line contains a single character: X for binary serialization and A for ASCII serialization, followed by a new line.

https://stat.ethz.ch/R-manual/R-devel/library/base/html/serialize.html

AAAAAH Maybe a comment X for binary serialization and A for ASCII serialization there?

Is the comment on the line above not enough?

Yeah, maybe it is. Though I did read it when reviewing and thought we were testing that the string started with ARROW so it wasn't when I was reading it last night. Not a huge deal either way, I think if someone needs to know this, they would poke at it more

jonkeane · 2024-06-05T02:57:34Z

r/R/metadata.R

as.raw(c(31, 139)) is magic, indeed.

gzip's magic number is 1f 8b, that's it in integers. However, I removed this check while debugging the test failure in the backwards-compat tests. In any case, I'd expect memDecompress() to error if it's not valid gzip, so it's probably not needed.

jonkeane · 2024-06-05T03:00:27Z

r/R/metadata.R

Suggested change

stop("Serialized data contains a promise object")

stop("Invalid serialized data: Serialized data contains a promise object")

Up for other suggestions, but it would be good to make it clear that Serialized data containing a promise is problematic.

Currently it doesn't matter because this error gets swallowed in https://github.com/apache/arrow/pull/41969/files#diff-659e9fa6b66e5a72b4e3f9ac79ffddf08f92d9ea3d7aa45bd8c73b9a022fa2e5R52 and in the end the user sees an opaque "Invalid metadata$r" warning. This is a holdover from how we're currently doing the deserialization, any errors are just trapped if it fails to deserialize and we return NULL with that warning. Happy to revisit that though if there's interest.

jonkeane · 2024-06-05T03:04:16Z

r/R/metadata.R

on_save is to make if safe_r_metadata is saving metadata or loading metadata, yeah? It would be nice to either have doc strings (even just as a comment) explaining that, or maybe saving = FALSE is slightly more transparent to me?

Sure, can do. The meaning is described in a comment lower in the code but I can clarify up top too.

Done in 1057b78

paleolimbot

Thank you for doing this!

paleolimbot · 2024-06-05T14:48:59Z

r/R/metadata.R

Suggested change

# By capturing the data in a list, we can inspect it for promises without

# triggering their evaluation.

# By capturing the data in a list, we can minimize the possibility

# that R internals will evaluate any promises present before it

# can be inspected.

I'm not sure this is more accurate--I can obj <- deserialize(charToRaw()) the data in https://github.com/apache/arrow/pull/41969/files#diff-0386351ec2a20934987de3d32d4aee6fc609fbfbe3af3bf287a66941e8d563a7R121-R141 and the promise doesn't evaluate; it only evaluates if I touch obj. (This is on R 4.3.)

paleolimbot · 2024-06-07T18:07:46Z

Are there any additional changes we need to make here before merging?

nealrichardson · 2024-06-07T20:36:42Z

Needs a NEWS bullet, but other than that I'm not aware of anything.

…is safe in older R

nealrichardson · 2024-06-13T13:20:08Z

@github-actions crossbow submit test-r-arrow-backwards-compatibility

github-actions · 2024-06-13T13:22:26Z

Revision: e64b85f

Submitted crossbow builds: ursacomputing/crossbow @ actions-e588e9e142

Task	Status
test-r-arrow-backwards-compatibility

github-actions · 2024-06-13T13:23:49Z

⚠️ GitHub issue #42143 has been automatically assigned in GitHub to PR creator.

conbench-apache-arrow · 2024-06-16T01:54:12Z

After merging your PR, Conbench analyzed the 8 benchmarking runs that have been run so far on merge-commit 801de2f.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

nealrichardson requested review from paleolimbot and thisisnic as code owners June 5, 2024 00:09

github-actions bot added Component: R awaiting review Awaiting review labels Jun 5, 2024

jonkeane reviewed Jun 5, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting review Awaiting review awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jun 5, 2024

paleolimbot approved these changes Jun 5, 2024

View reviewed changes

github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting merge Awaiting merge labels Jun 5, 2024

nealrichardson added 10 commits June 13, 2024 09:04

Add some protections and options around unserialize()

67ce55e

Just do it in R

b46db7e

Add test for data with promise in it. With workaround, we can make th…

3f0773f

…is safe in older R

moar test

df7075f

Report which types have been dropped

7f0271c

Fix for backwards compat tests

860333b

Make VctrsExtensionType safe too

f1c4cdd

MORE SAFETY

07155c8

More commenting

d69c316

news

e64b85f

nealrichardson force-pushed the safe-unserialize-r branch from 1057b78 to e64b85f Compare June 13, 2024 13:17

github-actions bot added the Component: Documentation label Jun 13, 2024

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 13, 2024

nealrichardson mentioned this pull request Jun 13, 2024

[R] Sanitize R metadata #42143

Closed

nealrichardson changed the title ~~[R] Sanitize R metadata~~ GH-42143: [R] Sanitize R metadata Jun 13, 2024

apache deleted a comment from github-actions bot Jun 13, 2024

nealrichardson merged commit 801de2f into apache:main Jun 14, 2024

nealrichardson deleted the safe-unserialize-r branch June 14, 2024 20:09

jonkeane mentioned this pull request Jun 20, 2024

[R]: vctrs custom class errors on roundtrip #42220

Closed

tanho63 mentioned this pull request Aug 19, 2024

[R] write_parquet() has infinite recursion error when writing packageVersion() attributes #43748

Closed

jonkeane mentioned this pull request Jan 26, 2025

GH-45300: [R] Remove data.table from class attribute in metadata #45346

Closed

	stop("Serialized data contains a promise object")
	stop("Invalid serialized data: Serialized data contains a promise object")

-  # By capturing the data in a list, we can inspect it for promises without
-  # triggering their evaluation.
+  # By capturing the data in a list, we can minimize the possibility
+  # that R internals will evaluate any promises present before it
+  # can be inspected.

GH-42143: [R] Sanitize R metadata #41969

GH-42143: [R] Sanitize R metadata #41969

Uh oh!

Conversation

nealrichardson commented Jun 5, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

jonkeane left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nealrichardson Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paleolimbot commented Jun 7, 2024

Uh oh!

nealrichardson commented Jun 7, 2024

Uh oh!

nealrichardson commented Jun 13, 2024

Uh oh!

github-actions bot commented Jun 13, 2024

Uh oh!

github-actions bot commented Jun 13, 2024

Uh oh!

conbench-apache-arrow bot commented Jun 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nealrichardson commented Jun 5, 2024 •

edited by github-actions bot

Loading

nealrichardson Jun 5, 2024 •

edited

Loading