Skip to content

Conversation

@nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Jun 5, 2024

Rationale for this change

arrow uses R serialize()/unserialize() to store additional metadata in the Arrow schema. This PR adds some extra checking and sanitizing in order to make the reading of this metadata robust to data of unknown provenance.

What changes are included in this PR?

  • When writing metadata, we strip out all but simple types: strings, numbers, boolean, lists, etc. Objects of other types, such as environments, external pointers, and other language types, are removed.
  • When reading metadata, the same filter is applied. If there are types that are not in the allowlist, one of two things happen. By default, they are removed with a warning. If you set options(arrow.unsafe_metadata = TRUE), the full metadata including disallowed types is returned, also with a warning. This option is an escape hatch in case we are too strict with dropping types when reading files produced by older versions of the package that did not filter them out.
  • unserialize() is called in a way that prevents promises contained in the data from being automatically invoked. This technique works on all versions of R: it is not dependent on the patch for RDS reading that was included in 4.4.
  • Other sanity checking to be stricter about only reading back in something of the form we wrote out: assert that the data is ASCII-serialized, and if it is compressed, it is gzip, the same way we do on serialization. It's not clear that it's necessary, but it's not bad to be extra strict here.

Are these changes tested?

Yes

Are there any user-facing changes?

For most, no. But:

This PR contains a "Critical Fix".

Without this patch, it is possible to construct an Arrow or Parquet file that would contain code that would execute when the R metadata is applied when converting to a data.frame. If you are using an older version of the package and are reading data from a source you do not trust, you can read into a Table and use its internal $to_data_frame() method, like read_parquet(..., as_data_frame = FALSE)$to_data_frame(). This should skip the reading of the R metadata.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

r/R/metadata.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not strictly necessary, but would asserting that this is ARROW be a bit more obvious?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually about how base::serialize() works, signifying that it is ASCII:

The format consists of a single line followed by the data: the first line contains a single character: X for binary serialization and A for ASCII serialization, followed by a new line.

https://stat.ethz.ch/R-manual/R-devel/library/base/html/serialize.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AAAAAH Maybe a comment X for binary serialization and A for ASCII serialization there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment on the line above not enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe it is. Though I did read it when reviewing and thought we were testing that the string started with ARROW so it wasn't when I was reading it last night. Not a huge deal either way, I think if someone needs to know this, they would poke at it more

r/R/metadata.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as.raw(c(31, 139)) is magic, indeed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gzip's magic number is 1f 8b, that's it in integers. However, I removed this check while debugging the test failure in the backwards-compat tests. In any case, I'd expect memDecompress() to error if it's not valid gzip, so it's probably not needed.

r/R/metadata.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stop("Serialized data contains a promise object")
stop("Invalid serialized data: Serialized data contains a promise object")

Up for other suggestions, but it would be good to make it clear that Serialized data containing a promise is problematic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it doesn't matter because this error gets swallowed in https://github.com/apache/arrow/pull/41969/files#diff-659e9fa6b66e5a72b4e3f9ac79ffddf08f92d9ea3d7aa45bd8c73b9a022fa2e5R52 and in the end the user sees an opaque "Invalid metadata$r" warning. This is a holdover from how we're currently doing the deserialization, any errors are just trapped if it fails to deserialize and we return NULL with that warning. Happy to revisit that though if there's interest.

r/R/metadata.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on_save is to make if safe_r_metadata is saving metadata or loading metadata, yeah? It would be nice to either have doc strings (even just as a comment) explaining that, or maybe saving = FALSE is slightly more transparent to me?

Copy link
Member Author

@nealrichardson nealrichardson Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, can do. The meaning is described in a comment lower in the code but I can clarify up top too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1057b78

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting review Awaiting review awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jun 5, 2024
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this!

r/R/metadata.R Outdated
Comment on lines +79 to +80
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# By capturing the data in a list, we can inspect it for promises without
# triggering their evaluation.
# By capturing the data in a list, we can minimize the possibility
# that R internals will evaluate any promises present before it
# can be inspected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is more accurate--I can obj <- deserialize(charToRaw()) the data in https://github.com/apache/arrow/pull/41969/files#diff-0386351ec2a20934987de3d32d4aee6fc609fbfbe3af3bf287a66941e8d563a7R121-R141 and the promise doesn't evaluate; it only evaluates if I touch obj. (This is on R 4.3.)

@github-actions github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting merge Awaiting merge labels Jun 5, 2024
@paleolimbot
Copy link
Member

Are there any additional changes we need to make here before merging?

@nealrichardson
Copy link
Member Author

Needs a NEWS bullet, but other than that I'm not aware of anything.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 13, 2024
@nealrichardson
Copy link
Member Author

@github-actions crossbow submit test-r-arrow-backwards-compatibility

@github-actions
Copy link

Revision: e64b85f

Submitted crossbow builds: ursacomputing/crossbow @ actions-e588e9e142

Task Status
test-r-arrow-backwards-compatibility GitHub Actions

@nealrichardson nealrichardson changed the title [R] Sanitize R metadata GH-42143: [R] Sanitize R metadata Jun 13, 2024
@apache apache deleted a comment from github-actions bot Jun 13, 2024
@github-actions
Copy link

⚠️ GitHub issue #42143 has been automatically assigned in GitHub to PR creator.

@nealrichardson nealrichardson merged commit 801de2f into apache:main Jun 14, 2024
@nealrichardson nealrichardson deleted the safe-unserialize-r branch June 14, 2024 20:09
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 8 benchmarking runs that have been run so far on merge-commit 801de2f.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants