Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize decorator data at the end #1531

Merged

Conversation

yasonk
Copy link
Contributor

@yasonk yasonk commented Oct 13, 2024

Describe your changes

Moving all decorator data to the end in order to enable truncating it later. See issue #1489

Contains a minor refactor to make writing decorator info more direct by reducing interconnected layers of indirection.
Also extracted some functions from the Serializer code to make it easer follow.

Checklist before requesting a review

  • Repo forked and branch created from next according to naming convention.
  • Commit messages and codestyle follow conventions.
  • Relevant issues are linked in the PR description.
  • Tests added for new functionality. (No new tests)
  • Documentation/comments updated according to changes. (No Doc changes)

) -> Self {
let variant = EncodedDecoratorVariant::from(decorator);
let decorator_data_offset =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making things a bit more direct. To calculate decorator offset, this used to make a call to an argument passed into this function using two other arguments passed into this function.

*MAGIC, magic
)));
}
read_and_validate_magic(source)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a single line of code for each item being read, improves readability of this method

Copy link
Contributor

@PhilippGackstatter PhilippGackstatter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Looks good to me in general.

I think I have two comments:

  • How do we actually support dropping the decorator data (see inline comment)? And if this is non-trivial, do we perhaps need to add a public utility function fn mast_forest_drop_decorators(serialized_mast_forest: &mut Vec<u8>) to do this? (An honest question - I'm not sure).
  • Can you add one or more tests to ensure that we can serialize a MastForest, drop its decorator data and deserialize it successfully and that all its decorator data is gone? It seems to me that the serialization tests are somewhat rare, so adding more seems like a good idea.

Comment on lines 93 to 97
// decorator & node counts
target.write_usize(self.decorators.len());
// node counts
Copy link
Contributor

@PhilippGackstatter PhilippGackstatter Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure in which context we intend to actually drop the decorator data from the serialized representation, but I'm wondering: How do we actually do that? We only have a Vec<u8> so wouldn't we need the length of the decorator data at some fixed offset to read it and then be able to drop the appropriate amount of bytes from the end? Since we can't retroactively write at an offset at the beginning of the section, perhaps we could add this offset at the very end of the layout.

<MAGIC>
<VERSION>
...
<DECORATOR_DATA_OFFSET_FROM_BACK: u64>

This way we could read the u64 (or whatever makes sense, it just needs a fixed size) from the back of the Vec<u8>. Then we Vec::truncate the appropriate amount.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PhilippGackstatter Not sure what the best mechanism would be, but your proposal seems reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure in which context we intend to actually drop the decorator data from the serialized representation, but I'm wondering: How do we actually do that?

I think the general idea is that producers, e.g. the compiler, will emit a fully-fleshed out package containing all the things that could be useful for interacting with that package (e.g. debug info, documentation, sources in some caes, etc.), and the MAST contained in the package would have all of the decorators (naturally required to support debugging, etc.). There are contexts where all of that stuff is simply dead weight though, or you want to optimize for binary size, in which case you can strip all of that, resulting in a minimal package that only contains exactly what is needed to execute what it contains. One such potential use case, would be publishing packages on-chain, which would only be feasible if they are as small as possible.

My assumption was that we stored decorators in a separate section, whose offset and size is recorded in the header of the binary format. Decorator references are simply relative offsets to the decorator data stored in the decorator section (i.e. the offset is relative to the base offset of the decorator section). Thus, truncating the decorators conceptually involves splitting the binary at the boundaries of the decorator section, then stitching the parts we're keeping back together, and setting the decorator section size and offset to 0. Decorator references would remain, but would be ignored since there is no decorator section to resolve them with.

Stripping both the decorator data and the decorator references, probably requires reconstructing the MAST from scratch, since I don't think we have any means by which to keep both references and the data separate from everything else. One slightly less drastic approach might be to visit all the decorator references and zero them out, so that compressing the binary format essentially erases any overhead associated with leaving them behind.

I think fundamentally though, the goal is to provide a fast and efficient means of stripping optional bits from the MAST, without rebuilding it from scratch. Optional sections would be kept separate as described above, so that you don't have to really know anything about what is stored in them to remove them.

That's my take anyway!

Copy link
Contributor

@PhilippGackstatter PhilippGackstatter Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation on where this is useful!

Stripping both the decorator data and the decorator references, probably requires reconstructing the MAST from scratch, since I don't think we have any means by which to keep both references and the data separate from everything else.

Maybe I'm missing something but I think that is possible. Currently the before-enter and after-exit decorator data is already completely separate. It's a Vec<(MastNodeId, Vec<DecoratorId>)> where for each mast node id we add the vector of decorator ids. If that is missing, we simply wouldn't add the decorators to the corresponding mast nodes.
For BasicBlocks we currently store the DecoratorList alongside the basic block data, but since it is just referencing ops in the block and decorators, (it's a Vec<(usize, DecoratorId)>) we could move it to the decorator section and turn it into a Vec<(MastNodeId, Vec<(usize, DecoratorId)>)>, so that it would work conceptually the same as the other decorators for deserialization: If that list is missing, we simply don't add it to the basic block.

I think that would be the optimal layout as it would allow us to remove all decrator data. Taking your comments into account also, the layout would then be:

//! The serialization format of MastForest is as follows:
//!
//! (Metadata)
//! - MAGIC
//! - VERSION
//!
//! (sections metadata)
//! - nodes length (`usize`)
//! - decorator data section offset (`usize`)
//! - decorators length (`usize`)
//!
//! (procedure roots section)
//! - procedure roots (`Vec<MastNodeId>`)
//!
//! (basic block data section)
//! - basic block data (previously called "node data", but I think it is more precise now)
//!
//! (node info section)
//! - MAST node infos (`Vec<MastNodeInfo>`)
//!
//! (decorator data section)
//! - Decorator data
//! - String table
//!
//! (decorator info section)
//! - decorator infos (`Vec<DecoratorInfo>`)
//!
//! (basic block decorator lists section)
//! - basic block decorator lists (`Vec<(MastNodeId, Vec<(usize, DecoratorId)>)>`)
//! 
//! (before enter and after exit decorators section)
//! - before enter decorators (`Vec<(MastNodeId, Vec<DecoratorId>)>`)
//! - after exit decorators (`Vec<(MastNodeId, Vec<DecoratorId>)>`)

@yasonk So for this PR we would have to change:

  • Add decorator data section offset to the header. Contrary to what I said before this is actually possible because we serialize the various data into Vec<u8>s first and then write those to the target, so we can inspect it for its length and add it to the header before writing those data.
  • Serialize DecoratorList of BasicBlocks similarly to before-enter/after-exit decorators.

I think we also should add a MastForest::truncate_decorators(serialized_mast_forest: &mut Vec<u8>) -> Result<(), MastForestError> function which validates the MAGIC and VERSION and attempts to read the sections metadata. If it is all valid it drops the decorator data from the end and updates the decorator data section offset and decorators length.

Finally, we should add tests as mentioned before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PhilippGackstatter Do you think it may be helpful to split this work into 2 PRs?

  1. Place all decorator data at the end. (Good catch with the decorator list inside basic blocks)
  2. Truncate decorators.

For (1), I would leave the current tests that make sure the data is serialized and deserialized correctly. Also include a fix for extracting decorator lists that are inside node_data.

For (2), we would implement the truncate_decorators along with adding state/logic needed to make that happen. E.g. updating decorator data section offset and decorators_length.
This would obviously require new tests to check that the decorators are properly removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it may be helpful to split this work into 2 PRs?

Whatever you prefer, I think both works. The truncate_decorators function shouldn't be a very large addition. Adding some tests for it and making sure that deserilization still works fine would be more code to add I would guess. Doing 2 PRs is also fine. After (1) the code should also be in a consistent state overall, that's the important part.

Btw re-reading the layout again I proposed, the basic block decorator lists should be a vector - I edited my comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PhilippGackstatter I decided to first get the decorator data grouped. Will create the truncate_decorators function and tests in the next PR.

I think I addressed all items, you suggested. Please check if I missed anything.

@@ -198,6 +196,10 @@ impl MastForest {
Some(id_remappings)
}

pub fn set_decorators(&mut self, node_id: MastNodeId, decorator_list: DecoratorList) {
Copy link
Contributor Author

@yasonk yasonk Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created this to keep the code consistent. But this method should probably be replaced by

mast_forest.get_node_by_id(node_id).set_decorators(decorator_list)

Same for the set_before_enter and set_after_exit methods

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can replace it with an explicit match where we use this method, see here: #1531 (comment)

Copy link
Contributor

@PhilippGackstatter PhilippGackstatter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Functionality-wise it seems to work as intended, which is great.

I think we can make use of some existing methods rather than introduce new ones like MastForest::set_decorators for less public methods and more safety (see inline comments). Other comments are mostly nits and optimizations.

I think you may also need to rebase your commits and sign them (if you haven't done this before, this may help: https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits). At least the merge window at the bottom is complaining about unsigned commits.

let node_data = basic_block_data_builder.finalize();
let string_table = string_table_builder.into_table();
node_data.write_into(target);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
node_data.write_into(target);
basic_block_data.write_into(target);

Nit: We should align the name with the serialization format description at the beginning of the file.

let roots: Vec<u32> = Deserializable::read_from(source)?;
let decorator_data: Vec<u8> = Deserializable::read_from(source)?;

// Reading nodes
let node_data: Vec<u8> = Deserializable::read_from(source)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let node_data: Vec<u8> = Deserializable::read_from(source)?;
let basic_block_data: Vec<u8> = Deserializable::read_from(source)?;

Nit: Same here.

@@ -232,35 +229,105 @@ impl Deserializable for MastForest {
mast_forest
};

let basic_block_decorators: Vec<(usize, DecoratorList)> = read_block_decorators(source)?;
for (node_id, decorator_list) in basic_block_decorators {
let node_id = MastNodeId::try_from((node_id, &mast_forest))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Instead of the TryFrom with a tuple, I would add a from_usize_safe method analogous to from_u32_safe. This seems more in line with the existing API and I don't think there's a benefit to implementing the trait.

let basic_block_decorators: Vec<(usize, DecoratorList)> = read_block_decorators(source)?;
for (node_id, decorator_list) in basic_block_decorators {
let node_id = MastNodeId::try_from((node_id, &mast_forest))?;
mast_forest.set_decorators(node_id, decorator_list);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mast_forest.set_decorators(node_id, decorator_list);
match &mut mast_forest[node_id] {
MastNode::Block(basic_block) => {
basic_block.init_decorators(decorator_list);
},
other => {
return Err(DeserializationError::InvalidValue(format!(
"expected mast node with id {node_id} to be a basic block, found {other:?}"
)))
},
}

I would prefer doing something like this rather than implement set_decorators on MastForest which really only supports basic blocks. If we have invalid serialized bytes and the node_id would point to a Join node for example, then set_decorators would silently ignore that this is an error. So we should explicitly handle it so deserialization fails if we encounter something like that.

Comment on lines 303 to 307
fn read_decorator_infos<R: ByteReader>(
source: &mut R,
decorator_count: usize,
) -> Result<Vec<DecoratorInfo>, DeserializationError> {
let mut decorator_infos: Vec<DecoratorInfo> = Vec::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we wouldn't need to read this to a Vec first. Where this is called, we could iterate over an iterator that reads the next DecoratorInfo from source and returns it. But I think this requires a custom iterator which is probably overkill.

As an alternative optimization, let's use Vec::with_capacity(decorator_count) here so we only allocate once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to implement the iterator since it isn't that much more code. Also implemented a iterator for reading MastNodeInfos

source: &mut R,
node_count: usize,
) -> Result<Vec<MastNodeInfo>, DeserializationError> {
let mut mast_node_infos: Vec<MastNodeInfo> = Vec::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut mast_node_infos: Vec<MastNodeInfo> = Vec::new();
let mut mast_node_infos: Vec<MastNodeInfo> = Vec::with_capacity(node_count);

let mut inner_vec: Vec<(usize, DecoratorId)> = Vec::with_capacity(decorator_vec_len);
for _ in 0..decorator_vec_len {
let op_id: usize = source.read()?;
let decorator_id: DecoratorId = source.read()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do this analogous to read_before_after_decorators and use let decorator_id = DecoratorId::from_u32_safe(source.read()?, mast_forest)?; so we have the additional safety check that the DecoratorId we're reading points to a valid decorator and is not out of bounds.
That way we don't need impl Deserializable for DecoratorId as well.


/// Used to initialize decorators for the [`BasicBlockNode`]. Replaces the existing decorators
/// with the given ['DecoratorList'].
pub fn init_decorators(&mut self, decorator_list: DecoratorList) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn init_decorators(&mut self, decorator_list: DecoratorList) {
pub fn set_decorators(&mut self, decorator_list: DecoratorList) {

Nit: set instead of init seems more idiomatic.

Comment on lines 239 to 241
/// Sets the list of decorators for the [`MastNode`]. Currently, is only supported for the
/// [`Block`] variant.
pub fn set_decorators(&mut self, decorator_list: DecoratorList) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, I would remove this in favor of a match at the call-site so we can avoid adding this public method and we can handle errors in a better way.

Comment on lines 15 to 21
//! (basic block data section)
//! - basic block data
//!
//! (node info structs)
//! - MAST node infos (`Vec<MastNodeInfo>`)
//!
//! (raw decorator data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe we can call all the sections "section", e.g. "node info section", etc.. This makes it clearer that the offsets in sections metadata point to those sections.

E.g.:

//! The serialization format of MastForest is as follows:
//!
//! (Metadata)
//! - MAGIC
//! - VERSION
//!
//! (sections metadata)
//! - nodes length (`usize`)
//! - decorator data section offset (`usize`)
//! - decorators length (`usize`)
//!
//! (procedure roots section)
//! - procedure roots (`Vec<MastNodeId>`)
//!
//! (basic block data section)
//! - basic block data
//!
//! (node info section)
//! - MAST node infos (`Vec<MastNodeInfo>`)
//!
//! (decorator data section)
//! - Decorator data
//! - String table
//!
//! (decorator info section)
//! - decorator infos (`Vec<DecoratorInfo>`)
//!
//! (basic block decorator lists section)
//! - basic block decorator lists (`Vec<(MastNodeId, Vec<(usize, DecoratorId)>)>`)
//! 
//! (before enter and after exit decorators section)
//! - before enter decorators (`Vec<(MastNodeId, Vec<DecoratorId>)>`)
//! - after exit decorators (`Vec<(MastNodeId, Vec<DecoratorId>)>`)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PhilippGackstatter makes sense. I will remove decorator data section offset (usize) per request from @plafer since it will be implemented later.

Is there an GitHub issue for truncating decorators already?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh with #1531 (comment) I initially assumed the decorator data section was implemented already but the comment was not updated - my bad.

I don't believe there's an issue for it, so yes let's just create an issue and maybe update the comment to:

//! - decorator data section offset (`usize`) (not implemented, see issue #X)

//! - decorators length (`usize`)
//! - decorator data section offset (`usize`) (NOT YET IMPLEMENTED)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's remember to update this before we merge

@yasonk
Copy link
Contributor Author

yasonk commented Nov 20, 2024

Rebased and started signing commits. But I don't think I can sign older commits.

Copy link
Contributor

@PhilippGackstatter PhilippGackstatter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for addressing the comments!

A few nits about avoiding allocations and removing public functionality (Deserializable for DecoratorId) from the public interface that we don't need for now.

You can retroactively sign your commits and then force push. Here's one way: https://stackoverflow.com/questions/41882919/is-there-a-way-to-gpg-sign-all-previous-commits. Without signed commits GitHub won't let us merge this PR.

core/src/mast/mod.rs Outdated Show resolved Hide resolved
core/src/mast/mod.rs Outdated Show resolved Hide resolved
core/src/mast/serialization/mod.rs Outdated Show resolved Hide resolved
@yasonk yasonk force-pushed the serialize-decorator-data-at-the-end branch from 7848ee8 to 2099676 Compare November 21, 2024 03:24
@yasonk
Copy link
Contributor Author

yasonk commented Nov 21, 2024

@PhilippGackstatter I addressed the last 4 comments. Thank you for the great suggestions in all cases!

@PhilippGackstatter
Copy link
Contributor

@yasonk Thanks for the updates and signing commits!

I think just two more things to make CI happy:

  • Replace uses of std::iter::from_fn with core::iter::from_fn and import Vec from alloc::vec::Vec instead of std.
  • Add an entry in CHANGELOG.md

You can try locally if the no-std build works without issues using make build-no-std.

@plafer Should we go ahead and merge after that or do you want to take another look?

@bobbinth
Copy link
Contributor

@plafer Should we go ahead and merge after that or do you want to take another look?

Let me do a quick review. I'll try to do this on Friday.

@yasonk yasonk force-pushed the serialize-decorator-data-at-the-end branch from c7e0f29 to d5a3eb8 Compare November 21, 2024 20:59
CHANGELOG.md Show resolved Hide resolved
@yasonk yasonk force-pushed the serialize-decorator-data-at-the-end branch from bd21506 to d74ea4a Compare November 22, 2024 21:03
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you for the great work!

@bobbinth bobbinth merged commit 0868ae8 into 0xPolygonMiden:next Nov 23, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants