Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace message Tree with a topologically sorted varint delimited stream of Directory messages #229

Closed
EdSchouten opened this issue Aug 22, 2022 · 1 comment · Fixed by #230

Comments

@EdSchouten
Copy link
Collaborator

EdSchouten commented Aug 22, 2022

I know that on numerous occasions we've had discussions within the working group on whether the Tree vs. Directory dichotomy is truly necessary, and whether there are ways to eliminate it. Let me make it clear up front that this issue is filed under the assumption that we do want to keep these separate like we have today, which is why I'm not posting this under #159 or #170. It's just low hanging fruit.

The last couple of days I've worked on optimising handling of large Tree objects in bb-clientd. Whereas some of the other Protobuf messages can sometimes become hundreds of kilobytes in size (e.g., Command), it's not impossible for Trees to become dozens of megabytes. While working on optimising this area, I realised that my life would have been a lot easier if Trees were stored as a varint delimited stream of Directory messages (just like Bazel --build_event_binary_file). My suggestion would be to make it topologically sorted (parents before children), so that the root of the Tree comes first. Some advantages:

  1. Because of the topological sorting, you can instantiate the contents of a Tree on a native file system in a streaming manner. Memory consumption of such an extraction tool would only be proportional to the number of referenced child directories that have not yet been observed.
  2. Similarly, you can do one-shot path lookups (for paths without "..") by doing a forward scan against the stream, only unmarshaling directory objects that match the length and hash yielded by the parent.
  3. Right now you see that consumers of Tree first unmarshal the message, and then marshal each of the child directories again, just so that their hash can be computed. This implicitly assumes that the producer and consumer use the same strategy for marshaling. By using a varint delimited stream, consumers have direct access to directories in marshaled form, meaning that both this assumption and the redundant marshaling are eliminated.
  4. In bb-clientd I need to be able to have fast random access to Tree objects. To facilitate this, I'm basically storing the Tree object in the client-side CAS just once, but creating n+1 entries in the index. One for the Tree as a whole, and one for each of the child directories, pointing to byte ranges in the Tree's bytestream that correspond to the individual directory messages. That way these directory messages become directly addressable. This happens to be possible today, only because Protobuf messages are encoded identically regardless of the depth at which they are placed. The format I'm proposing not only allows me to eliminate this assumption, it means I can also compute the n additional entries in a streaming manner.
@EdSchouten
Copy link
Collaborator Author

EdSchouten commented Sep 29, 2022

It looks like I've been able to solve (3) and (4) on the Buildbarn side without making any protocol changes, as it turns out that Protobuf's wire format has some interesting properties:

  • Nested messages use the same encoding as string and bytes fields. This means that nested messages are prefixed with their own length.
  • There is no difference in how singular and repeated fields are encoded. It's just that repeated fields may occur in the output multiple times.

So if you look at an REv2 Tree's binary representation, it's typically little more than this:

  • Root directory:
    • Tag of field number 1 of and type 2 (bytes): (1 << 3) | 2 == 0x0a
    • Varint encoded length of Directory message
    • Directory message
  • Child directory 1:
    • Tag of field number 2 of and type 2 (bytes): (2 << 3) | 2 == 0x12
    • Varint encoded length of Directory message
    • Directory message
  • Child directory 2:
    • Tag of field number 2 of and type 2 (bytes): (2 << 3) | 2 == 0x12
    • Varint encoded length of Directory message
    • Directory message
  • ...

Basically the same as what I proposed above, except for the presence of field tags and a lack of topological sorting.

Anyway, I've successfully managed to patch up Buildbarn to deal with Tree objects in a streaming manner as follows:

With these changes in place, I'm capable of efficiently working with Tree objects that are hundreds of megabytes in size.

I'm still going to leave this issue open, as issues (1) and (2) remain. Furthermore, I'm not too happy about maintaining my own Protobuf message parser. My solution would break as soon as someone adds a simple integer field to the Tree message, as my parser is only capable of processing 'bytes' fields.

EdSchouten added a commit to EdSchouten/remote-apis that referenced this issue Sep 30, 2022
I'm currently trying to improve the performance of handling of large
output directories (Tree messages), having sizes in the order of
hundreds of megabytes. In the process, I have realised that there is a
lot of value in enforcing that the Directory messages contained in them
are topologically sorted. Two practical use cases:

- When instantiating the contents of a Tree on a local file system,
  having the Tree be topologically sorted allows you to immediately
  create files and directories in the right place.

- When needing to resolve the properties of a single file by path, a
  topologically sorted Tree permits resolution by doing a simple forward
  scan.

Especially when other features like compression are taken into account,
it's useful if Tree messages can be processed in a streaming manner.

One practical issue is that most Protobuf libraries don't offer APIs for
processing messages in a streaming manner. This means that implementors
who want to achive these optimisations will need to write their own
message parsers; at least for the Tree tree itself. To make this as
painless as possible, we also require that the Tree is stored in some
normal form.

Fixes: bazelbuild#229
EdSchouten added a commit to EdSchouten/remote-apis that referenced this issue Sep 30, 2022
I'm currently trying to improve the performance of handling of large
output directories (Tree messages), having sizes in the order of
hundreds of megabytes. In the process, I have realised that there is a
lot of value in enforcing that the Directory messages contained in them
are topologically sorted. Two practical use cases:

- When instantiating the contents of a Tree on a local file system,
  having the Tree be topologically sorted allows you to immediately
  create files and directories in the right place.

- When needing to resolve the properties of a single file by path, a
  topologically sorted Tree permits resolution by doing a simple forward
  scan.

Especially when other features like compression are taken into account,
it's useful if Tree messages can be processed in a streaming manner.

One practical issue is that most Protobuf libraries don't offer APIs for
processing messages in a streaming manner. This means that implementors
who want to achive these optimisations will need to write their own
message parsers; at least for the Tree tree itself. To make this as
painless as possible, we also require that the Tree is stored in some
normal form.

Fixes: bazelbuild#229
EdSchouten added a commit to EdSchouten/remote-apis that referenced this issue Sep 30, 2022
I'm currently trying to improve the performance of handling of large
output directories (Tree messages), having sizes in the order of
hundreds of megabytes. In the process, I have realised that there is a
lot of value in enforcing that the Directory messages contained in them
are topologically sorted. Two practical use cases:

- When instantiating the contents of a Tree on a local file system,
  having the Tree be topologically sorted allows you to immediately
  create files and directories in the right place.

- When needing to resolve the properties of a single file by path, a
  topologically sorted Tree permits resolution by doing a simple forward
  scan.

Especially when other features like compression are taken into account,
it's useful if Tree messages can be processed in a streaming manner.

One practical issue is that most Protobuf libraries don't offer APIs for
processing messages in a streaming manner. This means that implementors
who want to achive these optimisations will need to write their own
message parsers; at least for the Tree tree itself. To make this as
painless as possible, we also require that the Tree is stored in some
normal form.

Fixes: bazelbuild#229
@EdSchouten EdSchouten changed the title REv3 idea: Replace message Tree with a topologically sorted varint delimited stream of Directory messages Replace message Tree with a topologically sorted varint delimited stream of Directory messages Sep 30, 2022
EdSchouten added a commit to EdSchouten/remote-apis that referenced this issue Oct 1, 2022
I'm currently trying to improve the performance of handling of large
output directories (Tree messages), having sizes in the order of
hundreds of megabytes. In the process, I have realised that there is a
lot of value in enforcing that the Directory messages contained in them
are topologically sorted. Two practical use cases:

- When instantiating the contents of a Tree on a local file system,
  having the Tree be topologically sorted allows you to immediately
  create files and directories in the right place.

- When needing to resolve the properties of a single file by path, a
  topologically sorted Tree permits resolution by doing a simple forward
  scan.

Especially when other features like compression are taken into account,
it's useful if Tree messages can be processed in a streaming manner.

One practical issue is that most Protobuf libraries don't offer APIs for
processing messages in a streaming manner. This means that implementors
who want to achive these optimisations will need to write their own
message parsers; at least for the Tree tree itself. To make this as
painless as possible, we also require that the Tree is stored in some
normal form.

Fixes: bazelbuild#229
EdSchouten added a commit to EdSchouten/remote-apis that referenced this issue Oct 2, 2022
I'm currently trying to improve the performance of handling of large
output directories (Tree messages), having sizes in the order of
hundreds of megabytes. In the process, I have realised that there is a
lot of value in enforcing that the Directory messages contained in them
are topologically sorted. Two practical use cases:

- When instantiating the contents of a Tree on a local file system,
  having the Tree be topologically sorted allows you to immediately
  create files and directories in the right place.

- When needing to resolve the properties of a single file by path, a
  topologically sorted Tree permits resolution by doing a simple forward
  scan.

Especially when other features like compression are taken into account,
it's useful if Tree messages can be processed in a streaming manner.

One practical issue is that most Protobuf libraries don't offer APIs for
processing messages in a streaming manner. This means that implementors
who want to achive these optimisations will need to write their own
message parsers; at least for the Tree tree itself. To make this as
painless as possible, we also require that the Tree is stored in some
normal form.

Fixes: bazelbuild#229
EdSchouten added a commit to EdSchouten/remote-apis that referenced this issue Oct 2, 2022
I'm currently trying to improve the performance of handling of large
output directories (Tree messages), having sizes in the order of
hundreds of megabytes. In the process, I have realised that there is a
lot of value in enforcing that the Directory messages contained in them
are topologically sorted. Two practical use cases:

- When instantiating the contents of a Tree on a local file system,
  having the Tree be topologically sorted allows you to immediately
  create files and directories in the right place.

- When needing to resolve the properties of a single file by path, a
  topologically sorted Tree permits resolution by doing a simple forward
  scan.

Especially when other features like compression are taken into account,
it's useful if Tree messages can be processed in a streaming manner.

One practical issue is that most Protobuf libraries don't offer APIs for
processing messages in a streaming manner. This means that implementors
who want to achive these optimisations will need to write their own
message parsers; at least for the Tree tree itself. To make this as
painless as possible, we also require that the Tree is stored in some
normal form.

Fixes: bazelbuild#229
EdSchouten added a commit to EdSchouten/remote-apis that referenced this issue Oct 11, 2022
I'm currently trying to improve the performance of handling of large
output directories (Tree messages), having sizes in the order of
hundreds of megabytes. In the process, I have realised that there is a
lot of value in enforcing that the Directory messages contained in them
are topologically sorted. Two practical use cases:

- When instantiating the contents of a Tree on a local file system,
  having the Tree be topologically sorted allows you to immediately
  create files and directories in the right place.

- When needing to resolve the properties of a single file by path, a
  topologically sorted Tree permits resolution by doing a simple forward
  scan.

Especially when other features like compression are taken into account,
it's useful if Tree messages can be processed in a streaming manner.

One practical issue is that most Protobuf libraries don't offer APIs for
processing messages in a streaming manner. This means that implementors
who want to achive these optimisations will need to write their own
message parsers; at least for the Tree tree itself. To make this as
painless as possible, we also require that the Tree is stored in some
normal form.

Fixes: bazelbuild#229
EdSchouten added a commit to EdSchouten/remote-apis that referenced this issue Oct 13, 2022
I'm currently trying to improve the performance of handling of large
output directories (Tree messages), having sizes in the order of
hundreds of megabytes. In the process, I have realised that there is a
lot of value in enforcing that the Directory messages contained in them
are topologically sorted. Two practical use cases:

- When instantiating the contents of a Tree on a local file system,
  having the Tree be topologically sorted allows you to immediately
  create files and directories in the right place.

- When needing to resolve the properties of a single file by path, a
  topologically sorted Tree permits resolution by doing a simple forward
  scan.

Especially when other features like compression are taken into account,
it's useful if Tree messages can be processed in a streaming manner.

One practical issue is that most Protobuf libraries don't offer APIs for
processing messages in a streaming manner. This means that implementors
who want to achive these optimisations will need to write their own
message parsers; at least for the Tree tree itself. To make this as
painless as possible, we also require that the Tree is stored in some
normal form.

Fixes: bazelbuild#229
EdSchouten added a commit to EdSchouten/bazel that referenced this issue Oct 13, 2022
remote-apis PR 230 added a way where producers of Tree messages can
indicate that the directories contained within are stored in topological
order. The advantage of using such an ordering is that it permits
instantiation of such objects onto a local file system in a streaming
fashion. The same holds for lookups of individual paths.

Even though Bazel currently does not gain from this, this change at
least modifies Bazel's REv2 client to emit topologically sorted trees.
This makes it possible for tools such as Buildbarn's bb-browser to
process them more efficiently.

More details:
- bazelbuild/remote-apis#229
- bazelbuild/remote-apis#230
bergsieker pushed a commit that referenced this issue Oct 21, 2022
* Regenerate the Go source code for the Remote Execution protocol

* Add a hint for indicating that a Tree is topologically sorted

I'm currently trying to improve the performance of handling of large
output directories (Tree messages), having sizes in the order of
hundreds of megabytes. In the process, I have realised that there is a
lot of value in enforcing that the Directory messages contained in them
are topologically sorted. Two practical use cases:

- When instantiating the contents of a Tree on a local file system,
  having the Tree be topologically sorted allows you to immediately
  create files and directories in the right place.

- When needing to resolve the properties of a single file by path, a
  topologically sorted Tree permits resolution by doing a simple forward
  scan.

Especially when other features like compression are taken into account,
it's useful if Tree messages can be processed in a streaming manner.

One practical issue is that most Protobuf libraries don't offer APIs for
processing messages in a streaming manner. This means that implementors
who want to achive these optimisations will need to write their own
message parsers; at least for the Tree tree itself. To make this as
painless as possible, we also require that the Tree is stored in some
normal form.

Fixes: #229
EdSchouten added a commit to EdSchouten/bazel that referenced this issue Oct 23, 2022
remote-apis PR 230 added a way where producers of Tree messages can
indicate that the directories contained within are stored in topological
order. The advantage of using such an ordering is that it permits
instantiation of such objects onto a local file system in a streaming
fashion. The same holds for lookups of individual paths.

Even though Bazel currently does not gain from this, this change at
least modifies Bazel's REv2 client to emit topologically sorted trees.
This makes it possible for tools such as Buildbarn's bb-browser to
process them more efficiently.

More details:
- bazelbuild/remote-apis#229
- bazelbuild/remote-apis#230
EdSchouten added a commit to EdSchouten/bazel that referenced this issue Oct 27, 2022
remote-apis PR 230 added a way where producers of Tree messages can
indicate that the directories contained within are stored in topological
order. The advantage of using such an ordering is that it permits
instantiation of such objects onto a local file system in a streaming
fashion. The same holds for lookups of individual paths.

Even though Bazel currently does not gain from this, this change at
least modifies Bazel's REv2 client to emit topologically sorted trees.
This makes it possible for tools such as Buildbarn's bb-browser to
process them more efficiently.

More details:
- bazelbuild/remote-apis#229
- bazelbuild/remote-apis#230
copybara-service bot pushed a commit to bazelbuild/bazel that referenced this issue Oct 27, 2022
remote-apis PR 230 added a way where producers of Tree messages can
indicate that the directories contained within are stored in topological
order. The advantage of using such an ordering is that it permits
instantiation of such objects onto a local file system in a streaming
fashion. The same holds for lookups of individual paths.

Even though Bazel currently does not gain from this, this change at
least modifies Bazel's REv2 client to emit topologically sorted trees.
This makes it possible for tools such as Buildbarn's bb-browser to
process them more efficiently.

More details:
- bazelbuild/remote-apis#229
- bazelbuild/remote-apis#230

Partial commit for third_party/*, see #16463.

Signed-off-by: Sunil Gowroji <[email protected]>
copybara-service bot pushed a commit to bazelbuild/bazel that referenced this issue Nov 9, 2022
remote-apis PR 230 added a way where producers of Tree messages can    indicate that the directories contained within are stored in topological    order. The advantage of using such an ordering is that it permits    instantiation of such objects onto a local file system in a streaming    fashion. The same holds for lookups of individual paths.

Even though Bazel currently does not gain from this, this change at    least modifies Bazel's REv2 client to emit topologically sorted trees.    This makes it possible for tools such as Buildbarn's bb-browser to    process them more efficiently.

More details:
- bazelbuild/remote-apis#229
- bazelbuild/remote-apis#230

Closes #16463.

PiperOrigin-RevId: 487196375
Change-Id: Iafcfd617fc101fec7bfa943552113ce57ab8041b
ShreeM01 pushed a commit to bazelbuild/bazel that referenced this issue Dec 1, 2022
remote-apis PR 230 added a way where producers of Tree messages can    indicate that the directories contained within are stored in topological    order. The advantage of using such an ordering is that it permits    instantiation of such objects onto a local file system in a streaming    fashion. The same holds for lookups of individual paths.

Even though Bazel currently does not gain from this, this change at    least modifies Bazel's REv2 client to emit topologically sorted trees.    This makes it possible for tools such as Buildbarn's bb-browser to    process them more efficiently.

More details:
- bazelbuild/remote-apis#229
- bazelbuild/remote-apis#230

Closes #16463.

PiperOrigin-RevId: 487196375
Change-Id: Iafcfd617fc101fec7bfa943552113ce57ab8041b
ShreeM01 pushed a commit to bazelbuild/bazel that referenced this issue Dec 1, 2022
remote-apis PR 230 added a way where producers of Tree messages can
indicate that the directories contained within are stored in topological
order. The advantage of using such an ordering is that it permits
instantiation of such objects onto a local file system in a streaming
fashion. The same holds for lookups of individual paths.

Even though Bazel currently does not gain from this, this change at
least modifies Bazel's REv2 client to emit topologically sorted trees.
This makes it possible for tools such as Buildbarn's bb-browser to
process them more efficiently.

More details:
- bazelbuild/remote-apis#229
- bazelbuild/remote-apis#230

Partial commit for third_party/*, see #16463.

Signed-off-by: Sunil Gowroji <[email protected]>
meteorcloudy pushed a commit to bazelbuild/bazel that referenced this issue Dec 2, 2022
* Emit Tree objects in topological order

remote-apis PR 230 added a way where producers of Tree messages can    indicate that the directories contained within are stored in topological    order. The advantage of using such an ordering is that it permits    instantiation of such objects onto a local file system in a streaming    fashion. The same holds for lookups of individual paths.

Even though Bazel currently does not gain from this, this change at    least modifies Bazel's REv2 client to emit topologically sorted trees.    This makes it possible for tools such as Buildbarn's bb-browser to    process them more efficiently.

More details:
- bazelbuild/remote-apis#229
- bazelbuild/remote-apis#230

Closes #16463.

PiperOrigin-RevId: 487196375
Change-Id: Iafcfd617fc101fec7bfa943552113ce57ab8041b

* Emit Tree objects in topological order

remote-apis PR 230 added a way where producers of Tree messages can
indicate that the directories contained within are stored in topological
order. The advantage of using such an ordering is that it permits
instantiation of such objects onto a local file system in a streaming
fashion. The same holds for lookups of individual paths.

Even though Bazel currently does not gain from this, this change at
least modifies Bazel's REv2 client to emit topologically sorted trees.
This makes it possible for tools such as Buildbarn's bb-browser to
process them more efficiently.

More details:
- bazelbuild/remote-apis#229
- bazelbuild/remote-apis#230

Partial commit for third_party/*, see #16463.

Signed-off-by: Sunil Gowroji <[email protected]>

Signed-off-by: Sunil Gowroji <[email protected]>
Co-authored-by: Ed Schouten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant