Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate dictionary references to enable dictionary usage for any combination of window size and content size #4017

Open
ekinnear opened this issue Apr 5, 2024 · 1 comment
Assignees

Comments

@ekinnear
Copy link

ekinnear commented Apr 5, 2024

(Couldn't find an existing issue covering this, but happy to redirect conversation elsewhere.)

My understanding based on @Cyan4973's comment in httpwg/http-extensions#2754 is that Zstd stops being able to use a dictionary once we're more than a single window size "away" from the beginning of the content. In other words, the dictionary is almost "prepended" to the content for the purposes of being reference-able.

It seems like this entangles the ability to support backreferences based on some window size with the ability to use a dictionary at all when the window size is less than the size of the content itself.

In a browser context (and on mobile devices), we cannot use significantly large window sizes due to memory usage constraints.

Today, my understanding is that this means we will not benefit from dictionary usage beyond the first window-size number of bytes in the content.

Would it be possible to have a separate mechanism for referring to dictionary content that wasn't dependent on the window size? We're essentially already committing to having the dictionary in memory, and need a way to limit the maximum size of those dictionaries, so it seems quite unfortunate to not be able to benefit from having spent those resources to improve [de]compression.

@Cyan4973 Cyan4973 self-assigned this Apr 5, 2024
@Cyan4973
Copy link
Contributor

Cyan4973 commented Apr 5, 2024

Yes this is correct.
In Zstandard, a Dictionary is only valid up to reaching the Window Size.

A word of context: dictionary compression was originally designed to serve Small Data compression. For this use case, the "window" size is not a constraint.

But when the dictionary functionality is (ab)used as a delta engine, now targeting very large contents and references, this Window Size can become a concern, because it typically means it must be as large as the content to decompress, which could be a lot.

Would it be possible to have a separate mechanism for referring to dictionary content that wasn't dependent on the window size?

In theory, this is possible.
We could imagine a "trailing dictionary", which remains valid after crossing the limit of the Window Size.
It's not too difficult to do, but it's not "zero" work either, particular attention must be paid for validation and safety.

The real price to pay for this new capability is incompatibility with existing decoders. And that's a pretty steep one.
There is a "reserved bit" in the header which could be employed for this purpose, thus making it clear for existing decoders that they can't decode the incoming content. It is a safe way to avoid corruption on the receiver side. But it can still generate confusion if some zstd frames are no longer decodable by existing decoders.

Now that we have an idea of the cost (and confusion is a pretty high cost, paid at ecosystem scale), what would be the benefits?

Let's get back to the example stated in httpwg/http-extensions#2754 :
There is a big binary, about ~50MB, that must be updated, and the updated version is likely ~50MB.

Because :

We're essentially already committing to having the dictionary in memory,

the memory cost starts at ~50 MB, just for the reference.
Then, because zstd requires the content to be within Window Size, we need another ~50MB for the decompressed content. So that's about 100 MB, which is a hefty budget.

Now, let's imagine, we implement a "trailing dictionary" as mentioned above. Now we don't need the entire decompressed content in memory, just the window size (typically between 2 and 8 MB, so let's take 4 MB for the sake of argument). Thanks to these savings, we now "only" need ~54 MB.
That's indeed better than 100 MB. But is it really such a big win ?

It looks to me that, as long as the mechanism requires by construction the entire reference in memory, we are already paying a pretty big memory budget. And then the rest is essentially within a factor 2, which is sure not nothing, but not transformative either.
And the price to pay for this benefit is ecosystem confusion (and I deliberately ignore here the development / validation costs, and added security surface). It's really not clear to me if this is a good trade off.

The situation would be wildly different if the memory saving was more significant, say for example > 10x smaller.
But, as we have just seen, such an objective requires reducing also the memory budget allocated to the reference.

I could definitely imagine an algorithm that would offer this property.
The crux of the problem is that we are "abusing" the dictionary mechanism to make it a delta engine. It works, but with the caveat of the memory consumption mentioned above. That's because a dictionary content can be referenced anywhere during the decompression process, so it must remain available entirely.
In contrast, a dedicated delta engine could take advantage of the sequential access nature into the reference, which is the most essential one. Because such a design reduces the choice of segments into the reference, it would not improve compression, even making it a bit worse, but also likely not by much. The real benefit though is that it would offer a dramatically lowered memory budget, as the reference is just "streamed" once, which could be done with just a few small buffers.

Such a design would offer a dramatically lower memory budget for a delta engine, the difference is probably large enough to deserve some attention.
However, it's also a different mechanism. It's no longer just using an existing dictionary compression algorithm "as is", it requires some kind of preprocessing step, and the rest or the components can then be passed to a proper compression engine. It's definitely more work, and validation, and unfortunately longer delay to delivery. I can get the tension here.
That being said, what's the timeline here ? If this capability is meant to serve the next ~20+ years of Internet, it's probably worth a design discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants