Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compression for uploaded documentation #780

Merged
merged 15 commits into from
Jun 11, 2020
Merged

Conversation

jyn514
Copy link
Member

@jyn514 jyn514 commented May 28, 2020

This needs some tests, but otherwise I think it should be fine as is, it was super simple after #643 :)

This uses zstd 5, but it's very easy to change the compression now and reasonably easy to change it after it's merged (we just have to change compression from a boolean into an enum and continue supporting zstd).

This compresses transparently, so that calling backend.get() automatically decompresses files as needed. If the files were not compressed before uploading, they are not decompressed when downloading.

Closes #379

cc @namibj
r? @pietroalbini

@jyn514
Copy link
Member Author

jyn514 commented May 28, 2020

Oh, I just realized this has the wrong content-types on s3: It uses the original content types instead of zstd or something like that. I'll add the original content type as a field in the metadata.

@jyn514
Copy link
Member Author

jyn514 commented May 28, 2020

Actually that seems like a lot of work for no benefit, I don't know that we'll ever actually use the zstd encoding for anything.

@jyn514
Copy link
Member Author

jyn514 commented May 28, 2020

Some benchmarks:

Benchmarking compress regex html: Collecting 100 samples in estimated 7.5520 s (10k i                                                                                     compress regex html     time:   [676.22 us 725.12 us 779.81 us]
Found 16 outliers among 100 measurements (16.00%)
  16 (16.00%) high severe

Benchmarking decompress regex html: Collecting 100 samples in estimated 5.1476 s (91k                                                                                     decompress regex html   time:   [55.181 us 57.623 us 60.467 us]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

src/db/migrate.rs Outdated Show resolved Hide resolved
src/storage/s3.rs Outdated Show resolved Hide resolved
@pietroalbini
Copy link
Member

The implementation looks good to me!

I agree with @Nemo157 that we should store {algorithm} instead of just compressed, and that we should use S3's native Content-Encoding to store it instead of a custom metadata.

Also, based on #379 (comment) the two options I was considering were zstd 9 or brotli 5 leaning torwards brotli as it uses 10% less disk space in that benchmark. You're using zstd 5 in the PR.

@Nemo157
Copy link
Member

Nemo157 commented May 28, 2020

The other option that @namibj was investigating was zstd with a custom dictionary optimized for rustdoc output, I can probably adapt the linked benchmarking script to test that as well with a dictionary created from the docs I have.

@jyn514
Copy link
Member Author

jyn514 commented May 28, 2020

Oops, the benchmarks before got cut off. Here are some taken with a desktop instead of a laptop:

compress regex html     time:   [715.28 us 715.74 us 716.27 us]                                
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

decompress regex html   time:   [37.410 us 37.438 us 37.470 us]                                   
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

@Nemo157
Copy link
Member

Nemo157 commented May 29, 2020

Another idea might be to store a set of all the encodings used for a particular release into the release table, so we can easily lookup basic stats on compression usage without needing to go to S3. (A set to allow for different dictionaries based on filetype to be recorded).

@jyn514
Copy link
Member Author

jyn514 commented May 29, 2020

store a set of all the encodings used for a particular release into the release table

This sounds like we'd only use it for metrics? I'm not opposed in principle but storing the same data in two places risks it getting out of sync.

@Kixiron
Copy link
Member

Kixiron commented May 30, 2020

Metrics don’t need to be a source of complexity, we can just as easily have an IntCounter that we increment with the total bytes saved by each build (plus additional if we’d like)

@Nemo157
Copy link
Member

Nemo157 commented May 31, 2020

It's not really for metrics like what goes in grafana, more for future visibility into usage distribution. Especially with if we migrate dictionaries in the future knowing how many releases use which dictionaries could be useful to determine whether it seems feasible to do something like migrate all instances of a certain dictionary to a new one to remove it from the application.

Given that releases are basically read-only currently I don't think there's much chance of this data getting out of sync.

@Kixiron
Copy link
Member

Kixiron commented May 31, 2020

How would that interact with us deleting docs in the future (if we ever decided to do that)?

@jyn514
Copy link
Member Author

jyn514 commented May 31, 2020

How would that interact with us deleting docs in the future (if we ever decided to do that)?

No effect, we'd just delete the rows out of the database as well.

@jyn514
Copy link
Member Author

jyn514 commented May 31, 2020

Another idea might be to store a set of all the encodings used for a particular release into the release table, so we can easily lookup basic stats on compression usage without needing to go to S3. (A set to allow for different dictionaries based on filetype to be recorded).

@Nemo157 a possible issue with this is it prevents using different compression algorithms for different files within a release. Does that sound worth it? Otherwise we'd have to store metadata for each file individually in the database which sort of the defeats the point of an S3 bucket.

@Nemo157
Copy link
Member

Nemo157 commented May 31, 2020

That's why I mention a set of encodings, we don't need to know exactly which files use which encoding, just that having this set of encodings is enough to access all files in the release (and if we want to know in detail then we would have to query the metadata on all the files).

@jyn514
Copy link
Member Author

jyn514 commented Jun 5, 2020

I added the set of compression algorithms used as a many-to-many-table in the database, with a unique constraint so we don't end up with duplicates by accident.

I'm open to suggestions for more tests but otherwise I think I addressed all the comments.

@jyn514
Copy link
Member Author

jyn514 commented Jun 5, 2020

As to using a custom zstd dictionary: it sounds like the rust zstd crate needs some fairly major changes to make that performant, I don't want to delay having any compression at all on that. If we want to use brotli instead for the storage savings that sounds reasonable but otherwise I think zstd 9 is fine.

@jyn514 jyn514 changed the title [WIP] Add compression for uploaded documentation Add compression for uploaded documentation Jun 5, 2020
src/db/migrate.rs Outdated Show resolved Hide resolved
@Nemo157
Copy link
Member

Nemo157 commented Jun 5, 2020

I think it definitely makes sense to start with zstd now and add custom dictionary support later. Going with brotli now and moving to zstd custom dictionary later would require keeping both libraries around indefinitely.

Doing it in two steps also gives us the chance to see any performance impacts of each step.

One last thought I had was whether there should be a config var to disable compression, so if there are any issues in production that can just be toggled rather than having to do a rollback.

@jyn514
Copy link
Member Author

jyn514 commented Jun 5, 2020

One last thought I had was whether there should be a config var to disable compression, so if there are any issues in production that can just be toggled rather than having to do a rollback.

It doesn't take very long to revert, about 5 minutes. In general I like the idea of having feature gates instead of "it's compiled in or it's not" but I'm a little wary of toggling features with environment variables, I'd rather come up with something more principled.

src/storage/mod.rs Outdated Show resolved Hide resolved
- Don't require writing a database migration
- Make tests not compile if they aren't exhaustive
@jyn514
Copy link
Member Author

jyn514 commented Jun 11, 2020

Let me know if there are any more changes that need to be made, I think I addressed all the review comments.

@jyn514
Copy link
Member Author

jyn514 commented Jun 11, 2020

I just found that this will prevent us from updating builds if we rebuild a crate:

thread 'main' panicked at 'Building documentation failed: Error(Db(DbError { severity: "ERROR", parsed_severity: Some(Error), code: SqlState("23505"), message: "duplicate key value violates unique constraint \"compression_rels_release_algorithm_key\"", detail: Some("Key (release, algorithm)=(1, 0) already exists."), hint: None, position: None, where_: None, schema: Some("public"), table: Some("compression_rels"), column: None, datatype: None, constraint: Some("compression_rels_release_algorithm_key"), file: Some("nbtinsert.c"), line: Some(434), routine: Some("_bt_check_unique") }))', src/bin/cratesfyi.rs:321:21

I think I need to add an ON CONFLICT DO UPDATE clause.

Previously, postgres would give a duplicate key error when a crate was
built more than once:

```
thread 'main' panicked at 'Building documentation failed: Error(Db(DbError { severity: "ERROR", parsed_severity: Some(Error), code: SqlState("23505"), message: "duplicate key value violates unique constraint \"compression_rels_release_algorithm_key\"", detail: Some("Key (release, algorithm)=(1, 0) already exists."), hint: None, position: None, where_: None, schema: Some("public"), table: Some("compression_rels"), column: None, datatype: None, constraint: Some("compression_rels_release_algorithm_key"), file: Some("nbtinsert.c"), line: Some(434), routine: Some("_bt_check_unique") }))', src/bin/cratesfyi.rs:321:21
```

Now, duplicate keys are discarded.
@jyn514 jyn514 merged commit 8413226 into rust-lang:master Jun 11, 2020
@jyn514 jyn514 deleted the compression branch June 11, 2020 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compress documentation uploaded to S3
4 participants