compressed storage for rustdoc- and source-files #1342

syphar · 2021-04-04T13:06:37Z

I worked a little on the compressed storage topic, and this is a first design-draft.
It's working, so you can run builds, see the archives, and also click through rustdoc- or source-pages.

Before I'm finishing this up, I would love to hear your ideas / your feedback, so I know I'm going the right direction.

I squashed the commits since in reality I was jumping back and forth between topics all the time, so the commits wouldn't help you. So when you look at the code, the best way to understand what I thought is going in this order:

new BZip2 compression support ( mostly in storage::compression + tests + bench)
S3 & database storage backends: support for fetching the range of an object and decompress the range (mostly in storage::s3, storage::file, also in storage::Storage.get_range, some tests in storage::tests)
an archive-index in storage::archive_index which supports storing file-names, their ranges and compression algorithms. Can be created based on zip, but is flexible if we want to use different formats later, or mix them.
storage::Storage.get_index_for handles fetching the index for a certain archive-path, also creates a local cache for these index files. .exists queries to storage can be answered from that cache.
storage::Storage.store_all_in_archive will take a folder and store it in a zip-file together with a new index (local and remote).
storage::Storage.get_from_archive fetches a remote file from an archive. when the index is not cached, it is fetched first, then we only fetch and decompress the data.
storage::Storage.exists_in_archive for exist-queries. When the index is local, without request to S3
docbuilder::rustwide_builder starts creating two archives per release instead of uploading folders (one for the sources, one for the doc-build output). It also stores a bool into the release-table so we know the web handlers should look for archives
web::rustdoc::rustdoc_html_server_handler can fetch files from the archive if the release data stated archive_storage.
web::source::source_browser_handler also fetches files from the archive. Currently does a database fetch to know when to use archive-queries and when not. This could be changed to only checking for the local index cache, if we can guarantee that.

My general thoughts:

in the beginning I was actually thinking about using the zip central directory as the index, but that lead to some complexity (while keeping the file hashes). Here in this project the custom index, integrated with the existing compression logic made more sense.
as the sources and doc-output was separate before, I left it separate
I chose archives per release, because they won't have to be updated, only rebuilt
the index-format should be flexible enough to support future updates
I assumed we can trade local storage for a little performance by storing the indexes locally too
I'm not sure what the idea behind the compression-algorithm storage is, it looks like purely for statistics. I tried to insert similar data too for the archives.
I assumed the date_updated inside an archive is just the date_updated of the archive itself

IMHO everything around storage, compression and the index is in a good place, perhaps only needs some small tests around the errors that can happen.

Potential next steps / things I think have to be finished:

try CBOR and/or local compression of the index
research needed storage space for local indexes
depending on the result, add LRU cleanup with size limit in the cache (probably for all files, not only the index)
finishing up the rustdoc page, fixing the redirect (let's see how much I have to change for that, there is quite a mixup between remote (S3) paths and request paths in there).
add the used storage technique as a config var, so we can easily test the build with them.
tests for the archive-build
tests for source-handler and rustdoc changes

Questions/unsure:

I'm not sure about where to put the new archive_storage attribute in the mix between MetaData and CrateDetails.
~~should we use a totally separate path for the new archives? That would make cleanup easier~~
~~do we want to store/control in more detail where to use archives and where not?~~
~~should I clear the destination folders before uploading the archives? so a rebuild would replace the separate-file storage with the archives~~
~~should we use another storage format for the index? ( I saw CBOR being used in oubliette)~~
~~do we want to safeguard against problems by adding a data-hash to the index? (like CRC32 inside zip files)~~
~~do you see value in adding versioning to the index? (so when the format changes in a backwards incompatible way)~~
~~should we add some LRU cleanup for the index? Or can we assume that we do have enough storage for all of them? (could be added later too)~~
~~do you see other things we need to add for operating this? command line tools?~~
~~are there any metrics you would like to see?~~

Potential next steps after this PR here:

~~we could replace the releases.files column with just getting and using the index~~
optionally clean remote storage for a version when rebuilding a crate
small archives could be downloaded and used locally
start rebuilding the newest version for all creates ( probably after stop building crates on all targets #343 ?)
at some point we could do Downloadable docs #174 (if this seems stable enough)).
remove unused local index files to safe storage (potentially LRU cache)

Fixes #1004

jyn514 · 2021-04-04T14:44:36Z

an archive-index in storage::archive_index which supports storing file-names, their ranges and compression algorithms. Can be created based on zip, but is flexible if we want to use different formats later, or mix them.

I think the priority should be access and atomic updates, updates don't have to be fast. Using a zipfile makes me nervous the updates aren't atomic - I haven't read the code, do you know off the top of your head what happens if a request comes in while you're updating the archive?

web::source::source_browser_handler also fetches files from the archive. Currently does a database fetch to know when to use archive-queries and when not. This could be changed to only checking for the local index cache, if we can guarantee that.

What do you mean by "guarantee that"? Why would checking the local cache be faster than a database access?

in the beginning I was actually thinking about using the zip central directory as the index, but that lead to some complexity (while keeping the file hashes). Here in this project the custom index, integrated with the existing compression logic made more sense.

What's the difference between the zip central directory and the index? Where are each stored?

as the sources and doc-output was separate before, I left it separate
I chose archives per release, because they won't have to be updated, only rebuilt
I assumed the date_updated inside an archive is just the date_updated of the archive itself

👍, I don't actually know what we use date_updated for but that seems fine.

I assumed we can trade local storage for a little performance by storing the indexes locally too

We've been running into storage issues lately, but we also have very little storage on the prod machine - @pietroalbini would it be feasible to upgrade the storage from 100 GB to 150/200 or so? That should be enough.

@syphar how much storage are we talking about? Anything less than a GB is probably fine and doesn't need changes to the prod machine.

should I clear the destination folders before uploading the archives? so a rebuild would replace the separate-file storage with the archives

Hmm, my only worry is that if the rebuild goes wrong somehow we'll lose the old docs. Since we build on nightly it may be hard to replicate the old builds, nightly is allowed to have breaking changes to unstable features. Maybe this could be optional? No need to implement it for now if it's a hassle.

do you see value in adding versioning to the index? (so when the format changes in a backwards incompatible way)

Are we using a custom format for the index?

do we want to safeguard against problems by adding a data-hash to the index? (like CRC32 inside zip files)

Hmm - what benefit do you see from doing that? We don't have checksums currently, right?

I have more questions but I want to skim the diff first :)

jyn514

an archive-index in storage::archive_index which supports storing file-names, their ranges and compression algorithms. Can be created based on zip, but is flexible if we want to use different formats later, or mix them.

I think the priority should be access and atomic updates, updates don't have to be fast. Using a zipfile makes me nervous the updates aren't atomic - I haven't read the code, do you know off the top of your head what happens if a request comes in while you're updating the archive?

I misunderstood this the first time - the archive is per-release, so it will never be updated, only replaced.

how much storage are we talking about? Anything less than a GB is probably fine and doesn't need changes to the prod machine.

I measured 949101 (~1 MB) for an index of stm32f4 and 2325 (~2 KB) for regex. So this would be between 250 MB and 90 GB for all crates (and much closer to the low end because stm32f4 is uncommonly large). I do think we should have an LRU cache if it ends up going much over 250 MB, but that will take a long time so I'm fine with not landing it in the initial draft.

do you see value in adding versioning to the index? (so when the format changes in a backwards incompatible way)

Are we using a custom format for the index?

I didn't realize we were compressing the index itself. I think storing the compression format seems useful, but it looks like we already do that?

cratesfyi=# select compression from files where path = 'rustdoc/regex/rustdoc-regex-1.4.0.zip.index';
 compression 
-------------
           0
(1 row)

src/db/add_package.rs

src/docbuilder/rustwide_builder.rs

src/storage/archive_index.rs

src/storage/mod.rs

src/web/rustdoc.rs

src/web/source.rs

jyn514 · 2021-04-04T16:31:25Z

These seem like real regressions:

[2021-04-04T13:12:32Z TRACE docs_rs::web::page::templates] Finished loading templates
[2021-04-04T13:12:32Z INFO  docs_rs::web] Running docs.rs web server on http://127.0.0.1:33855
[2021-04-04T13:12:32Z DEBUG docs_rs::web::rustdoc] got error serving tokio/time: path not found
thread 'web::rustdoc::test::test_no_trailing_rustdoc_slash' panicked at '/tokio/0.2.21/tokio/time: expected redirect to /tokio/0.2.21/tokio/time/index.html, got redirect to /tokio/0.2.21/tokio/', src/test/mod.rs:75:13

thread 'web::rustdoc::test::go_to_latest_version_keeps_platform' panicked at 'assertion failed: `(left == right)`
  left: `"/crate/dummy/0.2.0/target-redirect/x86_64-unknown-linux-gnu/dummy/index.html"`,
 right: `"/crate/dummy/0.2.0/target-redirect/x86_64-pc-windows-msvc/dummy/index.html"`', src/web/rustdoc.rs:867:13

syphar · 2021-04-05T13:31:15Z

web::source::source_browser_handler also fetches files from the archive. Currently does a database fetch to know when to use archive-queries and when not. This could be changed to only checking for the local index cache, if we can guarantee that.

What do you mean by "guarantee that"? Why would checking the local cache be faster than a database access?

you're right, in my mind I'm always thinking about network roundtrips for databases, which we don't have here.

in the beginning I was actually thinking about using the zip central directory as the index, but that lead to some complexity (while keeping the file hashes). Here in this project the custom index, integrated with the existing compression logic made more sense.

What's the difference between the zip central directory and the index? Where are each stored?

The ZIP directory is just the end of the file. It contains all the filenames, and also the positions to fetch. I had a working prototype that had a kind of virtual buffer which prefetched the end of the archive and then the range for the file. After that worked, it seemed more complex than just using our own index, and decompressing the stream ourselves.

should I clear the destination folders before uploading the archives? so a rebuild would replace the separate-file storage with the archives

Hmm, my only worry is that if the rebuild goes wrong somehow we'll lose the old docs. Since we build on nightly it may be hard to replicate the old builds, nightly is allowed to have breaking changes to unstable features. Maybe this could be optional? No need to implement it for now if it's a hassle.

I'll add it to the after this PR list

do we want to safeguard against problems by adding a data-hash to the index? (like CRC32 inside zip files)

Hmm - what benefit do you see from doing that? We don't have checksums currently, right?

It would only be a safeguard against potential problems when fetching the ranges. My feeling would be that we're fine without.

I assumed we can trade local storage for a little performance by storing the indexes locally too

We've been running into storage issues lately, but we also have very little storage on the prod machine - @pietroalbini > would it be feasible to upgrade the storage from 100 GB to 150/200 or so? That should be enough.

how much storage are we talking about? Anything less than a GB is probably fine and doesn't need changes to the prod machine.

I measured 949101 (~1 MB) for an index of stm32f4 and 2325 (~2 KB) for regex. So this would be between 250 MB and 90 GB for all crates (and much closer to the low end because stm32f4 is uncommonly large). I do think we should have an LRU cache if it ends up going much over 250 MB, but that will take a long time so I'm fine with not landing it in the initial draft.

I'll experiment a little with using a different format (like CBOR), and/or also compressing the local index (currently it's plain text). Depending on that we can decide if the cleanup is a topic for now or after.

do you see value in adding versioning to the index? (so when the format changes in a backwards incompatible way)

Are we using a custom format for the index?

I didn't realize we were compressing the index itself. I think storing the compression format seems useful, but it looks like we already do that?
cratesfyi=# select compression from files where path = 'rustdoc/regex/rustdoc-regex-1.4.0.zip.index';
 compression 
-------------
           0
(1 row)

The compression on S3 is definitely already there, also in the content-encoding header on the remote storage. I was more talking about the format (json fields etc). But IMHO we don't really need it.

syphar · 2021-05-17T18:10:23Z

I could free a little time to continue on this PR.

assumed we can trade local storage for a little performance by storing the indexes locally too

We've been running into storage issues lately, but we also have very little storage on the prod machine - @pietroalbini > would it be feasible to upgrade the storage from 100 GB to 150/200 or so? That should be enough.
how much storage are we talking about? Anything less than a GB is probably fine and doesn't need changes to the prod machine.

I measured 949101 (~1 MB) for an index of stm32f4 and 2325 (~2 KB) for regex. So this would be between 250 MB and 90 GB for all crates (and much closer to the low end because stm32f4 is uncommonly large). I do think we should have an LRU cache if it ends up going much over 250 MB, but that will take a long time so I'm fine with not landing it in the initial draft.

I'll experiment a little with using a different format (like CBOR), and/or also compressing the local index (currently it's plain text). Depending on that we can decide if the cleanup is a topic for now or after.

I changed the format to CBOR which saved some storage.
I tested with the itertools crate (1590 files) and ended up with a local index of 26005 bytes. Compressing CBOR actually increases the size.

Coming from the size of the S3 bucket (~400 million files) we would end up having a maximum of ~6.2 GiB.
This assumes similar lengths of file-names across crates, and in the source-packages. Also we would have to substract the files for the nightly specific static files which are not in archives right now.

So I agree we should have a cleanup (LRU or something else) at some point, but we could add this later.

syphar · 2021-05-25T18:14:24Z

I just saw one thing I changed didn't compile :)
Will fix later. There is plenty of other things to improve anyways before this can have another review

teohhanhui · 2021-06-09T07:48:19Z

BZip2 compression support

Why not zstd?

syphar · 2021-06-09T08:14:46Z

BZip2 compression support

Why not zstd?

I wanted the archive to be a standard archive format which also supports range request. And for that there is only ZIP.

While inside the ZIP archive format we can use different compression algorithms, I wanted to chose an algorithm that is more widely supported by zip decompression tools (see facebook/zstd#1378 for ZIP itself supporting zstd).

BZip2 is far better than deflate, and seemed to be a good starting point.

In general the idea was also to directly reuse the archives for downloadable docs (see #174 ), though I still need to validate my assumptions in that direction.

There was already some work by @Nemo157 regarding an archive-format with zstd ( see https://github.com/Nemo157/oubliette ), but that would be a completely custom format.

teohhanhui · 2021-06-15T12:05:36Z

I wanted the archive to be a standard archive format which also supports range request. And for that there is only ZIP.

Found https://github.com/martinellimarco/t2sz from facebook/zstd#395

The compressed archive can be uncompressed with any Zstandard tool, including zstd.

The original genesis of that seems to be here: mxmlnkn/ratarmount#40

syphar · 2021-06-15T13:18:56Z

I wanted the archive to be a standard archive format which also supports range request. And for that there is only ZIP.

Found https://github.com/martinellimarco/t2sz from facebook/zstd#395

Please correct me if I'm wrong, but to my knowledge .tar.zstd doesn't support range requests. in the end we want to fetch only the part of the archive that represents the file that we want to show.

The compressed archive can be uncompressed with any Zstandard tool, including zstd.

The original genesis of that seems to be here: mxmlnkn/ratarmount#40

I'll dig into your links later, thank you for these. A quick look didn't show me how this would support range requests?

I assume we would use ratarmount / indexed_zstd and let it make range-requests to S3 after every seek.
Since zstd is not compressing by file, but only the whole tar-archive I assume we cannot just fetch the range where we find our file, but we also have to fetch some per-archive directory/wordlist.

Our goal is to keep the current page speed, which means we are limited to a maximum of one S3 request per rustdoc file to fetch.

jyn514 · 2021-06-15T13:56:04Z

We also have the option to switch compression formats in the future if it turns out that our storage costs are growing faster than expected. But I agree with @syphar, having the smallest possible archive size is not a goal at the moment.

jyn514 · 2021-06-24T00:57:28Z

start rebuilding the newest version for all creates ( probably after #343 ?)

Alternatively, we could keep the defaults the same and only rebuild the default platform.

bjornharrtell · 2021-07-06T22:44:24Z

Unless I'm mistaken I think https://github.com/lz4/lz4 supports random access.

syphar · 2021-07-21T18:55:58Z

Unless I'm mistaken I think https://github.com/lz4/lz4 supports random access.
lz4 looks like a normal stream compression format, how would adding multiple files look like in it?
That would mean we have to store the blob and the index side-by-side.

Using a container-format like ZIP has the advantage of being self-contained.

danieldjewell · 2021-08-10T21:55:00Z

I wanted the archive to be a standard archive format which also supports range request. And for that there is only ZIP.

Found https://github.com/martinellimarco/t2sz from facebook/zstd#395

Please correct me if I'm wrong, but to my knowledge .tar.zstd doesn't support range requests. in the end we want to fetch only the part of the archive that represents the file that we want to show.

If it's compressed with t2sz then, in theory, you could use range requests to get the specific ZSTD block and then decompress that... although I'm not 100% sure.

I assume we would use ratarmount / indexed_zstd and let it make range-requests to S3 after every seek.

Not to throw a wrench in the works at this point, but I'm not sure that from an architectural point of view this is a good idea... (See my comment in #174 about the compression ratios of a solid ZSTD archive and of a t2sz converted/indexed archive.)

Indexed GZIP is used widely in the Bioinformatics world for storing genetic data -- see https://github.com/samtools/htslib/blob/develop/bgzip.c ... BGZF is an indexed GZIP variant that supports random access. (See https://github.com/samtools/hts-specs and http://samtools.github.io/hts-specs/SAMv1.pdf for info on how this is applied).

Another indexed binary format very widely used in the Meteorology world is GRIB2 (Gridded Binary 2) -- the main output format for forecast systems like NOAA/NWS's GFS (Global Forecast System) ... See https://nomads.ncep.noaa.gov/txt_descriptions/fast_downloading_grib_doc.shtml for an example.

The GRIB2 format uses a simple index that looks like this:

1:0:d=2021081006:PRMSL:mean sea level:anl: 
2:878656:d=2021081006:CLWMR:1 hybrid level:anl: 
3:979138:d=2021081006:ICMR:1 hybrid level:anl: 
4:1168467:d=2021081006:RWMR:1 hybrid level:anl: 
5:1438719:d=2021081006:SNMR:1 hybrid level:anl: 
6:1518142:d=2021081006:GRLE:1 hybrid level:anl: 
7:1568632:d=2021081006:REFD:1 hybrid level:anl: 
8:2442948:d=2021081006:REFD:2 hybrid level:anl:
9:3317324:d=2021081006:REFC:entire atmosphere:anl:

The format is:

<record#>:<byteOffsetStart>:<date>:<DataFieldName>:<DataAtmosphericLevel>:<DataTimeOffset>:

(where DataFieldName is the type of data -- PRMSL = Pressure @ Mean Sea Level; Atmospheric Level = The pressure level of the prediction (200mb, 500mb, etc.); Offset = hours into the future (anl=analysis="at the time of the publication of this analysis"))

I would also look at THREDDS (https://github.com/Unidata/tds) and Siphon (https://github.com/Unidata/siphon) which allow for the subsetting/indexing of geospatial gridded forecast data (see https://unidata.github.io/python-training/gallery/500hpa_hght_winds/ -- xarray has some of the indexing and subsetting built right in -- it's transparent to the end user.)

I know these might seem unrelated but they're pretty complete (and complex) examples of ways that the issue of retrieving a small amount of data from a very large file have been successfully accomplished.

Since zstd is not compressing by file, but only the whole tar-archive I assume we cannot just fetch the range where we find our file, but we also have to fetch some per-archive directory/wordlist.

Yes, that would make sense to me - a directory/index would seemingly be required...

Our goal is to keep the current page speed, which means we are limited to a maximum of one S3 request per rustdoc file to fetch.

And this point is where I question the architecture... Because it seems like there's an opportunity to maybe reduce the number of S3 requests overall...

I see a couple of theoretical options:

A crate's documentation is stored in an indexed/compressed tar format
A. Retrieve the index file/data (perhaps stored in a database?)
B. Determine what data is required
C. Fetch, decompress, and serve up the file (or files -- if s3 supports an http range request with multiple ranges in a single request)
Same as above or stored in a non-indexed format (server side processing)
A. Retrieve the entire compressed archive
B. Extract the data you need; serve it up
C. (optional) cache extracted static decompressed files in memory if frequently accessed?
Same as above but moving the decompression to the client side...
A. Use JavaScript + an index file to request either all or part of the compressed documentation (all for caching... and also offline access)
B. Extract the content client-side and display
C. (use some logic to decide if sending all the docs is a good idea -- sending 20MB compressed might be excessive, unless someone is working on an stm32 project for example)

Option 3 offers some interesting options -- if the entire crate compressed documentation is sent to the client and if that client views more than one page, you're saving S3 requests, are you not? Likewise, if that compressed data is cached client side and someone frequently accesses it.... No requests required (other than maybe a "hey is my locally cached copy still up-to-date?" query).

(Sorry this got kinda long, but I wanted to be clear and complete 😁)

danieldjewell · 2021-08-11T02:30:25Z

After digging in a bit further, I was able to (pretty easily, given that I'm pretty much a Rust novice) add an archival step (outputting a tar.zst) to the build process in src/docbuilder/rustwide_builder.rs

I did not realize, however, that the output is not flat and complete at that point -- it relies on the web server component of docs.rs to actually render valid HTML. (But perhaps there's an easy way to make static docs... cargo doc outputs static HTML)

I guess the point is that storing the generated doctree (e.g. the files in rustdoc/{name}/{version}/) is pretty straightforward. Retrieving them within the context of docs.rs's web server is another story.

Random question: Does the DB/Webserver component of Docs.rs really offer a huge benefit in terms of end-user functionality that couldn't be handled with static files/client side AJAX? Most documentation generators (e.g. Sphinx, etc.) generate to static HTML and then that's what gets served up on github.io or whereever... (Not saying that's the only way, but... there is a major advantage: speed).

Maybe the direction for docs.rs should be to move the web serving to a web server (take your pick) and have docs.rs generate content and/or act as an API for the database stuff. I recognize that's a pretty big structural change but it would also offer some flexibility/scalability options. (I don't know that having clients access S3 directly is always a good idea [S3 is OK as a CDN I guess...] ... but the performance of apache/nginx when serving up static files is incredible compared to dynamic content... I also have to believe it's way faster than processing the requests through docs.rs's webserver.)

syphar · 2021-08-11T18:46:35Z

@jyn514 some notes on this merge here:

since the last review is some time ago, it's perhaps best not to look through the changes, but the whole PR again, but that's up to you

here some things that are changed (from memory, so probably incomplete)

I moved some rustdoc/source specific logic into the storage layer, and now the web-handlers look better IMHO
used storage technique for new builds is a setting
index files are smaller now (due to CBOR)
test-fakes can generate releases with archive-storage
the build is tested with plain file-storage and archive-storage separately
web::source and web::rustdoc tests are tested with both storage techniques for the tests that I guessed use storage. ( I used test case parametrization via test_case for this, I'm happy to implement better ideas :) )

There are some of the rustdoc tests failing, I'll dig into this.

jyn514 · 2021-08-15T23:57:55Z

I did not realize, however, that the output is not flat and complete at that point -- it relies on the web server component of docs.rs to actually render valid HTML. (But perhaps there's an easy way to make static docs... cargo doc outputs static HTML)

No, there is not an easy way to make it static: the header is dynamic and changes whenever a new version is published.

Random question: Does the DB/Webserver component of Docs.rs really offer a huge benefit in terms of end-user functionality that couldn't be handled with static files/client side AJAX?

Well, I personally would be very disappointed if the 'go to latest version' link went away and you had to explicitly click through to /crate to see what the latest version is. People have enough trouble finding /crate already, we don't need to move more core functionality there.

If we do this through AJAX only, that means people can no longer disable javascript. See #845 (comment) for an idea I had about how to avoid that, but it's completely separate from compression.

Option 3 offers some interesting options -- if the entire crate compressed documentation is sent to the client and if that client views more than one page, you're saving S3 requests, are you not? Likewise, if that compressed data is cached client side and someone frequently accesses it.... No requests required (other than maybe a "hey is my locally cached copy still up-to-date?" query).

This is only feasible if the crate docs are small enough to send to the client. We have several crates with millions of pages and multiple gigabytes of docs. We've discussed caching the whole archive on the build server in the past, it should be feasible as long as the archive is small enough: #1004 (comment).

Please move discussion about rethinking the entire docs.rs website to the tracking issue (#1004) instead of the PR, it makes it harder to find what progress is being made.

jyn514

I have a bunch of implementation details I commented on but this looks like a good approach :) I tested it locally and it works great. I think we're getting close.

.github/workflows/ci.yml

src/docbuilder/rustwide_builder.rs

src/storage/archive_index.rs

src/storage/mod.rs

src/test/fakes.rs

src/web/source.rs

src/config.rs

src/storage/archive_index.rs

src/storage/mod.rs

src/web/source.rs

src/config.rs

src/docbuilder/rustwide_builder.rs

jyn514 · 2021-09-09T02:24:16Z

r=me with the nit fixed :) tests are failing for some reason though:

---- web::rustdoc::test::test_no_trailing_target_slash::_true stdout ----
 [2021-08-18T20:43:43Z ERROR docs_rs::web::page::templates] Failed to load rustc resource suffix: missing rustc version
    
    Stack backtrace:
       0: docs_rs::web::page::templates::load_rustc_resource_suffix
                 at ./src/web/page/templates.rs:84:9
       1: docs_rs::web::page::templates::load_templates
                 at ./src/web/page/templates.rs:138:23
       2: docs_rs::web::page::templates::TemplateData::new
                 at ./src/web/page/templates.rs:35:46
       3: docs_rs::web::Server::start
                 at ./src/web/mod.rs:434:38
        4: docs_rs::test::TestFrontend::new
                 at ./src/test/mod.rs:379:21

jyn514 · 2021-09-09T02:30:16Z

Hmm, they work for me locally - let me rerun it and see if it helps.

syphar · 2021-09-09T06:04:18Z

r=me with the nit fixed :) tests are failing for some reason though:

---- web::rustdoc::test::test_no_trailing_target_slash::_true stdout ----
 [2021-08-18T20:43:43Z ERROR docs_rs::web::page::templates] Failed to load rustc resource suffix: missing rustc version
    
    Stack backtrace:
       0: docs_rs::web::page::templates::load_rustc_resource_suffix
                 at ./src/web/page/templates.rs:84:9
       1: docs_rs::web::page::templates::load_templates
                 at ./src/web/page/templates.rs:138:23
       2: docs_rs::web::page::templates::TemplateData::new
                 at ./src/web/page/templates.rs:35:46
       3: docs_rs::web::Server::start
                 at ./src/web/mod.rs:434:38
        4: docs_rs::test::TestFrontend::new
                 at ./src/test/mod.rs:379:21

I'm still working through the test failures, wasn't able to do as much as I wanted the last weeks.

jyn514 · 2021-09-09T13:24:07Z

@syphar I did some debugging - the target-redirect page is still working correctly, the problem is that

docs.rs/src/web/rustdoc.rs

Line 351 in 44e68e4

    
           let blob = match storage.fetch_rustdoc_file(&name, &version, &path, krate.archive_storage) {

is returning false for path = x86_64-apple-darwin/dummy/index.html, name = dummy, version = 0.2.0 (whether or not archive_storage is true).

Which is wrong because that file is still being uploaded:

[2021-09-09T13:27:12Z DEBUG docs_rs::test::fakes] adding directory Rustdoc from /tmp/docs.rs-fakePzis7K
[2021-09-09T13:27:12Z DEBUG docs_rs::test::fakes] added rustdoc files [["text/html","dummy/index.html"]]
[2021-09-09T13:27:12Z DEBUG docs_rs::test::fakes] writing file /tmp/docs.rs-fakeQ4D9hQ/x86_64-apple-darwin/dummy/index.html
[2021-09-09T13:27:12Z DEBUG docs_rs::test::fakes] adding directory Rustdoc from /tmp/docs.rs-fakeQ4D9hQ/x86_64-apple-darwin
[2021-09-09T13:27:12Z DEBUG docs_rs::test::fakes] added platform files for x86_64-apple-darwin
[2021-09-09T13:27:12Z DEBUG docs_rs::db::add_package] Adding package into database
[2021-09-09T13:27:12Z INFO  docs_rs::db::add_package] Updating crate data for dummy
[2021-09-09T13:27:12Z DEBUG docs_rs::db::add_package] Adding build into database
[2021-09-09T13:27:12Z DEBUG docs_rs::web::rustdoc] got error serving x86_64-apple-darwin/dummy/index.html: path not found

In particular, this query is returning no rows:

docs.rs/src/storage/database.rs

Line 62 in 44e68e4

self.pool.get()?.query(

for path = rustdoc/dummy/0.2.0/x86_64-apple-darwin/dummy/index.html

src/test/fakes.rs

syphar · 2021-09-09T14:52:43Z

In web::rustdoc we're down to only one failing test I need to dig into ( test_no_trailing_target_slash)

syphar · 2021-09-09T16:01:48Z

In web::rustdoc we're down to only one failing test I need to dig into ( test_no_trailing_target_slash)

Note to myself, I found it (not fixed yet).

Problem is that in FakeRelease.create ~~we're~~ I'm calling upload_files for rustdoc multiple times. In upload_files I then create a new archive for each call.
I probably can refactor it so we only create one archive which contains the files for all platforms.

So, only a test-fake issue, likely not prod.

syphar · 2021-09-12T10:45:02Z

@jyn514 tests are green (🎉 ), this is ready for another review.

I would propose before merging this I'll squash it into a single commit.

I have some open points in the potential next steps, but as I understood I would tackle them when this is merged.

jyn514 reviewed Apr 4, 2021

View reviewed changes

jyn514 mentioned this pull request Apr 5, 2021

500 internal error on docs.rs source view with semver #1344

Closed

syphar added the S-waiting-on-author Status: This PR is incomplete or needs to address review comments label Apr 13, 2021

syphar changed the title ~~(RFC) Start implementing compressed storage for rustdoc- and source-files~~ compressed storage for rustdoc- and source-files Apr 13, 2021

syphar self-assigned this Apr 14, 2021

syphar added A-backend Area: Webserver backend A-builds Area: Building the documentation for a crate labels Apr 14, 2021

jyn514 mentioned this pull request Jun 8, 2021

Compress documentation per-crate, not per-file #1004

Closed

rust-lang locked as off-topic and limited conversation to collaborators Aug 15, 2021

jyn514 reviewed Aug 16, 2021

View reviewed changes

jyn514 mentioned this pull request Aug 16, 2021

Downloadable docs #174

Closed

jyn514 reviewed Aug 18, 2021

View reviewed changes

src/storage/archive_index.rs Show resolved Hide resolved

jyn514 suggested changes Aug 21, 2021

View reviewed changes

src/storage/mod.rs Show resolved Hide resolved

src/web/source.rs Show resolved Hide resolved

src/web/source.rs Outdated Show resolved Hide resolved

src/config.rs Outdated Show resolved Hide resolved

src/docbuilder/rustwide_builder.rs Outdated Show resolved Hide resolved

jyn514 reviewed Sep 9, 2021

View reviewed changes

src/test/fakes.rs Outdated Show resolved Hide resolved

syphar marked this pull request as ready for review September 12, 2021 10:38

syphar requested a review from jyn514 September 12, 2021 10:38

syphar added S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed and removed S-waiting-on-author Status: This PR is incomplete or needs to address review comments labels Sep 12, 2021

compressed storage for rustdoc- and source-files.

9089455

syphar force-pushed the compressed-storage branch from 3978249 to 9089455 Compare September 12, 2021 15:47

jyn514 merged commit dd650cb into rust-lang:master Sep 12, 2021

syphar deleted the compressed-storage branch September 12, 2021 16:35

rust-lang unlocked this conversation Sep 12, 2021

jsha mentioned this pull request Mar 7, 2023

introduce new archive index format based on SQLite #2068

Merged

syphar mentioned this pull request Aug 3, 2023

some crates take long enough to make docs.rs look stuck #335

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compressed storage for rustdoc- and source-files #1342

compressed storage for rustdoc- and source-files #1342

syphar commented Apr 4, 2021 •

edited by jyn514

Loading

jyn514 commented Apr 4, 2021

jyn514 left a comment

jyn514 commented Apr 4, 2021

syphar commented Apr 5, 2021

syphar commented May 17, 2021

syphar commented May 25, 2021

teohhanhui commented Jun 9, 2021

syphar commented Jun 9, 2021

teohhanhui commented Jun 15, 2021 •

edited

Loading

syphar commented Jun 15, 2021

jyn514 commented Jun 15, 2021

jyn514 commented Jun 24, 2021

bjornharrtell commented Jul 6, 2021

syphar commented Jul 21, 2021

danieldjewell commented Aug 10, 2021

danieldjewell commented Aug 11, 2021

syphar commented Aug 11, 2021 •

edited

Loading

jyn514 commented Aug 15, 2021 •

edited

Loading

jyn514 left a comment

jyn514 commented Sep 9, 2021

jyn514 commented Sep 9, 2021

syphar commented Sep 9, 2021

jyn514 commented Sep 9, 2021 •

edited

Loading

syphar commented Sep 9, 2021

syphar commented Sep 9, 2021 •

edited

Loading

syphar commented Sep 12, 2021

compressed storage for rustdoc- and source-files #1342

compressed storage for rustdoc- and source-files #1342

Conversation

syphar commented Apr 4, 2021 • edited by jyn514 Loading

My general thoughts:

Potential next steps / things I think have to be finished:

Questions/unsure:

Potential next steps after this PR here:

jyn514 commented Apr 4, 2021

jyn514 left a comment

Choose a reason for hiding this comment

jyn514 commented Apr 4, 2021

syphar commented Apr 5, 2021

syphar commented May 17, 2021

syphar commented May 25, 2021

teohhanhui commented Jun 9, 2021

syphar commented Jun 9, 2021

teohhanhui commented Jun 15, 2021 • edited Loading

syphar commented Jun 15, 2021

jyn514 commented Jun 15, 2021

jyn514 commented Jun 24, 2021

bjornharrtell commented Jul 6, 2021

syphar commented Jul 21, 2021

danieldjewell commented Aug 10, 2021

danieldjewell commented Aug 11, 2021

syphar commented Aug 11, 2021 • edited Loading

jyn514 commented Aug 15, 2021 • edited Loading

jyn514 left a comment

Choose a reason for hiding this comment

jyn514 commented Sep 9, 2021

jyn514 commented Sep 9, 2021

syphar commented Sep 9, 2021

jyn514 commented Sep 9, 2021 • edited Loading

syphar commented Sep 9, 2021

syphar commented Sep 9, 2021 • edited Loading

syphar commented Sep 12, 2021

syphar commented Apr 4, 2021 •

edited by jyn514

Loading

teohhanhui commented Jun 15, 2021 •

edited

Loading

syphar commented Aug 11, 2021 •

edited

Loading

jyn514 commented Aug 15, 2021 •

edited

Loading

jyn514 commented Sep 9, 2021 •

edited

Loading

syphar commented Sep 9, 2021 •

edited

Loading