-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compressed storage for rustdoc- and source-files #1342
Conversation
I think the priority should be access and atomic updates, updates don't have to be fast. Using a zipfile makes me nervous the updates aren't atomic - I haven't read the code, do you know off the top of your head what happens if a request comes in while you're updating the archive?
What do you mean by "guarantee that"? Why would checking the local cache be faster than a database access?
What's the difference between the zip central directory and the index? Where are each stored?
👍, I don't actually know what we use date_updated for but that seems fine.
We've been running into storage issues lately, but we also have very little storage on the prod machine - @pietroalbini would it be feasible to upgrade the storage from 100 GB to 150/200 or so? That should be enough. @syphar how much storage are we talking about? Anything less than a GB is probably fine and doesn't need changes to the prod machine.
Hmm, my only worry is that if the rebuild goes wrong somehow we'll lose the old docs. Since we build on nightly it may be hard to replicate the old builds, nightly is allowed to have breaking changes to unstable features. Maybe this could be optional? No need to implement it for now if it's a hassle.
Are we using a custom format for the index?
Hmm - what benefit do you see from doing that? We don't have checksums currently, right? I have more questions but I want to skim the diff first :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an archive-index in storage::archive_index which supports storing file-names, their ranges and compression algorithms. Can be created based on zip, but is flexible if we want to use different formats later, or mix them.
I think the priority should be access and atomic updates, updates don't have to be fast. Using a zipfile makes me nervous the updates aren't atomic - I haven't read the code, do you know off the top of your head what happens if a request comes in while you're updating the archive?
I misunderstood this the first time - the archive is per-release, so it will never be updated, only replaced.
how much storage are we talking about? Anything less than a GB is probably fine and doesn't need changes to the prod machine.
I measured 949101 (~1 MB) for an index of stm32f4 and 2325 (~2 KB) for regex. So this would be between 250 MB and 90 GB for all crates (and much closer to the low end because stm32f4 is uncommonly large). I do think we should have an LRU cache if it ends up going much over 250 MB, but that will take a long time so I'm fine with not landing it in the initial draft.
do you see value in adding versioning to the index? (so when the format changes in a backwards incompatible way)
Are we using a custom format for the index?
I didn't realize we were compressing the index itself. I think storing the compression format seems useful, but it looks like we already do that?
cratesfyi=# select compression from files where path = 'rustdoc/regex/rustdoc-regex-1.4.0.zip.index';
compression
-------------
0
(1 row)
These seem like real regressions:
|
you're right, in my mind I'm always thinking about network roundtrips for databases, which we don't have here.
The ZIP directory is just the end of the file. It contains all the filenames, and also the positions to fetch. I had a working prototype that had a kind of virtual buffer which prefetched the end of the archive and then the range for the file. After that worked, it seemed more complex than just using our own index, and decompressing the stream ourselves.
I'll add it to the after this PR list
It would only be a safeguard against potential problems when fetching the ranges. My feeling would be that we're fine without.
I'll experiment a little with using a different format (like CBOR), and/or also compressing the local index (currently it's plain text). Depending on that we can decide if the cleanup is a topic for now or after.
The compression on S3 is definitely already there, also in the content-encoding header on the remote storage. I was more talking about the format (json fields etc). But IMHO we don't really need it. |
I could free a little time to continue on this PR.
I changed the format to CBOR which saved some storage. Coming from the size of the S3 bucket (~400 million files) we would end up having a maximum of ~6.2 GiB. So I agree we should have a cleanup (LRU or something else) at some point, but we could add this later. |
I just saw one thing I changed didn't compile :) |
Why not zstd? |
I wanted the archive to be a standard archive format which also supports range request. And for that there is only ZIP. While inside the ZIP archive format we can use different compression algorithms, I wanted to chose an algorithm that is more widely supported by zip decompression tools (see facebook/zstd#1378 for ZIP itself supporting zstd). BZip2 is far better than deflate, and seemed to be a good starting point. In general the idea was also to directly reuse the archives for downloadable docs (see #174 ), though I still need to validate my assumptions in that direction. There was already some work by @Nemo157 regarding an archive-format with zstd ( see https://github.com/Nemo157/oubliette ), but that would be a completely custom format. |
Found https://github.com/martinellimarco/t2sz from facebook/zstd#395
The original genesis of that seems to be here: mxmlnkn/ratarmount#40 |
Please correct me if I'm wrong, but to my knowledge
I'll dig into your links later, thank you for these. A quick look didn't show me how this would support range requests? I assume we would use Our goal is to keep the current page speed, which means we are limited to a maximum of one S3 request per rustdoc file to fetch. |
We also have the option to switch compression formats in the future if it turns out that our storage costs are growing faster than expected. But I agree with @syphar, having the smallest possible archive size is not a goal at the moment. |
Alternatively, we could keep the defaults the same and only rebuild the default platform. |
Unless I'm mistaken I think https://github.com/lz4/lz4 supports random access. |
Using a container-format like ZIP has the advantage of being self-contained. |
If it's compressed with t2sz then, in theory, you could use range requests to get the specific ZSTD block and then decompress that... although I'm not 100% sure.
Not to throw a wrench in the works at this point, but I'm not sure that from an architectural point of view this is a good idea... (See my comment in #174 about the compression ratios of a solid ZSTD archive and of a t2sz converted/indexed archive.) Indexed GZIP is used widely in the Bioinformatics world for storing genetic data -- see https://github.com/samtools/htslib/blob/develop/bgzip.c ... BGZF is an indexed GZIP variant that supports random access. (See https://github.com/samtools/hts-specs and http://samtools.github.io/hts-specs/SAMv1.pdf for info on how this is applied). Another indexed binary format very widely used in the Meteorology world is GRIB2 (Gridded Binary 2) -- the main output format for forecast systems like NOAA/NWS's GFS (Global Forecast System) ... See https://nomads.ncep.noaa.gov/txt_descriptions/fast_downloading_grib_doc.shtml for an example. The GRIB2 format uses a simple index that looks like this:
The format is:
I would also look at THREDDS (https://github.com/Unidata/tds) and Siphon (https://github.com/Unidata/siphon) which allow for the subsetting/indexing of geospatial gridded forecast data (see https://unidata.github.io/python-training/gallery/500hpa_hght_winds/ -- I know these might seem unrelated but they're pretty complete (and complex) examples of ways that the issue of retrieving a small amount of data from a very large file have been successfully accomplished.
Yes, that would make sense to me - a directory/index would seemingly be required...
And this point is where I question the architecture... Because it seems like there's an opportunity to maybe reduce the number of S3 requests overall... I see a couple of theoretical options:
Option 3 offers some interesting options -- if the entire crate compressed documentation is sent to the client and if that client views more than one page, you're saving S3 requests, are you not? Likewise, if that compressed data is cached client side and someone frequently accesses it.... No requests required (other than maybe a "hey is my locally cached copy still up-to-date?" query). (Sorry this got kinda long, but I wanted to be clear and complete 😁) |
After digging in a bit further, I was able to (pretty easily, given that I'm pretty much a Rust novice) add an archival step (outputting a tar.zst) to the build process in src/docbuilder/rustwide_builder.rs I did not realize, however, that the output is not flat and complete at that point -- it relies on the web server component of docs.rs to actually render valid HTML. (But perhaps there's an easy way to make static docs... I guess the point is that storing the generated doctree (e.g. the files in Random question: Does the DB/Webserver component of Docs.rs really offer a huge benefit in terms of end-user functionality that couldn't be handled with static files/client side AJAX? Most documentation generators (e.g. Sphinx, etc.) generate to static HTML and then that's what gets served up on github.io or whereever... (Not saying that's the only way, but... there is a major advantage: speed). Maybe the direction for docs.rs should be to move the web serving to a web server (take your pick) and have docs.rs generate content and/or act as an API for the database stuff. I recognize that's a pretty big structural change but it would also offer some flexibility/scalability options. (I don't know that having clients access S3 directly is always a good idea [S3 is OK as a CDN I guess...] ... but the performance of apache/nginx when serving up static files is incredible compared to dynamic content... I also have to believe it's way faster than processing the requests through docs.rs's webserver.) |
@jyn514 some notes on this merge here: since the last review is some time ago, it's perhaps best not to look through the changes, but the whole PR again, but that's up to you here some things that are changed (from memory, so probably incomplete)
There are some of the rustdoc tests failing, I'll dig into this. |
No, there is not an easy way to make it static: the header is dynamic and changes whenever a new version is published.
Well, I personally would be very disappointed if the 'go to latest version' link went away and you had to explicitly click through to /crate to see what the latest version is. People have enough trouble finding /crate already, we don't need to move more core functionality there. If we do this through AJAX only, that means people can no longer disable javascript. See #845 (comment) for an idea I had about how to avoid that, but it's completely separate from compression.
This is only feasible if the crate docs are small enough to send to the client. We have several crates with millions of pages and multiple gigabytes of docs. We've discussed caching the whole archive on the build server in the past, it should be feasible as long as the archive is small enough: #1004 (comment). Please move discussion about rethinking the entire docs.rs website to the tracking issue (#1004) instead of the PR, it makes it harder to find what progress is being made. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a bunch of implementation details I commented on but this looks like a good approach :) I tested it locally and it works great. I think we're getting close.
r=me with the nit fixed :) tests are failing for some reason though:
|
Hmm, they work for me locally - let me rerun it and see if it helps. |
I'm still working through the test failures, wasn't able to do as much as I wanted the last weeks. |
@syphar I did some debugging - the target-redirect page is still working correctly, the problem is that Line 351 in 44e68e4
path = x86_64-apple-darwin/dummy/index.html , name = dummy , version = 0.2.0 (whether or not archive_storage is true).
Which is wrong because that file is still being uploaded:
In particular, this query is returning no rows: docs.rs/src/storage/database.rs Line 62 in 44e68e4
for path = rustdoc/dummy/0.2.0/x86_64-apple-darwin/dummy/index.html
|
In |
Note to myself, I found it (not fixed yet). Problem is that in So, only a test-fake issue, likely not prod. |
@jyn514 tests are green (🎉 ), this is ready for another review. I would propose before merging this I'll squash it into a single commit. I have some open points in the potential next steps, but as I understood I would tackle them when this is merged. |
3978249
to
9089455
Compare
I worked a little on the compressed storage topic, and this is a first design-draft.
It's working, so you can run builds, see the archives, and also click through rustdoc- or source-pages.
Before I'm finishing this up, I would love to hear your ideas / your feedback, so I know I'm going the right direction.
I squashed the commits since in reality I was jumping back and forth between topics all the time, so the commits wouldn't help you. So when you look at the code, the best way to understand what I thought is going in this order:
storage::compression
+ tests + bench)storage::s3
,storage::file
, also instorage::Storage.get_range
, some tests instorage::tests
)storage::archive_index
which supports storing file-names, their ranges and compression algorithms. Can be created based on zip, but is flexible if we want to use different formats later, or mix them.storage::Storage.get_index_for
handles fetching the index for a certain archive-path, also creates a local cache for these index files..exists
queries to storage can be answered from that cache.storage::Storage.store_all_in_archive
will take a folder and store it in a zip-file together with a new index (local and remote).storage::Storage.get_from_archive
fetches a remote file from an archive. when the index is not cached, it is fetched first, then we only fetch and decompress the data.storage::Storage.exists_in_archive
for exist-queries. When the index is local, without request to S3docbuilder::rustwide_builder
starts creating two archives per release instead of uploading folders (one for the sources, one for the doc-build output). It also stores a bool into the release-table so we know the web handlers should look for archivesweb::rustdoc::rustdoc_html_server_handler
can fetch files from the archive if the release data statedarchive_storage
.web::source::source_browser_handler
also fetches files from the archive. Currently does a database fetch to know when to use archive-queries and when not. This could be changed to only checking for the local index cache, if we can guarantee that.My general thoughts:
date_updated
inside an archive is just thedate_updated
of the archive itselfIMHO everything around storage, compression and the index is in a good place, perhaps only needs some small tests around the errors that can happen.
Potential next steps / things I think have to be finished:
Questions/unsure:
archive_storage
attribute in the mix betweenMetaData
andCrateDetails
.should we use a totally separate path for the new archives? That would make cleanup easierdo we want to store/control in more detail where to use archives and where not?should I clear the destination folders before uploading the archives? so a rebuild would replace the separate-file storage with the archivesshould we use another storage format for the index? ( I saw CBOR being used in oubliette)do we want to safeguard against problems by adding a data-hash to the index? (likeCRC32
inside zip files)do you see value in adding versioning to the index? (so when the format changes in a backwards incompatible way)should we add some LRU cleanup for the index? Or can we assume that we do have enough storage for all of them? (could be added later too)do you see other things we need to add for operating this? command line tools?are there any metrics you would like to see?Potential next steps after this PR here:
we could replace thereleases.files
column with just getting and using the indexFixes #1004