Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show Contents of Readme on Crate Pages #869

Merged
merged 39 commits into from
Aug 17, 2017

Conversation

kureuil
Copy link
Contributor

@kureuil kureuil commented Jul 9, 2017

Hey everyone, I started working on README rendering earlier and I thought that I should get some feedback now that it is working correctly on my local instance.

I edited the /crates/new endpoint so that when a crate is uploaded to the registry, its README (markdown only) is rendered to HTML and a readmes/{crate}/{crate}-{version}.html file is created on S3 when the crate file is uploaded there.
Then, when visiting a crate page, the Ember application makes a request to {s3}/readmes/{crate}/{crate}-{version}.html: if the file is found, its content is rendered in place, if not nothing happens.
This way, the renderd README is cached and versioned, following @ashleygwilliams recommendations.

On the crate/version route, the README contents is displayed in place of the package description.

The generated HTML is sanitized using Ammonia. This crate removes all tags and attributes that are not whitelisted, making the resulting markup as safe as possible for the end user. Bonus point: is uses the html5ever crate (created by the Servo project) to parse HTML.


Finally, I added a new script in `src/bin` named `render-readmes`. Its goal is to render the README of every uploaded crate version and to upload it on S3. This means that if modify the renderer behavior and we want to retroactively apply it, we're just one command (and a lil' bit of time) away ! (Though I might have left too much `unwrap()`/`expect()` in it)

TODO:

* [x] Security checks in `render::markdown_to_html` (script tag, javascript callbacks, more ?)
* [x] Tests
* [x] Documentation
* [x] ~Edit delete-crate & delete-version scripts to delete README files~ It would seem that the scripts only delete database entries, not stored files
* [x] Script to render published crates README (?)

Fixes #81

Preview:

![screenshot from 2017-07-10 00-22-45](https://user-images.githubusercontent.com/1275463/27998080-f83bad7a-6505-11e7-884e-95ee4f7c2a47.png)

@kureuil kureuil changed the title [WIP] Show Contents of Readme on Crate Pages Show Contents of Readme on Crate Pages Jul 9, 2017
@kureuil kureuil force-pushed the 81-show-crate-readme branch from 225d8b9 to 1308f1c Compare July 14, 2017 23:14
@kureuil
Copy link
Contributor Author

kureuil commented Jul 15, 2017

@carols10cents This PR should be ready for review now :)

Tell me if you need anything concerning this.

Cargo.toml Outdated
@@ -49,6 +50,9 @@ serde_derive = "1.0.0"
serde = "1.0.0"
clippy = { version = "=0.0.142", optional = true }
chrono = "0.4.0"
pulldown-cmark = { version = "0.0.15", default-features = false }
ammonia = "0.5.0"
syntect = "1.7.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For dependency version specification, did we want to leave out the specific patch version e.g. 0.5 compared to 0.5.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, I just copied and pasted the requirement line from the crates' pages (example). If I understand correctly the cargo documentation, this is just a stylistic issue. I'd be happy to change it if that's what we want, but I'm not sure this is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we leave out the patch version, we get the latest patch version when the Cargo.lock file gets updated via cargo update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand... The current lock file references diesel 0.14.1 despite having diesel = 0.14.0 in the Cargo.toml. Where is the behavior you describe documented ?

Copy link
Contributor

@vignesh-sankaran vignesh-sankaran Jul 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes you're right, apologies, I should have read the cargo documentation as well.

src/uploaders.rs Outdated
@@ -54,35 +54,54 @@ impl Uploader {
}
}

pub fn readme_location(&self, crate_name: &str, version: &str) -> Option<String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a documentation comment here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on it ! :)

Copy link
Contributor

@vignesh-sankaran vignesh-sankaran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to get crates.io to cooperate on Debian sid, I'll be able to review this fully during this week.

Copy link
Member

@carols10cents carols10cents left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing!!! So much work, thank you so much!!!!! I have a few small review comments, and I have some overall thoughts I'm going to leave in a non-review comment, but I'm super excited for this!!! ❗️ ❣️ ‼️

src/uploaders.rs Outdated
@@ -82,40 +82,34 @@ impl Uploader {
format!("readmes/{}/{}-{}.html", name, name, version)
}

/// Uploads a file using the configured uploader (either S3, Local or NoOp).
/// It returns a a tuple containing the path of the uploaded file
/// and its checksum.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this refactoring!!!!! 😻 😻 😻 😻

src/uploaders.rs Outdated
response
};
if handle.response_code().unwrap() != 200 {
if let Err(e) = self.delete(req.app().clone(), &crate_path) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I noticed from the old code that's not in the new code is this delete call if the response isn't 200-- I think what this does is makes sure we delete anything on s3 in case, say, the file was partially uploaded. Was there a reason for removing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This delete call is here to remove the uploaded crate if the readme couldn't be uploaded. It technically wasn't removed. In the new code, I'm creating the Bomb for the uploaded crate before uploading the readme so that if the readme upload fails a CargoResult::Error() is returned, triggering the Bomb and deleting the uploaded crate file. I hope I got the Bomb behavior correct.

version.num,
path.display()
));
match file {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL this code was really confusing to me until I looked at the tar docs and saw that tar::Entries iterates over Result<Entry> 😂😂😂

Since everywhere else in find_file_by_path is calling expect and crashing, wdyt about calling expect instead of this match and making this method return just String rather than Option<String>, and removing the unwrap on line 200?

I'm not entirely sure what the error would be here, so I don't know what to say in the expect. I was going to suggest making the unwrap on line 200 be an expect that says "file not found" but I see that there's already an expect that says just that in this method!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I might have done a little bit too much expect/unwrap for a script supposed to be long-running though it would seem that panic'ing in a thread does not affect the main one.

While I'm at it, do we want to refactor the threading model on this script to use a real threadpool and not create a thread for each crate ? It would certainly improve performance but we couldn't expect/unwrap as much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that we're at the scale where we need to be concerned about performance at that level

}

/// Creates and Arc over an App instance.
fn make_app() -> Arc<App> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplication with src/bin/server.rs is a bit 😿😿😿

Do you think it would be possible to factor out the duplication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree with you ! Do you know where we would want this, as a new function in the app module (app::make) ? As a new method on the App structure (App::from_env) ? Somewhere else ?

src/render.rs Outdated
}

/// Renders the given markdown to HTML using the current settings.
pub fn to_html(&self, text: &str) -> CargoResult<String> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome!!!!!!

@carols10cents
Copy link
Member

Again, this is looking awesome!! Since this is such a big change though (which again, is awesome!!!), I want to do a bit more review and testing before getting this out in production.

  • I'd love to get a skim on this from @alexcrichton and @ashleygwilliams.

    • Alex, this would render a README per crate version as HTML and upload that HTML to s3 and serve it from there. I know s3 space is cheap, but do you have any concerns around that?
    • Ashley, are there any things that npm does with its READMEs now that we've forgotten here? I'm about to reread your blog post again to see if I find anything. Things like the list of tags that are whitelisted-- could you help me find those in npm's code and crossreference them? I'd also be happy to jump on a call to discuss if you'd rather.
  • I want to do some performance testing:

    • Individual publish before and after this change, using the local uploader
    • Individual publish running locally but uploading to s3
    • How long it takes the render-readmes script to run for how many versions with realistic READMEs (I want to know approximately how long this will take before running this on production!!)
  • Depending on the results of those performance tests, that might sway my thoughts about whether we should do these possible enhancements now or whether we can put them off until we see more of a performance hit:

    • Should we move the readme rendering/uploading into a background job and out of the publish request entirely?
    • We have the README markdown for the most recent published version stored in the database since it's used for search. Would it be worth it to have a render-latest-readmes script that could rerender just those? The use case I'm thinking of is a bug of some sort, and that not having to download the tar files would be faster and let us fix a bug on at least the majority of crate landing pages sooner.
    • If the render-readmes task is going to take a long time on production (like over, idk, 10 min?), I wouldn't want to have to keep rerunning it on everything-- it'd be nice to have a way for it to pick up where it left off if it hits a problem and stops. So perhaps store a readme_last_rendered datetime and have a query that gets only the versions whose readmes haven't been rendered since a given time?

@vignesh-sankaran I'm definitely interested in how this affects development-- let's try to get your development environment fixed. Please ping me if you're still having trouble!

@carols10cents
Copy link
Member

carols10cents commented Jul 23, 2017

Oh ummmm one more thing I need to debug before merging: when I run the tests locally on osx with 1.19.0 (and 1.18.0), I get an invalid memory access 😱😱😱 end of output of cargo test --verbose:


     Running `/Users/carolnichols/rust/crates.io/target/debug/deps/all-d0ac81dc8b97766c`

running 132 tests
error: process didn't exit successfully: `/Users/carolnichols/rust/crates.io/target/debug/deps/all-d0ac81dc8b97766c` (signal: 11, SIGSEGV: invalid memory reference)

This doesn't make much sense to me because it worked on travis! What are you running, @kureuil? Have you ever seen this? @vignesh-sankaran once you get your dev env up, please try running the tests and report back!

@kureuil
Copy link
Contributor Author

kureuil commented Jul 23, 2017

@carols10cents I've never seen this during development of this PR. I wrote this code using Rust 1.18 and nightly-2017-07-04 on Fedora 25 and 26.

Do you have any more logs, or a coredump file ?

@vignesh-sankaran
Copy link
Contributor

vignesh-sankaran commented Jul 24, 2017

@carols10cents Yes I have run into that problem in macOS, and I don't know how to figure out what dependency is causing that problem :(. The Cargo.lock file has been updated, I'll try it again. All tests run fine on Linux, specifically Debian Sid.

The problem I'm specifically having is not knowing which part of the fn new() function handling the /crates/new endpoint requires superuser permissions. This could take me a while to figure out since I'll have to log everything happening in that function until I see where the 500 error is being thrown. @kureuil, have you run into this problem? I can't seem to publish a crate locally to the locally hosted backend unless I run it as sudo

@kureuil
Copy link
Contributor Author

kureuil commented Jul 24, 2017

@vignesh-sankaran I did not run into this problem when I was developing, and I even tested it with the local uploader. However, I did not test the local upload after the refactor of Uploader 63c3a5f as all tests were passing. The source of the problem is the path of the crate that get prefixed at line 155.
This was only done for S3 before, and is now done for every uploaders, so it is now trying to write to /crates/{name} instead of ./dist/local_uploads/crates/{name}.
Looking at the source code of the S3 crate, it is removing the leading slash of every path given to it. I really don't get why we're doing it in the uploader, I think I'll just remove the leading slash in src/uploaders.
@carols10cents any thoughts on this ? Did I miss something ?

@vignesh-sankaran
Copy link
Contributor

vignesh-sankaran commented Jul 24, 2017

@kureuil Ah yes that'd also explain the 404 I ran into when trying to run this in the local dev environment, since the directory to upload with READMEs doesn't exist. Specifically, the front end couldn't find the README file to display.

The permissions problem actually exists even on master, so I'll still have to look into this. This gives me a starting point to work with though :).

EDIT: @carols10cents, ok so I'm still running to problems on macOS, it could be one of the 4 new dependencies we're adding to Cargo.toml, which are tar, pulldown-cmark, ammonia, and syntect. We need to find out which of these have a C/C++ dependency or serves as a wrapper around said code, and see if this is a known problem with that dependency.

@kureuil
Copy link
Contributor Author

kureuil commented Jul 24, 2017

Little update on this, I think I found the source of the segfault. The culprits are git2 and syntect and their regex functions. git2 wraps libgit2 which has a regfree function (declared as a weak alias) and syntect uses the onigumura regex library which also defines a regfree function which because it is not declared as a weak alias is overriding libgit2's function. So when libgit2 think it is calling its function it is in reality calling onig's one. I don't even know how it is reliably working on linux...

I'll remove it for now and will make the syntax highlighting client-side for now. I'll try to submit another PR later when this issue will be fixed.

Copy link
Contributor

@vignesh-sankaran vignesh-sankaran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll document private types as well, see if we can get them displayed on the crates.io documentation pages as discussed here

src/uploaders.rs Outdated
fn crate_path(name: &str, version: &str) -> String {
// No slash in front so we can use join
format!("crates/{}/{}-{}.crate", name, name, version)
}

fn readme_path(name: &str, version: &str) -> String {
Copy link
Contributor

@vignesh-sankaran vignesh-sankaran Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc comment please :)

@kureuil
Copy link
Contributor Author

kureuil commented Aug 13, 2017

Yeah, I have the CORS config and a custom bucket policy to make all my uploaded objects public, but I think this one is already in place.

{
    "Version": "2008-10-17",
    "Id": "Policy1397632521960",
    "Statement": [
        {
            "Sid": "Stmt1397633323327",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::kureuil-crates/*"
        }
    ]
}

@kureuil
Copy link
Contributor Author

kureuil commented Aug 14, 2017

So I tried some changes, and the one I prefer is moving the content of the "Links" section in the quick-links list and putting the other sections in the sidebar. What do you think of this layout ?

screenshot from 2017-08-14 22-59-40

@carols10cents
Copy link
Member

So I tried running the render readmes script on staging by using heroku run ./target/release/render-readmes, and it failed with this:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { code: -3, klass: 2, 
message: "failed to resolve path \'/app/tmp/index-checkout\': No such file or directory" }'
...
   9: core::result::unwrap_failed
  10: cargo_registry::app::App::new
  11: render_readmes::main

Indeed, there is no index checkout, because tmp directories aren't shared between dynos, and heroku run spins up a new dyno. So it doesn't have a git checkout.

Good news is the render readmes script doesn't actually need a git checkout! It needs a curl::Easy handle, an uploader, and a database connection. These all happen to be conveniently available if you have an App, but it isn't too terrible to get them without one.

I had to change uploader upload a bit, since we want to use the app's proxy handle for testing, but the render readmes script doesn't need to use the app-configured proxy, so I just passed in the curl::Easy handle rather than the app.

After these changes, I was able to run the render readmes script on staging!

It took like 10 seconds to do 26 versions, most of which don't have readmes, so not a very representative test.

I got the CORS settings worked out with alex and I've got this deployed on staging. Interesting ones:

So it looks like it's working! I'll probably deploy this sometime tomorrow when I'm able to monitor the processing and impact :)

@carols10cents
Copy link
Member

let's queue this up (i'm about to get on a plane so i'm not going to deploy it til later)

bors: r+

bors-voyager bot added a commit that referenced this pull request Aug 17, 2017
869: Show Contents of Readme on Crate Pages r=carols10cents

Hey everyone, I started working on README rendering earlier and I thought that I should get some feedback now that it is working correctly on my local instance.

I edited the `/crates/new` endpoint so that when a crate is uploaded to the registry, its README (markdown only) is rendered to HTML and a `readmes/{crate}/{crate}-{version}.html` file is created on S3 when the crate file is uploaded there.
Then, when visiting a crate page, the Ember application makes a request to `{s3}/readmes/{crate}/{crate}-{version}.html`: if the file is found, its content is rendered in place, if not nothing happens.
This way, the renderd README is cached and versioned, following @ashleygwilliams [recommendations](#81 (comment)).

On the `crate/version` route, the README contents is displayed in place of the package description.

The generated HTML is sanitized using [Ammonia](https://crates.io/crates/ammonia). This crate removes all tags and attributes that are not whitelisted, making the resulting markup as safe as possible for the end user. Bonus point: is uses the html5ever crate (created by the Servo project) to parse HTML.

~~~Codeblocks syntax highlighting is done server-side using [syntect](https://crates.io/crates/syntect). This crate uses Sublime syntax files to highlight code samples, which means that we automatically have access to a lot of pre-existing grammars and we can easily add/create missing ones.~~~ Syntax highlighting is now done client-side using PrismJS due to syntect's dependency onigumura conflicting with git2.

Finally, I added a new script in `src/bin` named `render-readmes`. Its goal is to render the README of every uploaded crate version and to upload it on S3. This means that if modify the renderer behavior and we want to retroactively apply it, we're just one command (and a lil' bit of time) away ! (Though I might have left too much `unwrap()`/`expect()` in it)

TODO:

* [x] Security checks in `render::markdown_to_html` (script tag, javascript callbacks, more ?)
* [x] Tests
* [x] Documentation
* [x] ~Edit delete-crate & delete-version scripts to delete README files~ It would seem that the scripts only delete database entries, not stored files
* [x] Script to render published crates README (?)

Fixes #81

Preview:

![screenshot from 2017-07-10 00-22-45](https://user-images.githubusercontent.com/1275463/27998080-f83bad7a-6505-11e7-884e-95ee4f7c2a47.png)
@bors-voyager
Copy link
Contributor

bors-voyager bot commented Aug 17, 2017

Build succeeded

@bors-voyager bors-voyager bot merged commit e28e48f into rust-lang:master Aug 17, 2017
@carols10cents
Copy link
Member

carols10cents commented Aug 19, 2017

IT'S ALIVE!!!!!! (render-readmes is still running :) but everything is looking great!!!!!)

@alfiedotwtf
Copy link

Is this documented anywhere on crates.io's website? Sorry, I couldn't find any just now

@vignesh-sankaran
Copy link
Contributor

vignesh-sankaran commented Oct 15, 2017

@alfiedotwtf Looks like it's rendering here now, was this the crate in question?. I'm not across how exactly readme rendering works, but I think it takes a little while to render server side? Otherwise it could be an add on preventing the loading of the readme on the client side?

@alfiedotwtf
Copy link

Yeah, I think that's the confusion...

I read about this pull request a while back in "This Week in Rust", and so I added a README.md to my crate this morning, but when I cargo publishd, nothing showed up.

Looking at the cargo.io documentation, I couldn't find anything on how to trigger the rendering on the crate page. So I ended up looking at other crates that did have it working, and found that to trigger this feature I needed to add the "readme" config item in Config.toml :)

@vignesh-sankaran
Copy link
Contributor

Hmm sounds like something that could be made clearer, could you perhaps open an issue up in the cargo repository? doc.crates.io's source code is currently in the cargo repo for some reason :).

@alfiedotwtf
Copy link

Ah, thanks for the pointer... done: rust-lang/cargo#4620

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants