Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new tidy check to ensure that rustdoc DOM IDs are all declared as expected #86178

Closed

Conversation

GuillaumeGomez
Copy link
Member

@GuillaumeGomez GuillaumeGomez commented Jun 9, 2021

We need to declare the IDs in a map in rustdoc to prevent the possibility to have conflicts when generating DOM. Until now, it was checked manually, which was very "unsafe", as the second commit of this PR proves it (second commit was moved into #86819 and merged).

cc @jsha
r? @Mark-Simulacrum

@GuillaumeGomez GuillaumeGomez added T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. labels Jun 9, 2021
@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jun 9, 2021
Copy link
Member

@Mark-Simulacrum Mark-Simulacrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would strongly prefer to avoid parsing free-form Rust code; rather something like the feature gate declarations which are in an ad-hoc but simple(r) DSL seem preferable, along with the requisite // START-DOM-ID-DECLS or so section, rather than looking for the function name.

I'm wary that the test as-is seems to reach into rustdoc's HTML generation and broadly speaking feels pretty invasive in terms of trying to match things, but at the same time feels pretty unlikely to be "good enough" to actually close out this problem. Do we have cases where not having this tidy check led to problems (e.g., issues, discussions, etc.)? Maybe some broader discussion amongst the rustdoc team that indicates that both a) this is a real problem and b) this solution is going to solve it sufficiently and not cause more pain than necessary?

src/tools/tidy/src/rustdoc_html_ids.rs Outdated Show resolved Hide resolved
src/tools/tidy/src/rustdoc_html_ids.rs Outdated Show resolved Hide resolved
@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 20, 2021
@Mark-Simulacrum
Copy link
Member

The second commit of this PR adds a few declarations in the id map, but it's not clear to me what value that has precisely -- it doesn't seem like the generated HTML today had any problems, so that fix is not fully clear to me as to its purpose (especially when viewed in comparison to the declaration of the id "help" as ignored, rather than adding it to the map)...

@GuillaumeGomez
Copy link
Member Author

It's prevention. It doesn't have issues in the std/core/test docs because we check them, but it very likely has issues in some other crates. The goal of this ID map is to replace duplicated IDs because you can't have the same ID used twice on the same page (it's invalid). My other comment above explain a bit more.

I'll update to fix the comment.

@Mark-Simulacrum
Copy link
Member

I understand that in theory this prevents bugs, but if we've never encountered a bug due to this I'm a little hesitant to start parsing arbitrary Rust code in tidy via regexes to try to catch it.

cc @rust-lang/rustdoc

@jyn514
Copy link
Member

jyn514 commented Jun 20, 2021

I am not super familiar with this part of the code, but I think I lean towards Mark's perspective, if we've never had bugs related to this it seems silly to spend lots of effort on it.

@Manishearth
Copy link
Member

I also lean towards Mark's perspective. Parsing rust code this way seems rather invasive and I don't think that's how we should achieve this, and I'm unsure if this check is necessary in the first place.

@GuillaumeGomez
Copy link
Member Author

I understand that in theory this prevents bugs, but if we've never encountered a bug due to this I'm a little hesitant to start parsing arbitrary Rust code in tidy via regexes to try to catch it.

We did, quite a few times. It's just that I sent PRs to add the ID in the ID map and that stopped there. But it happened. I don't have enough time to start writing checks if I don't think they're useful. ;)

However, maybe the approach here isn't the right one. If you have another idea on how to enforce it, I'd be very happy to hear it! (Parsing rust source code with regex is really not great...)

@GuillaumeGomez GuillaumeGomez force-pushed the rustdoc-id-tidy-check branch from 07e0162 to 6584dbf Compare June 23, 2021 14:22
@GuillaumeGomez
Copy link
Member Author

Updated/rebased: a new ID was created and wasn't added to the ID map (deref-methods). I can provide more explanations why having the same ID present more than once in the same page is a problem:

It's an anchor issue. For a long time, we detected them in the std when the html check we have detected that a page has the same ID more than once. We still do, but it actually doesn't cover everything, far from it. From time to time, someone tells me they found a conflict in IDs. In such case, I simply add the ID to the map and then it's fixed when the version is available to the user.

However, it happened quite a lot of time. At some point we even had two ID maps which were kinda working together but it was a huge mess. I simplified things a bit recently. Then @jsha started reworking the front-end a lot and I realized that IDs have been modified/created (and forgotten to be added to this ID map, creating new potential issues). So to prevent this problem to go unnoticed until someone comes to me directly, I wrote this check.

Now I completely agree that parsing Rust (and now HTML too because of the template) with regex is very suboptimal to say the least, but I didn't find a better idea... We need this test, but like I said, if you have a better idea on how to check the IDs, I'd be very happy to hear about it!

@jyn514
Copy link
Member

jyn514 commented Jul 3, 2021

However, maybe the approach here isn't the right one. If you have another idea on how to enforce it, I'd be very happy to hear it! (Parsing rust source code with regex is really not great...)

https://github.com/rust-lang/rust/pull/86178/files#r649044525 seems like a good idea to me, we could use JSON or something something so it's easier to parse. Parsing a 15 line file once per run doesn't seem too expensive.

@GuillaumeGomez
Copy link
Member Author

But this is half the problem like I said: we still need to check in the rustdoc source code that IDs are actually used, and that IDs in the rustdoc source code are declared inside the map. Making their declaration better is only a small part of the problem.

@jyn514
Copy link
Member

jyn514 commented Jul 3, 2021

@GuillaumeGomez I don't know how to test all the IDs are used, but you can check they're in the map by forcing all the IDs to come from the map, something like this:

// pretend this has once_cell or something
static HTML_IDS: &[&str] = parse_json(include_str!("src/librustdoc/html_ids.json"));
write_str!(buffer, HTML_IDS[html_id::TRAIT_IMPLEMENTATIONS_LIST]);

@GuillaumeGomez
Copy link
Member Author

But unfortunately that doesn't work inside the tera template files. Also, it doesn't prevent someone to use id=... (we can always try to catch them in reviews, but it's not the best fail-safe...).

@jyn514
Copy link
Member

jyn514 commented Jul 5, 2021

Does your PR currently work for tera files?

Also, it doesn't prevent someone to use id=... (we can always try to catch them in reviews, but it's not the best fail-safe...).

Sure, but it's a lot easier to catch "you used an id without going through the html_ids module" than "the hard-coded ID you used needs to be updated in html::markdown" IMO. In particular my idea will catch refactors from Rust to JS and the other way around automatically without needing review.

@GuillaumeGomez
Copy link
Member Author

Does your PR currently work for tera files?

Yes. I simply added the .html files to the extensions. Since it's a regex, it doesn't care much...

Sure, but it's a lot easier to catch "you used an id without going through the html_ids module" than "the hard-coded ID you used needs to be updated in html::markdown" IMO. In particular my idea will catch refactors from Rust to JS and the other way around automatically without needing review.

But that would still require this regex to check it. ;)

Also, we currently don't check in js because there is simply no way to do so...

JohnTitor added a commit to JohnTitor/rust that referenced this pull request Jul 7, 2021
… r=jyn514

Clean up rustdoc IDs

I cherry-picked the commit from rust-lang#86178. It adds missing rustdoc IDs (for the HTML) and remove unused ones.

cc `@camelid`

r? `@jyn514`
@bors
Copy link
Contributor

bors commented Jul 7, 2021

☔ The latest upstream changes (presumably #86920) made this pull request unmergeable. Please resolve the merge conflicts.

@GuillaumeGomez GuillaumeGomez force-pushed the rustdoc-id-tidy-check branch from 708711b to 03ed00e Compare July 7, 2021 09:29
@GuillaumeGomez GuillaumeGomez removed the S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. label Jul 12, 2021
@crlf0710 crlf0710 added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jul 31, 2021
Copy link
Member

@Mark-Simulacrum Mark-Simulacrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still a little uncertain that this is worthwhile to pursue, but if we do, I would like some comments a the top of the tidy file added describing the failure modes and how we're handling them. For example: new way of generating IDs is added to rustdoc. How do we catch that? (Or, if we don't, that is useful to document too).

I think some discussion of the goals would also be useful. At least, embedding

We need to declare the IDs in a map in rustdoc to prevent the possibility to have conflicts when generating DOM. Until now, it was checked manually, which was very "unsafe", as the second commit of this PR proves it.

into the tidy check top-level doc comment would already be useful, but expanding on what is unsafe about it (i.e., what we're trying to prevent) and how this mitigates that would be good, I think. Part of the problem is that the rustdoc ID map is not documented anywhere (that I can find quickly) which means that there's no basis for understanding its purpose without doing some digging around and having background knowledge of HTML's ID "limitations".

src/tools/tidy/src/rustdoc_html_ids.rs Outdated Show resolved Hide resolved
src/tools/tidy/src/rustdoc_html_ids.rs Show resolved Hide resolved
check_id(path, trimmed.split('"').skip(1).next().unwrap(), ids, line_nb, bad);
is_checking_small_section_header = None;
}
} else if trimmed.starts_with("write_small_section_header(") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, it is currently too easy for this function to be renamed or argument order changed -- there's no comments that I can see indicating it's special. I think we should at least have comments to the definition. But still, parsing arbitrary Rust code seems worrisome (easy to get wrong).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GuillaumeGomez Could you add a doc-comment to write_small_section_header() saying that calls to it are parsed by tidy?

Also, this check seems somewhat fragile since it will only check calls that don't have any non-whitespace characters before them. Could you instead use something like this?

Suggested change
} else if trimmed.starts_with("write_small_section_header(") {
} else if trimmed.contains("write_small_section_header(") && !trimmed.contains("fn write_small_section_header(") {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the doc comment and updating the check.

@camelid camelid added the T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) label Sep 13, 2021
@GuillaumeGomez
Copy link
Member Author

Until now, it was checked manually, which was very "unsafe", as the second commit of this PR proves it.

This part of the PR description seems to be out of date...? I don't see any changes to the contents of the ID map.

Good point, I updated the PR description.

@camelid
Copy link
Member

camelid commented Sep 15, 2021

We need to declare the IDs in a map in rustdoc to prevent the possibility to have conflicts when generating DOM.

Can you explain this part to me? In what situations can we have ID conflicts? IIUC, all the IDs in init_id_map are used in static HTML generation, so they should only appear once. Why do they need to be tracked in the IdMap?

I would expect that only "dynamic" IDs (e.g., #tymethod.foo on a struct page since there could be multiple trait impls with the function foo) would need to be in the IdMap, and they can't be in init_id_map because we don't know them ahead-of-time.

@GuillaumeGomez
Copy link
Member Author

It's when we generate HTML from markdown (for titles). In order to avoid IDs to be duplicated, we list the ones used by rustdoc (the "static" ones, not the "dynamic" ones). Basically, the ones that don't go through our ID map if we don't put them ourselves.

@camelid
Copy link
Member

camelid commented Sep 15, 2021

It's when we generate HTML from markdown (for titles). In order to avoid IDs to be duplicated, we list the ones used by rustdoc (the "static" ones, not the "dynamic" ones). Basically, the ones that don't go through our ID map if we don't put them ourselves.

Ah, that makes sense, I forgot about headings 👍

@camelid
Copy link
Member

camelid commented Sep 15, 2021

I just realized that #85833 (comment) is a real-world example of the fragility of the write_small_section_header( check. I'm not sure what to do about this PR; it seems like we should have some kind of tooling to prevent ID conflicts, but this approach doesn't seem robust enough.

@GuillaumeGomez
Copy link
Member Author

Yep... I think it's better than nothing while we look for a better idea.

@camelid camelid added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 15, 2021
@GuillaumeGomez GuillaumeGomez removed the S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. label Sep 29, 2021
@GuillaumeGomez
Copy link
Member Author

ping

@JohnCSimon JohnCSimon added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 19, 2021
@camelid
Copy link
Member

camelid commented Oct 30, 2021

I think this check might not work with some of the IDs that were landed in the scrape-examples PR. Closing and re-opening (to re-run CI) to see for sure.

@camelid camelid closed this Oct 30, 2021
@camelid camelid reopened this Oct 30, 2021
@GuillaumeGomez
Copy link
Member Author

@camelid CI passed but just in case I rebased on master.

@rust-log-analyzer

This comment has been minimized.

@JohnCSimon JohnCSimon added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 5, 2021
@bors
Copy link
Contributor

bors commented Dec 5, 2021

☔ The latest upstream changes (presumably #91356) made this pull request unmergeable. Please resolve the merge conflicts.

@camelid
Copy link
Member

camelid commented Dec 10, 2021

I'm going to close this PR for the reasons I listed on Zulip. Sorry for taking so long to follow up on this.

@camelid camelid closed this Dec 10, 2021
@GuillaumeGomez GuillaumeGomez deleted the rustdoc-id-tidy-check branch August 19, 2024 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.