Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate detection of dead doc links #15257

Closed
silverwind opened this issue Sep 8, 2017 · 16 comments
Closed

Automate detection of dead doc links #15257

silverwind opened this issue Sep 8, 2017 · 16 comments
Labels
doc Issues and PRs related to the documentations. stalled Issues and PRs that are stalled.

Comments

@silverwind
Copy link
Contributor

Links in docs get regularily broken (example), and it should be possible to have a script that iterates all links in the docs and checks for a HTTP response code of < 400.

Probably not something we want to run as part of the CI, but I could see the script being run on-demand regularily.

@silverwind silverwind added the doc Issues and PRs related to the documentations. label Sep 8, 2017
@Trott
Copy link
Member

Trott commented Sep 8, 2017

I wonder if this might be something for @nodejs/website to figure out.

@phillipj
Copy link
Member

phillipj commented Sep 8, 2017

IIRC @mikeal once said he created something for crawling every link found on a website?

@tniessen
Copy link
Member

tniessen commented Sep 8, 2017

that iterates all links in the docs and checks for a HTTP response code of < 400.

This is probably not enough, changing headings within a page causes the #hash to change and links won't jump to the correct section anymore. So we will need to parse the retrieved documents as well.

@vsemozhetbyt
Copy link
Contributor

Maybe we can use puppeteer for this.

@vsemozhetbyt
Copy link
Contributor

vsemozhetbyt commented Sep 8, 2017

Strawman with puppeteer for simple wrong hashes detection (intra links only):

script
'use strict';

const { URL } = require('url');
const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  const { href, origin, pathname } = new URL('https://nodejs.org/api/all.html');
  await page.goto(href);

  const wrongLinks = await page.evaluate((mainOrigin, mainPathname) => {
    return [...document.body.querySelectorAll('a[href]')]
           .filter(link => link.origin === mainOrigin &&
                           link.pathname === mainPathname &&
                           link.hash !== '' &&
                           document.body.querySelector(link.hash) === null)
           .map(link => `${link.innerText} : ${link.href}`)
           .join('\n');
  }, origin, pathname);

  console.log(wrongLinks);
  browser.close();
})();

Currently, it detects these links:

output

cluster.settings : https://nodejs.org/api/all.html#clustersettings
verify.update() : https://nodejs.org/api/all.html#crypto_verifier_update_data_inputencoding
verify.verify() : https://nodejs.org/api/all.html#crypto_verifier_verify_object_signature_signatureformat
verify.update() : https://nodejs.org/api/all.html#crypto_verifier_update_data_inputencoding
verify.verify() : https://nodejs.org/api/all.html#crypto_verifier_verify_object_signature_signatureformat
TCP-based protocol : https://nodejs.org/api/all.html#debugger_tcp_based_protocol
Http2Session and Sockets : https://nodejs.org/api/all.html#http2_http2sesion_and_sockets
ALPN negotiation : https://nodejs.org/api/all.html#alpn-negotiation
ServerRequest : https://nodejs.org/api/all.html#http2_class_server_request
stream.pushStream() : https://nodejs.org/api/all.html#http2_stream-pushstream
readable._destroy : https://nodejs.org/api/all.html#stream_readable_destroy_err_callback
readable._destroy : https://nodejs.org/api/all.html#stream_readable_destroy_err_callback
stream._destroy() : https://nodejs.org/api/all.html#stream_readable_destroy_err_callback


@silverwind
Copy link
Contributor Author

silverwind commented Sep 8, 2017

I was more thinking of external links when I opened this, but it's good to have checks for those relative links too. For external ones, I think a simple status code check should be enough.

For the internal ones, I could see a check like above being part of the CI, but I don't think we can include puppeteer in the repository, it's just too heavy.

@ghaiklor
Copy link
Contributor

Also, there is a tool called html-proofer that can be used for such things. I use it in some of my static pages to check if all the resources are exists like (images, stylesheets, etc) and... it also checks for any broken links on your website.

@vsemozhetbyt
Copy link
Contributor

vsemozhetbyt commented Sep 10, 2017

A more meticulous and tangled variant for internal links checking (for hash-only links and for inter-document links inside the doc site). It still uses puppeteer, so it is not bearable inside the repo or CI, but it can be occasionally used locally.

The current run has resulted in #15293 and #15291.

@TimothyGu
Copy link
Member

Could we use something Node.js like jsdom or cheerio instead of Puppeteer? The latter sounds a lot like an overkill to me, while cheerio might even be small enough to be bundled in core.

@TimothyGu
Copy link
Member

Or even better, a Markdown-based solution, that can possibly be integrated with doctool.

@bnb
Copy link
Contributor

bnb commented Sep 25, 2017

@TimothyGu I had someone PR Danger as a CI/CD tool for a markdown-only project of mine to detect broken links - it may be useful to run on docs updates?

http://danger.systems/js/
https://github.com/danger/danger-js

@jasnell
Copy link
Member

jasnell commented Aug 12, 2018

There's been zero activity on this in 11 months. I recommend closing.

@jasnell jasnell added the stalled Issues and PRs that are stalled. label Aug 12, 2018
@timaschew
Copy link

timaschew commented Aug 12, 2018

I wrote a tool, similar to the html-proofer API, based on node.
I wrote it because html-proofer too slow for > 1000 pages.
Since my tool is using cheerio which is using htmlparser2, it's super fast.

https://github.com/timaschew/link-checker

@vsemozhetbyt
Copy link
Contributor

FWIW, the internal doc system is checked now (see #21889), so we only need external link validation.

@jasnell
Copy link
Member

jasnell commented Oct 17, 2018

Still no actual activity on this, should we keep it open?

@silverwind
Copy link
Contributor Author

silverwind commented Oct 17, 2018

I agree, better close it than.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Issues and PRs related to the documentations. stalled Issues and PRs that are stalled.
Projects
None yet
Development

No branches or pull requests

10 participants