Browser executable-less preprocessor #1469

saschanaz · 2018-02-02T04:27:17Z

ReSpec shouldn't require installing real browsers to use respec2html.js as jsdom supports browser-like DOM manipulation in Node.js. Dropping Nightmare (and so Electron) dependency will allow easier build environment settings on GPU-less CLI environments.

The text was updated successfully, but these errors were encountered:

marcoscaceres · 2018-02-05T08:12:05Z

Definitely. I've hit this on Dreamhost too.

The problem in the past (years ago!) was that JSDOM was super-duper slow for whatever reason - specially on big specs - so it wasn't feasible to use it. That's probably been addressed since then, but someone would need to try it out.

tidoust · 2018-05-12T14:59:31Z

I thought I'd report here because I've been investigating how to make Respec run in JSDOM (for Reffy). Good news is it's mostly doable. Bad news is it's not exactly straightforward...

The tweaks that I needed to make to JSDOM and Respec in Reffy are parked in a drop-respec branch in Reffy for now. See code and comments in util.js. Here is a summary of problems I encountered and workarounds I found so far:

JSDOM does not support URL.createObjectURL:
Respec uses that to create a separate worker for the core/highlight module. Dropping that module solves the problem, but obviously means that the module does not run.
JSDOM does not support document.createRange:
Respec uses that in the core/list-sorter module. Dropping that module solves the problem, with the obvious same consequence as above.
Also note Fix temp ID syntax in list sorter #1665 related to that module: JSDOM throws an error because the generated CSS selector is invalid.
JSDOM's CSS parser is not necessarily up-to-date:
JSDOM does not like the @keyframes and @supports directives in particular and throws an error when parsing the CSS Respec injects. Not sure this really has any real impact, but since I don't need the CSS in Reffy, replacing them with @media directives does the trick.
JSDOM does not support innerText:
Given the use in Respec, replacing occurrences of innerText in Respec by textContent solves the problem.
marked behaves strangely in the presence of <pre> blocks when there are no blank lines after the block.
For instance, if you look at this markdown demo, you'll note that the generated HTML for the <pre> tag includes unexpected (and broken) <p> tags.
Now, the problem is that Respec drops blank lines before it calls that library. An example of spec where this happens is Netinfo. Interestingly, in "real" browsers, the weird tags magically disappear afterwards and the generated HTML looks clean (but the output of markdownToHtml is wrong). I do not know why. In JSDOM, the generated HTML stays ugly, and extracting the IDL afterwards is problematic.
Replacing \n with \n\n in normalizePadding to preserve a blank line fixes the issue.
JSDOM does not support cloning of attributes:
This is needed by HyperHTML. In theory, providing a polyfill is easy but constructors and prototypes are shared in JSDOM, so there is no good way to get back to the underlying document when given an attribute to clone that does not belong to one, or at least I could not find a good solution here. This requires tweaking the code of HyperHTML to call createAttributeNS itself when the attribute it wants to clone does not belong to a document.
JSDOM does not support inserAdjacentElement:
It's easy to come up with a polyfill.
JSDOM does not support closest:
Same thing, it's easy to come up with a polyfill.
JSDOM does not support matchMedia:
Given that Reffy does not need that logic, it's easy to mock the interface
JSDOM does not support fetch:
Actually, that's fine for Reffy, because I want to control network requests as much as possible and force usage of a local cache when possible.
JSDOM does not support a few visual methods on window:
In particular, blur, focus, moveBy, moveTo, resizeBy, resizeTo, scroll, scrollBy, scrollTo are not supported (and a couple of specs call scrollBy). Reffy does not need that logic, so it's easy to mock these methods.
JSDOM does not support IndexedDB:
Respec degrades gracefully when IndexexDB is not supported. It still outputs an obscure error message to say that it cannot call open on undefined (where undefined should have been IndexedDB). Detecting whether IndexedDB exists before trying to access it in Respec would allow to report a more useful error message.
Also worth noting here: the readyPromise gets rejected but, depending on the spec, the promise may be checked in another tick. When that happens, Node.js reports an UnhandledPromiseRejectionWarning, which is, as far I can tell, a false positive.
JSDOM does not seem to support the whatToShow filter in TreeWalker:
As a result, HyperHTML fails to remove the  comments it adds all over the place when it runs. That does not have any consequence.
As far as I can tell, there is no easy way to run code in JSDOM once Respec has started to run, and in particular when the respecIsReady promise is set. As a result, I need to do a bit of polling to wait for the promise to appear so that I can attach to it and detect the end of the generation. Or is there a better way?

With the workarounds and limitations mentioned above, I manage to run Reffy on all W3C specs that Reffy knows about, and the outcome seems good!

saschanaz · 2018-05-12T17:18:11Z

Thanks for sharing your experience!

Yeah, I also tried JSDOM and faced those limits. Maybe we should change the whole logic. Some personal thoughts:

We now have access to ES modules, ideally we should be able to use the same modules in both browsers and Node.js without JSDOM page load or a separate browser executable.
We currently heavily depends on browser native global DOM methods, I think we can still use JSDOM there if we make sure that we pass JSDOM document object rather than using global variables. Or viperHTML?
Workers. I hope we can load it as a module in Node.js, if worker scripts can be modules... (Update: Yes, a worker script can be a module.)

marcoscaceres · 2018-05-14T04:30:35Z

@tidoust, I thought Reffy ran against "static" documents? If we had a Reffy web service, ReSpec could simply post a static snapshot directly from the browser without needing to run any transformations on the server.

To rephrase: what is Reffy's input?

tidoust · 2018-05-14T07:33:52Z

To rephrase: what is Reffy's input?

Reffy's goal is to generate a knowledge graph of specs and an analysis of that graph to create a list of potential anomalies. To generate the knowledge graph, Reffy:

starts from a list of spec URLs (multiple lists, actually: specs-common.json, specs-cg.json, specs-whawg.json.
retrieves information about each spec from the W3C API and Specref, including the URL of the Editor's Draft
fetches the Editor's Draft of the spec
parses the Editor's Draft to extract IDL, references and links. For Editor's Drafts that use ReSpec, this is where Reffy needs to get the generated document, which is currently done using RespecDocWriter.

The input of the Reffy web service that we talked about is slightly different. It would take as input:

the URL of a spec, or a static snapshot of the generated draft
the latest version of the knowledge graph (typically from reffy-reports).

From that, it would simply have to parse the provided spec, and analyses the result against the knowledge graph. No transformation is needed on the server for that (unless the input is the raw source, obviously).

I started to develop such a service (see reffy-service). I haven't had time to test it with POST content, so it's most probably broken for now. It also lacks a README for now. Note the service needs the HTML and the URL of the spec to test. The URL is needed when the spec already exists in the knowledge graph, so that Reffy can remove it from there first before, otherwise the resulting graph would contain two specs that e.g. define the same IDL terms, and analysis would report that as anomalies.

I'm now looking at ways to make the service more robust before we deploy it on one of our servers. The investigation reported here is part of that, as an attempt to reduce the number of dependencies that would have to run on the server.

marcoscaceres · 2018-05-14T07:51:25Z

@tidoust thanks for clarifying. @saschanaz, do you know if puppeteer runs without needing a virtual screen? If yes, then that addresses GPU-less CLI environments problem in one sense... still a fat dependency, but at least we don't need to refactor code significantly to back port ReSpec to JSDOM.

saschanaz · 2018-05-14T08:12:14Z

do you know if puppeteer runs without needing a virtual screen?

puppeteer runs in headless mode by default, so yes.

marcoscaceres · 2018-05-14T08:14:38Z

@tidoust, given that Puppeteer runs without needing the virtual display, is it a viable alternative to JSDOM for Reffy?

tidoust · 2018-05-14T08:45:05Z

@tidoust, given that Puppeteer runs without needing the virtual display, is it a viable alternative to JSDOM for Reffy?

Sure! I was not reporting these issues as requests to change ReSpec. The problem here is not how ReSpec is written, it is JSDOM not yet supporting enough features for ReSpec. I just wanted to document the current limitations somewhere.

Actually, we still have a virtual screen for now on the server, but maybe that's no longer needed with Puppeteer (it was for nightmare). We just haven't tried without. But that was not the reason why I wanted to get rid of RespecDocWriter, it was mostly to simplify dependencies and avoid having to run a full chromeless browser on the server.

marcoscaceres · 2018-05-14T09:07:03Z

I just wanted to document the current limitations somewhere.

Ok cool. Sorry I was confused. I appreciate the documentation above also, so thanks for that!

We just haven't tried without. But that was not the reason why I wanted to get rid of RespecDocWriter, it was mostly to simplify dependencies and avoid having to run a full chromeless browser on the server.

Understood. Ok, if there is anything else we can do, happy to try to help. Reffy looks super cool and if we can integrate somehow directly (via POST or whatever), happy to help with that however we can!

Reffy needs to run ReSpec on Editor's Drafts. It relied on RespecDocWriter until now, which uses Puppeeter, which in turn relies on a headless version of Chrome. That requires downloading 100s of MB of dependencies and is a tad heavy for the task at hands, especially as we consider running Reffy as an HTTP service. With this update, RespecDocWriter is no longer being used. Instead, Reffy makes ReSpec run in JSDOM, which is much more lightweight (and JS-based). Problem is that JSDOM does not yet support enough features to run ReSpec. Concrete issues are documented in: https://github.com/w3c/respec/issues/1469#issuecomment-388560878 This update monkey-patches JSDOM's code to add good-enough implementations of missing features, as well as to regain control over network requests so that all network requests go through our cache. It also monkey-patches ReSpec to: 1. drop a couple of modules that cannot run because they use non-implemented features of JSDOM that cannot be easily patched 2. update the code of a couple of ReSpec dependencies for the same reason The whole thing remains fragile: updates to ReSpec or to its dependencies may well break the regular expressions used to monkey patch the code. Most of the monkey-patching should resist minor changes. A couple of them may not though.

saschanaz · 2019-03-22T05:26:33Z

@marcoscaceres This has always been my ultimate purpose, for binary-less lighter dependencies, a scriptable API, easier test without karma, and easier debug. Could we bake an experimental build with minimal features to check what should be done, although I'm currently not sure which feature to start?

Edit: My vague thought:

We create a structure that represents ReSpec document, which includes Document object, user configuration, definition maps, biblio maps, etc.
Expose an API function that receives Document and user configuration, creates the structure, and passes it to modules. For the current in-HTML script injection use, ReSpec calls it internally with the current document and respecConfig value.
Modules will receive the structure, replacing the current use of global document and the conf argument. Normalizing configuration shouldn't modify the input configuration to prevent unexpected side effect.
Modules that store its state in module level should instead store it in the structure.
Modules that add event listeners, or do anything cannot be stored statically, should instead insert an inline script to do it. (Separate static and dynamic parts #1668)

marcoscaceres · 2019-03-24T15:29:42Z

Excited to see where this ends up!

saschanaz · 2019-03-27T08:10:10Z

I got a blocker from hyperHTML, which cannot be imported on node.js. Maybe we have to fork viperhtml (which is a lot lighter than hyperhtml but only for server) to make it work.

marcoscaceres · 2019-03-27T15:18:13Z

Pinged WebReflection for guidance.

WebReflection · 2019-03-27T16:37:27Z

I got a blocker from hyperHTML, which cannot be imported on node.js

hyperHTML is 100% code covered via node, all you need to do is to write require('basichtml').init() before requiring hyperhtml.

I've used basichtml to serve output even in cloudflare so everything you need should be there.

However, viperHTML is always an alternative if you need just plain SSR 👋

saschanaz · 2019-03-28T00:26:29Z

So it looks like require('basichtml').init() defines document globally? That somehow clashes with jsdom... But thanks, TIL basichtml.

WebReflection · 2019-03-28T11:40:33Z

@saschanaz if you have jsdom, and it pollutes the global scope, hyperHTML should work already.

If you don't you can use an empty .init() or specify various things, including the object to pollute.

i.e.

const window = {};
require('basichtml').init({window});

const {customElements, document, HTMLElement} = window;
customElements.define('hello-world', class extends HTMLElement {
  connectedCallback() {
    this.textContent = 'Hello world!';
  }
});

// workaround to execute with a temporary global document
const $ = fn => {
  const oldDocument = global.document;
  global.document = document;
  const result = fn();
  if (oldDocument)
    global.document = oldDocument;
  else
    delete global.document;
  return result;
};

const {bind} = $(() => require('hyperhtml'));

$(() => {
  bind(document.documentElement)`
  <head>
    <title>This is basicHTML</title>
  </head>
  <body><hello-world/></body>`;
});

console.log(document.toString());
/*
<!DOCTYPE html><html><head>
    <title>This is basicHTML</title>
  </head><body><hello-world>Hello world!</hello-world></body></html>
*/

but then again, if you have already a document everything should just work, you just need to require hyperhtml after 👋

saschanaz · 2019-03-28T11:51:58Z

I mean require("basichtml").init(); require("jsdom"); fails. Maybe jsdom detects browser environment by checking global existence of document or something...

Currently I decided to use viperhtml for template string and parse the result with jsdom again, as I'm currently trying to remove any global or module level states. It's still a draft, so maybe I could migrate to basichtml later if it looks great enough.

WebReflection · 2019-03-28T11:59:38Z

I mean require("basichtml").init(); require("jsdom"); fails.

that is not a real-world use case, but even if you need to do that, you can pass an object instead, so nothing gets polluted, and jsdom will work.

Currently I decided to use viperhtml for template string and parse the result with jsdom again, as I'm currently trying to remove any global or module level states

You are contradicting yourself, since jsdom pollute the global scope, right?

It's still a draft, so maybe I could migrate to basichtml later if it looks great enough.

Sure, whatever works best. I don't even know what is your actual goal (had no time to read this whole thread, I just jumped in at "I got stuck due node" when all I do is using node for everything I develop on the web 😁)

saschanaz · 2019-03-28T12:00:40Z

since jsdom pollute the global scope

AFAIK it doesn't, does it?

WebReflection · 2019-03-28T12:04:41Z

actually my bad, it doesn't. I've seen tests written with global.window = new JSDOM but apparently it doesn't indeed pollute anything by its own.

Then again, neither does basichtml, it's just the handy default that never, personally, hurts.

In regards of hyperHTML it needs to do some feature detection requiring a document in a way or another.

saschanaz · 2019-10-28T04:56:23Z

@tidoust Hi! Unfortunately we found that #2187 introduces too many changes and creates additional maintenance burden to us. (Even when it might give some hypothetical wins 👀)

I want to hear your current need again to decide the future of #2187. Could you use respec2html now that we depend on headless Chromium? If you can't, what would be the blockers?

tidoust · 2019-10-28T09:19:29Z

Hello @saschanaz,

I want to hear your current need again to decide the future of #2187. Could you use respec2html now that we depend on headless Chromium? If you can't, what would be the blockers?

I don't think that there is anything blocking us from using respec2html in practice. It was more a matter of reducing dependencies to large binaries (also not to have to run a virtual screen on the server, but that's no longer needed with a headless Chromium).

The one thing that I do now, and which I don't know if I can easily do if we switch to respec2html, is to hook my own implementation of "fetch" to force the use of cached versions of resources so that I can run crawls offline (for testing purpose). That is not a requirement though, and I'm sure I'll find a way to do that.

I totally understand the maintenance burden argument. I actually considered using a forked version of Respec with #2187 baked in... and decided against it for the very same reason.

+@dontcallmedom who was the one advocating for the switch to JSDOM in the first place.

dontcallmedom · 2019-10-28T09:26:22Z

I agree it'd a nice to have; if it's too costly to maintain, then we'll absorb the cost in running with headless chromium.

saschanaz · 2019-10-28T10:08:41Z

Okay, that means this is now officially low-priority unless we find an easier way.

@marcoscaceres I still want to keep #2187 open to 1) extract some good thing from there 2) maybe just for me to experiment things. Sounds good?

marcoscaceres · 2019-10-28T13:46:50Z

Sounds good. Yes, let’s grab whatever goodies we can from #2187.

Making latest version of Respec run under JSDOM is too much of a maintenance burden in the end: https://github.com/w3c/respec/issues/1469#issuecomment-546790835 This update makes Reffy use respecDocWriter (which uses the headless browser Puppeteer under the hoods) again. This allows to run the latest version of Respec. JSDOM is still used to load specifications that do not use Respec and to detect when a specification uses Respec. Note the code no longer runs external scripts when it loads specifications: that seems unneeded in practice. The code still contains minimal monkey patching for JSDOM so because a number of specifications do have inline scripts that call `window.matchMedia`, which JSDOM does not support. This will fix #134.

saschanaz · 2020-05-19T08:16:19Z

A new possibility: someone made a JS-to-binary compiler that also supports WASM.

Not sure it's any valid, though.

saschanaz mentioned this issue May 12, 2018

Convert respec-worker as an ES module #1667

Open

1 task

tidoust mentioned this issue May 15, 2018

Have ReSpec run in JSDOM, stop using RespecDocWriter w3c/reffy#106

Merged

saschanaz mentioned this issue Dec 19, 2018

Refactor conf.definitionMap as a module #1954

Closed

saschanaz changed the title ~~Use/support jsdom to preprocess~~ Browser executable-less preprocessor Mar 22, 2019

saschanaz mentioned this issue Mar 23, 2019

feat: executable-less API #2187

Closed

saschanaz mentioned this issue Oct 24, 2019

refactor: migrate to nanohtml #2548

Closed

saschanaz added the Low priority Suggestions with no good driving force label Oct 28, 2019

tidoust mentioned this issue Nov 21, 2019

Get back to RespecDocWriter to generate specs w3c/reffy#195

Merged

saschanaz mentioned this issue Mar 11, 2021

refactor(core/base-runner): replace conf.state with run(conf, state) #3373

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Browser executable-less preprocessor #1469

Browser executable-less preprocessor #1469

saschanaz commented Feb 2, 2018

marcoscaceres commented Feb 5, 2018

tidoust commented May 12, 2018

saschanaz commented May 12, 2018 •

edited

Loading

marcoscaceres commented May 14, 2018

tidoust commented May 14, 2018

marcoscaceres commented May 14, 2018

saschanaz commented May 14, 2018

marcoscaceres commented May 14, 2018

tidoust commented May 14, 2018

marcoscaceres commented May 14, 2018

saschanaz commented Mar 22, 2019 •

edited

Loading

marcoscaceres commented Mar 24, 2019

saschanaz commented Mar 27, 2019 •

edited

Loading

marcoscaceres commented Mar 27, 2019

WebReflection commented Mar 27, 2019

saschanaz commented Mar 28, 2019

WebReflection commented Mar 28, 2019 •

edited

Loading

saschanaz commented Mar 28, 2019

WebReflection commented Mar 28, 2019

saschanaz commented Mar 28, 2019 •

edited

Loading

WebReflection commented Mar 28, 2019

saschanaz commented Oct 28, 2019

tidoust commented Oct 28, 2019

dontcallmedom commented Oct 28, 2019

saschanaz commented Oct 28, 2019

marcoscaceres commented Oct 28, 2019

saschanaz commented May 19, 2020

Browser executable-less preprocessor #1469

Browser executable-less preprocessor #1469

Comments

saschanaz commented Feb 2, 2018

marcoscaceres commented Feb 5, 2018

tidoust commented May 12, 2018

saschanaz commented May 12, 2018 • edited Loading

marcoscaceres commented May 14, 2018

tidoust commented May 14, 2018

marcoscaceres commented May 14, 2018

saschanaz commented May 14, 2018

marcoscaceres commented May 14, 2018

tidoust commented May 14, 2018

marcoscaceres commented May 14, 2018

saschanaz commented Mar 22, 2019 • edited Loading

marcoscaceres commented Mar 24, 2019

saschanaz commented Mar 27, 2019 • edited Loading

marcoscaceres commented Mar 27, 2019

WebReflection commented Mar 27, 2019

saschanaz commented Mar 28, 2019

WebReflection commented Mar 28, 2019 • edited Loading

saschanaz commented Mar 28, 2019

WebReflection commented Mar 28, 2019

saschanaz commented Mar 28, 2019 • edited Loading

WebReflection commented Mar 28, 2019

saschanaz commented Oct 28, 2019

tidoust commented Oct 28, 2019

dontcallmedom commented Oct 28, 2019

saschanaz commented Oct 28, 2019

marcoscaceres commented Oct 28, 2019

saschanaz commented May 19, 2020

saschanaz commented May 12, 2018 •

edited

Loading

saschanaz commented Mar 22, 2019 •

edited

Loading

saschanaz commented Mar 27, 2019 •

edited

Loading

WebReflection commented Mar 28, 2019 •

edited

Loading

saschanaz commented Mar 28, 2019 •

edited

Loading