-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to compile MDAST (instead of Markdown) to JSX #2529
Comments
I doubt 70% of docusaurus time from string -> rendered react is parsing markdown.
I don’t believe in this — that it’s faster. Rust being theoretically faster will be lost by using two languages.
Indeed.
Everything’s possible — I seriously doubt it’s a good idea though.
Looks like you already know how to improve performance: not do useless work.
What? What is the
Of course, makes lots of sense to cache things if you do repeated work! It’s the same as TS? You compile TS to JS. You don‘t compile things every time you run them? |
Also, you can: Something along the following pseudocode: import {createProcessor} from '@mdx-js/mdx'
const tree = getMdastRootFromSomewhere()
const processor = createProcessor()
const transformedTree = await processor.run(tree)
const result = processor.stringify(transformedTree)
console.log(result) |
Hey @slorber! Good to hear from you!
There are still more performance optimizations to be realized in
Potentially.
Crossing the boundary between rust and JS has trade offs depending on:
My gut feeling is crossing right after parsing markdown is early and large enough it may not realize much benefit. But I don't have data to back that up, it would take benchmarking.
If you're interested, plugin support in the rust implementation is on the roadmap wooorm/markdown-rs#32 insights and assistance are welcome! |
Thanks for your feedback @wooorm, I'll try to profile better to see if I was wrong but trust you to be right that it's not worth it. Didn't know it was already possible to pass mdast to the processor.
This has always been like that historically. We bundle 2 apps with Webpack, mostly the same but with differences here and there on the output, one for the client (served to browsers) and one for the server. The server app is then run during SSG in Node.js to generate the static files. Both apps bundle the MDX files with a Webpack loader to render them but the subtle differences between client/server outputs are not really affecting the MDX compilation so it's useless to do this work multiple times. I plan to explore and see if it's not possible to run the Node SSG phase against the client app (and remove the need for a server app), but afaik it's quite common in modern React frameworks to compile things for each env (client/server/rsc). (but I suspect we do not have the same constraints as Next.js so it might work) Webpack cache loader was a quite popular loader (now deprecated, superseded by global Webpack 5 persistent caching) that permitted to cache loader result on file system (by default) but with ability to customize: https://v4.webpack.js.org/loaders/cache-loader/
I kind of see this from a different angle:
For code TS code in your own app using loaders, you retranspile it every time you restart your bundler. Unless your bundler has a caching system. And Webpack cache is not "shared" between different configs/compilers, because those configs may eventually affect the transformation and lead to different result. To give a more concrete example based on this repo's mdx loader const clientConfig = {
name: "client",
entry: "./some-doc.mdx",
module: { rules: [{ test: /\.mdx?$/, use: ["@mdx-js/loader"] }] },
};
const serverConfig = {
name: "server",
entry: "./some-doc.mdx",
output: {libraryTarget: 'commonjs2'}, // Example of difference
module: { rules: [{ test: /\.mdx?$/, use: ["@mdx-js/loader"] }] },
};
await compile([clientConfig, serverConfig]); You can see in the simplified example above that we compile 2 times the same app with a different output target (in practice the client/server entry points are slightly different). In this case the MDX file I see 2 options:
Do you have a better alternative to suggest in this situation? |
@ChristianMurphy I'm already having a POC of running our own website with Rspack. We see some perf benefits but could improve even more. Using Rsdoctor shows that mdx and postcss loaders are the remaining bottlenecks. If we migrated to a full Rust build toolchain, I'm afraid our community wouldn't be happy if they couldn't write remark plugins in JS anymore, so I'd prefer to only move certain parts to Rust (by the way, I don't know Rust yet 😅). I may read that cpuprofile wrong but to me CPU.20240816.154753.69358.0.001.cpuprofile
Agree that there are tradeoffs to consider. Instead of exchanging large string payloads, what about sending smaller ones and let Rust read Markdown from the file system, process it, and write the result to the file system instead of returning it? We could even batch the pre-parsing of all the mdx files we found to pre-populate caches in one pass, that loaders could then use. |
And that's fair.
I think there's still a lot that can be done to improve parsing time in JS. If your team could invest resources, whether that be developer time or investing in Titus' time, that progress toward parser improvements (both in JS and Rust) could be accelerated.
Crossing to the operating system layer, and especially the file system layer. Would be more expensive than communicating directly between the JS and Rust processes. |
Thanks, didn't know about that. Looking forward for these techniques to be backported.
My team is mostly me working 2d/week on Docusaurus 😅. I'd be happy to support MDX financially if I were a Meta stakeholder, but I'm just a freelancer, and spending Meta's money on someone else than me involves bureaucracy that I'm not really aware of. We have money on opencollective and the last time I tried to unlock it for one of our contributors my request staled. I could try again though. As far as I remember, it would require documenting a clear goal/reason for spending that money.
The way I see it is that the loader will read on the file-system anyway. And if we have a cache and make it persistent it also means writing it to the file system. And it's potentially done earlier, with a different timing. Even if something is more expensive in terms of IO, this might still be worth it. After all, at the bundling phase, we have to read a mdx file on the file-system anyway. Instead of reading it, we could read another file: the pre-parsed one. What I mean is that we would be increasing the work done before bundling (but this extra work can be done in parallel to other work we already do). And we would decrease the time it takes to bundle. We might transform 4s of CPU work to 5s of IO+CPU work, and it might still be worth it due to parallelization and avoiding a waterfall. |
That would be cool! “Continued development and support of MDX”? 🤷♂️
Nope, indeed. There’s also: but maybe you want/have to compile it twice, perhaps there are differences between the server and the client. |
I will ask and see what I get answered 👍 I'd be happy if we could support your work. We could probably find some projects to work on, for example, porting remark-directive in Rust could be one, or enhancing the speed of JS remark-parse?
The only use case we have for different MDX processing in dev/prod is this 😅 Once we share compilation between client/server, this will go away. I don't think our users have different behaviors in client/server considering we only introduced this |
Yes! 👍
I wonder about RSC? As I understand it, components will sometimes run on the server, sometimes on the client, hydrating the earlier server stuff. And then webpack is already making “server” bundles, and different “client” bundles, which might both include the same original
🤔 our |
I don't think it's related to RSC but as soon as you do SSR or SSG, you do need multiple bundles that are not exactly the same. In our case, the 2 bundles are quite similar, but with the introduction of RSC the 2 bundles could become significantly different because server components are only in the server bundle and client components are only in client bundles, or eventually both). I don't know about Meta's resources, and their RSC bundling performance problems, but I think having multiple different bundles is required. I'm not sure what you want to cache here. We compile the same code in 2 very different ways so not sure what kind of cache can be shared between these. But to me that's not really the subject for us here because we don't use RSCs yet. In an RSC app using MDX, you'd probably render MDX docs on the server only because they are static, and only include a serialized JSON version on the client app. I doubt there's good use-case for adding "use client". Afaik MDX docs do not support directives and client components are expected to be in separate files, so this is not expected to work and MDX docs are expected to be packaged as JSX only in the server app: "use client"
# Hello
content
import {useState} from "react"
export function MyClientComponent() {
"use client"
const [state, setState] = useState(...)
// ...
} In a non-RSC app using MDX, you compile and bundle the same MDX doc as JSX for both apps. Even if the 2 apps are different, the JSX version of the MDX doc for both apps is very likely to be similar. That's what I want to optimize. Note: we already use Webpack persistent caching, so rebuilding Docusaurus (including MDX) is faster. But that's not the point here: I'm looking to make cold builds (empty Webpack cache) faster by sharing the MDX compiler result across 2 loaders. |
Yes, historically we haven't used the message infrastructure from Unified, but I plan to do it in the future. Similarly I could use processor data instead of file data.
The same remark plugins is run against each MDX file for both client/server compilers (that's the actual problem!). If we remove that condition, then the message will be logged twice for each file. |
Right, that’s what I’m going after: there will be different bundles, where webpack will process files differently. If it processes 1 javascript file in 2 ways, then it makes sense that it processes 1 mdx file in 2 ways too.
I am wondering about this from the premature optimization point: you can do work to cache MDX now. Then you add RSC (is that not inevitable?), and perhaps then you cannot cache anymore?
Why should MDX be static?
Not yet supported, indeed. But I can imagine someone asking for this feature?
I wonder about this. Perhaps! |
Even if the 2 bundles are different, I believe the mdx => jsx conversion will remain the same in both cases. Just that one will be evaluated and not the other.
MDX is content. Even if the RSC payload of a mdx file is static, it's still able to reference interactive React components. You can render MDX docs only on a server and still make interactive content. That's assuming that users do not use client-only hooks directly inlined into MDX docs. But even if they do, the mdx => jsx compilation input/output will probably remain the same no?
That's possible but we could also decide to not support it in Docusaurus on purpose and ask client components to always be in separate files. |
Initial checklist
Problem
Compiling many MDX documents can be expensive, in particular for an MDX-based static site generator like Docusaurus.
According to profiling, 50-70% of that time is spent on remark-parse, which as far as I understand has been reimplemented in Rust and could help us improve performances:
It looks like the Rust port also supports popular syntax extensions like math and gfm (but not yet directives), so the Rust ecosystem probably has all it needs (apart directives) for parsing Markdown for most Docusaurus users.
Problem: for flexibility and retro-compatibility, we'd like to keep the ability to write remark/rehype plugins in JS, but they are not supported in Rust.
Solution
Use Rust for parsing, and JS for the rest?
What I'd like to be able to do is something like this:
Does it make sense?
Would this improve performance or it's not worth it (due to js<->rust cost?)
Could it work if we added a
noParse
option that removesremark-parse
from the pipeline?Not 100% related but we are also compiling each MDX doc twice currently during the bundling phase in Webpack loaders, for client/server environments (and maybe soon RSC env too?).
Problems:
I'd also like to:
I hope it makes sense?
Alternatives
Keep using the MDX JS implementation, but it will likely become the main Docusaurus bottleneck once we migrate to modern tools (Rspack, LightningCSS...) according to my profiling.
The text was updated successfully, but these errors were encountered: