-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parts of xterm.js should run in a web worker #1515
Comments
Any drafts/thoughts yet into this direction? From my limited parser centric view it seems the renderer and the data input + parsing are fighting over cpu time so my first wild guess is to move the all core parts (data input + parser + most of the terminal class) into a webworker and let only the DOM related stuff reside in the main thread. This will require some event translation, but thats already part of the new layered proposal. Tricky part will be to get the buffer data moved to the main thread without much runtime delay (SharedArrayBuffer might solve this in the future without a copy but is currently deactivated in all engines due to Spectre, so we gonna have to do it "the hard way"). |
My thinking is to move core/base into a worker and then ui/base/public runs in the UI thread. How buffer state is synchronized still needs some thinking. |
One idea is to maintain the buffer only in the worker, and then have a more renderer friendly view model in the main thread that can be incrementally updated by the worker. We could also rewrite core in rust as a WebAssembly module, but I think that is out of scope for now 😅 |
Yeah that would be my preference, I think monaco needs a more complete view of the buffer though.
Yeah, seems like quite the effort to undertake 😄. Cleaning up the TS beats this out at least for now imo |
I gave this some more thoughs, here's a rough overview of how it could work. The UI could subscribe a view and can specify the amount of rows it wants to see, and also the offset of rows to either top or bottom. The core will than track buffer changes that affect that viewport and emit updates with update instructions. The goal should be that the UI does not need to store any state, it should simply consume the updates and draw / redraw based on them. // subscribe to a new view
const view = terminal.createViewport({ rows: 20, offset: 0, sticky: true });
// listen for view updates and modify the UI accordingly
view.onUpdate((changeDecorations /*changes*/, terminalState /*infos*/) => {
// TODO:
// specify how changeDecorations could look like, maybe like this:
for (let change of changeDecorations) {
switch (change.type) {
case ChangeType.DELETE: {
// some rows have left the viewport
}
case ChangeType.ADD: {
// some rows have been added to the viewport
}
case ChangeType.MODIFY: {
// exstinig viewport rows have been modified
}
}
}
});
// terminal viewport is scrolled in the UI, update the viewport offset
myViewportElement.addEventListener('wheel', (evt) => {
// calculate the new offset
let offset = ...
// NOTE: this will likely trigger the onUpdate callback if things have changed
view.setOffset({ offset, sticky: offset > 0 });
}) |
@mofux Hmm not sure about this - in theory the inner terminal core does not need to hold any data of lines, that are beyond the actual terminal height, to work properly. Only exception to this is resizing, which will occur not very often. Maybe we can turn this into an advantage and only hold the "hot parts" of the buffer in the core and delegate the scrollback data buffering to the renderer? It will need a backpropagation for resize events though. I have not thought deeper into this, just some random thoughts so far for me. |
@jerch The scrollback has to be maintained somewhere, and I think it is better to have it in the core for this reason: |
@mofux Yupp you are right, totally forgot the addons. From an API design perspective it also seems more convenient to hold scrollback data in the offscreen part. What bugs me with the current buffer design is the fact, that we mostly hold rendering info in the buffer for every cell, even for the scrollback data, which cant be accessed by normal "terminal means" anymore (thus wont change anymore, except resize once it should support reflow). This is a very memory hungry layout. With your second layout sheet we might be able to create some less expensive scrollback buffer (the purple part) that is closer to rendering needs (merged attributes and cells maybe?), while the orange part still gives quick access to all attributes on a cell by cell basis. I am still worried about the amount of data that we have to send between the worker threads to get something new shown. Also not sure yet, how addons will play into this, some gut feeling tells me that we might have to build different API level entries if we want critical parts to be customizable from outside (like a core, pre and post render API). |
One more thought regarding the architecture (and the title of the issue): Instead of just running core in a webworker, it would be nice (and in most cases even more useful) if we consider to run core in a server-side node process (close to where the pty process is), which then talks to the UI (browser) via RPC / websockets. Technically, there shouldn't be much of a difference between communicating with a webworker vs. communicating with a server process, it's only a different transport layer. |
@mofux Imho this a groundshaking thing, there are several pros and cons for it, what comes into my mind: pros:
cons:
Dont take the last point to serious, this is more rant than an objective argument. I still tend to the client core thingy for a simple reason: creating a transport layer for bytestrings is much easier/faster than handling different API states across components with different machine locations, imho. But I might be wrong with that. |
@jerch Thanks for your thoughts, really appreciated! Please consider the ideas mentioned above as a testbed for discussion in order to develop a vision for future enhancements. The main goal I'm seeing is to separate the ui from the core, so we can offload the work performed by the parser to a separate thread, unblocking the UI thread. The second goal is to shape an interface (contract) that allows core and ui to talk to each other via some kind of IPC (be it
I think we misunderstood here. I'm thinking of core as being a DOM independent piece of JS code that can run in the Browser, a WebWorker or Node.js without any porting. It should definitely be up to the developer to decide where core should run (depending on his use-case). My thought was to decouple core and ui as much as possible, so we could potentially support all these scenarios.
It depends. If we only send incremental updates to the ui this shouldn't be much of a problem.
At the moment we are sending all the pty data to the Browser frontend (with the same latency burden), which is really painful in scenarios where xterm.js is running in a browser that is far away from the server. I've seen situations where the data stream from the pty would just spam the websocket so hard that it would eventually disconnect because pings didn't come through anymore. Catching all the data at the server and only sending view updates to the browser (maybe throttled) could improve this situation. The thing is, we can't skip processing parts of the pty data stream because it has to consistently update the buffer - we have to eat it all. But we can skip / throttle updates of a view that reads from a consistent buffer state. And that's where I see the big advantage of running core at the server side. We're not forced to send the whole pty stream to the client anymore, we only send view updates. Commands like |
👍
Yupp. There is a small but though - if we only send incremental updates the frontend part needs a way to hold the old data and merge the new onto. Oh - and the backend part needs some abstraction to filter updated content from old stuff. Maybe we can establish some buffer cache key thing on row level or even better (for amount of comm) / worse (for runtime + memory usage) on cell level (a real "write-through" cell data thingy). About the latency / server-client-comm update thing: Last but not least I think possible changes from #791 should be part of the considerations, some of my suggestions there might raise the burden to get easy and cheap updates delivered (esp. my storage & pointer approaches there might end up being contradictional). |
The canvas renderer already does this for the viewport: xterm.js/src/renderer/TextRenderLayer.ts Line 21 in 5620da4
|
Closing as out of scope in the interest of keeping the issue list small as this probably won't happen for years if at all. |
Reopening this as it would be a nice direction to go and ensure the main thread stays responsive during heavy workloads. I played with web workers a bit lately and I would imagine it work work by using a shared array buffer if it's supported, and if hand the write buffer data over to the worker (and not persist in the main thread). The main challenge imo is how you're meant to easily get embedders to leverage the workers as there are multiple files required then, xterm.js may also get bundled into a different position. |
@Tyriar Some investigations I did in that field: Shared array buffer (SAB) with own atomic read/write locks is the fastest way to get data "moved" between worker threads. Its still loosing 20-30% for thread sync / locks compared to single threaded promise code moving data around (tested without any workload on the data, so those numbers are just for the data "moving" part). Downside of this solution: hard to get done right (and maintain?), may not work in all browsers due to Spectre security patches. Normal object transfer is okish and the only valid fallback if SAB + atomics are not available. Runs ~50% slower than a SAB solution, thus might penalize screen update latency (no clue if this will be perceivable in the end). Since the buffer is made of typed arrays a SAB solution could be easily done. Still a fallback might be needed to cover engines without SABs. Last but not least the worker-mainthread interface will be challenging since we have so many events/callbacks into browser code, that need to be plugged without race conditions. |
Yes we'd definitely need a fallback as Safari doesn't have SAB for example. My explorations in workers and SAB is that they're pretty awesome, you just need to think about how fallbacks and packaging would work (2 build targets? rely on dynamic imports? how do embedders bundle?). Good point with callbacks and events, what exactly we would pull into the worker is not clear cut at all. You can see by zooming into a profile that arguable parsing isn't the expensive bit: So maybe in an ideal world the buffer should also be a shared array buffer owned by the worker thread and only events that indicate explain which ranges changed or something would be sent over to the main thread. Another thing to think about is how does transferControlToOffscreen fit into all this. It would be extremely cool if we have a renderer thread, a parse/buffer thread and the main thread barely does anything, keeping the application extremely responsive even when heavy parsing/rendering is going on. It's definitely clear that this would be a pretty radical change though, no way I'll have time to play with this any time soon but it's always fun to talk about. We can think about this as we shape the architecture and maybe do small parts of it first (like having the webgl renderer optionally run in a worker). I also want an issue to point duplicates at for VS Code dropping many frames when the terminal is busy. |
Indeed. This could even be made lockfree for the normal PRINT action (which in conjunction with Yes the webgl renderer seems to be a perfect fit to test out new paths into this direction as only webgl context is currently allowed for the offcanvas. Not even sure if we can gain anything for the DOM renderer here - Would it be faster to pre-construct the DOM strings in a worker and move the content as a big update batch to the mainthread? Idk, never tested anything like that. Edit: @bufferWorker
@main
class Terminal ... {
// not decorated things get spawned on all thread targets listed on the class
public doXY(...) {...}
// only in bufferWorker
@bufferWorker
public doSomethingOnBuffer(...) {...}
} It is just a rough idea atm, not even sure if decorators can be mis-used for that type of macro-precompilation stuff. It certainly would introduce another precompile step, still it would make those definitions on code level much easier. Oh well, it also would break with IntelliSense, so might not be a good idea at all. Is there anything in TS to do macrolike precompile tasks? |
Since all related issues in vscode repository are closed and they point to this issue, I am going to ask here: is there any workaround for vscode to be able to print long lines without freezing? This problem sometimes makes vscode unusable, as some packages print their debug information in a sequence of a very long lines. |
@arsinclair This is def. a bug in some terminal buffer consuming handler in vscode, xterm.js itself does not have that issue. Imho the right way to fix this would be to address the issue in the slowpoke function. Hacky workaround: |
@arsinclair you're probably seeing microsoft/vscode#100338 |
@arcanis With slowpoke function I mean some code that eats your precious CPU cycles for non obvious reasons. The linkifier would be a candidate, if you have very long auto wrapping lines. It does not matter if there are any links in your output, identifying them itself may take pretty long. |
What about it?! |
Yes, this is exactly the case.
The problem is, once I scroll to the long lines block, the CPU load jumps to 100% and it never goes down. I've tried waiting for several hours to see if it ever finishes. I am neither xterm.js, nor vscode developer, so I am not familiar with the codebase and I don't know why it is happening and how to fix it. |
I was thinking a bit about this today. This is a very high level diagram of what we have today: So what parts would we want to move to the worker? We could for example move Here's my proposal which was covered somewhat above but without pretty pictures: The small buffer line We would allow the parser protocol to also run in the main thread (via dynamic import) when SAB/workers are not supported or not enabled: So with the above we would come up with some strictly defined "parser protocol", and then allow it to be wrapped in a web worker (eg. Using this architecture we could also reduce latency by hooking the threads up directly to the pty: The refactor would involve:
Then later:
* I also explored combining all |
Here's a little proof of concept that uses a worker to change parts of the first buffer line: https://github.com/Tyriar/xterm.js/tree/worker_poc |
Also after thinking about |
Imho to answer the question "what should go into a worker" should be carefully chosen from runtime metrics we see so far. From there we can identify a working goal for xterm.js:
The last question can be answered pretty quick - everything thats needs browser DOM API has to stay on main. Currently thats mostly final IO stuff - "I" for input from keyboard and mouse (thus all DOM event handlers), and "O" for everything that makes its appearance on the screen. Final here means - everything from the first (input) or last (output) stages, while we could slice into the functionalities sooner or later (from callstack perspective) and put things either on worker or main side. (Sidenote - I am not sure whether offscreen canvas is yet in a usable stage to move output completely off from main, my last tests regarding this for the image addon was not so promising, as engines diverge too much in supported features.) Regarding the first question things cannot be answered that easy, as it depends alot on the used sequences in the incoming data. For pretty common
For very data intensive payload (e.g. image data for the image addon) utf8 decoding and escape sequence parsing will jump to ~25%, but meanwhile it stays far below 20% of the workload. From an isolation of concerns perspective both (utf8 decoding and sequence parsing) could be easily made faster by different approaches (up to 4 times faster with wasm), but the overall benefit would be small, if we dont get the "buffer writing" any better, as that causes ~50% of the runtime for common And thats the point where I want to question your schematics above, where the buffer still resides on mainthread side. Imho we wont get much of a relief on the mainthread, if we go that route. Additionally the parser protocol in between will create work on top eating alot of the benefit. (Sidenote: I currently remove the image decoder worker thread for that reason from the image addon sixel path, as it puts too much work on top making everything worse, while sixel in wasm is so fast with much shorter "data lines"). So how about that:
(Note thats only a quick writeup not covering the details yet. And as always - those details kinda would answer the 2nd question, as things moved to the "other side" might get really hard to maintain/expose.)
In the beginning of the buffer rewrite back in 2018 I started with exactly that. This is really much better performance-wise than the current single bufferline abstraction. (Cannot quite remember, but I think the speed benefit was ~2.5x for normal buffer actions, incl. resize.) In the end we did not go that path back then because of the need for strict custom memory management, which felt way too C-ish, and we had no clean story how to ensure that from |
The 2 big things that happen on the main thread are processing input via write buffer/input handler/etc. and rendering. I'm pretty convinced at this point that the rendering should remain on the main thread in order to ensure all rendering occurs in the same frame. For one we have it performing pretty well now, and secondly issues like microsoft/vscode#145801 will happen again if we moved it, but AFAICT be impossible to fix. This class of problem is much more noticable on lower specced hardware but even on good hardware it's too big of a hit on the UX imo not having synchronized rendering. Additionally, I think moving the rendering over to another thread will end up being too much of a and never end up happening because of that. So that just leaves processing of input. Note that a big part of this comes in when the embedder does a lot of other things, such as has many other terminals that may be busy, other UI components fighting for CPU time (monaco, VS Code's workbench, etc.).
Yes, this plan would only make sense if the additional parser protocol would not cause additional overhead. I believe we could pull it off by having most interactions with the buffer (and flags, etc.) done via Let's dig into some of the expensive examples you gave what would live on worker and what would happen on main. Let me know if this doesn't sound right so I can fix my understanding:
100% worker (-3-5% off cpu)
100% worker (-5-8% off cpu)
Worker for everything except:
Worker for everything except:
So given the above what we would communicate back are events like: interface IProtocolMessage {
events: IProtocolEvent<unknown>[];
}
interface IProtocolEvent<T extends ProtocolEvents> {
type: T
payload: ProtocolEventTypes[T]
}
const enum ProtocolEvents {
Scroll,
AddLineToLink.
CursorChange,
A11yChar
}
interface ProtocolEventTypes {
[ProtocolEvents.Scroll]: { lineBuffer: SharedArrayBuffer, more? };
[ProtocolEvents.AddLinkToLine]: { id: number, y: number };
[ProtocolEvents.CursorChange]: undefined;
[ProtocolEvents.A11yChar]: { char: string };
}
I did consider this, but I think that would end up turning a bunch of the API asynchronous?
This is where the proposal above shines as it feels like it could actually happen. If we're only moving the input processing, I don't think there is any impact on the existing API where everything is synchronous and easy to use.
I wrote a nicely abstracted IPC layer for Luna Paint to go to/from the extension host to a webview. That's where I figured out the neat type mapping trick above ( We already dynamically import xterm.js to avoid this on startup unless absolutely required, reducing it will help with UI responsiveness and unblock other features that may be initializing if the terminal was open on launch.
That's a similar speed up to what I was seeing resizing larger drop to when I was trying to estimate impact.
I don't mind going more C-ish for stuff like this as that lets the JIT do its thing much more effectively, plus it's pretty much the only way to squeeze more big gains out of JS without going WASM. We're basically doing this low level stuff all over the place anyway with the buffer management. I think the I put that little exploration down as something interesting we may want to do later, it doesn't need to happen for the web worker stuff as all the worker needs to know about are the mutable ~15+ buffer row arrays. So a message over can be assembled quickly like: postMessage({
rows: [sab, sab, sab, ...]
}); I guess these will typically already be on the worker side anyway so we maybe only need to send them main -> worker on startup? |
@Tyriar Oh well, sorry I totally overread the SAB notion - yes that would make the buffer protocol a no-brainer and allow direct writes. It has another intriguing advantage - in the worker we can use all sync code up to atomic waits, thus the parsing should be much faster again. Currently the stack unwinding is really painful (and the reason why the worker approach in the image addon takes so long), which is not the case anymore with the parser in a worker - we can simply place a "HALT" atomic flag there, which stops at certain positions w'o screwing up stack state. (This could also be used to indicate a pending hook from outside waiting for completion.) |
Yes I agree. To me it still seems to be almost impossible to achieve reliable off main rendering across all engines. (Firefox only supporting webgl ctx type, Safari lacking offscreen canvas in total, and proper webgl in particular). For the interface you stubbed - looks about right. At this point I think we really should invest the time to check if we get The messaging to/from the worker is still somewhat painful - we have to take care, that there is no message congestion possible (doing most via SAB state flags will avoid most of that). And last but not least - we need to adopt parser hooks to the worker concept - a simple straight forward way maybe just places a "HALT" on the parser worker, executes the hooks on mainthread as done currently and unlocks the worker afterwards. A more involved interface would allow to place hooks in the worker context (maybe a future extension). |
Good point with parser hooks. Some ideas:
The hand over from the worker was quite fast when I tested it, like on the order of < 1ms, at least for my very simple PoC (would be worse in a busy event loop). So I'm not actually sure this would impact the UX much with all async hooks, unless you used a lot of hooks. |
Yes, processing foreign hooks in main while is the worker is "HALT"ed might impose bad perf again, if the hooks are often (like for every single line). Though imho it still will be much faster if done with a HALT-atomic on sync worker code, than currently possible with the stack unwinding. Certainly being able to place the hooks into the worker will be best performance-wise, but comes with several restrictions:
This reduces the usability of worker-added hooks to tiny notifiers or pure reentrant inline functions operating only on symbols defined in the worker context - not very useful if you ask me. Imho the HALT-flag and execution on main side is here the more likely use pattern for most foreign hooks. Imho we should measure the impact of a HALT atomic, I think the perf overhead is pretty small (but true, it will create a tiny perf overhead for its synchronization needs). About the HALT flag - this could be implemented schematically as follows: // worker side - within sequence parser:
...
if (customHooksRegistered(currentSeq)) {
notifyMain(currentSeq);
Atomics.store(bufferSab, haltFlagPos, 1);
Atomics.wait(bufferSab, haltFlagPos, 1);
}
...
// mainthread
parserWorker.onSequence(currentSeq => {
// walk all registered hooks and exec
...
// finally
Atomics.store(bufferSab, haltFlagPos, 0);
Atomics.notify(bufferSab, haltFlagPos, 1);
}); Big advantage here - if we ensure to always pull values from the SAB (no local variables in the parser) we dont have to repair any stack variables, thus the blocking/unblocking works directly by issuing these 4 atomics instructions. (Another advantage: a hook can still be async, the final part can also be issued by a thenable) |
@jerch 👍 I haven't played with Some of the events like |
I didn't see an issue tracking this yet.
Blocked by:
Related:
The text was updated successfully, but these errors were encountered: