Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce input lag by 36ms by separating frame rate limiting from update batching #166

Merged
merged 1 commit into from
Jun 12, 2020

Conversation

trishume
Copy link
Contributor

@trishume trishume commented Jun 9, 2020

So I've been doing some latency testing and tracing of xrdp and I found that after it receives a paint the computer sits idle for 40ms before sending the paint to the client. In xterm depending on the monitor and other factors I can see keyboard-to-screen latencies of 120ms with xrdp, so in those cases the 40ms sleep accounts for 1/3rd of the end-to-end latency.

I tracked it down to a timer that serves dual duty as a hard-coded frame rate limiter (25fps) and a way to wait for any more updates from the app before sending off a packet. 40ms is way longer than is needed to wait for more screen paints, so I split it into two separate delays: an inter-frame delay and a "waiting for more updates" delay. I kept a 4ms delay to wait for more updates, which is still probably much more than necessary, but it corresponds to 1/10th the previous delay and is reasonably small, still providing a 36ms unconditional latency improvement.

I've been doing some latency testing and tracing of xrdp and I found
that after it receives a paint the computer sits idle for 40ms before
sending the paint to the client. In xterm depending on the monitor and
other factors I can see keyboard-to-screen latencies of 120ms with xrdp,
so in those cases the 40ms sleep accounts for 1/3rd of the end-to-end
latency.

I tracked it down to this timer that serves dual duty as a hard-coded
frame rate limiter (25fps) and a way to wait for any more updates from
the app before sending off a packet. 40ms is way longer than is needed
to wait for more screen paints, so I split it into two separate delays:
an inter-frame delay and a "waiting for more updates" delay. I kept a
4ms delay to wait for more updates, which is still probably much more
than necessary, but it corresponds to 1/10th the previous delay and is
reasonably small, still providing a 36ms unconditional latency
improvement.
@metalefty
Copy link
Member

Hi, thank you very much for your contribution. It looks good to me at first glance, let me some more time to test.

@trishume
Copy link
Contributor Author

Cool no problem. I've been using it for a day and I've checked that it works in gnome-terminal, xterm, emacs and Chrome. I checked using my latency tester that the latency does indeed improve. I also verified with a high speed camera on testufo.com that the frame limit was still working, although the first time I did that test I realized I got a comparison flipped and had to fix that, hence the force push.

@metalefty
Copy link
Member

Fortunately, perhaps unfortunately, I've never seen 120ms delay so I don't see the latency differences in your fix. I might be insensitive but 120ms is noticeable to me. However, your suggestion is quire right. I'm fine with this. Also, replacing hardcoded magic numbers with macros is very good, thanks.

@jsorg71 Any concerns?

@trishume
Copy link
Contributor Author

Yah the 36ms improvement is only noticeable on its own if you feel for it and compare side by side, but I confirmed it's there with my latency tester, and I think these kinds of differences do subtly affect feel and they add up. If I use my Dell monitor with better latency alongside this patch I get a 30ms difference from the monitor and a 36ms difference from the patch and suddenly the latency going from 120ms to ~60ms is something I can notice and feels great.

This was just an easy thing I noticed as a prelude to trying to improve 4k performance with apps like Chrome, although I'm less sure you'll want to merge my other patches, at least without some cleanup. So far I've improved scrolling and latency in apps like Sublime Text that submit large change rectangles by switching the hashing from CRC to wyhash, and only doing the color conversion if a tile differs: https://github.com/trishume/xorgxrdp/tree/thume-remix

Haven't got around to writing up proper measurements on those changes yet though, but in the worst case I measured on a 4k monitor the capture time went from 120ms to 11ms. The wyhash header also doesn't conform to the old version of C that xorgxrdp uses.

I'm also thinking of trying to parallelize some things with OpenMP or something.

@metalefty
Copy link
Member

xorgxrdp has much room for performance improvement. It has a change get merged but as you say some cleanup will be needed.

@jsorg71
Copy link
Contributor

jsorg71 commented Jun 12, 2020

This looks good to me.

@metalefty metalefty merged commit a144f8b into neutrinolabs:devel Jun 12, 2020
@bolkedebruin
Copy link

@trishume Can I say that this is awesome? We are using vnc now as backend due to the latency xorg is giving us, which is strange of course due the the "nativeness" of xorg. This latency is very noticable. We can compare different systems and this is currently the order that users perceive as the fastest: Citrix, RDP (W10), XRDP (vnc), XRDP (xorg), so any improvement in this area is very welcome.

@trishume
Copy link
Contributor Author

@bolkedebruin Interesting. Well this patch plus the patch I linked in my other comment I haven't submitted yet should hopefully make it similar to or better than the VNC backend. I don't think I can compete with W10 RDP and Citrix yet though. That'd require some parallelizing (more annoying than I hoped) and I'm thinking even that probably won't be enough.

I use a 4k display and the absolute worst thing is scrolling, because there's just so many changed pixels so frequently. I know Citrix has some kind of scrolling detection optimization that allows it to compress by re-using a shifted version of the previous screen area. I think it might be possible to do this in xrdp by detecting areas that are a constant pixel offset in the frame then using xrdp_orders_screen_blt on that, but I haven't yet figured out if that handles overlapping regions properly.

@bolkedebruin
Copy link

@trishume I havent tried your alternative CRC implementation yet if that is what you are referring too. If it reduces (perceived) latency we will do that soon.

Indeed scrolling is the worst so if that could be improved I am all ears ;-). We also see that cpu usage of xrdp goes up significantly when scrolling. It can max out a cpu quite easily. That probably makes sense given all the screen changes, but nevertheless it was surprising to us. We had to make a cpu core available to xrdp per session to allow enough performance for the users.

I hop you continue providing patches in this area. I can offer a significant testing base to which we can roll out a new version pretty quickly.

@trishume
Copy link
Contributor Author

trishume commented Jun 23, 2020

@bolkedebruin so I implemented scrolling detection and it makes a big difference to scrolling. This branch has my patches with my USDT probes cut out including this PR, accelerating large screen updates by switching to wyhash, and scrolling detection: https://github.com/trishume/xorgxrdp/tree/new-patch or https://github.com/trishume/xorgxrdp/tree/new-patch-squashed

Note that scrolling detection has unavoidable tearing at the fringes between where it does the scroll rectangle and the updated tiles. It can also copy small tile compression artifacts making slightly noticeable color fringes. Both of these are only fixable by moving xrdp to use the RemoteFX Progressive protocol, see for instance https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-rdpegfx/0b19d058-fff0-43e5-8671-8c4186d60529 and https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-rdpegfx/f6cbf151-d83e-4f7a-a290-809e376f7722

In order to get the scrolling detection done and have fun doing it in my spare time I completely ignored the xorgxrdp code style. So the patch includes long lines and more modern C where I declare variables where I use them. I also allocate some small temporary buffers during the scroll detection. All of these are things I'm guessing would be desirable to fix if it was to be upstreamed. Fixing the code style of wyhash might be especially annoying if that's required. It could also use some more comments explaining how it works, although that would be very against the xorgxrdp code style 😛

In the mean time people who want dramatically better performance in common situations but don't care about compatibility with super old C versions (i.e. basically everyone) can use my branch. Hopefully I or someone else will bother to clean up my patch for upstreaming at some point.

@trishume
Copy link
Contributor Author

@bolkedebruin note that I've submitted these changes as a work in progress PR so I guess further comments should go on #167

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants