Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design server / client IPC for sharing framebuffer #4

Closed
raisjn opened this issue Oct 29, 2020 · 16 comments
Closed

design server / client IPC for sharing framebuffer #4

raisjn opened this issue Oct 29, 2020 · 16 comments

Comments

@raisjn
Copy link
Collaborator

raisjn commented Oct 29, 2020

background

The rM2's framebuffer is not well understood, but we know it requires a software thread (SWTCON) to drive it. This thread runs with high priority and is responsible for using data from a memory location to drive the contents of the display.

We are able to get access to the framebuffer update APIs by using LD_PRELOAD while loading the remarkable-shutdown binary and hardcoding their memory addresses. The update API is similar to the way the rM1 updates: it takes a rectangle and a few arguments (like waveform_mode) and updates the display.

We propose using a client/server model for interacting with the framebuffer, instead of each applications starting its own SWTCON. (supernote a6 uses a similar design: only one process interacts with the framebuffer and the rest use IPC to communicate with it.)

proposal

  • have a server process that drives the SWTCON. It opens a shared memory location (/dev/shm/swtfb) and listens over IPC for clients to invoke the update API.
  • client process can write to the shared memory location and then invoke the IPC API to tell the server which region to update and any additional flags.

Underlying technology:

  • use shm_open (POSIX) shared memory
  • use msgrcv/msgsnd (SysV) message queue

remarks

advantages of using a server/client model

  1. the LD_PRELOAD method is fragile and is awkward to launch, isolating that technique to one place will make it easier to manage
  2. easier for different languages to interface with - they only need to make IPC calls
  3. with a client/server API for the framebuffer, we are opening up the path to supporting split applications.
  4. with shared memory, the framebuffer can be screenshotted easily. if each application uses its own SWTCON, they have to expose their framebuffer's memory for screenshotting

potential pitfalls

  1. the API across IPC needs to be stable or upgradeable so that rm2fb doesn't have mismatches with client versions
  2. we are creating extra headache by having to create and maintain a server
  3. we might not achieve low latency drawing if we use IPC
  4. what's the point of server/client? what will you do when SDK comes out?

open questions

  1. do all clients share the framebuffer memory or does each client request its own?
  2. do we deal with input sharing now or later?
  3. what does API look like?

prototype

see master branch

@Eeems
Copy link
Collaborator

Eeems commented Oct 29, 2020

with shared memory, the framebuffer can be screenshotted easily. if each application uses its own SWTCON, they have to expose their framebuffer's memory for screenshotting

We likely will still have to deal with this for launchers since xochitl uses its own SWTCON right? Which means that if you use a launcher, this solution will not provide you with a mechanism to just "Screenshot the screen" if xochtil is then started by the launcher.

@raisjn
Copy link
Collaborator Author

raisjn commented Oct 29, 2020

with shared memory, the framebuffer can be screenshotted easily. if each application uses its own SWTCON, they have to expose their framebuffer's memory for screenshotting

We likely will still have to deal with this for launchers since xochitl uses its own SWTCON right? Which means that if you use a launcher, this solution will not provide you with a mechanism to just "Screenshot the screen" if xochtil is then started by the launcher.

at the minimum, we need to deal with this for xochitl, yes. it would be cool if we can use LD_PRELOAD to expose xochitl's framebuffer memory through shared mem since the APIs use a pointer internally. the ideal case is to write something that searches for the memory location (and is not binary specific) but we can use hardcoded memory addresses for each binary hash of xochitl like reStream if it comes to it.

@raisjn
Copy link
Collaborator Author

raisjn commented Nov 1, 2020

after playing with rm2fb.so and xochitl and having them run / update the screen, it's possible for the SWTCON to get out of alignment and only one process will be able to draw, while the other process' SWTCON will be misbehaving. after a while, the misbehaving process may be able to draw again, but the other process will then have trouble.

unable to draw = draws lighter than it should or draws more black pixels / lots of contrast.

after talking to bokluk, this behavior seems like a reason to only have one SWTCON running at a time. alternatively, we can figure out what specifically causes this behavior and try to prevent it, allowing multiple SWTCONs to exist.

@Eeems
Copy link
Collaborator

Eeems commented Nov 2, 2020

after talking to bokluk, this behavior seems like a reason to only have one SWTCON running at a time. alternatively, we can figure out what specifically causes this behavior and try to prevent it, allowing multiple SWTCONs to exist.

So until this is solved, we won't be able to properly port launchers that expect to run before xochitl?

@raisjn
Copy link
Collaborator Author

raisjn commented Nov 3, 2020

after talking to bokluk, this behavior seems like a reason to only have one SWTCON running at a time. alternatively, we can figure out what specifically causes this behavior and try to prevent it, allowing multiple SWTCONs to exist.

So until this is solved, we won't be able to properly port launchers that expect to run before xochitl?

i'm not sure. i think it is easy(ish) to return to the days of a launcher only launching one app (then we never have problems with multiple SWTCON running), but in that case the launcher will have to re-calibrate its own SWTCON when an app it launched dies.

@Eeems
Copy link
Collaborator

Eeems commented Nov 3, 2020

after talking to bokluk, this behavior seems like a reason to only have one SWTCON running at a time. alternatively, we can figure out what specifically causes this behavior and try to prevent it, allowing multiple SWTCONs to exist.

So until this is solved, we won't be able to properly port launchers that expect to run before xochitl?

i'm not sure. i think it is easy(ish) to return to the days of a launcher only launching one app (then we never have problems with multiple SWTCON running), but in that case the launcher will have to re-calibrate its own SWTCON when an app it launched dies.

Restarting the launcher's SWTCON and force redrawing might be an option. It'll be kinda slow, but at least it would work.

@raisjn
Copy link
Collaborator Author

raisjn commented Nov 4, 2020

Restarting the launcher's SWTCON and force redrawing might be an option. It'll be kinda slow, but at least it would work.

an interesting converse here (that you made me think of with your comment) is to restart xochitl whenever we return to it. i will try this out to see how it does

@raisjn raisjn pinned this issue Nov 6, 2020
@raisjn raisjn unpinned this issue Nov 6, 2020
@pl-semiotics
Copy link

I don't have an rM2 (at least at the moment), but I have done some work on rM1 (most noticeably, I wrote the VNC server which some folks use to effectively stream the framebuffer via snooping on the MXC_SEND_UPDATE ioctls).

I'm trying to understand the rM2 framebuffer in order to, among other things, see if said server can be ported to it while maintaining the good efficiency we have on rM1. I was wondering if I am understanding the rM2 fb architecture correctly:

  • User threads send mxcfb_update structures not to the kernel via ioctl but to another thread
  • This thread does something complicated involving shared memory from some (which? zero-sugar_defconfig doesn't seem to
    load either mxcfb driver) and possibly REGAL/-D dithering algorithms that we do not understand, in order to produce actual output

Is this approximately correct? I would also be curious whether any of you have had success in reverse engineering the interface between the framebuffer-update thread and the actual hardware (and whichever kernel module this goes through).

I do rather have some concern around this (client/server) approach with regards to latency, as it seems that it unavoidably introduces an extra two context switches due to the involvement of an extra userspace process and potentially an extra copy. If we need a single shared SWTCON thread, I would be interested in whether we could potentially move it into the kernel, replacing xochitl's calls into it with those into the kernel? I do not have a good enough understanding of the back end architecture of the SWTCON thread (based on the files I see here, and without my own device, unfortunately) to be clear on whether or not this would introduce one extra context switch into xochitl, but it seems like it would still be better than trying to run it in another userspace process.

@ddvk
Copy link
Owner

ddvk commented Nov 12, 2020

@pl-semiotics the architecture is, the app writes to a buffer, then the buffer is copied/mangled/transformed with eink waveforms and written to /dev/fb0 and from then to the screen, the driver being mxsfb (all that was previously done by the epdc).

the client/server approach was chosen because:

  • we have no other way of using the framebuffer atm
  • only 1 app can use it after initialising, as there is some state in the screen

@pl-semiotics
Copy link

pl-semiotics commented Nov 12, 2020

@ddvk thanks for the elaboration, that matches roughly what I was expecting (unfortunately for me). I'm a bit curious where update region detection and pushing an update to the display happens at the moment, though---as far as I understood, the actual display driver IC would most of the time remain idle, and only actually apply nonzero voltages to the display when it is informed of a region being updated? The mxsfb seems to provide a much more normal framebuffer interface, and doesn't do anything to detect writes to it, so I'm a bit unclear on how the driver knows when to actually apply the update waveforms that are being written to that area of memory. In particular, mxsfb seems to be designed to expose the embedded LCDIF block on i.MX and as best as I can tell from a quick read of the driver, seems to be configuring it in a dotclock mode where indeed a pixel's worth of data will be sent out every pixclock, which doesn't seem at all the appropriate way to drive an e-ink display.

I understand entirely that rationale for the client/server approach. My thought is more: does it perhaps make sense to move the "server" portion into the kernel, for best performance? (Admittedly, this may degrade native xochitl slightly, if the update regions are in fact otherwise being magically DMA'd by the device somehow; but it'd still be better than a userspace-shared-memory solution). And this way we can even provide a compatible update ioctl() with rM1. (I wonder also whether mxc_epdc_v2_fb can be ported to be ~as performant as the binary userspace component, since I thought other i.MX7Dual devices used it).

@raisjn
Copy link
Collaborator Author

raisjn commented Nov 12, 2020

regarding performance, i would like to be methodical.

that is to say :

  • quantify the cost of context switching on rm
  • quantify ipc cost
  • quantify screen update cost for du mode and small area
  • quantify screen update for large area

i don't want to pursue an avenue for perf benefit until we have a sense of timings and see that the context switches or ipc comm are really a problem.

my sense is that the biggest perf benefit you get is only sending over dirty regions for vnc, which will save large amounts (100-200 ms to save whole buffer vs 5-20 ms for small dirty regions while pen is drawing)

context switches are on order of a few micro seconds, as is the ipc communication.

I'm not against kernel driver, it makes sense, but i wouldn't want to put qt code in the kernel, so we'd need open impl. before making kernel driver.

re mxsfb: i didn't think the driver is aware of dirty region, rather the swtcon would be.

@pl-semiotics
Copy link

pl-semiotics commented Nov 12, 2020

@raisjn Sorry, I should have been clearer---my performance concerns are not at all about the VNC solution, for which other (e.g. network) latencies are certainly orders of magnitude beyond those provided by the driver thread situation. (For that, I'm mostly interested in seeing how far down the stack I can hook in so as to maximize compatibility with different userspaces). I am interested, however, in ensuring that xochitl performance is not degraded when e.g. doing windowing things, since even a few ms there may be quite noticeable (I have half designed (if not yet quite implemented due to lack of time) a compositing approach for RM1 based on virtual FBs that intercept the update ioctls, which is able to provide quite good performance and eliminate an extra userspace process from the fastpath for display updates); and I also think it would be pleasant to provide a compatible interface with RM1. My first instinct is also that it feels rather cleaner to put this portion of hardware driving in the kernel; and there is perhaps a certain aesthetic advantage to not dipping back into another userspace thread here.

Yes, I would certainly agree that trying to load the existing binary driver in the kernel would be a bad idea :)

I suppose my question is how swtcon is conveying the dirty region to the hardware, as it certainly does not appear to be doing so via mxsfb, and I'm not seeing any other obvious (enabled) device drivers in the kernel source. I could imagine that certain writes to the mapped address space are doorbells for the epd controller (and this would probably be the one case where I'd argue to not add anything to the kernel; but rather (most likely) to try to figure out how to allow application swtcon threads to work together, perhaps by storing the relevant display state in shared memory), but I don't see any relevant configuration happening in mxsfb, and the memory space seems to come from dma_alloc_writecombine() rather than any hardware provided memory-mapping region.

@raisjn
Copy link
Collaborator Author

raisjn commented Nov 12, 2020

(I have half designed (if not yet quite implemented due to lack of time) a compositing approach for RM1 based on virtual FBs that intercept the update ioctls, which is able to provide quite good performance and eliminate an extra userspace process from the fastpath for display updates); and I also think it would be pleasant to provide a compatible interface with RM1

very cool! for our client ipc, we are using ld_preload to shim rm1 apps and pretend to be the mxcfb driver. (see issue #12), and hopefully in the future (rm2fb v2) we can start doing cool things around split screen apps, windowing, etc

do you have a design doc for the compositor that can be shared?

but rather (most likely) to try to figure out how to allow application swtcon threads to work together, perhaps by storing the relevant display state in shared memory),

this is something we are also thinking about

@Eeems
Copy link
Collaborator

Eeems commented Nov 12, 2020

(I have half designed (if not yet quite implemented due to lack of time) a compositing approach for RM1 based on virtual FBs that intercept the update ioctls, which is able to provide quite good performance and eliminate an extra userspace process from the fastpath for display updates); and I also think it would be pleasant to provide a compatible interface with RM

@pl-semiotics When you get around to implementing it, please let me know. I'd love to pull that into Oxide.

@raisjn
Copy link
Collaborator Author

raisjn commented Nov 18, 2020

I'm closing this out as there is now a server / client implementation that uses LD_PRELOAD hooks and is mostly transparent to rM1 applications.

A new issue can be opened for v2 design questions: per app framebuffers, dirty region emitting, etc

@raisjn raisjn closed this as completed Nov 18, 2020
@pl-semiotics
Copy link

pl-semiotics commented Jan 3, 2021

@raisjn @Eeems This is perhaps not the right place to mention this, but since you seemed interested in my compositor plan---now that I'm done with the vnc server updates that were taking up most of my remarkable-related time, here are a couple of updates. I was originally going to write my own minimalistic compositor protocol implemented in-kernel with virtual framebuffers and a number of optimizations, but I'm now (seeing that the RM2 does everything in userspace, not using PXP/etc. hardware at all) thinking that it's probably better to just reuse Wayland (especially since one ought to get shared-memory Xwayland ~for free in that environment). If there are performance issues, it may be worth adding some protocol extensions for single-buffering and cropping buffers/matching stride sizes to reduce the number of copies needed, but my plan is to get a MWE by hacking together a wl_shm style compositor that draws to the rM1/2 framebuffer (and wrapper that emulates the RM1/RM2 native interface sufficiently for xochitl to draw to wayland---although for RM2 this is of course probably something that directly replaces the libqsgepaper EPFramebuffer with one backed by wayland).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants