-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Event notification based version of vc_dispmanx_snapshot? #440
Comments
Stumbled at random chance onto the alkemir/gertCloner repository, which exists to duplicate output to DPI and HDMI displays simultaneously, suggesting that this is not "only a SPI display" problem. Polling If there existed a notification callback version of the |
The framebuffer/GPU updates are driven by the display frequency (in most cases - there is a slight proviso on 90 and 180 display rotations), so vsync would probably be the option to use here. Not sure there any other scope for adding any extra callback. If I understand what you are asking for, it's like a callback when the image changes, and I don't that would be particular easy/efficient to implement. |
I modified gertCloner to use the 'vsync callback' method which gives an interrupt at 60hz. The original gertCloner does the same as fbcp and has a dumb sleep.. i found that when cloning DPI to HDMI with a dumb sleep you would get visible tearing and stuttering.. changing to use the vsync callback produced perfect frames: It seems that the existing callback works great in this case (DPI to HDMI) but I guess for SPI screens 'if nothing changes, why should I send anything' I guess is what this issue is about? |
I think that the current vsync callback is a signal "from the wrong end of the chain", if that makes sense. That is, imagine one has the following kind of native C++ GLES2/EGL style application (a bit pseudo): int main()
{
initEGL();
eglSwapInterval(0);
for(;;)
{
updateAndRender(); // (*)
eglSwapBuffers(); // (**)
}
} where the In my fbcp-ili9341 driver, I do already implement the vsync callback, but it's not working out great, because it's triggering at 16.6667msecs, whereas if I'm busy spinning to poll new frames in a loop while playing a PAL NES game, I'm obtaining frames at regular 20msec intervals (50fps). This 50Hz vs 60Hz mismatch is causing jittering latency then to the obtained frames, leading to microstuttering. I find I'm getting best results when I busy spin a dedicated thread to do snapshots as fast as possible. This gives me lowest latency to obtain the new produced frames from the application point of view, but the polling is then killing CPU usage, so I end up doing heuristics to guess when the earliest time to start polling for the next frame would be.
To minimize the bytes transferred on the SPI bus, I do already do per-pixel diffs on the obtained frames to avoid wasting excess byte transfers. That side of the flow is not an issue, but the latency/jitter part is. The same 50hz vs 60hz mismatch jitter will be present also on DPI if using the vsync callback. (though even doubly so if the DPI display is driven at its own vsync as well) Ideally, there would be some way to hook into getting notifications about the "swap chain" (Direct3D lingo) of produced frames that the application side has generated (at
This does remind of an issue due to fixed sleep times in the original fbcp code. Instead of a fixed |
I don't think there's importance to have a callback for when the image pixels change, but just for when the application/framebuffer chain has pushed a new frame out for the dispmanx chain to process. If the application produces the same frame hundreds of times in a row, then the callback would fire once for each (per-pixel diffs would not be nice to do in dispmanx or anything like that). |
AIUI there's no such thing. What constitutes a new frame in your proposed scheme? Every DispmanX update with a different source (effectively what James proposed)? What if multiple things are contributing DispmanX elements - do they all generate a callback? Nothing stopping me setting up 16 QVGA video clips as a mini-video wall, so you'll now be getting an event off each - prepare for 480 events a second if they're 30fps clips. Also please note that the Linux frame buffer is not double-buffered, therefore you'd never get any updates from that at all. Whilst I can appreciate what you're trying to achieve, I can't think of a suitable scalable mechanism for doing so. |
Thanks for the detailed description! This is very informative.
What happens if I miss adding any images between two vsync events? The display will not show(?) a black screen, so somehow it knows that it should not have presented a "dropped frame" on the screen, suggesting that there is something that would exist to tell "there's now a full set of new frame data to push, please present/swap this on the screen".
If I understand this description correctly, then it would indeed be ideal to receive 90 callbacks in this scenario.
The application always has some kind of API call it uses to tell "whaever I've rendered up to now, it's now a complete formed image, please present that on the screen all at once" (eglSwapBuffers, SDL_Flip, glXSwapBuffers, ...). I agree I'm not familiar with Linux Framebuffers API or DispmanX enough to know what the flow is for these frame delimiters to go through the internals, but my understanding is that such information does get carried through, because when one calls
I'm not familiar enough with the DispmanX lingo here - is there a built-in primitive to synchronize these DispmanX elements to form a frame? Let's say my application is building up this huge QVGA video wall, and it's taking up 300 msecs to create the initial first frames of each element. My understanding is that the DispmanX compositor would not start presenting half of them on the screen (the remaining half of the screen staying e.g. black) at 16msecs intervals throughout this 300msecs period, but only when I tell DispmanX "I've now produced all elements this frame, please swap/present", the composited contents from all of them are swapped out to the HDMI controller to display? That would be the event that would be ideal to hook onto here. If the API doesn't work like that, and there's no such thing that applies, then perhaps what I would be looking for are 480 events per second here in this scenario if the elements are unbuffered. Searching at https://www.raspberrypi.org/forums/viewtopic.php?t=33615, it mentions a code snippet
perhaps that is the frame synchronization event that the native applications that implement EGL/X11/framebuffer/etc. swapping? In that case, I think that would be the event at which we'd like to hook onto here. Does that sound plausible? And thanks for guiding the conversation! |
Searching further, I find raspberrypi/firmware#355, and perhaps what we're looking for here is to hook in the middle to get an event callback from functions |
Then it retains the previous image passed for the element.
The swapbuffers type calls are effectively making updates to the display list atomic. DispmanX level is vc_dispmanx_update_start and vc_dispmanx_update_submit_sync. BUT that is for only one client. vc_dispmanx_snapshot() in a tight loop is effectively saying "please compose the scene as currently defined to this offline buffer". Yes it is independent of the display VSYNC because it is requesting a new pass through the hardware to the memory buffer.
Yes, that's the crux - it's someone else's DispmanX calls that you are wanting to hook on to. To my mind firmware#355 is unrelated. They were requesting a callback when their own resources actually got updated, and had hit a genuine bug. |
Would DispmanX allow in this kind of scenario each of the 16 programs to have one visible cell of a 4x4 grid on the display, and then each application is driving video at 30fps to its own cell? And then each application would be doing its own In that case, I think it would be desirable to receive the 16*30fps worth of updates in this kind of callback, i.e. since the HDMI display has a chance to produce a finished image at any point in time as the updates to the cells arrive, then also a SPI/DPI-based display would like to do the same. Now of course such a 16x16 grid is a bit contrived as a performance overhead, usually there's only few (in e.g. a fullscreen GLES2/X11 app, only one or a couple?) surfaces, but still as far as the "liveness" of data is concerned and what can visually appear on the HDMI vs on the DPI/SPI screens would be identical. However that does bring up a question: what pumps the GPU to start doing the work of compositing a new frame? I.e. in the regular HDMI out scenario, when any of those 4x4 cells update their contents, does any app calling |
Oh sorry, linking that was not to imply that that bug would be relevant here, except to link to the existence of that function. |
DispmanX absolutely allows multiple clients.
and you should have 4 independent versions of omxplayer playing the video (potentially not at realtime as you may be overloading the codec block depending on the resolution of the source video). Each one is a DispmanX client. Each instance is independent of the display output rate. As previously stated, at the point the display output hits vsync it will take a copy of the current list of items to display. The hardware will then start rendering those elements on-demand to produce the required output pixels (by default the composited output for the displays does NOT get written to memory). It sounds like in the case of your PAL SNES application producing 50Hz updates, it has already failed if the HDMI is running at the default 60Hz. Trying to crowbar in some secondary display copying off a primary running at that incorrect rate by adding callbacks synced to application updates makes little sense. Set the primary to run at the correct rate (tvservice or similar), and your vsyncs are then at the correct rate, so no jitter. |
Thanks, this is a very illustrative example!
This is interesting. How does the hardware know that it will be able to render those elements before the vblank interval is over so that the new pixels make it in to the screen before it runs out of vblank time? Presumably the answer is that it doesn't, so this can cause an extra vertical refresh worth of HDMI latency(?) This sounds like a typical "optimizing for latency vs throughput" tradeoff: an eager driver would immediately start compositing when an update is made, so that the contents would get in to the next earliest vsync. However presumably this is not done because it's a worry that such eagerness could cause redundant work if there are multiple updates that routinely occur per one vsync interval, so to maximize throughput at the expense of latency, the driver just kicks off its work at vsync periods of time(?) This is unfortunate for applications that want to optimize for low latency and don't have a concept of vsync, i.e. use
I expect that most realistic applications are such "single sync source" cases, and there generally exists only one fullscreen surface that is present on screen. Under such conditions, getting a notification of each update does not sound unrealistic, and software regularly is optimized to fall into such category. I'd be surprised if any of the Retropie game emulators running have more than a single visible sync source running when active.
By
The display is not necessarily secondary, but a primary one, and the rate is not incorrect, but it is defined by the application. And this is certainly not crowbaring :)
If there did exist callbacks synced to application updates, it would be possible to develop a push modelled pipeline all the way from the application's Here is the SPI driver in action:
Unfortunately there does not exist a single correct rate, but it is dictated by the application that is currently running, and it can vary depending on the workload of the application. If you look at the second video, Outrun renders at 60Hz, Tyrian renders at 36Hz, Super Mario renders at 50Hz, Prince of Persia renders at 12Hz, and Quake 3 renders at 20-40fps depending on scene complexity. It is not practical to require one to reconfigure the system with tvservice or another program to change the refresh rate, because it can be unknown; the user would not even know what such a refresh rate would be, and this would not work for applications that cannot maintain a fixed rate. And it would be tedious to have to reconfigure whenever switching applications. The videos above perform the busy spinning polling of It sounds like it would be desirable here to have a driver mode where dispmanx would perform eager GPU composition of its layers immediately whenever an update occurs, rather than delaying its work for an extra vsync. This might even improve performance and latency on HDMI connected applications if there existed only few sync sources. From your description it sounds like dozens of layers on screen sounds like a rare case that does not occur in practical applications, and application architectures such as omxplayer and Kodi generally want to optimize for a single sync source anyways. Perhaps this could be a |
Let me come back to the originally posed questions, now I think I understand the original answers from @6by9 better. If I was to design this kind of "eager" compositing mode to DispmanX, it would be nice if DispmanX would for each call to Whenever a rerender finishes, a global event callback would fire to all registered listeners, along with the contents of the composited frame. This way one could still constrain GPU composits to not occur more frequently than given configured frame rate max cap, say at 60Hz, since the GPU would wait for at least 16.666ms until it composited the next time; but still if application produced frames at 50Hz or 36Hz or some other odd amount, all these updates would always immediately trigger an eager composit. This would enable the scenario where application producing at 50Hz would actually fire the events at 50Hz to the SPI display driver, but also the 4x4 wall of videos would cap to only update at 60Hz. This would also potentially improve latency on HDMI(?) Then,
Each finished GPU composit would cause a frame event signal to be produced.
Yes, though if multiple occurred since last GPU composit, those would get buffered up until next composition. (when at least
The callback would fire whenever a GPU composit finishes, and since an internal
This would then create only The advantage of this kind of composition mode would be that instead of waiting for fixed wallclock times (vsync signal) of when a GPU scene recomposit should occur, it would eagerly start if the last composit was too long ago, and as result content that renders at < I hope I did not misunderstand anything too badly that was described earlier, and apologies if so. Perhaps something like this might be workable to think about further? |
You have a hardware pipe of:
There's no point in the GPU being eager to produce the output when you have a defined hardware device that has a specified output pixel rate. The alternative use for the HVS is to write into SDRAM instead of the pixel valve, sometimes referred to as offline. This is what When you called I'm not going to implement a whole new API driving DispmanX in a different manner, mainly as resource is being expended to get onto the mainline KMS driver and deprecate DispmanX and similar. (*) There is a fallback to composing to memory first should the scene complexity exceed the threshold that the hardware can cope with on a just-in-time basis. The threshold isn't an easily defined hard limit as it is heavily dependent on SDRAM access speed, and it is possible to get the pipeline to underflow when it gets the threshold calculation wrong. (**) If you really want to have no display, then you need the application to be calling |
I'm not convinced it's worth spending any more time on this - it would be quite a lot of work for very minimal benefit. |
The framework already exists for sending the vsync callbacks to registered clients. A copy/paste of that and changing the trigger code is relatively straightforward. |
…e when using vsync to work around dispmanx issue raspberrypi/userland#440
…e when using vsync to work around dispmanx issue raspberrypi/userland#440
…e when using vsync to work around dispmanx issue raspberrypi/userland#440
…aspberrypi/userland#440) that tries to grab a smooth 60fps video stream, to ease a bit more on the CPU. On Pi Zero W, results in ~30% CPU usage; if this hack is omitted, CPU usage is at around ~16%.
I have been working on porting fbcp-ili9341 to support bigger 480x320 displays and Pi Zero in addition to 320x240/Pi 3, and increasingly find the project juggling between high CPU consumption and stuttering because of the inefficient polling. There is a recent demo video that shows statistics about the CPU consumption aspect: https://www.youtube.com/watch?v=dqOLIHOjLq4 . I wonder if adding an event for updated surfaces might be something to consider revisiting? The benefit would be considerable for SPI displays. |
…spberrypi/userland#440 to be more stable, and utilize a linear increment/geometric fallback type of shootahead mechanism to detect when frame rate of the content has recently increased. (e.g. returning to 60fps updating menu after a 24fps video)
…land#440 : do not be as eager to slow down content frame rate if user plays a game, and spends a few seconds in a static menu, and then continues on 60fps content. Try to go to sleep better by expiring old frame intervals by time in addition to # of samples in the histogra, and try to improve the fast tracking logic by looking at the most recent interval as well.
fbcp-ili9341 has added a statistics option to display a histogram of the amount of stuttering that occurs. The option looks like this in practice: The above picture is from running Read https://github.com/juj/fbcp-ili9341#about-smoothness for details on the different test modes. The video is posted at https://www.youtube.com/watch?v=IqzKT33Rwjc. For example in the 24Hz running Big Buck Bunny video, frame intervals are all over the place when the vsync signal is being used. Snapshotting frames in a dedicated polling thread gives a much smoother and more pleasing result, but unfortunately that approach is not feasible on a Pi Zero. If there was an event callback to get surfaces as they are updated, then I would expect that the frame rate jittering would look much like shown in the video in bottom right (or even better), but with 0% redundant cpu and gpu overhead that would come from having to snapshot excessive frames to discover that nothing had actually updated any surfaces. @6by9 mentioned above that this might be a relatively straightforward addition, and it would make a world of a difference for SPI based displays, many of which could run at a smooth 60fps rate even on a Pi Zero, contrary to what is often stated, based on badly performing software drivers in the past. If this would indeed end up being something relatively easy to do, I think it would definitely be worth the addition. |
The numbers thrown about here for potential update rates seem to be forgetting that the maximum full screen update rate of an SPI-connected ILI9341 is around 30FPS. The typical ili9341 is connected by simple ribbon cable or discrete wires and tends to fail above 40Mhz. At a higher voltage and with coax or twisted pair, maybe a higher clock rate could function. At 40Mhz, a full screen update (320x240 x 16-Bpp) would require at least 1228800 bits transmitted (more due to commands and inter-byte delays). If we start with an ideal number of 1228800 bits, this would give us a maximum frame rate of 40000000 / 1228800 = 32FPS. On the RPI hardware, I believe the maximum workable SPI rate is 31.25Mhz because the clock uses a power of 2 divider from a 125Mhz master clock. If you do partial frame updates, some games can run at 60fps (see my sg_free game emulator as an example of this). The RPI0 presents a bigger challenge because it's single core CPU doesn't have enough speed to do complex partial-frame update calculations and run the game emulator and transmit the data over the SPI bus. The most practical way to reduce latency is to have the game emulator communicate directly with the SPI display (like my emulator). |
Thanks for the comment. It is appreciated, however many of the statements about refresh rate there are incorrect, or perhaps apply to your specific case. E.g. SPI can well run at higher speeds, inter-byte delay can be disabled, power of 2 divider is not needed, the master clock is not at 125MHz, so ILI9341s can go up to 60fps with full updates. But that is siderailing the conversation so let's not pursue that aspect further (l'll just defer to check out https://github.com/juj/fbcp-ili9341 in detail for more information on the above), since this issue is not about frame rate, or getting content to run at 60fps, or "fast", whatever the attained refresh rate may be in a particular scenario. It is actually more visibly uncovered in the opposite scenario - what happens when you are running content that is not authored 60hz, but more generally, how to deal with stuttering, latency and performance wastage that occurs on all frame rates. Agreed, like you mention, Pi Zero is a big challenge, and it is bit much more by this issue. Current DispmanX does kind of work serviceably (with only a little bit of stutter because of this bug, which we would like to remove) if both the content and the display run up to 60fps, see the top right corner test case in the video. This issue is prominent in the scenario where the content does not run at 60fps, but runs at a lower refresh frequency. If you watch the video posted above, it has content that runs at 50hz (Super Mario), 36hz (OpenTyrian) and 24hz (Big Buck Bunny). Many games, like MineCraft, cannot maintain a single fixed update rate, but vary depending on the complexity of the view to the scene. Since there is no way to get events from the display driver, we need to resort to a background thread polling to get to smooth framerates. For example if your SPI display was constrained to max 32FPS, you'd still like to watch 24hz Big Buck Bunny on it as smooth 24fps update and not have it stutter. This issue is about identified wasted CPU work that has to occur, and if there was a new added feature to DispmanX to create such an event based mechanism to obtain frames, such CPU wastage would be avoided, and SPI displays on Pi Zero would be more viable as well. You work on creating a game emulator that directly communicates with an SPI display is admirable! Such approach does unfortunately not scale, there are probably thousands of graphical applications that are developed for the Pi, and you cannot rewrite all of them to directly talk to a SPI display, which is why this bug entry is asking about adding a feature to serve as a general solution. |
@juj thanks for the clarification. I guess I have low quality ILI9341 displays because mine all start to fail at around 36Mhz or so. Can you point me to a vendor link for a display that can handle 60FPS? If the RPI0 SPI master clock is not 125Mhz, can you explain how it works in more detail? What clock rates are possible? |
Again, please let's not derail what is already becoming a very lengthy thread with unrelated conversation. Feel free to open a followup at e.g. fbcp-ili9341 tracker, Raspberry Pi Interfacing (DSI, CSI, I2C, etc.) subforum, or shoot me an email at . Thanks. |
Thanks @juj for opening this conversation and all the others bringing in their expertise of the relevant software stacks. I understand that what is required for this issue to be solved is a call back for newly available content. I also understand that the main problem that is seen with that by @6by9 is the possibly unbounded amount of callbacks per second. Maybe a simple solution to this conundrum could be the following: the functionality to register a one-shot callback for new content. As in the example of the SPI driver, most applications will NOT be interested in a possibly unbounded amount of callbacks, but need their own processing time of the content or have a maximum FPS limit. They do not want the above-mentioned 480 callbacks. So to solve this, they could request a one-time notification every time they are ready for new content. Bonus points if the API would allow to ask: "hey, my last frame number was n, is there already a newer frame available? If so, return false, if not, register a one-time callback for when there is a newer frame and return true." |
Hi! I'm very interested in this conversation (which is, very interesting... :)) as i'm waiting for a cm3 device with an ili9341 screen (freeplay cm3). Is there any news on this subject? For specific systems (cm3/recalbox), i think i may implement this in SDL/SDL2 to get the most of emulators, but this is not a nice solution. |
Hi @juj, My setup:
My Build Command: It's working on Lakka too. Hope It will help someone is using the same screen as me. Regards, |
Currently software display drivers that implement support for displaying main framebuffer contents on displays interfaced via the SPI bus are resorting to periodically polling snapshots of the screen contents using the
vc_dispmanx_snapshot()
API. This is done for example in the popular rpi-fbcp SPI framebuffer driver, and in my fbcp-ili9341 program, and they suffer from suboptimal performance, because one needs to guess when new frames would be available. Snapshotting too often wastes CPU (one would capture the same frame twice or multiple times), and snapshotting too infrequently will cause stuttering, or missed frames.Does there exist, or would it be possible to add into the driver, a variant of this API that would allow one to register direct event callbacks for when a new frame has been produced? In such a callback, one could then either call
vc_dispmanx_snapshot()
, or perhaps more preferably, directly receive a pointer to the new produced frame (for example if an optional flag to grab these was set during the time when the event callback was registered). This would greatly improve these SPI-based display drivers to conserve CPU and minimize latency and stuttering.There does already exist a way to register to vsync notifications and that can be used as a heuristic in some cases, but this is not quite what is desired, since the GPU might be pushing out frames at different or variable rates.
If such an API already exists, tips towards how to use would be very much appreciated. Or if not, would this be something that was possible and reasonable to add?
The text was updated successfully, but these errors were encountered: