Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event notification based version of vc_dispmanx_snapshot? #440

Open
juj opened this issue Nov 18, 2017 · 26 comments
Open

Event notification based version of vc_dispmanx_snapshot? #440

juj opened this issue Nov 18, 2017 · 26 comments

Comments

@juj
Copy link

juj commented Nov 18, 2017

Currently software display drivers that implement support for displaying main framebuffer contents on displays interfaced via the SPI bus are resorting to periodically polling snapshots of the screen contents using the vc_dispmanx_snapshot() API. This is done for example in the popular rpi-fbcp SPI framebuffer driver, and in my fbcp-ili9341 program, and they suffer from suboptimal performance, because one needs to guess when new frames would be available. Snapshotting too often wastes CPU (one would capture the same frame twice or multiple times), and snapshotting too infrequently will cause stuttering, or missed frames.

Does there exist, or would it be possible to add into the driver, a variant of this API that would allow one to register direct event callbacks for when a new frame has been produced? In such a callback, one could then either call vc_dispmanx_snapshot(), or perhaps more preferably, directly receive a pointer to the new produced frame (for example if an optional flag to grab these was set during the time when the event callback was registered). This would greatly improve these SPI-based display drivers to conserve CPU and minimize latency and stuttering.

There does already exist a way to register to vsync notifications and that can be used as a heuristic in some cases, but this is not quite what is desired, since the GPU might be pushing out frames at different or variable rates.

If such an API already exists, tips towards how to use would be very much appreciated. Or if not, would this be something that was possible and reasonable to add?

@juj
Copy link
Author

juj commented Jan 10, 2018

Stumbled at random chance onto the alkemir/gertCloner repository, which exists to duplicate output to DPI and HDMI displays simultaneously, suggesting that this is not "only a SPI display" problem. Polling vc_dispmanx_snapshot()s is required for example in Kite's SUPER AIO - HDMI + DPI Screen Cloning Raspberry Pi Zero Retropie handheld device to simultaneously output to DPI and HDMI.

If there existed a notification callback version of the vc_dispmanx_snapshot() API, it could be used here instead of the polling loop, giving lower latency, better performance and lighter power consumption & temps when simultaneously outputting to DPI and HDMI. CC @geebles for awareness.

@JamesH65
Copy link
Collaborator

The framebuffer/GPU updates are driven by the display frequency (in most cases - there is a slight proviso on 90 and 180 display rotations), so vsync would probably be the option to use here. Not sure there any other scope for adding any extra callback. If I understand what you are asking for, it's like a callback when the image changes, and I don't that would be particular easy/efficient to implement.

@kiteretro
Copy link

kiteretro commented Jan 10, 2018

I modified gertCloner to use the 'vsync callback' method which gives an interrupt at 60hz. The original gertCloner does the same as fbcp and has a dumb sleep.. i found that when cloning DPI to HDMI with a dumb sleep you would get visible tearing and stuttering.. changing to use the vsync callback produced perfect frames:

CLICK TO OPEN YOUTUBE:
DPI to HDMI cloning

It seems that the existing callback works great in this case (DPI to HDMI) but I guess for SPI screens 'if nothing changes, why should I send anything' I guess is what this issue is about?

@juj
Copy link
Author

juj commented Jan 10, 2018

The framebuffer/GPU updates are driven by the display frequency (in most cases - there is a slight proviso on 90 and 180 display rotations), so vsync would probably be the option to use here.

I think that the current vsync callback is a signal "from the wrong end of the chain", if that makes sense. That is, imagine one has the following kind of native C++ GLES2/EGL style application (a bit pseudo):

int main()
{
   initEGL();
   eglSwapInterval(0);
   for(;;)
   {
      updateAndRender(); // (*)
      eglSwapBuffers(); // (**)
   }
}

where the (*) starred line could take up a variable random amount of time, depending on application. A PAL NES emulator produces frames at 50Hz, whereas some other apps could be producing at 60Hz, or depending on the updateAndRender(); workload, at frame rates that could vary arbitrarily. Think of the updateAndRender() call in an arbitrary application being effectively identical to a sleepRandomMSecs() function, i.e. we don't know when the application would have pushed a new frame to be presented.

In my fbcp-ili9341 driver, I do already implement the vsync callback, but it's not working out great, because it's triggering at 16.6667msecs, whereas if I'm busy spinning to poll new frames in a loop while playing a PAL NES game, I'm obtaining frames at regular 20msec intervals (50fps). This 50Hz vs 60Hz mismatch is causing jittering latency then to the obtained frames, leading to microstuttering.

I find I'm getting best results when I busy spin a dedicated thread to do snapshots as fast as possible. This gives me lowest latency to obtain the new produced frames from the application point of view, but the polling is then killing CPU usage, so I end up doing heuristics to guess when the earliest time to start polling for the next frame would be.

It seems that the existing callback works great in this case (DPI to HDMI) but I guess for SPI screens 'if nothing changes, why should I send anything' I guess is what this issue is about?

To minimize the bytes transferred on the SPI bus, I do already do per-pixel diffs on the obtained frames to avoid wasting excess byte transfers. That side of the flow is not an issue, but the latency/jitter part is. The same 50hz vs 60hz mismatch jitter will be present also on DPI if using the vsync callback. (though even doubly so if the DPI display is driven at its own vsync as well)

Ideally, there would be some way to hook into getting notifications about the "swap chain" (Direct3D lingo) of produced frames that the application side has generated (at (**)), and when the GPU composition stack of those frames has finished. Then there would be a dispmanx callback that'd get called for anyone listening to notifications for finished frames, and in that callback, the listener could grab the frame contents (or get the frame contents fed to it directly, perhaps depending on callback registration flags) to read. This way the push model would be preserved with low latency, without regards to vsync, which should only come into play later down the chain when pushing the frames out to the display.

The original gertCloner does the same as fbcp and has a dumb sleep..

This does remind of an issue due to fixed sleep times in the original fbcp code. Instead of a fixed sleep(25msecs), one wants to do a variable sleep to shoot ahead for the arrival time for the next frame, factoring the time that taking the snapshot took, instead of always sleeping a fixed msecs (which then would add up to whatever time that vc_dispmanx_snapshot() was sleeping). The function call vc_dispmanx_snapshot() is synchronous, but inside it, it waits for an async time for an event from the GPU, so calls to vc_dispmanx_snapshot() can take varying amounts of time before returning, and that needs to be adapted to, otherwise there will be jitter. (Though all of this is a bit orthogonal to the issue at hand)

@juj
Copy link
Author

juj commented Jan 10, 2018

If I understand what you are asking for, it's like a callback when the image changes, and I don't that would be particular easy/efficient to implement.

I don't think there's importance to have a callback for when the image pixels change, but just for when the application/framebuffer chain has pushed a new frame out for the dispmanx chain to process. If the application produces the same frame hundreds of times in a row, then the callback would fire once for each (per-pixel diffs would not be nice to do in dispmanx or anything like that).

@6by9
Copy link
Contributor

6by9 commented Jan 10, 2018

I don't think there's importance to have a callback for when the image pixels change, but just for when the application/framebuffer chain has pushed a new frame out for the dispmanx chain to process.

AIUI there's no such thing.
DispmanX is accumulating a list of elements to display for the next frame, as well as holding the list for the current frame.
If you present it a new image to display on an element before the VSYNC, then it will be applied to the list for the next frame. If the old image element is being displayed then it is released when it is finished with at the VSYNC. If 2 or more updates to an element are submitted between VSYNCs, then the earlier ones that haven't made it to the screen will be released immediately (That is how things like rendering 90fps off the camera works, only 2/3rds will make it to the screen, the other 1/3 being released immediately).

What constitutes a new frame in your proposed scheme? Every DispmanX update with a different source (effectively what James proposed)? What if multiple things are contributing DispmanX elements - do they all generate a callback? Nothing stopping me setting up 16 QVGA video clips as a mini-video wall, so you'll now be getting an event off each - prepare for 480 events a second if they're 30fps clips.

Also please note that the Linux frame buffer is not double-buffered, therefore you'd never get any updates from that at all.

Whilst I can appreciate what you're trying to achieve, I can't think of a suitable scalable mechanism for doing so.

@juj
Copy link
Author

juj commented Jan 10, 2018

Thanks for the detailed description! This is very informative.

If you present it a new image to display on an element before the VSYNC, then it will be applied to the list for the next frame.

What happens if I miss adding any images between two vsync events? The display will not show(?) a black screen, so somehow it knows that it should not have presented a "dropped frame" on the screen, suggesting that there is something that would exist to tell "there's now a full set of new frame data to push, please present/swap this on the screen".

If 2 or more updates to an element are submitted between VSYNCs, then the earlier ones that haven't made it to the screen will be released immediately (That is how things like rendering 90fps off the camera works, only 2/3rds will make it to the screen, the other 1/3 being released immediately).

If I understand this description correctly, then it would indeed be ideal to receive 90 callbacks in this scenario.

What constitutes a new frame in your proposed scheme?

The application always has some kind of API call it uses to tell "whaever I've rendered up to now, it's now a complete formed image, please present that on the screen all at once" (eglSwapBuffers, SDL_Flip, glXSwapBuffers, ...). I agree I'm not familiar with Linux Framebuffers API or DispmanX enough to know what the flow is for these frame delimiters to go through the internals, but my understanding is that such information does get carried through, because when one calls vc_dispmanx_snapshot() in a tight loop, it never returns unbuffered tearing data or data from partial processed surfaces as far as I can tell, and it's neither limited to being locked to vsync, but a program that's pushing out frames at 50Hz does indeed generate a 50Hz stream of new frames on the receiving side when polled with vc_dispmanx_snapshot().

Every DispmanX update with a different source (effectively what James proposed)? What if multiple things are contributing DispmanX elements - do they all generate a callback? Nothing stopping me setting up 16 QVGA video clips as a mini-video wall, so you'll now be getting an event off each - prepare for 480 events a second if they're 30fps clips.

I'm not familiar enough with the DispmanX lingo here - is there a built-in primitive to synchronize these DispmanX elements to form a frame? Let's say my application is building up this huge QVGA video wall, and it's taking up 300 msecs to create the initial first frames of each element. My understanding is that the DispmanX compositor would not start presenting half of them on the screen (the remaining half of the screen staying e.g. black) at 16msecs intervals throughout this 300msecs period, but only when I tell DispmanX "I've now produced all elements this frame, please swap/present", the composited contents from all of them are swapped out to the HDMI controller to display? That would be the event that would be ideal to hook onto here.

If the API doesn't work like that, and there's no such thing that applies, then perhaps what I would be looking for are 480 events per second here in this scenario if the elements are unbuffered.

Searching at https://www.raspberrypi.org/forums/viewtopic.php?t=33615, it mentions a code snippet

// finish display update
	ret = vc_dispmanx_update_submit_sync( update_handle );
	if (ret != 0) {
		return LLGRLIB_ERROR;
	}

perhaps that is the frame synchronization event that the native applications that implement EGL/X11/framebuffer/etc. swapping? In that case, I think that would be the event at which we'd like to hook onto here. Does that sound plausible? And thanks for guiding the conversation!

@juj
Copy link
Author

juj commented Jan 10, 2018

Searching further, I find raspberrypi/firmware#355, and perhaps what we're looking for here is to hook in the middle to get an event callback from functions vc_dispmanx_update_submit and vc_dispmanx_update_submit_sync that someone else called (plus the frame contents). That is, the native implementation of EGL and SDL etc. are all calling to these functions to present frames(?), and we'd want to hook in to getting callbacks from when they finish the GPU work.

@6by9
Copy link
Contributor

6by9 commented Jan 10, 2018

What happens if I miss adding any images between two vsync events? The display will not show(?) a black screen, so somehow it knows that it should not have presented a "dropped frame" on the screen, suggesting that there is something that would exist to tell "there's now a full set of new frame data to push, please present/swap this on the screen".

Then it retains the previous image passed for the element.

The application always has some kind of API call it uses to tell "whaever I've rendered up to now, it's now a complete formed image, please present that on the screen all at once" (eglSwapBuffers, SDL_Flip, glXSwapBuffers, ...). I agree I'm not familiar with Linux Framebuffers API or DispmanX enough to know what the flow is for these frame delimiters to go through the internals, but my understanding is that such information does get carried through, because when one calls vc_dispmanx_snapshot() in a tight loop, it never returns unbuffered tearing data or data from partial processed surfaces as far as I can tell, and it's neither limited to being locked to vsync, but a program that's pushing out frames at 50Hz does indeed generate a 50Hz stream of new frames on the receiving side when polled with vc_dispmanx_snapshot().

The swapbuffers type calls are effectively making updates to the display list atomic. DispmanX level is vc_dispmanx_update_start and vc_dispmanx_update_submit_sync. BUT that is for only one client.
My example of the video wall has 16 clients (probably MMAL video_render components running on the GPU, but could be independent Linux apps) plus the Linux frame buffer. Each is making sure that the changes it makes are atomic, but doesn't care about the overall "scene" updating. Which of those updates counts as the "scene change" and needs to notify you?

vc_dispmanx_snapshot() in a tight loop is effectively saying "please compose the scene as currently defined to this offline buffer". Yes it is independent of the display VSYNC because it is requesting a new pass through the hardware to the memory buffer.

perhaps what we're looking for here is to hook in the middle to get an event callback from functions vc_dispmanx_update_submit and vc_dispmanx_update_submit_sync that someone else called (plus the frame contents).

Yes, that's the crux - it's someone else's DispmanX calls that you are wanting to hook on to.
With framebuffer copy type applications you have no idea of what resources are on the screen or what applications they relate to, so adding filtering on who/what is going to be difficult. Without filtering you could well be getting those 480 callbacks a second.

To my mind firmware#355 is unrelated. They were requesting a callback when their own resources actually got updated, and had hit a genuine bug.

@juj
Copy link
Author

juj commented Jan 10, 2018

My example of the video wall has 16 clients (probably MMAL video_render components running on the GPU, but could be independent Linux apps) plus the Linux frame buffer. Each is making sure that the changes it makes are atomic, but doesn't care about the overall "scene" updating. Which of those updates counts as the "scene change" and needs to notify you?

Would DispmanX allow in this kind of scenario each of the 16 programs to have one visible cell of a 4x4 grid on the display, and then each application is driving video at 30fps to its own cell? And then each application would be doing its own vc_dispmanx_update_submit(_sync) calls to tell DispmanX about its own portion of the finished frames? Does that then mean that as far as a connected HDMI display is concerned, it would see at 60Hz a random snapshot of frames in time in each cell as the applications update them at around 30fps? I.e. each individual cell is updated disjoint from each other and unsynchronized?

In that case, I think it would be desirable to receive the 16*30fps worth of updates in this kind of callback, i.e. since the HDMI display has a chance to produce a finished image at any point in time as the updates to the cells arrive, then also a SPI/DPI-based display would like to do the same. Now of course such a 16x16 grid is a bit contrived as a performance overhead, usually there's only few (in e.g. a fullscreen GLES2/X11 app, only one or a couple?) surfaces, but still as far as the "liveness" of data is concerned and what can visually appear on the HDMI vs on the DPI/SPI screens would be identical.

However that does bring up a question: what pumps the GPU to start doing the work of compositing a new frame? I.e. in the regular HDMI out scenario, when any of those 4x4 cells update their contents, does any app calling vc_dispmanx_update_submit(_sync) eagerly trigger the GPU to produce a new recomposited scene pixel buffer immediately, which the HDMI controller then drives from when its next vsync interval occurs? Or is there some kind of lazy aspect to this in the driver?

@juj
Copy link
Author

juj commented Jan 10, 2018

To my mind firmware#355 is unrelated. They were requesting a callback when their own resources actually got updated, and had hit a genuine bug.

Oh sorry, linking that was not to imply that that bug would be relevant here, except to link to the existence of that function.

@6by9
Copy link
Contributor

6by9 commented Jan 11, 2018

DispmanX absolutely allows multiple clients.
Find a video clip, and try

omxplayer --no-keys --refresh --loop --win "0 0 640 360" file.mp4 &
omxplayer --no-keys --refresh --loop --win "640 0 1280 360" file.mp4 &
omxplayer --no-keys --refresh --loop --win "0 360 640 720" file.mp4 &
omxplayer --no-keys --refresh --loop --win "640 360 1280 720" file.mp4 &

and you should have 4 independent versions of omxplayer playing the video (potentially not at realtime as you may be overloading the codec block depending on the resolution of the source video). Each one is a DispmanX client.

Each instance is independent of the display output rate. As previously stated, at the point the display output hits vsync it will take a copy of the current list of items to display. The hardware will then start rendering those elements on-demand to produce the required output pixels (by default the composited output for the displays does NOT get written to memory).
If you have only one application updating the screen then you can alter the HDMI output parameters to match the source video - Kodi can do that, as can omxplayer, but it only applies for a single sync source.

It sounds like in the case of your PAL SNES application producing 50Hz updates, it has already failed if the HDMI is running at the default 60Hz. Trying to crowbar in some secondary display copying off a primary running at that incorrect rate by adding callbacks synced to application updates makes little sense. Set the primary to run at the correct rate (tvservice or similar), and your vsyncs are then at the correct rate, so no jitter.

@juj
Copy link
Author

juj commented Jan 11, 2018

DispmanX absolutely allows multiple clients.
Find a video clip, and try

omxplayer --no-keys --refresh --loop --win "0 0 640 360" file.mp4 &
omxplayer --no-keys --refresh --loop --win "640 0 1280 360" file.mp4 &
omxplayer --no-keys --refresh --loop --win "0 360 640 720" file.mp4 &
omxplayer --no-keys --refresh --loop --win "640 360 1280 720" file.mp4 &

and you should have 4 independent versions of omxplayer playing the video (potentially not at realtime as you may be overloading the codec block depending on the resolution of the source video). Each one is a DispmanX client.

Thanks, this is a very illustrative example!

As previously stated, at the point the display output hits vsync it will take a copy of the current list of items to display. The hardware will then start rendering those elements on-demand to produce the required output pixels

This is interesting. How does the hardware know that it will be able to render those elements before the vblank interval is over so that the new pixels make it in to the screen before it runs out of vblank time? Presumably the answer is that it doesn't, so this can cause an extra vertical refresh worth of HDMI latency(?)

This sounds like a typical "optimizing for latency vs throughput" tradeoff: an eager driver would immediately start compositing when an update is made, so that the contents would get in to the next earliest vsync. However presumably this is not done because it's a worry that such eagerness could cause redundant work if there are multiple updates that routinely occur per one vsync interval, so to maximize throughput at the expense of latency, the driver just kicks off its work at vsync periods of time(?)

This is unfortunate for applications that want to optimize for low latency and don't have a concept of vsync, i.e. use eglSwapInterval(0), but they want to (and can) push frames immediately as they are available.

If you have only one application updating the screen then you can alter the HDMI output parameters to match the source video - Kodi can do that, as can omxplayer, but it only applies for a single sync source.

I expect that most realistic applications are such "single sync source" cases, and there generally exists only one fullscreen surface that is present on screen. Under such conditions, getting a notification of each update does not sound unrealistic, and software regularly is optimized to fall into such category. I'd be surprised if any of the Retropie game emulators running have more than a single visible sync source running when active.

It sounds like in the case of your PAL SNES application producing 50Hz updates, it has already failed if the HDMI is running at the default 60Hz.

By it, do you mean the PAL SNES application? This bug is talking about a context where there likely does not exist a connected HDMI display, and SPI connected displays can not have a concept of vsync - they are practically eglSwapInterval(0) style of applications that desire to push write pixels on the screen immediately as they are available to be written. (this is because the bus is the constraining factor)

Trying to crowbar in some secondary display copying off a primary running at that incorrect rate

The display is not necessarily secondary, but a primary one, and the rate is not incorrect, but it is defined by the application. And this is certainly not crowbaring :)

adding callbacks synced to application updates makes little sense

If there did exist callbacks synced to application updates, it would be possible to develop a push modelled pipeline all the way from the application's swapScreen() function to the driver writes to the SPI display, which would minimize latency and remove unnecessary wasted work that otherwise required polling loops cause.

Here is the SPI driver in action:

Set the primary to run at the correct rate (tvservice or similar), and your vsyncs are then at the correct rate, so no jitter.

Unfortunately there does not exist a single correct rate, but it is dictated by the application that is currently running, and it can vary depending on the workload of the application. If you look at the second video, Outrun renders at 60Hz, Tyrian renders at 36Hz, Super Mario renders at 50Hz, Prince of Persia renders at 12Hz, and Quake 3 renders at 20-40fps depending on scene complexity. It is not practical to require one to reconfigure the system with tvservice or another program to change the refresh rate, because it can be unknown; the user would not even know what such a refresh rate would be, and this would not work for applications that cannot maintain a fixed rate. And it would be tedious to have to reconfigure whenever switching applications.

The videos above perform the busy spinning polling of vc_dispmanx_snapshot(), and the results are good with low latency. However dedicating one core to have to poll snapshots is wasteful from power consumption and latency perspective.

It sounds like it would be desirable here to have a driver mode where dispmanx would perform eager GPU composition of its layers immediately whenever an update occurs, rather than delaying its work for an extra vsync. This might even improve performance and latency on HDMI connected applications if there existed only few sync sources. From your description it sounds like dozens of layers on screen sounds like a rare case that does not occur in practical applications, and application architectures such as omxplayer and Kodi generally want to optimize for a single sync source anyways. Perhaps this could be a latency vs throughput optimized driver configuration mode, where latency optimized mode would do this composition eagerly, and then allow firing some kind of global registerable callback (if any exist in the system)?

@juj
Copy link
Author

juj commented Jan 11, 2018

Let me come back to the originally posed questions, now I think I understand the original answers from @6by9 better.

If I was to design this kind of "eager" compositing mode to DispmanX, it would be nice if DispmanX would for each call to vc_dispmanx_update_submit(_sync) immediately kick off a rerendering on the GPU, if the last rendering of the scene was more than a configured 1000ms/FRAME_RATE ago. If the previous rerendering occurred sooner than this, say at time T_prev, then DispmanX would queue that update in the internal list, and set a timer to recomposit at time T_prev+1000ms/FRAME_RATE the next time.

Whenever a rerender finishes, a global event callback would fire to all registered listeners, along with the contents of the composited frame.

This way one could still constrain GPU composits to not occur more frequently than given configured frame rate max cap, say at 60Hz, since the GPU would wait for at least 16.666ms until it composited the next time; but still if application produced frames at 50Hz or 36Hz or some other odd amount, all these updates would always immediately trigger an eager composit. This would enable the scenario where application producing at 50Hz would actually fire the events at 50Hz to the SPI display driver, but also the 4x4 wall of videos would cap to only update at 60Hz. This would also potentially improve latency on HDMI(?)

Then,

What constitutes a new frame in your proposed scheme?

Each finished GPU composit would cause a frame event signal to be produced.

Every DispmanX update with a different source (effectively what James proposed)?

Yes, though if multiple occurred since last GPU composit, those would get buffered up until next composition. (when at least 1000ms/FRAME_TIME of time has passed since the previous composition)

What if multiple things are contributing DispmanX elements - do they all generate a callback?

The callback would fire whenever a GPU composit finishes, and since an internal FRAME_RATE field would limit GPU composits to only occur at most 1000ms/FRAME_RATE msecs back to back, this would cap the amount of callbacks.

Nothing stopping me setting up 16 QVGA video clips as a mini-video wall, so you'll now be getting an event off each - prepare for 480 events a second if they're 30fps clips.

This would then create only FRAME_RATE amounts of event callbacks.

The advantage of this kind of composition mode would be that instead of waiting for fixed wallclock times (vsync signal) of when a GPU scene recomposit should occur, it would eagerly start if the last composit was too long ago, and as result content that renders at < FRAME_RATE interval would automatically get less jitter, since it would immediately perform each of its composits as they are submitted via vc_dispmanx_update_submit(_sync), rather than the driver holding them until next vsync.

I hope I did not misunderstand anything too badly that was described earlier, and apologies if so. Perhaps something like this might be workable to think about further?

@6by9
Copy link
Contributor

6by9 commented Jan 12, 2018

You have a hardware pipe of:
HVS -> FIFO -> Pixel valve -> [ DSI | DPI | HDMI | VEC (analogue video) ]
(HVS = Hardware Video Scaler).

  • The pixel valve is programmed with the pixel frequencies for the end device.
  • The HVS is given a display list by DispmanX. It starts composing the first few lines of the frame into the small FIFO (actually part of the pixel valve, but easier to draw it separate). The Pixel valve feeds the data out at the desired rate. As space becomes available in the FIFO the HVS will compose more of the scene. It's all done just-in-time(*), sometimes referred to as online.
  • At the end of the frame you get the next VSYNC interrupt from the hardware (be it HVS to say frame complete, or pixelvalve to say FIFO empty) and the DispmanX code updates the HVS display list ready for the next frame, and releases any images that have been removed from the scene.

There's no point in the GPU being eager to produce the output when you have a defined hardware device that has a specified output pixel rate.

The alternative use for the HVS is to write into SDRAM instead of the pixel valve, sometimes referred to as offline. This is what vc_dispmanx_snapshot is using with the current display list (which may be different from what is being rendered to the display for the current frame, and can be at an alternate resolution/transform).

When you called vc_dispmanx_display_open you specified the display to use. From the hardware perspective that display is the timing master, even if you have nothing connected to view it. I note in your code you're even asking what resolution that display is via vc_dispmanx_display_get_info.
Anything you are trying to slave off vc_dispmanx_snapshot is a secondary display from the DispmanX's perspective, and you are effectively creating the VSYNCs every time you call vc_dispmanx_snapshot. (**)

I'm not going to implement a whole new API driving DispmanX in a different manner, mainly as resource is being expended to get onto the mainline KMS driver and deprecate DispmanX and similar.
If I get some time I will look at adding a callback off every dispmanx_update_submit, even though that seems the wrong solution to me. All filtering/rate limiting of those events will be up to the registered callback, not the firmware.
And I'll stress again that the Linux framebuffer does NOT use dispmanx_update_submit - it is single buffered and accepts tearing as a risk.

(*) There is a fallback to composing to memory first should the scene complexity exceed the threshold that the hardware can cope with on a just-in-time basis. The threshold isn't an easily defined hard limit as it is heavily dependent on SDRAM access speed, and it is possible to get the pipeline to underflow when it gets the threshold calculation wrong.
It can also be enforced using dispmanx_offline=1 in config.txt.
Either way it has a performance hit as you've now added a full frame write and full frame read at the output frame rate to SDRAM - 1080P60 RGBX32 that's almost 4Gbit/s in each direction.

(**) If you really want to have no display, then you need the application to be calling vc_dispmanx_display_open_offscreen so that it is rendering to SDRAM, but I'm not sure if there is a way of then having a separate app calling vc_dispmanx_snapshot on it.

@JamesH65
Copy link
Collaborator

I'm not convinced it's worth spending any more time on this - it would be quite a lot of work for very minimal benefit.

@6by9
Copy link
Contributor

6by9 commented Jan 12, 2018

I'm not convinced it's worth spending any more time on this - it would be quite a lot of work for very minimal benefit.

The framework already exists for sending the vsync callbacks to registered clients. A copy/paste of that and changing the trigger code is relatively straightforward.
It may help in @juj 's specific case, but I can also see issues over being flooded with updates if using multiple DispmanX clients or using the frame buffer, hence my comments specifically relating to those.

juj added a commit to juj/fbcp-ili9341 that referenced this issue May 20, 2018
juj added a commit to juj/fbcp-ili9341 that referenced this issue May 20, 2018
juj added a commit to juj/fbcp-ili9341 that referenced this issue May 20, 2018
juj added a commit to juj/fbcp-ili9341 that referenced this issue May 24, 2018
…aspberrypi/userland#440) that tries to grab a smooth 60fps video stream, to ease a bit more on the CPU. On Pi Zero W, results in ~30% CPU usage; if this hack is omitted, CPU usage is at around ~16%.
@juj
Copy link
Author

juj commented May 26, 2018

I have been working on porting fbcp-ili9341 to support bigger 480x320 displays and Pi Zero in addition to 320x240/Pi 3, and increasingly find the project juggling between high CPU consumption and stuttering because of the inefficient polling. There is a recent demo video that shows statistics about the CPU consumption aspect: https://www.youtube.com/watch?v=dqOLIHOjLq4 .

I wonder if adding an event for updated surfaces might be something to consider revisiting? The benefit would be considerable for SPI displays.

juj added a commit to juj/fbcp-ili9341 that referenced this issue Jun 14, 2018
…spberrypi/userland#440 to be more stable, and utilize a linear increment/geometric fallback type of shootahead mechanism to detect when frame rate of the content has recently increased. (e.g. returning to 60fps updating menu after a 24fps video)
juj added a commit to juj/fbcp-ili9341 that referenced this issue Jun 16, 2018
…land#440 : do not be as eager to slow down content frame rate if user plays a game, and spends a few seconds in a static menu, and then continues on 60fps content. Try to go to sleep better by expiring old frame intervals by time in addition to # of samples in the histogra, and try to improve the fast tracking logic by looking at the most recent interval as well.
@juj
Copy link
Author

juj commented Jun 17, 2018

fbcp-ili9341 has added a statistics option to display a histogram of the amount of stuttering that occurs. The option looks like this in practice:

bunny

The above picture is from running omxplayer bigbuckbunny320p.mp4 obtained from wget http://adafruit-download.s3.amazonaws.com/bigbuckbunny320p.mp4 (see https://learn.adafruit.com/adafruit-pitft-28-inch-resistive-touchscreen-display-raspberry-pi/playing-videos) and tested in four different modes - either using vc_dispmanx_vsync_callback() or a background thread, and related options.

Read https://github.com/juj/fbcp-ili9341#about-smoothness for details on the different test modes. The video is posted at https://www.youtube.com/watch?v=IqzKT33Rwjc.

For example in the 24Hz running Big Buck Bunny video, frame intervals are all over the place when the vsync signal is being used. Snapshotting frames in a dedicated polling thread gives a much smoother and more pleasing result, but unfortunately that approach is not feasible on a Pi Zero.

If there was an event callback to get surfaces as they are updated, then I would expect that the frame rate jittering would look much like shown in the video in bottom right (or even better), but with 0% redundant cpu and gpu overhead that would come from having to snapshot excessive frames to discover that nothing had actually updated any surfaces.

@6by9 mentioned above that this might be a relatively straightforward addition, and it would make a world of a difference for SPI based displays, many of which could run at a smooth 60fps rate even on a Pi Zero, contrary to what is often stated, based on badly performing software drivers in the past. If this would indeed end up being something relatively easy to do, I think it would definitely be worth the addition.

@bitbank2
Copy link

The numbers thrown about here for potential update rates seem to be forgetting that the maximum full screen update rate of an SPI-connected ILI9341 is around 30FPS. The typical ili9341 is connected by simple ribbon cable or discrete wires and tends to fail above 40Mhz. At a higher voltage and with coax or twisted pair, maybe a higher clock rate could function. At 40Mhz, a full screen update (320x240 x 16-Bpp) would require at least 1228800 bits transmitted (more due to commands and inter-byte delays). If we start with an ideal number of 1228800 bits, this would give us a maximum frame rate of 40000000 / 1228800 = 32FPS. On the RPI hardware, I believe the maximum workable SPI rate is 31.25Mhz because the clock uses a power of 2 divider from a 125Mhz master clock. If you do partial frame updates, some games can run at 60fps (see my sg_free game emulator as an example of this). The RPI0 presents a bigger challenge because it's single core CPU doesn't have enough speed to do complex partial-frame update calculations and run the game emulator and transmit the data over the SPI bus. The most practical way to reduce latency is to have the game emulator communicate directly with the SPI display (like my emulator).

@juj
Copy link
Author

juj commented Jun 21, 2018

Thanks for the comment. It is appreciated, however many of the statements about refresh rate there are incorrect, or perhaps apply to your specific case. E.g. SPI can well run at higher speeds, inter-byte delay can be disabled, power of 2 divider is not needed, the master clock is not at 125MHz, so ILI9341s can go up to 60fps with full updates.

But that is siderailing the conversation so let's not pursue that aspect further (l'll just defer to check out https://github.com/juj/fbcp-ili9341 in detail for more information on the above), since this issue is not about frame rate, or getting content to run at 60fps, or "fast", whatever the attained refresh rate may be in a particular scenario. It is actually more visibly uncovered in the opposite scenario - what happens when you are running content that is not authored 60hz, but more generally, how to deal with stuttering, latency and performance wastage that occurs on all frame rates. Agreed, like you mention, Pi Zero is a big challenge, and it is bit much more by this issue.

Current DispmanX does kind of work serviceably (with only a little bit of stutter because of this bug, which we would like to remove) if both the content and the display run up to 60fps, see the top right corner test case in the video.

This issue is prominent in the scenario where the content does not run at 60fps, but runs at a lower refresh frequency. If you watch the video posted above, it has content that runs at 50hz (Super Mario), 36hz (OpenTyrian) and 24hz (Big Buck Bunny). Many games, like MineCraft, cannot maintain a single fixed update rate, but vary depending on the complexity of the view to the scene. Since there is no way to get events from the display driver, we need to resort to a background thread polling to get to smooth framerates. For example if your SPI display was constrained to max 32FPS, you'd still like to watch 24hz Big Buck Bunny on it as smooth 24fps update and not have it stutter.

This issue is about identified wasted CPU work that has to occur, and if there was a new added feature to DispmanX to create such an event based mechanism to obtain frames, such CPU wastage would be avoided, and SPI displays on Pi Zero would be more viable as well.

You work on creating a game emulator that directly communicates with an SPI display is admirable! Such approach does unfortunately not scale, there are probably thousands of graphical applications that are developed for the Pi, and you cannot rewrite all of them to directly talk to a SPI display, which is why this bug entry is asking about adding a feature to serve as a general solution.

@bitbank2
Copy link

@juj thanks for the clarification. I guess I have low quality ILI9341 displays because mine all start to fail at around 36Mhz or so. Can you point me to a vendor link for a display that can handle 60FPS? If the RPI0 SPI master clock is not 125Mhz, can you explain how it works in more detail? What clock rates are possible?

@juj
Copy link
Author

juj commented Jun 21, 2018

Again, please let's not derail what is already becoming a very lengthy thread with unrelated conversation. Feel free to open a followup at e.g. fbcp-ili9341 tracker, Raspberry Pi Interfacing (DSI, CSI, I2C, etc.) subforum, or shoot me an email at posti. Thanks.

@gjahn
Copy link

gjahn commented Jun 15, 2019

Thanks @juj for opening this conversation and all the others bringing in their expertise of the relevant software stacks.

I understand that what is required for this issue to be solved is a call back for newly available content. I also understand that the main problem that is seen with that by @6by9 is the possibly unbounded amount of callbacks per second.

Maybe a simple solution to this conundrum could be the following: the functionality to register a one-shot callback for new content. As in the example of the SPI driver, most applications will NOT be interested in a possibly unbounded amount of callbacks, but need their own processing time of the content or have a maximum FPS limit. They do not want the above-mentioned 480 callbacks. So to solve this, they could request a one-time notification every time they are ready for new content.

Bonus points if the API would allow to ask: "hey, my last frame number was n, is there already a newer frame available? If so, return false, if not, register a one-time callback for when there is a newer frame and return true."

@Cpasjuste
Copy link

Hi!

I'm very interested in this conversation (which is, very interesting... :)) as i'm waiting for a cm3 device with an ili9341 screen (freeplay cm3). Is there any news on this subject?

For specific systems (cm3/recalbox), i think i may implement this in SDL/SDL2 to get the most of emulators, but this is not a nice solution.

@xblue87
Copy link

xblue87 commented Apr 9, 2022

Hi @juj,
Thank you for the driver. It's working very well for me.

1649504039756
1649504041296

My setup:

My Build Command:
cmake -DILI9486=ON -DGPIO_TFT_DATA_CONTROL=24 -DGPIO_TFT_RESET_PIN=25 -DSTATISTICS=0 -DDISPLAY_ROTATE_180_DEGREES=ON -DSPI_BUS_CLOCK_DIVISOR=6 ..

It's working on Lakka too.
I changed the setting on distroconfig.txt
From:
dtoverlay=vc4-kms-v3d,cma-128
To
dtoverlay=vc4-fkms-v3d,cma-128

1649515634237

Hope It will help someone is using the same screen as me.

Regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants