Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on wake when outputs have the same geometry #3238

Closed
AlanGriffiths opened this issue Feb 19, 2024 · 13 comments · Fixed by #3248 or #3299
Closed

Crash on wake when outputs have the same geometry #3238

AlanGriffiths opened this issue Feb 19, 2024 · 13 comments · Fixed by #3248 or #3299
Assignees
Labels
bug triaged Triage into JIRA to plan it in

Comments

@AlanGriffiths
Copy link
Collaborator

AlanGriffiths commented Feb 19, 2024

On my X1C1 laptop, with external screen attached (which, by chance, has the same geometry)

  1. login to Miriway (tried edge, beta and edge/mir-pr3237)
  2. Wait until outputs are switched off
  3. Use touchpad, keyboard or mouse to wake system

Expect: desktop session wakes
Actual: Momentary display of desktop followed by return to greeter

I do know that I've not seen this crash in many other scenarios

@AlanGriffiths
Copy link
Collaborator Author

OK, so here's the failure mode:

[2024-02-19 14:28:09.038892] < CRITICAL! > mirserver: terminate_with_current_exception(): ./src/platforms/gbm-kms/server/kms/kms_page_flipper.cpp(69): Throw in function virtual bool mir::graphics::gbm::KMSPageFlipper::schedule_flip(uint32_t, uint32_t, uint32_t)
Dynamic exception type: boost::wrapexcept<std::logic_error>
std::exception::what: Page flip for crtc_id is already scheduled

[2024-02-19 14:28:09.039049] < - debug - > mirserver: Handling Terminated from pid=1848353

@AlanGriffiths
Copy link
Collaborator Author

So, it looks like the necessary condition is for both outputs to be sharing a frame buffer. (On the same card and of the same geometry?)

And, to confirm, with display-config=sidebyside there's no problem.

@AlanGriffiths AlanGriffiths changed the title Crash on wake Crash on wake when cloning outputs with the same geometry Feb 19, 2024
@Saviq Saviq added the triaged Triage into JIRA to plan it in label Feb 19, 2024
@AlanGriffiths AlanGriffiths changed the title Crash on wake when cloning outputs with the same geometry Crash on wake when outputs have the same geometry Feb 19, 2024
@AlanGriffiths
Copy link
Collaborator Author

Further clarification: the display configuration was "static" but placing both outputs with the same geometry

@AlanGriffiths
Copy link
Collaborator Author

A bit more information: the two outputs are (correctly) in a group, and the first is displayed correctly. It is when a page flip is requested for the second output that a problem is detected.

@AlanGriffiths
Copy link
Collaborator Author

...and that is because, after the suspend, both outputs have the same crtc_id.

Before the suspend they are:

[2024-02-20 16:28:56.745592] < - debug - > gbm-kms: **** mgg::KMSPageFlipper::schedule_flip(60, 100, 83)
[2024-02-20 16:28:56.745816] < - debug - > gbm-kms: **** mgg::KMSPageFlipper::schedule_flip(45, 100, 76)

After the suspend:

[2024-02-20 16:30:49.359230] < - debug - > gbm-kms: **** mgg::KMSPageFlipper::schedule_flip(45, 104, 83)
[2024-02-20 16:30:49.366010] < - debug - > gbm-kms: **** mgg::KMSPageFlipper::schedule_flip(45, 104, 76)

I.e. the first output has acquired the second's crtc_id.

@AlanGriffiths AlanGriffiths self-assigned this Feb 20, 2024
@AlanGriffiths
Copy link
Collaborator Author

And for those that wondered: 2.15.0 also fails the same way. (So it isn't a side effect of the platform changes)

@AlanGriffiths
Copy link
Collaborator Author

The crtc_ids are coming coming from mgk::find_crtc_for_connector(drm_fd_, connector) - here's instrumentation of the mgg::RealKMSOutput::ensure_crtc() code:

initially:

[2024-02-20 17:41:12.562652] < - debug - > gbm-kms: **** RealKMSOutput::ensure_crtc() crtc_id=60, id=83, connector_id=83, drm_fd_=11
[2024-02-20 17:41:12.562703] < -warning- > gbm-kms: drmModeCrtcSetGamma failed: Invalid argument
[2024-02-20 17:41:12.562792] < - debug - > gbm-kms: **** RealKMSOutput::ensure_crtc() crtc_id=45, id=76, connector_id=76, drm_fd_=11

After wake:

[2024-02-20 17:42:33.734990] < - debug - > gbm-kms: **** RealKMSOutput::ensure_crtc() crtc_id=45, id=83, connector_id=83, drm_fd_=11
[2024-02-20 17:42:33.735045] < -warning- > gbm-kms: drmModeCrtcSetGamma failed: Invalid argument
[2024-02-20 17:42:34.239211] < - debug - > gbm-kms: **** RealKMSOutput::ensure_crtc() crtc_id=45, id=76, connector_id=76, drm_fd_=11

@AlanGriffiths
Copy link
Collaborator Author

I wonder if this has the same underlying cause as #2661 and #2674 (which were "fixed" by #2684 - however, we were not sure that was right, even though it fixed the symptoms)

@RAOF
Copy link
Contributor

RAOF commented Feb 21, 2024

Oh! Is this a race? The find_crtc_for_connector pulls KMS state, then iterates over it to find a CRTC; that CRTC won't be in use according to KMS until the drmModeSetCrtc call in set_crtc. This is all driven from two different DisplaySinks, which get composited to in parallel, so on resume these calls could race.

On the other hand, at Display::configure() time, or during startup, we sequentially instantiate our DisplaySinks, and they¹ get a CRTC at construction, so those calls have coherent KMS state.

¹: Indirectly - the constructor calls output->has_crtc_mismatch(), which calls ensure_crtc()

@RAOF
Copy link
Contributor

RAOF commented Feb 21, 2024

Possibly the answer is that we should remove all the places where we ensure_crtc() and instead make sure that we go through Display::configure() at any point where we might have lost KMS state.

@AlanGriffiths
Copy link
Collaborator Author

Oh! Is this a race?

Interesting thought, but don't think so: the ensure_crtc() calls that find a ctrc are all on the same thread.

@AlanGriffiths
Copy link
Collaborator Author

The find_crtc_for_connector pulls KMS state, then iterates over it to find a CRTC; that CRTC won't be in use according to KMS until the drmModeSetCrtc call in set_crtc. This is all driven from two different DisplaySinks, which get composited to in parallel, so on resume these calls could race.

This is very close to describing the failure move.

I've hacked find_crtc_and_index_for_connector() to not return the same crtc_id twice in a row. And that does avoid the crash. Now I just need to find a less hacky implementation. (And this could well be the underlying problem for the bugs mentioned above.)

@RAOF
Copy link
Contributor

RAOF commented Feb 22, 2024

Oh! Is this a race?

Interesting thought, but don't think so: the ensure_crtc() calls that find a ctrc are all on the same thread.

Oh, yeah. On a deeper look, I forgot that we still had more than one output per DisplaySink; I thought it'd removed that so each display was driven separately.

github-merge-queue bot pushed a commit that referenced this issue Apr 8, 2024
## Release Notes
mir (2.16.4) UNRELEASED; urgency=medium

  * Bugfix release:
    - Ignoring zero length gamma curves for KMS outputs (Fixes: #3238)
    - Add hardware cursor support (Fixes: #3198)
- Screenshots now respect clipping areas and rotated outputs (Fixes:
#3236)
      (Fixes: #3259)
    - Fixed custom attributes propagation on outputs
      (Fixes: canonical/ubuntu-frame#172)
    - Handle the Ubuntu 64-bit time_t apocalypse (Fixes: #3285)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug triaged Triage into JIRA to plan it in
Projects
None yet
3 participants