Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buttons or performance profile randomly disappear #21577

Open
cockpituous opened this issue Feb 3, 2025 · 16 comments
Open

Buttons or performance profile randomly disappear #21577

cockpituous opened this issue Feb 3, 2025 · 16 comments
Labels
flake unstable test nightly

Comments

@cockpituous
Copy link
Contributor

cockpituous commented Feb 3, 2025

The job fedora-41/daily failed on commit 2933c6d.

Log: https://cockpit-logs.us-east-1.linodeobjects.com/pull-0-2933c6dd-20250203-013113-fedora-41-daily/log.html

Image

@martinpitt martinpitt changed the title Tests failed on 2933c6dd565a64f8a25121f1b3a7e3f1bdceaca1 Buttons or performance profile randomly disappear Feb 3, 2025
@martinpitt martinpitt moved this to detriment in Pilot tasks Feb 3, 2025
@martinpitt
Copy link
Member

We keep having this -- this smells like it was broken by a chromium update a few weeks ago. A lot of tests now randomly fail because primary buttons in dialogs disappear, or now this. Let's make this the official place to discuss and track. @jelly I think this isn't yet tracked anywhere?

I have to say, I have absolutely no idea about this yet. Some ideas:

  • Try to reproduce locally in the tasks container.
  • If it repeats, check if adding an extra sleep(1) or reattempting the screenshot makes any difference
  • Move pixel tests to firefox

@martinpitt martinpitt added the flake unstable test label Feb 3, 2025
@martinpitt
Copy link
Member

example of disappearing button, happens often in TestFiles.testPermissions

@jelly
Copy link
Member

jelly commented Feb 4, 2025

Attached a screenshot of the tuned issue as the s3 url might get lost in the future. For tuned I can see this being a flake, we need to wait on the tuned daemon. So let's check if our test asserts that the HTML exists.

The overview pixel test indeed does not wait on a profile, and I think profiles differ per VM. So let's try to first wait on that to "stabilize" in an new PR, land it and see if it still occurs.

@jelly
Copy link
Member

jelly commented Feb 4, 2025

However looking more carefully, we do have "#tuned-status-button" in the ignore list of b.assert_pixels so why it failing on that pixel.

@jelly
Copy link
Member

jelly commented Feb 4, 2025

This could be a simple TOCTOU issue, we load the page, don't wait tuned to load and then measure the "ignore rects" by grabbing the size of an empty button. After that take a screenshot, text is loaded and test fails.

@jelly
Copy link
Member

jelly commented Feb 4, 2025

Created an experiment in #21584 but this doesn't flake nearly as often so I am not sure if we would catch it.

With these findings I'll investigate the Files testPermissions issue which appears way more often.

@martinpitt
Copy link
Member

This TestSuperuser.testBasic flake is another related example from just now.

@martinpitt
Copy link
Member

martinpitt commented Feb 4, 2025

I get that disappearing button with rm -f Test*; test/check-application TestFiles.testPermissions in two out of three runs. However, all pixel tests in all projects currently fail for me due to some font noise; not sure what's going on there, I need to debug that first.

I get the failures in a completely clean toolbox create --image ghcr.io/cockpit-project/tasks:latest -c cclean (which is the same tag as .cockpit-ci/container's 2025-02-01), with rm -rf /tmp/x; mkdir /tmp/x; HOME=/tmp/x. I also tried rm -f Test*; test/common/run-tests TestFiles.testPermissions (that really shouldn't make a difference, but you never know..)

So I tried in an isolated podman (like in CI):

podman run -it --rm --device=/dev/kvm -v ~/.cache/cockpit-images:/cache/images -e COCKPIT_IMAGES_DATA_DIR=/cache/images ghcr.io/cockpit-project/tasks
git clone https://github.com/cockpit-project/cockpit-files
cd cockpit-files
make prepare-check
test/common/run-tests TestFiles.testPermissions

It does not have the pixel noise and does have the disappearing button. For faster iteration I tried vm-run and --machine/--browser, but that did not reproduce it in 5 attempts. Restarting the VM every time reproduces it fairly reliably.

As compromise/bisection, I ran the above in podman exec -itu 1111 cclean bash, i.e. the toolbox container, but with an isolated user/env. But that's just too painful with all the permission/tmp/user conflicts.

@jelly
Copy link
Member

jelly commented Feb 4, 2025

Looking at what started this all, the files permissions dialog. There is no wait condition for the Change button to show up so it makes sense that it fails. Then again the cancel button is shown and also strangely enough for the TestSuperUser flake we also don't see the primary button being drawn.

This happens in normal and in for example the RTL layout. We do resize the window once, and change layout which could cause a re-render

@martinpitt
Copy link
Member

martinpitt commented Feb 4, 2025

I changed the default wait_delay from 0.5s to 3s and 5s, and at least with 3s I still got a failure. So if this is a race condition, it's a very long one.

@jelly
Copy link
Member

jelly commented Feb 5, 2025

I changed the default wait_delay from 0.5s to 3s and 5s, and at least with 3s I still got a failure. So if this is a race condition, it's a very long one.

So:

wait_animations = True
wait_after_layout_change = False

                    if wait_after_layout_change:
                        time.sleep(wait_delay)
        if wait_animations:
            time.sleep(wait_delay)
            self.wait_js_cond('ph_count_animations(%s) == 0' % jsquote(selector))

So we would only wait on animations

@jelly
Copy link
Member

jelly commented Feb 5, 2025

         open_permissions_dialog('newfile')
+        b.wait_in_text(".pf-v5-c-modal-box button.pf-m-primary", "Change")
         b.assert_pixels(".pf-v5-c-modal-box", "permissions-modal")

Attempting this, makes it still fail. Copied the captureScreenshot code over to have a peek and all I got was:

Image

That uses:

        import base64
        rect = b.call_js_func('ph_element_clip', ".pf-v5-c-modal-box")
        def relative_clips(sels):
            return [(
                    r['x'] - rect['x'],
                    r['y'] - rect['y'],
                    r['x'] - rect['x'] + r['width'],
                    r['y'] - rect['y'] + r['height'])
                    for r in b.call_js_func('ph_selector_clips', sels)]
        rect["type"] = "box"
        ret = b.bidi("browsingContext.captureScreenshot", quiet=True,
                        context=b.driver.top_context,
                        clip=rect)
        png_now = base64.standard_b64decode(ret["data"])
        with open("before.png", 'wb') as f:
            f.write(png_now)

        b.assert_pixels(".pf-v5-c-modal-box", "permissions-modal")

The default layout should be:

        if not self.current_layout and os.environ.get("TEST_SHOW_BROWSER") in [None, "pixels"]:
            self.current_layout = self.layouts[0]
            size = self.current_layout["shell_size"]
            self._set_window_size(size[0], size[1])

where layouts is picked from:

        try:
            with open(f'{TEST_DIR}/browser-layouts.json') as fp:
                self.layouts = json.load(fp)
        except FileNotFoundError:
            self.layouts = default_layouts
        self.current_layout = None

So this should be:

default_layouts: Sequence[BrowserLayout] = (
    {
        "name": "desktop",
        "theme": "light",
        "shell_size": (1920, 1200),
        "content_size": (1680, 1130)
    },

And screenshotting the whole body gives me:

Image

So resolution is indeed set correct.

@jelly
Copy link
Member

jelly commented Feb 5, 2025

So looking into assert_pixels most of the stuff is harmless moving the mouse, then we do a layout change.

This code then walks over the layouts, and calls set_layout unnecessary at first as we are in the first layout (desktop). I've tried to apply a delay:

        b.assert_pixels(".pf-v5-c-modal-box", "permissions-modal", wait_after_layout_change=True, wait_delay=4)

But that still fails now in the dark theme. So adding a delay is not enough.

@allisonkarlitskaya
Copy link
Member

@martinpitt
Copy link
Member

martinpitt commented Feb 12, 2025

I updated the above local podman run recipe to include image caching. It's a bit rough (image-download fails, needs either local chmod 777 or commenting out the call from image-customize), but works.

I tried @jelly 's wait_in_text() on the right place (the test fails on the multiple-files-permissions-modal, not the one you added it to), but it doesn't help either. And both in CI as well as for me locally it always fails on the default resolution.

I tried adding this into assert_pixels_in_current_layout() so that it waits after each layout change:

        if selector == ".pf-v5-c-modal-box":
            self.wait_in_text(".pf-v5-c-modal-box button.pf-m-primary", "Change")

(the if is necessary as that test does a dozen pixel tests, and not all of them on the dialog). So this makes absolutely sure that it's not just a slow dialog or a PF/React bug. It still fails, so I'm now very strongly convinced that this is a browser bug.

For the lols I also re-attempted

        b.assert_pixels(".pf-v5-c-modal-box", "multiple-files-permissions-modal", wait_after_layout_change=True, wait_delay=3)

but as both me and @jelly already determined, it still fails. This also corroborates "browser bug".

I'm at my wit's end here. I'll send a PR that ignores the primary button in the pixel test and make sure it exists: cockpit-project/cockpit-files#949

martinpitt added a commit to cockpit-project/cockpit-files that referenced this issue Feb 12, 2025
cockpit-project/cockpit#21577 has plagued us
for long enough now, and nobody of us is able to figure out the root
cause or proper workaround. This is *not* a race nor a PF/React/etc.
bug, it really must be a Chromium bug. (See the issue for details)

As a bandaid, ignore the button for the pixel tests and assert that it
exists.
allisonkarlitskaya pushed a commit to cockpit-project/cockpit-files that referenced this issue Feb 12, 2025
cockpit-project/cockpit#21577 has plagued us
for long enough now, and nobody of us is able to figure out the root
cause or proper workaround. This is *not* a race nor a PF/React/etc.
bug, it really must be a Chromium bug. (See the issue for details)

As a bandaid, ignore the button for the pixel tests and assert that it
exists.
@martinpitt martinpitt moved this from next to detriment in Pilot tasks Feb 12, 2025
@martinpitt
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flake unstable test nightly
Projects
Status: detriment
Development

No branches or pull requests

4 participants