fix: CI tests on OS X #770

BenHenning · 2025-11-20T21:38:27Z

Fixes #746

This PR does a bunch of things, but most importantly it fixes repeatedly failing OS X tests that ran in CI. These fell into two categories:

A single scrolling test that would always fail due to a dropped rendering frame in headless Chrome at an inopportune point in the test.
Two alert tests that would reliably cause a different test to timeout.

These two fixes required completely different solutions and underwent different investigations. A long discussion of the investigation can be found in the conversation thread in this PR.

Fixing `alert` dialog problems

These failures were a bit odd. Two flyout tests verify callback logic which is set up in the test to open an alert dialog. These tests pass without an issue. However, when either one of these tests is run exactly before move start tests (both of them). The first move start test will pass, the second will reliably time out in CI on OS X. Sometimes WebdriverIO will include errors about focus issues, but this didn't always happen.

The best guess here is that there's an actual issue in Chrome and/or WebdriverIO specifically with alert dialogs that is putting the framework or browser into a bad state. Blockly shares no state across its tests due to page reloads and no cookies or local storage being leveraged, but forcing a session reload between tests (i.e. a full browser reopen) fixes the issue which seems to confirm that it's an issue in the framework environment and not Blockly. This seems like a reasonable fix.

Note that this is sort of discouraged due to session reloading increasing test runtime, but the CI completions do not seem much slower than before even with the extra reloads. Locally it takes 2-3x longer to run when using headless mode. When using non-headless mode it encounters actual failures, but fortunately this needs to be forced since the setup currently is such that session reloading will only ever happen with headless runs. It also seems likely that forcing session reloads like this will actually improve other behavioral inconsistencies or flakes in the tests, as well.

Session reloading has only been enabled for CI runs of tests since it shouldn't be needed locally, at least per observation (though it can be demonstrated locally using the command: CI=true npm run test:ci).

Fixing rendering synchronization issues

The rendering issue came down to the fact that Blockly can get into a temporarily broken state if its rendering queue is not run. Since Blockly relies on requestAnimationFrame for this to occur, dropped frames (which are legitimate according to the spec and something we can observe specifically in Mac CI with headless Chrome) will lead to block positioning not updating when the test expects, leading to a failing test. This can be easily reproduced even on Linux locally by linking against a version of Blockly that forces immediate rendering (e.g. pretending JavaFX is enabled).

The fix was to force the tests to try and trigger a re-render of the workspace at select moments. During investigation it was easiest just to do this everywhere there's a moment to pause execution since that's effectively a time when the test actually wants to wait for things to 'settle', so this is what's actually changed in the PR. All previous moments of pausing execution now route through a specific utility function that also will try to force rendering of Blockly, bypassing any dropped frames that the browser might introduce and ensuring tests are a lot more stable. It seems possible (though it hasn't been proven) that other tests may have also been flaky due to this behavior.

It's important to note that this might be a bug in core Blockly, but the reality is it seems nearly impossible to ever actually occur. Even if a browser dropped or delayed a rendering frame, eventual consistency should kick in since Blockly does deterministically render itself and thus will self-correct on the next render (which is fundamentally why the solution above works as well as it does).

Note also that this fix had a compatibility issue with the dialog fix above and required a special setting to disable synchronization since it's not valid in WebdriverIO to execute browser-side code while a dialog is open (and attempting to detect this results in a significant slowdown in CI runtime and a large increase in console warnings).

Other test infrastructure improvements

As part of the investigation it was useful to try and see what the tests were doing when they failed. WebdriverIO supports this with a built-in saveScreenshot function. All test suites have been updated in this PR to automatically check for failures when they finish each test and, if there's a failure, the current screen state will be snapshotted and uploaded to a zip file (specific to the OS) in GitHub Actions.

This actually did help with determining the root rendering issue because the screenshot showed a correct block layout even though debug logs showed incorrect positions. At first this hinted toward inconsistencies between the data model and actual rendering, but continued investigation eventually revealed the rendering frame being dropped (and the screenshot function was likely forcing rendering and 'correcting' the broken state).

Failures will also be screenshotted for local tests and saved in test/webdriverio/test/failures/ with a filename corresponding to the test's name. These are automatically ignored from being added to Git via .gitignore.

Finally, a couple other miscellaneous changes:

One additional pause was added to the scroll test. This probably isn't needed, but more pauses generally help with stability.
Ditto for sendKeyAndWait though this was more of a contract correction: if PAUSE_TIME was zero then it wasn't actually waiting anymore which seemed incorrect. That's been fixed.
There are some new string casts that seem necessary after some NPM upgrades, and they're generally innocuous.

And one final note: this PR mainly ensures all tests generally pass, not that there are no flakes. From observation tests may still flake on occasion, but they can be restarted and generally expected to pass. The OS X tests ran 14 times in a row without a failure before running into the 6 hour GitHub Actions runtime limit (it seems they slowed down each subsequent run, perhaps due to some sort of memory or resource leak). It may be possible to add a "try 3 times" approach to the WebdriverIO tests in CI in the future to significantly reduce flakes and increase the trustworthiness of failures to be actual problems, but this PR does not address that.

BenHenning · 2025-11-20T22:21:44Z

Looking at this scroll test first: Insert scrolls new block into view. It's failing with an error that the controls_if block isn't in the viewport. Comparing my local Linux run of the test vs. the CI run makes it pretty clear:

Linux local:

block bounds: { bottom: 1016, left: 53.34375, right: 213.34375, top: 904 }
viewport: { bottom: 1265.5, left: -10, right: 561, top: 814.5 }

Mac on CI:

block bounds: {
  bottom: 1882,
  left: 130.41592407226562,
  right: 290.4159240722656,
  top: 1770
}
viewport: { bottom: 1166.5, left: -10, right: 571, top: 705.5 }

The block is obviously way below the viewport for some reason. Also it's interesting to see that the viewport sizes are different which suggests discrepancies in test behaviors. Lack of cross-platform or cross-environment determinism will make this more difficult.

Attempt to use BiDi and upgrade WIO so that viewport manipulation can be used instead of window management for better compatibility.

BenHenning · 2025-11-20T23:30:55Z

Oh hey. It actually passed when using viewport instead of window size. Cool--I didn't expect that. I was guessing there could be some differences but I wasn't expecting there to be enough of one to cause that much of a coordinate discrepancy. I'm surprised but I don't care about the why quite enough to dig on it.

BenHenning · 2025-11-20T23:33:00Z

For reference, here are the results running the changed test on each platform.

Linux (local):

block bounds: { top: 904, bottom: 1016, left: 53.34375, right: 213.34375 }
viewport: { top: 740, bottom: 1340, left: -10, right: 570 }

Mac (CI):

block bounds: {
  top: 904,
  bottom: 1016,
  left: 54.20796203613281,
  right: 214.2079620361328
}
viewport: { top: 740, bottom: 1340, left: -10, right: 571 }

They are basically identical now which is great. :D

Unfortunately other Mac failures are likely a different problem entirely. Will dig on those next.

This test suite seems to now pass consistently between Linux & Mac.

BenHenning · 2025-11-20T23:37:59Z

Aha--enabling BiDi caused a bunch of new failures! Interesting.

BenHenning · 2025-11-21T00:01:19Z

Updating to latest Webdriver seems to fix a few more issues with BiDi and bring us back to basically where we were at before for failures (I think), plus a new failure affecting Linux (maybe only?).

Edit: Actually interestingly the previously fixed test is either failing again, or is failing when run in conjunction with other tests.

It's failing either again (after WIO update) or when run with the rest of the test suite.

BenHenning · 2025-11-21T00:08:46Z

Either it's failing again after the upgrade or, maybe more likely, it was flaky and incidentally passed earlier. Latest dimensions with the failure:

block bounds: {
  top: 1770,
  bottom: 1882,
  left: 130.41592407226562,
  right: 290.4159240722656
}
viewport: { top: 636, bottom: 1236, left: -10, right: 571 }

This is more or less what we saw earlier. It seems that the BiDi change is probably not needed (and we could downgrade back if I can find fixes for everything without it). The flake seems to be that, for some reason, the block is being put in the wrong place.

BenHenning · 2025-11-21T00:31:21Z

Huh. This is surprisingly difficult to break. It seems much more consistent now, but it obviously can still fail per the two repros above.

BenHenning · 2025-11-21T01:11:05Z

It passed 50 times in a row. Okay...this might be really tricky actually. My current working theory is that the extra browser waits to perform the debugging are actually adding a pause that allows scrolling to stabilize, thus fixing the test.

BenHenning · 2025-11-21T01:26:09Z

The extra pause seems to fix it. Removing all debug logs successfully re-failed the test (though we did see a failure earlier w/o needing to remove the later logs in the test, but oh well).

BenHenning · 2025-11-21T01:30:02Z

Attempting a 50x run of all tests except for the move tests since those time out. I think this might take a while. Will follow up with a rough estimate for completion if none fail.

Edit: Approximately 65-70 seconds per run so I'm guessing about an hour to run through 50 times without failure.

Edit 2: Though that's based on WebdriverIO's reported times. The runner actually took 2.5 minutes to fail so I think there's a lot of overhead not being reported here. That likely means this will take more than 2 hours to run through. I think GitHub has a 6 hour limit on runners so we should be below that.

BenHenning · 2025-11-21T01:35:09Z

Interestingly running the whole suite actually caused the earlier test to fail again. Will try re-adding all of the logs and see if we can glean anything interesting.

This reverts commit caf02c8.

BenHenning · 2025-12-05T03:24:49Z

Forcing session reloading fixes it. I may never fully understand the nuances that led to this particular issue, but I'm going to try reenabling the whole suite and also upping the timeout threshold for the Mac tests since there are a few that get close to 10s.

We also are going to have a bit slowdown with this change since there's a lot of overhead in reloading the entire browser for every suite, but it should tremendously improve stability especially on Mac.

BenHenning · 2025-12-05T03:29:12Z

Looks like most timeouts are already 30s so only needed to make a few adjustments there.

BenHenning · 2025-12-05T04:28:51Z

Looks like everything is passing, and honestly the runtime isn't terrible. I'm going to try kicking off 50 runs and see how far both Linux & Mac get to see how stable they are now.

BenHenning · 2025-12-08T23:18:46Z

Funnily the Ubuntu tests failed immediately with a flake but the Mac tests lasted 14 runs before hitting the 6 hour actions timeout (interestingly they took longer and longer to run each iteration so presumably there's some sort of bad memory or resource leak happening here). I don't think we can assume this means the Mac tests are more stable than the Ubuntu ones, but 14x runs in a row without failing is a pretty great improvement over before.

Also, ignore failures locally so they don't appear in the Git working direcctory.

Previous logic was accidentally removed fully rather than reverted. Also, remove unnecessary line.

BenHenning

Quick self-review pass.

BenHenning · 2025-12-09T00:28:00Z

I want to try and fix the lint warnings before sending this out.

BenHenning

Self-reviewed lint fixes.

maribethb · 2025-12-10T18:37:34Z

test/webdriverio/test/flyout_test.ts

      }
    });
    test('callbackkey is activated with enter', async function () {
+      setSynchronizeCoreBlocklyRendering(false);


we could potentially change these tests to have the callback do something other than open the alert if that's going to be problematic.

but if not, i would add a comment above this line saying that it needs to be disabled since it opens an alert, otherwise i fear we will lose sight of when this needs to be set to false

Yeah, I don't think changing the test would be a problem but it's also fortunately not necessary (plus I'd imagine we may run into alert problems again in the future depending on how we address the create variables issue). Went ahead and added comments to clarify the need for this as suggested.

maribethb · 2025-12-10T18:45:14Z

test/webdriverio/test/test_setup.ts

+    // If running in CI force a session reload to ensure no browser state can
+    // leak across test suites (since this can sometimes cause complex combined
+    // failures in CI).
+    await driver.reloadSession();


Is this run between each test or between suites? The PR description says between suites but this I think this is run between each test, though I may be mistaken. Not sure which one you intended so just wanted to double check.

Great call. You're correct, it is per test. I've found some new data on how it behaves, too, and I've updated the PR description and the comment here accordingly.

maribethb · 2025-12-10T18:48:02Z

test/webdriverio/test/test_setup.ts

+      workspace.render();
+      // Flush the rendering queue (this is a slight hack to leverage
+      // BlockSvg.render() directly blocking on rendering finishing).
+      const blocks = workspace.getTopBlocks();


You could await the finishQueuedRenders() promise instead, which is designed for this purpose

I don't think that's true unless I'm mistaken. finishQueuedRenders waits on requestAnimationFrame to execute and the whole reason this issue exists is because the browser isn't running requestAnimationFrame callbacks (which means the render manager will never finish the promise and this will end up blocking indefinitely).

The pathway here, while very hacky, synchronously flushes the queue directly to ensure that it actual renders rather than hoping it renders per the requestAnimationFrame callback.

Gotcha, that makes sense. I think triggerQueuedRenders might work then, as it cancels the animation frame and manually renders all the blocks in the queue. So it's possible this would potentially only re-render blocks that are "stuck" in the queue from the requestAnimationFrame callback not being called. But if that doesn't make sense to you then this hack is fine.

BenHenning

Self-reviewed comment fixes.

BenHenning added 3 commits November 20, 2025 20:51

fix: Fix failing NPM build.

8101a16

chore: Disable workflows for investigation.

b85a4b1

chore: Debug & disable test for investigation.

8859f98

chore: More investigation work.

ae9ee30

Attempt to use BiDi and upgrade WIO so that viewport manipulation can be used instead of window management for better compatibility.

chore: Clean up test & re-enable all.

27785c6

This test suite seems to now pass consistently between Linux & Mac.

BenHenning added 3 commits November 20, 2025 23:47

chore: Try upgrading WebdriverIO.

d5497fe

Merge branch 'main' into update-npm-lock-file

3d7c478

Merge branch 'update-npm-lock-file' into attempt-to-fix-mac-tests

75d0c73

chore: Re-isolate scroll test.

8d56eba

It's failing either again (after WIO update) or when run with the rest of the test suite.

BenHenning added 2 commits November 21, 2025 00:12

chore: Add more debug logs.

af1dbaa

chore: Run tests up to 50 times.

2e210fe

BenHenning added 2 commits November 21, 2025 00:52

chore: Speed & reduce tests for CI investigating.

2187b2c

chore: Fix path changing.

dabaad5

BenHenning added 3 commits November 21, 2025 01:12

chore: Remove debug logs to test theory.

4b97659

chore: Try to remove more logs.

cefb286

chore: Try adding a pause.

28a32f4

chore: Add comment & enable most tests.

059092f

chore: Re-add removed comment.

531979e

chore: Re-add logs for investigation.

efecfd9

BenHenning added 2 commits December 5, 2025 03:21

Revert "chore: Attempt WDIO upgrade."

ff5ba58

This reverts commit caf02c8.

chore: Add cross-suite session reloading in CI.

339667e

BenHenning added 3 commits December 5, 2025 03:26

chore: Re-enable all suites and tests.

69896a2

chore: Re-enable Ubuntu tests in CI.

1122f65

chore: Increase timeouts.

e407c73

BenHenning added 2 commits December 5, 2025 04:29

chore: Run tests 50 times in CI.

1411d90

fix: Make mkdir resilient for multiple runs.

65e6c02

BenHenning added 2 commits December 8, 2025 23:39

chore: Prepare branch for review.

be86fd2

Merge branch 'main' into attempt-to-fix-mac-tests

3589c6d

BenHenning changed the title ~~Attempt to fix mac tests~~ Fix CI tests on OS X Dec 8, 2025

BenHenning changed the title ~~Fix CI tests on OS X~~ fix: CI tests on OS X Dec 8, 2025

BenHenning added 2 commits December 9, 2025 00:06

fix: Fix broken artifact upload on failure.

a3c7f78

Also, ignore failures locally so they don't appear in the Git working direcctory.

fix: Fix broken scroll test.

96738ec

Previous logic was accidentally removed fully rather than reverted. Also, remove unnecessary line.

BenHenning commented Dec 9, 2025

View reviewed changes

chore: Lint fixes.

ceea30e

BenHenning commented Dec 9, 2025

View reviewed changes

BenHenning marked this pull request as ready for review December 9, 2025 00:37

BenHenning requested a review from maribethb December 9, 2025 00:37

github-actions bot assigned maribethb Dec 9, 2025

maribethb approved these changes Dec 10, 2025

View reviewed changes

chore: Comment fixes to address reviewer comments.

7f4b4fe

BenHenning commented Dec 10, 2025

View reviewed changes

BenHenning merged commit 664ecf4 into RaspberryPiFoundation:main Dec 10, 2025
7 checks passed

BenHenning deleted the attempt-to-fix-mac-tests branch December 10, 2025 22:45

BenHenning mentioned this pull request Dec 10, 2025

fix: Don't auto-close flyout for new vars RaspberryPiFoundation/blockly#9176

Closed

1 task

fix: CI tests on OS X #770

fix: CI tests on OS X #770

Uh oh!

Conversation

BenHenning commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixing alert dialog problems

Fixing rendering synchronization issues

Other test infrastructure improvements

Uh oh!

BenHenning commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenHenning commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenHenning commented Nov 20, 2025

Uh oh!

BenHenning commented Nov 20, 2025

Uh oh!

BenHenning commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenHenning commented Nov 21, 2025

Uh oh!

BenHenning commented Nov 21, 2025

Uh oh!

BenHenning commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenHenning commented Nov 21, 2025

Uh oh!

BenHenning commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenHenning commented Nov 21, 2025

Uh oh!

BenHenning commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenHenning commented Dec 5, 2025

Uh oh!

BenHenning commented Dec 5, 2025

Uh oh!

BenHenning commented Dec 8, 2025

Uh oh!

BenHenning left a comment

Choose a reason for hiding this comment

Uh oh!

BenHenning commented Dec 9, 2025

Uh oh!

BenHenning left a comment

Choose a reason for hiding this comment

Uh oh!

maribethb Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

BenHenning Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

maribethb Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

BenHenning Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

maribethb Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

BenHenning Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

maribethb Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

BenHenning left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

BenHenning commented Nov 20, 2025 •

edited

Loading

Fixing `alert` dialog problems

BenHenning commented Nov 20, 2025 •

edited

Loading

BenHenning commented Nov 20, 2025 •

edited

Loading

BenHenning commented Nov 21, 2025 •

edited

Loading

BenHenning commented Nov 21, 2025 •

edited

Loading

BenHenning commented Nov 21, 2025 •

edited

Loading

BenHenning commented Dec 5, 2025 •

edited

Loading