Skip to content

Conversation

@BenHenning
Copy link
Collaborator

@BenHenning BenHenning commented Nov 20, 2025

Fixes #746

This PR does a bunch of things, but most importantly it fixes repeatedly failing OS X tests that ran in CI. These fell into two categories:

  • A single scrolling test that would always fail due to a dropped rendering frame in headless Chrome at an inopportune point in the test.
  • Two alert tests that would reliably cause a different test to timeout.

These two fixes required completely different solutions and underwent different investigations. A long discussion of the investigation can be found in the conversation thread in this PR.

Fixing alert dialog problems

These failures were a bit odd. Two flyout tests verify callback logic which is set up in the test to open an alert dialog. These tests pass without an issue. However, when either one of these tests is run exactly before move start tests (both of them). The first move start test will pass, the second will reliably time out in CI on OS X. Sometimes WebdriverIO will include errors about focus issues, but this didn't always happen.

The best guess here is that there's an actual issue in Chrome and/or WebdriverIO specifically with alert dialogs that is putting the framework or browser into a bad state. Blockly shares no state across its tests due to page reloads and no cookies or local storage being leveraged, but forcing a session reload between tests (i.e. a full browser reopen) fixes the issue which seems to confirm that it's an issue in the framework environment and not Blockly. This seems like a reasonable fix.

Note that this is sort of discouraged due to session reloading increasing test runtime, but the CI completions do not seem much slower than before even with the extra reloads. Locally it takes 2-3x longer to run when using headless mode. When using non-headless mode it encounters actual failures, but fortunately this needs to be forced since the setup currently is such that session reloading will only ever happen with headless runs. It also seems likely that forcing session reloads like this will actually improve other behavioral inconsistencies or flakes in the tests, as well.

Session reloading has only been enabled for CI runs of tests since it shouldn't be needed locally, at least per observation (though it can be demonstrated locally using the command: CI=true npm run test:ci).

Fixing rendering synchronization issues

The rendering issue came down to the fact that Blockly can get into a temporarily broken state if its rendering queue is not run. Since Blockly relies on requestAnimationFrame for this to occur, dropped frames (which are legitimate according to the spec and something we can observe specifically in Mac CI with headless Chrome) will lead to block positioning not updating when the test expects, leading to a failing test. This can be easily reproduced even on Linux locally by linking against a version of Blockly that forces immediate rendering (e.g. pretending JavaFX is enabled).

The fix was to force the tests to try and trigger a re-render of the workspace at select moments. During investigation it was easiest just to do this everywhere there's a moment to pause execution since that's effectively a time when the test actually wants to wait for things to 'settle', so this is what's actually changed in the PR. All previous moments of pausing execution now route through a specific utility function that also will try to force rendering of Blockly, bypassing any dropped frames that the browser might introduce and ensuring tests are a lot more stable. It seems possible (though it hasn't been proven) that other tests may have also been flaky due to this behavior.

It's important to note that this might be a bug in core Blockly, but the reality is it seems nearly impossible to ever actually occur. Even if a browser dropped or delayed a rendering frame, eventual consistency should kick in since Blockly does deterministically render itself and thus will self-correct on the next render (which is fundamentally why the solution above works as well as it does).

Note also that this fix had a compatibility issue with the dialog fix above and required a special setting to disable synchronization since it's not valid in WebdriverIO to execute browser-side code while a dialog is open (and attempting to detect this results in a significant slowdown in CI runtime and a large increase in console warnings).

Other test infrastructure improvements

As part of the investigation it was useful to try and see what the tests were doing when they failed. WebdriverIO supports this with a built-in saveScreenshot function. All test suites have been updated in this PR to automatically check for failures when they finish each test and, if there's a failure, the current screen state will be snapshotted and uploaded to a zip file (specific to the OS) in GitHub Actions.

This actually did help with determining the root rendering issue because the screenshot showed a correct block layout even though debug logs showed incorrect positions. At first this hinted toward inconsistencies between the data model and actual rendering, but continued investigation eventually revealed the rendering frame being dropped (and the screenshot function was likely forcing rendering and 'correcting' the broken state).

Failures will also be screenshotted for local tests and saved in test/webdriverio/test/failures/ with a filename corresponding to the test's name. These are automatically ignored from being added to Git via .gitignore.

Finally, a couple other miscellaneous changes:

  • One additional pause was added to the scroll test. This probably isn't needed, but more pauses generally help with stability.
  • Ditto for sendKeyAndWait though this was more of a contract correction: if PAUSE_TIME was zero then it wasn't actually waiting anymore which seemed incorrect. That's been fixed.
  • There are some new string casts that seem necessary after some NPM upgrades, and they're generally innocuous.

And one final note: this PR mainly ensures all tests generally pass, not that there are no flakes. From observation tests may still flake on occasion, but they can be restarted and generally expected to pass. The OS X tests ran 14 times in a row without a failure before running into the 6 hour GitHub Actions runtime limit (it seems they slowed down each subsequent run, perhaps due to some sort of memory or resource leak). It may be possible to add a "try 3 times" approach to the WebdriverIO tests in CI in the future to significantly reduce flakes and increase the trustworthiness of failures to be actual problems, but this PR does not address that.

@BenHenning
Copy link
Collaborator Author

BenHenning commented Nov 20, 2025

Looking at this scroll test first: Insert scrolls new block into view. It's failing with an error that the controls_if block isn't in the viewport. Comparing my local Linux run of the test vs. the CI run makes it pretty clear:

Linux local:

block bounds: { bottom: 1016, left: 53.34375, right: 213.34375, top: 904 }
viewport: { bottom: 1265.5, left: -10, right: 561, top: 814.5 }

Mac on CI:

block bounds: {
  bottom: 1882,
  left: 130.41592407226562,
  right: 290.4159240722656,
  top: 1770
}
viewport: { bottom: 1166.5, left: -10, right: 571, top: 705.5 }

The block is obviously way below the viewport for some reason. Also it's interesting to see that the viewport sizes are different which suggests discrepancies in test behaviors. Lack of cross-platform or cross-environment determinism will make this more difficult.

Attempt to use BiDi and upgrade WIO so that viewport manipulation can be
used instead of window management for better compatibility.
@BenHenning
Copy link
Collaborator Author

BenHenning commented Nov 20, 2025

Oh hey. It actually passed when using viewport instead of window size. Cool--I didn't expect that. I was guessing there could be some differences but I wasn't expecting there to be enough of one to cause that much of a coordinate discrepancy. I'm surprised but I don't care about the why quite enough to dig on it.

@BenHenning
Copy link
Collaborator Author

For reference, here are the results running the changed test on each platform.

Linux (local):

block bounds: { top: 904, bottom: 1016, left: 53.34375, right: 213.34375 }
viewport: { top: 740, bottom: 1340, left: -10, right: 570 }

Mac (CI):

block bounds: {
  top: 904,
  bottom: 1016,
  left: 54.20796203613281,
  right: 214.2079620361328
}
viewport: { top: 740, bottom: 1340, left: -10, right: 571 }

They are basically identical now which is great. :D

Unfortunately other Mac failures are likely a different problem entirely. Will dig on those next.

This test suite seems to now pass consistently between Linux & Mac.
@BenHenning
Copy link
Collaborator Author

Aha--enabling BiDi caused a bunch of new failures! Interesting.

@BenHenning
Copy link
Collaborator Author

BenHenning commented Nov 21, 2025

Updating to latest Webdriver seems to fix a few more issues with BiDi and bring us back to basically where we were at before for failures (I think), plus a new failure affecting Linux (maybe only?).

Edit: Actually interestingly the previously fixed test is either failing again, or is failing when run in conjunction with other tests.

It's failing either again (after WIO update) or when run with the rest
of the test suite.
@BenHenning
Copy link
Collaborator Author

Either it's failing again after the upgrade or, maybe more likely, it was flaky and incidentally passed earlier. Latest dimensions with the failure:

block bounds: {
  top: 1770,
  bottom: 1882,
  left: 130.41592407226562,
  right: 290.4159240722656
}
viewport: { top: 636, bottom: 1236, left: -10, right: 571 }

This is more or less what we saw earlier. It seems that the BiDi change is probably not needed (and we could downgrade back if I can find fixes for everything without it). The flake seems to be that, for some reason, the block is being put in the wrong place.

@BenHenning
Copy link
Collaborator Author

Huh. This is surprisingly difficult to break. It seems much more consistent now, but it obviously can still fail per the two repros above.

@BenHenning
Copy link
Collaborator Author

BenHenning commented Nov 21, 2025

It passed 50 times in a row. Okay...this might be really tricky actually. My current working theory is that the extra browser waits to perform the debugging are actually adding a pause that allows scrolling to stabilize, thus fixing the test.

@BenHenning
Copy link
Collaborator Author

The extra pause seems to fix it. Removing all debug logs successfully re-failed the test (though we did see a failure earlier w/o needing to remove the later logs in the test, but oh well).

@BenHenning
Copy link
Collaborator Author

BenHenning commented Nov 21, 2025

Attempting a 50x run of all tests except for the move tests since those time out. I think this might take a while. Will follow up with a rough estimate for completion if none fail.

Edit: Approximately 65-70 seconds per run so I'm guessing about an hour to run through 50 times without failure.

Edit 2: Though that's based on WebdriverIO's reported times. The runner actually took 2.5 minutes to fail so I think there's a lot of overhead not being reported here. That likely means this will take more than 2 hours to run through. I think GitHub has a 6 hour limit on runners so we should be below that.

@BenHenning
Copy link
Collaborator Author

Interestingly running the whole suite actually caused the earlier test to fail again. Will try re-adding all of the logs and see if we can glean anything interesting.

@BenHenning
Copy link
Collaborator Author

BenHenning commented Dec 5, 2025

Forcing session reloading fixes it. I may never fully understand the nuances that led to this particular issue, but I'm going to try reenabling the whole suite and also upping the timeout threshold for the Mac tests since there are a few that get close to 10s.

We also are going to have a bit slowdown with this change since there's a lot of overhead in reloading the entire browser for every suite, but it should tremendously improve stability especially on Mac.

@BenHenning
Copy link
Collaborator Author

Looks like most timeouts are already 30s so only needed to make a few adjustments there.

@BenHenning
Copy link
Collaborator Author

Looks like everything is passing, and honestly the runtime isn't terrible. I'm going to try kicking off 50 runs and see how far both Linux & Mac get to see how stable they are now.

@BenHenning
Copy link
Collaborator Author

Funnily the Ubuntu tests failed immediately with a flake but the Mac tests lasted 14 runs before hitting the 6 hour actions timeout (interestingly they took longer and longer to run each iteration so presumably there's some sort of bad memory or resource leak happening here). I don't think we can assume this means the Mac tests are more stable than the Ubuntu ones, but 14x runs in a row without failing is a pretty great improvement over before.

@BenHenning BenHenning changed the title Attempt to fix mac tests Fix CI tests on OS X Dec 8, 2025
@BenHenning BenHenning changed the title Fix CI tests on OS X fix: CI tests on OS X Dec 8, 2025
Also, ignore failures locally so they don't appear in the Git working
direcctory.
Previous logic was accidentally removed fully rather than reverted.

Also, remove unnecessary line.
Copy link
Collaborator Author

@BenHenning BenHenning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick self-review pass.

@BenHenning
Copy link
Collaborator Author

I want to try and fix the lint warnings before sending this out.

Copy link
Collaborator Author

@BenHenning BenHenning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-reviewed lint fixes.

@BenHenning BenHenning marked this pull request as ready for review December 9, 2025 00:37
@BenHenning BenHenning requested a review from maribethb December 9, 2025 00:37
}
});
test('callbackkey is activated with enter', async function () {
setSynchronizeCoreBlocklyRendering(false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could potentially change these tests to have the callback do something other than open the alert if that's going to be problematic.

but if not, i would add a comment above this line saying that it needs to be disabled since it opens an alert, otherwise i fear we will lose sight of when this needs to be set to false

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think changing the test would be a problem but it's also fortunately not necessary (plus I'd imagine we may run into alert problems again in the future depending on how we address the create variables issue). Went ahead and added comments to clarify the need for this as suggested.

// If running in CI force a session reload to ensure no browser state can
// leak across test suites (since this can sometimes cause complex combined
// failures in CI).
await driver.reloadSession();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this run between each test or between suites? The PR description says between suites but this I think this is run between each test, though I may be mistaken. Not sure which one you intended so just wanted to double check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call. You're correct, it is per test. I've found some new data on how it behaves, too, and I've updated the PR description and the comment here accordingly.

workspace.render();
// Flush the rendering queue (this is a slight hack to leverage
// BlockSvg.render() directly blocking on rendering finishing).
const blocks = workspace.getTopBlocks();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could await the finishQueuedRenders() promise instead, which is designed for this purpose

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's true unless I'm mistaken. finishQueuedRenders waits on requestAnimationFrame to execute and the whole reason this issue exists is because the browser isn't running requestAnimationFrame callbacks (which means the render manager will never finish the promise and this will end up blocking indefinitely).

The pathway here, while very hacky, synchronously flushes the queue directly to ensure that it actual renders rather than hoping it renders per the requestAnimationFrame callback.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, that makes sense. I think triggerQueuedRenders might work then, as it cancels the animation frame and manually renders all the blocks in the queue. So it's possible this would potentially only re-render blocks that are "stuck" in the queue from the requestAnimationFrame callback not being called. But if that doesn't make sense to you then this hack is fine.

Copy link
Collaborator Author

@BenHenning BenHenning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-reviewed comment fixes.

@BenHenning BenHenning merged commit 664ecf4 into RaspberryPiFoundation:main Dec 10, 2025
7 checks passed
@BenHenning BenHenning deleted the attempt-to-fix-mac-tests branch December 10, 2025 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OS X WebdriverIO tests have legitimate failures

2 participants