-
Notifications
You must be signed in to change notification settings - Fork 13
fix: CI tests on OS X #770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: CI tests on OS X #770
Conversation
|
Looking at this scroll test first: Linux local: Mac on CI: The block is obviously way below the viewport for some reason. Also it's interesting to see that the viewport sizes are different which suggests discrepancies in test behaviors. Lack of cross-platform or cross-environment determinism will make this more difficult. |
Attempt to use BiDi and upgrade WIO so that viewport manipulation can be used instead of window management for better compatibility.
|
Oh hey. It actually passed when using viewport instead of window size. Cool--I didn't expect that. I was guessing there could be some differences but I wasn't expecting there to be enough of one to cause that much of a coordinate discrepancy. I'm surprised but I don't care about the why quite enough to dig on it. |
|
For reference, here are the results running the changed test on each platform. Linux (local): Mac (CI): They are basically identical now which is great. :D Unfortunately other Mac failures are likely a different problem entirely. Will dig on those next. |
This test suite seems to now pass consistently between Linux & Mac.
|
Aha--enabling BiDi caused a bunch of new failures! Interesting. |
|
Updating to latest Webdriver seems to fix a few more issues with BiDi and bring us back to basically where we were at before for failures (I think), plus a new failure affecting Linux (maybe only?). Edit: Actually interestingly the previously fixed test is either failing again, or is failing when run in conjunction with other tests. |
It's failing either again (after WIO update) or when run with the rest of the test suite.
|
Either it's failing again after the upgrade or, maybe more likely, it was flaky and incidentally passed earlier. Latest dimensions with the failure: This is more or less what we saw earlier. It seems that the BiDi change is probably not needed (and we could downgrade back if I can find fixes for everything without it). The flake seems to be that, for some reason, the block is being put in the wrong place. |
|
Huh. This is surprisingly difficult to break. It seems much more consistent now, but it obviously can still fail per the two repros above. |
|
It passed 50 times in a row. Okay...this might be really tricky actually. My current working theory is that the extra browser waits to perform the debugging are actually adding a pause that allows scrolling to stabilize, thus fixing the test. |
|
The extra pause seems to fix it. Removing all debug logs successfully re-failed the test (though we did see a failure earlier w/o needing to remove the later logs in the test, but oh well). |
|
Attempting a 50x run of all tests except for the move tests since those time out. I think this might take a while. Will follow up with a rough estimate for completion if none fail. Edit: Approximately 65-70 seconds per run so I'm guessing about an hour to run through 50 times without failure. Edit 2: Though that's based on WebdriverIO's reported times. The runner actually took 2.5 minutes to fail so I think there's a lot of overhead not being reported here. That likely means this will take more than 2 hours to run through. I think GitHub has a 6 hour limit on runners so we should be below that. |
|
Interestingly running the whole suite actually caused the earlier test to fail again. Will try re-adding all of the logs and see if we can glean anything interesting. |
|
Forcing session reloading fixes it. I may never fully understand the nuances that led to this particular issue, but I'm going to try reenabling the whole suite and also upping the timeout threshold for the Mac tests since there are a few that get close to 10s. We also are going to have a bit slowdown with this change since there's a lot of overhead in reloading the entire browser for every suite, but it should tremendously improve stability especially on Mac. |
|
Looks like most timeouts are already 30s so only needed to make a few adjustments there. |
|
Looks like everything is passing, and honestly the runtime isn't terrible. I'm going to try kicking off 50 runs and see how far both Linux & Mac get to see how stable they are now. |
|
Funnily the Ubuntu tests failed immediately with a flake but the Mac tests lasted 14 runs before hitting the 6 hour actions timeout (interestingly they took longer and longer to run each iteration so presumably there's some sort of bad memory or resource leak happening here). I don't think we can assume this means the Mac tests are more stable than the Ubuntu ones, but 14x runs in a row without failing is a pretty great improvement over before. |
Also, ignore failures locally so they don't appear in the Git working direcctory.
Previous logic was accidentally removed fully rather than reverted. Also, remove unnecessary line.
BenHenning
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick self-review pass.
|
I want to try and fix the lint warnings before sending this out. |
BenHenning
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-reviewed lint fixes.
| } | ||
| }); | ||
| test('callbackkey is activated with enter', async function () { | ||
| setSynchronizeCoreBlocklyRendering(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could potentially change these tests to have the callback do something other than open the alert if that's going to be problematic.
but if not, i would add a comment above this line saying that it needs to be disabled since it opens an alert, otherwise i fear we will lose sight of when this needs to be set to false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think changing the test would be a problem but it's also fortunately not necessary (plus I'd imagine we may run into alert problems again in the future depending on how we address the create variables issue). Went ahead and added comments to clarify the need for this as suggested.
| // If running in CI force a session reload to ensure no browser state can | ||
| // leak across test suites (since this can sometimes cause complex combined | ||
| // failures in CI). | ||
| await driver.reloadSession(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this run between each test or between suites? The PR description says between suites but this I think this is run between each test, though I may be mistaken. Not sure which one you intended so just wanted to double check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great call. You're correct, it is per test. I've found some new data on how it behaves, too, and I've updated the PR description and the comment here accordingly.
| workspace.render(); | ||
| // Flush the rendering queue (this is a slight hack to leverage | ||
| // BlockSvg.render() directly blocking on rendering finishing). | ||
| const blocks = workspace.getTopBlocks(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could await the finishQueuedRenders() promise instead, which is designed for this purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's true unless I'm mistaken. finishQueuedRenders waits on requestAnimationFrame to execute and the whole reason this issue exists is because the browser isn't running requestAnimationFrame callbacks (which means the render manager will never finish the promise and this will end up blocking indefinitely).
The pathway here, while very hacky, synchronously flushes the queue directly to ensure that it actual renders rather than hoping it renders per the requestAnimationFrame callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, that makes sense. I think triggerQueuedRenders might work then, as it cancels the animation frame and manually renders all the blocks in the queue. So it's possible this would potentially only re-render blocks that are "stuck" in the queue from the requestAnimationFrame callback not being called. But if that doesn't make sense to you then this hack is fine.
BenHenning
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-reviewed comment fixes.
Fixes #746
This PR does a bunch of things, but most importantly it fixes repeatedly failing OS X tests that ran in CI. These fell into two categories:
alerttests that would reliably cause a different test to timeout.These two fixes required completely different solutions and underwent different investigations. A long discussion of the investigation can be found in the conversation thread in this PR.
Fixing
alertdialog problemsThese failures were a bit odd. Two flyout tests verify callback logic which is set up in the test to open an alert dialog. These tests pass without an issue. However, when either one of these tests is run exactly before move start tests (both of them). The first move start test will pass, the second will reliably time out in CI on OS X. Sometimes WebdriverIO will include errors about focus issues, but this didn't always happen.
The best guess here is that there's an actual issue in Chrome and/or WebdriverIO specifically with alert dialogs that is putting the framework or browser into a bad state. Blockly shares no state across its tests due to page reloads and no cookies or local storage being leveraged, but forcing a session reload between tests (i.e. a full browser reopen) fixes the issue which seems to confirm that it's an issue in the framework environment and not Blockly. This seems like a reasonable fix.
Note that this is sort of discouraged due to session reloading increasing test runtime, but the CI completions do not seem much slower than before even with the extra reloads. Locally it takes 2-3x longer to run when using headless mode. When using non-headless mode it encounters actual failures, but fortunately this needs to be forced since the setup currently is such that session reloading will only ever happen with headless runs. It also seems likely that forcing session reloads like this will actually improve other behavioral inconsistencies or flakes in the tests, as well.
Session reloading has only been enabled for CI runs of tests since it shouldn't be needed locally, at least per observation (though it can be demonstrated locally using the command:
CI=true npm run test:ci).Fixing rendering synchronization issues
The rendering issue came down to the fact that Blockly can get into a temporarily broken state if its rendering queue is not run. Since Blockly relies on
requestAnimationFramefor this to occur, dropped frames (which are legitimate according to the spec and something we can observe specifically in Mac CI with headless Chrome) will lead to block positioning not updating when the test expects, leading to a failing test. This can be easily reproduced even on Linux locally by linking against a version of Blockly that forces immediate rendering (e.g. pretending JavaFX is enabled).The fix was to force the tests to try and trigger a re-render of the workspace at select moments. During investigation it was easiest just to do this everywhere there's a moment to pause execution since that's effectively a time when the test actually wants to wait for things to 'settle', so this is what's actually changed in the PR. All previous moments of pausing execution now route through a specific utility function that also will try to force rendering of Blockly, bypassing any dropped frames that the browser might introduce and ensuring tests are a lot more stable. It seems possible (though it hasn't been proven) that other tests may have also been flaky due to this behavior.
It's important to note that this might be a bug in core Blockly, but the reality is it seems nearly impossible to ever actually occur. Even if a browser dropped or delayed a rendering frame, eventual consistency should kick in since Blockly does deterministically render itself and thus will self-correct on the next render (which is fundamentally why the solution above works as well as it does).
Note also that this fix had a compatibility issue with the dialog fix above and required a special setting to disable synchronization since it's not valid in WebdriverIO to execute browser-side code while a dialog is open (and attempting to detect this results in a significant slowdown in CI runtime and a large increase in console warnings).
Other test infrastructure improvements
As part of the investigation it was useful to try and see what the tests were doing when they failed. WebdriverIO supports this with a built-in
saveScreenshotfunction. All test suites have been updated in this PR to automatically check for failures when they finish each test and, if there's a failure, the current screen state will be snapshotted and uploaded to a zip file (specific to the OS) in GitHub Actions.This actually did help with determining the root rendering issue because the screenshot showed a correct block layout even though debug logs showed incorrect positions. At first this hinted toward inconsistencies between the data model and actual rendering, but continued investigation eventually revealed the rendering frame being dropped (and the screenshot function was likely forcing rendering and 'correcting' the broken state).
Failures will also be screenshotted for local tests and saved in
test/webdriverio/test/failures/with a filename corresponding to the test's name. These are automatically ignored from being added to Git via.gitignore.Finally, a couple other miscellaneous changes:
sendKeyAndWaitthough this was more of a contract correction: ifPAUSE_TIMEwas zero then it wasn't actually waiting anymore which seemed incorrect. That's been fixed.And one final note: this PR mainly ensures all tests generally pass, not that there are no flakes. From observation tests may still flake on occasion, but they can be restarted and generally expected to pass. The OS X tests ran 14 times in a row without a failure before running into the 6 hour GitHub Actions runtime limit (it seems they slowed down each subsequent run, perhaps due to some sort of memory or resource leak). It may be possible to add a "try 3 times" approach to the WebdriverIO tests in CI in the future to significantly reduce flakes and increase the trustworthiness of failures to be actual problems, but this PR does not address that.