-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e pool-load-success test hanging on close #3063
Comments
In an attempt to divide & conquer, I made a branch zui-3063-electron-only which was started at Zui commit 59c4c4a right before the changes in #3061 merged to |
I studied the problem a bit at the packet layer, catching successful and failed test runs in attached files success.pcap.gz and failure.pcap.gz, respectively. Since the problem seems to happen right as the app is being closed at the end of the test I paid particularly close attention to connection closure activity in the pcaps by applying Wireshark filter The first closure activity always comes from the Chromium user agent connection that initiates a call to the Zed lake service In the failure case (starting with packet 531 here) the client does not respond immediately with its own FIN. This results in the server side sending TCP keep-alives every 15 seconds until the 60-second mark, at which time things finally shut down presumably because Playwright does a hard kill on the client due to the test timing out. This led me to guess that the root cause might be a timing issue related to one or more of these connections still having final bits of work still wrapping up right when the test starts trying to gracefully close things down. To test this theory, I did a sets of looped runs with a branch zui-3063-sleep where I could add a sleep timer at the end of the failing e2e test before the shutdown portion begins. In a baseline loop of 100 iterations without the sleep timer, I saw 83 successful runs and 17 failed runs. Adding a 5-second sleep, that went to 98 successful runs and 2 failed runs. Then at a 10-second sleep, I saw 400+ successful runs and zero failed runs before I stopped the test. This does seem to indicate there could be something to that theory. |
The next round of debug was to add
In a successful run we see:
In a failed run:
In contemplating the point of hang indicated by this, @jameskerr's web searches found nodejs/undici#2026 that looked like it might very well be the root cause of our problems. It showed up with the same transition we made to the newer Electron that uses the newer Node, and users in the issue also speak of it being an intermittent problem that goes away when they use |
Verified on Zui commit 5fa9071. I started this loop of the e2e test at this commit to check for failure rates.
It made it through 210 successful runs and 0 failures before I stopped the loop. Compared to the count of 46 successful runs and 10 failures we saw previously, this definitely indicates the problem has been addressed. |
Repro is with Zui commit ca047c0.
When working with Zui since the merge of #3061, I noticed one of the e2e tests failing that hadn't been before. How the failure looks:
I just did a controlled set of repros and it does indeed seem like the failures are correlated with those changes. I used these repro steps:
Running at commit 59c4c4a that was right before the changes in #3061 produced 53 successful runs and 0 failures before I stopped it. At commit ca047c0 for the merge of #3061 it produced 46 successful runs and 10 failures.
The text was updated successfully, but these errors were encountered: