-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.5 UI infinite SSE creation + timeout on Chrome #12626
Comments
So this is specifically after the code-split of Monaco Editor in #12150. But if you're getting a timeout on that, you would've gotten a timeout on every page before that PR.
This sounds like possibly an issue with your device as Firefox is more performant than Chrome. What are the specs of your device? Is it low-end? Also how performant is your network connection -- what is your download mbps? I can't reproduce this and the UI has a lot of usage, so there isn't much actionable in the issue as-is. |
@agilgur5 Thank you for looking into this. I can confirm this is problem for all my colleagues. Here is the system information
I got decent download speed 150 MBPS. Since its happening with everyone in my company, it seems like a bug. |
Huh, you're definitely on a (very) performant machine and performant network connection, more performant than my own in both cases too, where I can't reproduce this.
But it's not happening to all users (otherwise it would have been reported much earlier and have many +1s; also other contributors and I would've noticed). If it's everyone in your company, I wonder if it's a VPN or proxy issue? |
My colleagues and myself started experiencing this error today. Win 11 pro, 13900k, 128gb ram, 1gb down connection. So far it only occurs for us on 1 instance, other team instances appear fine. |
The timeout? Did you do an update recently? What version are you on? Again there isn't really anything actionable in the issue as-is. The one timeout error reported so far is for a dependency (Monaco) which is already as split as it can be, and so is no longer in Argo's control to shrink.
Yes that would further suggest that a specific configuration is causing this and not a generic "UI is slow". |
We've been on v3.5.2 for about 2 months. Our first report of the timeout issue came through today. We're still working on trying to figure out what may be the cause as it's hard to reproduce. Previously it was consistently breaking in chrome, however it now appears to be occasionally working. What I have observed, is that it only appears to happen in chrome for me, but works fine in FireFox. There appears to be stuck (pending) calls in chrome e.g.:
FF on the other hand looks to be using polling, whereas chrome appears to be trying to maintain a number of persistent connections. At this point, I suspect it could be teleport vpn or another security feature preventing too many persistent connections from being opened. Update: Once I figured this out, I was able to replicate it to other instances previously thought to be unaffected: It appears as though there is a hard limit of 6 connections per host in chrome, the UI hits that limit after about a minute or so of viewing the We also tested this and were able to replicate the behavior after removing teleport from the equation to rule out the possibility it is causing the issue. |
@agilgur5 I work with @bradleyboveinis and am adding some more info here:
On more caveat in our environment is we run with I know the UI changed between 3.4.x and 3.5.x to move to the unified workflows view, did this change how the SSEs are managed and how many may be used when viewing workflows list? |
Update here, I seem to have found the main root cause in #12663 (comment). That issue seems duplicative, or at least the root cause seems identical, but I'm not sure if the symptoms are exactly the same without more information from the user. Note that as I wrote there I was only able to partially reproduce this when enabling pagination and moving between pages. I was not able to reproduce this when doing nothing. I also only got two SSEs per page move, that would mostly add up on each other, but would never be more than two per page move. The root cause is likely the same, although I'm not sure how this infinite loop occurred as I couldn't repro it. Big thanks for all the details you both added here @bradleyboveinis and @stefansedich , as well as @alelapi in #12663! Those were all vital to partially reproducing it and figuring out what was going on 🙂 |
Responding to some questions & comments below:
Yea, that's a super rare limit I've occasionally hit into, usually only when you have an app that works with many tabs. There should only be one or two open connections for the Argo UI, so that was surprising to see. Makes sense based on the rest though -- nice job noticing that!
This I could not fully reproduce, as I was only able to have two connections open at a time and only when moving between pages. But some old ones would remain after moving pages, causing a very similar effect of eventually hitting the connection limit etc.
The SSEs not cancelling for some reason seems to be latency sensitive, so your usage might also have made the non-cancellation happen more frequently.
It did not, but I also landed a large refactor to that page in 3.5.0 in #11891. That PR actually fixed a few subtle bugs and optimized a bunch, including significantly reducing the network activity as there were many unnecessary / duplicative requests being made before I refactored it (e.g. many list requests even though there's a Unfortunately, I've also found 3 less than one-liner bugs in that refactor, including this one. One was a typo (#12663 (comment)), and then this one and the other one (#12562) were super nuanced, both being ref issues (recursive ref here, stale ref there). The other one also had historical codebase context/non-React usage that I didn't know about and this one had the infinite loop that I haven't been able to repro as well as the SSEs not being cancelled despite the UI cancellation code running. Really disorienting bugs to root cause, especially when the fixes are only like 10 characters long 😅 I also missed them in testing as they don't always pop up, requiring a certain configuration and potentially a race too. We definitely could use a lot more automated UI tests, though typically networking is mocked in those, so even that still might not catch these kinds of issues. It may require E2E UI tests to catch, but we already have quite a lot of E2E Controller, API, and CLI tests that can take some time to run (~10-25 min per test suite), so I'm a bit hesitant to continue adding E2Es specifically 😕 |
Regarding root causing and reproducing the infinite loop, I'm curious if there's maybe a stale cache or something causing that? That was something I was looking for more info on in #12663 (comment). In local dev, I clear my caches with some frequency, so I wonder if that's why I haven't been able to repro it. In particular would be |
…d not be deleted (Fixes argoproj#12626)
Pre-requisites
:latest
What happened/what did you expect to happen?
Argo WF GUI is not stable and not responsive. Most of the time its slow to load pages and running into error when click to show templates or workflows:
This issue is frequent in Chrome, on the other hand Firefox is loading everything fine most of the time.
Expected: Page should load without any error.
Versions:
Version
3.5.4
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
NA
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: