Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postmortem: 0.8.12 => 0.8.17 failed to load on Windows #3946

Closed
mattseddon opened this issue May 21, 2023 · 2 comments
Closed

Postmortem: 0.8.12 => 0.8.17 failed to load on Windows #3946

mattseddon opened this issue May 21, 2023 · 2 comments
Labels
😇 postmortem Critical issue follow to reflect and improve. Not to blame anyone or point fingers!

Comments

@mattseddon
Copy link
Member

High-level summary

Versions 0.8.12 => 0.8.16 (inclusive) of the extension did not initialize for Windows users.

Timeline

All dates/times are GMT+10.

2023-05-12 - #3856 merged to main
2023-05-16 - 0.8.12 released to the marketplace and open-vsx
2023-05-19 06:30 - User reported on Discord "The extension is currently unable to initialize" in Windows, with VSCode version 1.78.2, DVC extension v0.8.16
2023-05-19 07:04 - @shcheklein confirmed with the user that he could replicate the behaviour on Windows.
2023-05-19 07:05 - @shcheklein raised the issue with the team in #vs-code-eng
2023-05-19 07:34 - team started to debug
2023-05-19 08:15 - Assumption was made that the issue was caused by importing external ESM packages into the extension code base on Windows
2023-05-19 08:17 - @mattseddon asked the user to try rolling back to 0.8.11 to confirm
2023-05-19 08:20 - #3856 was confirmed as the causes of the issue (user non-responsive)
2023-05-19 08:32 - @mattseddon starts working on a fix¹
2023-05-19 14:43 - Revert-ish merged into main
2023-05-19 14:53 - @mattseddon notified the reporting user on Discord that a fix was being shipped
2023-05-19 15:00 - 0.8.17 released
2023-05-20 05:04 - User responded with Sorry for the delayed response. I know it's late, but I can confirm that 0.8.11 worked when rolling back. And v0.8.18 is working as well! Thanks y'all for the quick work.

¹ Due to the time of day in the rest of the world @mattseddon opted to try and fix the issue instead of immediately rolling back. In the time between starting to work on a fix and 0.8.17 being released various approaches were tried and failed. The process for testing updates was broken in that @mattseddon would send a VSIX to @shcheklein on Slack (see thread).

Perf Indicators

Time to notice: 3 days.
Self-healed after: 8.5hrs.
Resolved after: 23hrs.

Impact

The following graphs demonstrate how many Windows users we had for the extension during the time of the incident:

Total users

image

Users (with at least one active event)

image

Engaged Users

image

Root cause analysis

Somewhere between the code being transpiled and bundled by Webpack and then further mangled by VSCE importing of the split out/dynamically imported ESM packages are transformed into a relative import. The problem that this creates in detailed in the following issues:

Importing of ESM packages into VS Code extensions is also marked as problematic here: microsoft/vscode#130367

Personal thoughts

  • The risk to reward ratio for this change was not high enough to push it through. Process-exists has had no updates since being moved to ESM. Execa has had 4 releases but none of them effect any of the features that we use.
  • Revert first is always the best option.
  • Automatic revert was not available from the GH UI.
  • The total amount of time wasted by making this change is really disappointing.

Prevention and next steps

  1. Under these circumstances provisioning of a Windows machine for the person working on the fix should be done as soon as possible.
  2. From a testing point of view there isn't much more that we can do. At the time of writing there are no frameworks that take a compiled VSIX and test it against VS Code.
  3. We (I) can (and should) better assess the risks of making a change and opt that the cost benefit is not high enough.
  4. Focusing (my) efforts on a smaller more targetted set of tasks would be beneficial. Time is finite and working on the most impactful things is important.
  5. Whenever an incident occurs we should revert first and look for a more permanent fix as a follow up.
  6. Getting more users will help with the time to notice.
@mattseddon mattseddon added the 😇 postmortem Critical issue follow to reflect and improve. Not to blame anyone or point fingers! label May 21, 2023
@shcheklein
Copy link
Member

Thanks @mattseddon !

Under these circumstances provisioning of a Windows machine for the person working on the fix should be done as soon as possible.

I'm using a virtual Win machine that I'm running on my Mac for this. With full dev environment for DVC, DVCLive, VS Code.

@shcheklein
Copy link
Member

Closing this. We'll discuss on retro and it's still there to comment and read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
😇 postmortem Critical issue follow to reflect and improve. Not to blame anyone or point fingers!
Projects
None yet
Development

No branches or pull requests

2 participants